The ML project failure funnel
Anyone who has worked in tech in the data science / MLE / applied ML field knows that most ML projects will fail. Newcomers to the field have super high expectations about what they can do, and, sadly, this leads to a lot of them leaving the space altogether. ML project failure is an inevitable part of life in this field and it's happened to me many times at different stages of my career. In the last couple of years, I've begun to think about the building of ML projects as a sort of funnel process; ideas are cheap and many, but only a small amount of these ideas reach the end of the funnel and become productionized, solid, used, effective, and maintained ML applications.
Read more here...
ChatGPT and the end of software engineering?
One of the hottest ChatGPT takes I've heard from mostly non-technical, tech-enthusiast folks is that it, or one of it's near-term successors, will effectively put software engineers into the dustpan of history. The following is my attempt to collect my thoughts around this and present my case to a lay audience on why this will not be the case.
Read more here...
Misc October updates 🍂 🎃 and PyData NYC
Fall is here and you really ought to buy some of those pumpkin pancake/waffle mixes at the store. I'm there for it. I've been thinking a lot recently about data-driven culture and who is responsible for building it and maintaining it in organizations. Some people say this responsibility lies primarily with the executive leadership. Others say it's primarily on data teams to show their value, stop being a passive service team, and evangelize the value of data to product teams. I do think that as a generation of data practitioners move up the leadership ladder and enter the C-suite, strong data culture will be easier to build. I'm not sure where I come out on this, but I'd like to formalize my thoughts around this and write it up sometime soon. Also, the November PyData NYC conference is around the corner. I'll be running another causal inference tutorial session and excited to learn about the projects attendees will be putting this stuff towards. Beyond that, I'm really looking forward to sessions like "20 ideas to build social capital in the Data Science ecosystem" and "Causal machine learning for a smart paywall at The New York Times". I remember when I was a poor grad student that the paywall was less smart and we could just freely visit NYTimes.com while in private browser mode. Fall in NYC is gorgeous and I'm looking forward to seeing the colorful foliage in Central and Prospect Park, eating at Joe's Pizza in the Village, checking out old haunts, and then getting a parting slice at Joe's Pizza.
Teaching causal inference at SciPy 2022
I recently had the privilege of giving a talk and tutorial session at SciPy 2022 in Austin.
Besides rediscovering how hot central Texas is in the summer (the sun is trying to kill you),
I walked away with some useful insights from the audience such as: (1) people are hungry to learn more about this topic.
(2) Most people came in thinking that causal inference was a way to improve ML predictions, rather than more closely related to A/B testing and decision science.
(3) One of the more controversial points was around how people should not interpret variable importance measures (e.g. SHAP values) as causal.
You can watch the first half of the talk here and if you'd like to look through the materials / try your hand at the execises, you can find them here on GitHub.
Keeping up with blogs made less annoying
I love Medium and Substack. They're so great that only a non-English word can truly describe them: fantastico.
However, not all tech leadership, data, and engineering blogs will let you subscribe
and receive email updates on new posts. And some of the best personal blogs out there only drop a new post
a few times a year. You're then forced to periodically and manually check these websites for updates.
I've got a list of 50+ of these non-subscribable blogs that I care about, and checking in on them sucks.
So, I built a tool named Blog Checker (clever name, I know), which automates the process of checking personal websites and blogs for updates.
I have a script on OS X desktop that I double click to run this, and within 10 seconds I can find out if
any of my favorite sites has changed since the last time I read it.
Hopefully you can get some use out of it too!
As usual, the code is up on GitHub...
Aggregating great articles on data leadership
Like many data folk I'm subscribed to a bunch of weekly
data analytics, data science, analytics engineering, and etc. email services.
They're roundups of great blog posts, papers, and articles all over the web. Pretty regularly
I come across gems in these roundups that give fantastic insights about data leadership,
but usually a week later the email is buried under 6 feet of crap in my inbox. No more!
Here is my attempt at collecting the crème de la crème posts on data leadership and management, covering topics like
hiring, culture, strategy, org structure, and more.
I recently came across the phenomenon of the awesome-list, and thought it was a format that would work.
Check out the list here...
Obesity, causality, and agent-based modeling
Obesity, as a public health problem, has an enormous amount of "causes": the types of farmed foods we tend to subsidize on a national level,
our policies around public transit, the walkability of neighorhoods, the
presence of food deserts, our social networks and their attitudes toward obesity, the media, etc etc.
All of these complex, interconnected things make it really challenging to perform a causal analysis of potential solutions.
I recently came across a great paper that takes a stab at addressing this problem through simulations, and I think the lessons from this
are very much applicable to some problems we face in the data industry.
Ode to Edward Hopper
Not in any way data-related: my favorite artist is American painter Edward Hopper.
There is a beautiful stillness and moodiness in his work that I love. I was recently at the MoMA
and had a chance to see a few of his works up close. Not sure why, but I felt inspired to do a bit of a deep-dive into one of them.
How do baby names come and go?
I’m in my mid-thirties and as many of my friends are starting to make their own families, I'm having to learn lots of baby names.
I’ve heard lots of people say that “older” names are becoming popular and in hearing
these baby names I feel like there is something to this. What kind of name trends exist out there?
LSTMs and DNA
First of all, go easy on me, I wrote this way back when LSTMs were big and the original paper on the Transformer model was 6 months old 😉!
You can represent DNA as a sequence of unique symbols (e.g. the four DNA bases T, A, C, and G). That means
machines can learn from them just like they would from sequences of words in documents. Let's see if we can make predictions from these sequences...
Machine learning and art
In 2015 Leon Gatys wrote a paper describing an algorithm that
could "separate and recombine content and style of arbitrary images." If you ever wondered what
the Mona Lisa would look like if done in the style of van Gogh's Starry Night,
it turns out this is something an algo can do quite well!
Preparing for the transition to data science
Back when I was a Data Scientist and Program Director at Insight, we helped hundreds of academics with PhDs transition into the data science industry. A colleague and I recently wrote this blog post on what skills one should cultivate to become a data scientist. It includes some wonderful resources that everyone should know about. Check it out!
Visualizing socioeconomic disadvantage across US counties
When we create maps to view the spatial variation of
socioeconomic status, we are typically only viewing the variation of
one factor at a time (e.g. just income or just unemployment rate). I thought it would be useful to create and visualize a summary score of overall "socioeconomic disadvantage" from many socioeconomic
Predicting basketball winners through simulation
Back in the day a friend and I tried to see if we could create a simulation of a
professional basketball game to predict the winner. Learned a lot in the journey, even
if the destination wasn't that great!
© Roni Kobrosly 2022