The ML project failure funnel


Anyone who has worked in tech in the data science / MLE / applied ML field knows that most ML projects will fail. Newcomers to the field have super high expectations about what they can do, and, sadly, this leads to a lot of them leaving the space altogether. ML project failure is an inevitable part of life in this field and it's happened to me many times at different stages of my career. In the last couple of years, I've begun to think about the building of ML projects as a sort of funnel process; ideas are cheap and many, but only a small amount of these ideas reach the end of the funnel and become productionized, solid, used, effective, and maintained ML applications.
Read more here...

ChatGPT and the end of software engineering?


One of the hottest ChatGPT takes I've heard from mostly non-technical, tech-enthusiast folks is that it, or one of it's near-term successors, will effectively put software engineers into the dustpan of history. The following is my attempt to collect my thoughts around this and present my case to a lay audience on why this will not be the case.
Read more here...

Misc October updates 🍂 🎃 and PyData NYC


Fall is here and you really ought to buy some of those pumpkin pancake/waffle mixes at the store. I'm there for it. I've been thinking a lot recently about data-driven culture and who is responsible for building it and maintaining it in organizations. Some people say this responsibility lies primarily with the executive leadership. Others say it's primarily on data teams to show their value, stop being a passive service team, and evangelize the value of data to product teams. I do think that as a generation of data practitioners move up the leadership ladder and enter the C-suite, strong data culture will be easier to build. I'm not sure where I come out on this, but I'd like to formalize my thoughts around this and write it up sometime soon. Also, the November PyData NYC conference is around the corner. I'll be running another causal inference tutorial session and excited to learn about the projects attendees will be putting this stuff towards. Beyond that, I'm really looking forward to sessions like "20 ideas to build social capital in the Data Science ecosystem" and "Causal machine learning for a smart paywall at The New York Times". I remember when I was a poor grad student that the paywall was less smart and we could just freely visit while in private browser mode. Fall in NYC is gorgeous and I'm looking forward to seeing the colorful foliage in Central and Prospect Park, eating at Joe's Pizza in the Village, checking out old haunts, and then getting a parting slice at Joe's Pizza.

Teaching causal inference at SciPy 2022


I recently had the privilege of giving a talk and tutorial session at SciPy 2022 in Austin. Besides rediscovering how hot central Texas is in the summer (the sun is trying to kill you), I walked away with some useful insights from the audience such as: (1) people are hungry to learn more about this topic. (2) Most people came in thinking that causal inference was a way to improve ML predictions, rather than more closely related to A/B testing and decision science. (3) One of the more controversial points was around how people should not interpret variable importance measures (e.g. SHAP values) as causal.
You can watch the first half of the talk here and if you'd like to look through the materials / try your hand at the execises, you can find them here on GitHub.

Keeping up with blogs made less annoying


I love Medium and Substack. They're so great that only a non-English word can truly describe them: fantastico. However, not all tech leadership, data, and engineering blogs will let you subscribe and receive email updates on new posts. And some of the best personal blogs out there only drop a new post a few times a year. You're then forced to periodically and manually check these websites for updates. I've got a list of 50+ of these non-subscribable blogs that I care about, and checking in on them sucks. So, I built a tool named Blog Checker (clever name, I know), which automates the process of checking personal websites and blogs for updates. I have a script on OS X desktop that I double click to run this, and within 10 seconds I can find out if any of my favorite sites has changed since the last time I read it. Hopefully you can get some use out of it too!
As usual, the code is up on GitHub...

Aggregating great articles on data leadership


Like many data folk I'm subscribed to a bunch of weekly data analytics, data science, analytics engineering, and etc. email services. They're roundups of great blog posts, papers, and articles all over the web. Pretty regularly I come across gems in these roundups that give fantastic insights about data leadership, but usually a week later the email is buried under 6 feet of crap in my inbox. No more! Here is my attempt at collecting the crème de la crème posts on data leadership and management, covering topics like hiring, culture, strategy, org structure, and more. I recently came across the phenomenon of the awesome-list, and thought it was a format that would work. Please contribute!
Check out the list here...

Obesity, causality, and agent-based modeling


Obesity, as a public health problem, has an enormous amount of "causes": the types of farmed foods we tend to subsidize on a national level, our policies around public transit, the walkability of neighorhoods, the presence of food deserts, our social networks and their attitudes toward obesity, the media, etc etc. All of these complex, interconnected things make it really challenging to perform a causal analysis of potential solutions. I recently came across a great paper that takes a stab at addressing this problem through simulations, and I think the lessons from this are very much applicable to some problems we face in the data industry.

Ode to Edward Hopper


Not in any way data-related: my favorite artist is American painter Edward Hopper. There is a beautiful stillness and moodiness in his work that I love. I was recently at the MoMA and had a chance to see a few of his works up close. Not sure why, but I felt inspired to do a bit of a deep-dive into one of them.

Causal Curve 1.0.0! 🎉


What started as a project to pass the time during that initial 2020 COVID summer when nothing was going on grew into a proper python package! I just released a major, 3500-line overhaul of the the causal-curve package.

How do baby names come and go?


I’m in my mid-thirties and as many of my friends are starting to make their own families, I'm having to learn lots of baby names. I’ve heard lots of people say that “older” names are becoming popular and in hearing these baby names I feel like there is something to this. What kind of name trends exist out there?

Automating away the "elbow method"


Sometimes when you're tuning a parameter in a machine learning, you end up needing to look at something like scree plot to determine the best parameter value. It feels annoying and subjective. Here's a simple way to automate this away.



First of all, go easy on me, I wrote this way back when LSTMs were big and the original paper on the Transformer model was 6 months old 😉! You can represent DNA as a sequence of unique symbols (e.g. the four DNA bases T, A, C, and G). That means machines can learn from them just like they would from sequences of words in documents. Let's see if we can make predictions from these sequences...

Machine learning and art


In 2015 Leon Gatys wrote a paper describing an algorithm that could "separate and recombine content and style of arbitrary images." If you ever wondered what the Mona Lisa would look like if done in the style of van Gogh's Starry Night, it turns out this is something an algo can do quite well!

Preparing for the transition to data science


Back when I was a Data Scientist and Program Director at Insight, we helped hundreds of academics with PhDs transition into the data science industry. A colleague and I recently wrote this blog post on what skills one should cultivate to become a data scientist. It includes some wonderful resources that everyone should know about. Check it out!

Visualizing socioeconomic disadvantage across US counties


When we create maps to view the spatial variation of socioeconomic status, we are typically only viewing the variation of one factor at a time (e.g. just income or just unemployment rate). I thought it would be useful to create and visualize a summary score of overall "socioeconomic disadvantage" from many socioeconomic indicators.

Predicting basketball winners through simulation


Back in the day a friend and I tried to see if we could create a simulation of a professional basketball game to predict the winner. Learned a lot in the journey, even if the destination wasn't that great!