A love letter to all data scientists
5 months ago
I never met you, but I know I love you.Read More
Blog | A month ago | 6 — 8 mins
Why Jupyter delights Data Scientists and terrifies Machine Learning Engineers.
Jupyter notebooks are the absolute worst thing ever. That is, until I imagine trying to do data exploration and manipulation in some other platform and then they become the best thing ever. Of course, that's only until I think about how to take the models and features created in a notebook and productionize them - then they're back to the worst thing ever. And the cycle continues.
I realize that my love-hate relationship with notebooks has to do with the multiple hats that I typically wear in my roles. While I like to call myself a "full-stack" data scientist, at the small startups that I've been to, I often have to play the role of both Data Scientist and Machine Learning Engineer. As a Data Scientist, I haven't found something that fits my needs better than the Jupyter notebook. As a Machine Learning Engineer - I want those notebooks nowhere near my production systems. Let's talk about why.
When I'm wearing my Data Scientist hat, I'm usually in a messy, exploratory, unknown environment. I'm probably looking at data that I've either never seen before or don't fully understand. Jupyter provides the flexibility and interactivity that is essential for me to dive in and understand what's happening.
These explorations are often non-linear: diving deep into the data, resurfacing part-way then exploring from a new angle. Being able to interact and explore the data in this way has saved my butt multiple times. I often find:
Data fields that weren't behaving the way that I initially thought they would
Unique IDs that aren't as unique as they thought they were
"Boolean" columns that only take on values of "1" or "null"
Outliers that need to be identified, inspected, and understood...
The list of useful things that are made easier goes on, enabled by the easy access and interaction with the data made possible with Jupyter.
These same virtues of non-linear code exploration and code execution can become a nightmare when it comes to productionize things. I've opened up more than my fair share of notebooks (mostly my own) that I needed to productionize which look like the fever dreams of a mad man. The non-linearity that was so beneficial as a data scientist turns into a nightmarish choose-your-own adventure book for the machine learning engineer trying to recreate the final path that you took. Whereas all the failures, mistakes, and errors that were made are useful for a data scientist to hold onto, it makes the machine learning engineers job closer to archaeology: trying to discover and interpret the hidden meaning behind for loops, boolean statements and drop cols.
Visualizations are critical for a data scientist to be able to do their job quickly and effectively. Finding hidden patterns in the data, outliers or other sources of interest is much easier when the data can be viewed from the right angle. Visualizations help us not miss the forest (distribution) for the trees (individual data points). Jupyter allows for the crafting of visualizations, that can then be used to inform the decisions being made by the data scientist in regards to further feature engineering and crafting.
Jupyter is also incredibly extensible, which allows a Data Scientist to add in useful functionality for themselves. There are a number of extensions available which capture and display useful metadata, like what time code was last executed and how long that code took (runtime). Jupyter can also churn out some stunningly beautiful reports. There are extensions available that allow users to automatically add in a navigable and searchable table of contents which serve as hyperlinks guiding a reader through the experience. There's nothing quite as satisfying as walking through a beautifully crafted laboratory notebook, and read through a scientist's experiments and findings in a carefully crafted way. Being able to easily add, format and edit your thoughts, observations and graphs can help guide your audience towards your conclusions and thought processes.
The flexibility and use of visualizations also lead to one of the areas where it becomes a struggle to use: reviewability and stability. Jupyter notebooks can be incredibly difficult to version control, review, collaborate and reproduce. If you asked most programmers today to code without some form of version control and code review processes, they'd likely look at you with shocked horror. Tools like Git are industry standard and make lives easier.
When I'm wearing my Machine Learning Engineer hat, I want to put everything in version control and have it go through some form of review process - mostly to avoid my own stupid mistakes. Being able to easily track what has changed and when that change occurred is a vital element of a stable codebase. Most tools out there are built around the ability to easily review code - not the analysis, visualizations and other outcomes.
Jupyter Notebooks inherently makes finding the differences between notebooks incredibly challenging. Jupyter Notebooks aren't purely code text; they often have data tables and visualizations (which can be non-deterministic). Most code review systems weren't built to handle this, and so these visualizations largely appear unrendered and without noting differences or similarities between them. This makes the review of changing visualizations impossible. Additionally, since Jupyter persists notebooks as JSON blobs, which can be non-deterministic, diff programs can just straight up fail to find actual differences and mark everything as different. Features like the "timestamps of last run" and the "navigable table of contents" were useful for the data scientist, but that metadata can become additional noise when trying to review things if the diff program can't understand it. To this point, unlike code review, most successful notebook code reviews I've done have largely been two people, side-by-side and over-the-shoulder where we're looking at the same screen. In these reviews, we don't just want to review the code itself, but also the data, visualizations and observations made. The tools to do that easily just aren't there yet.
Ok, so we've recognized the issue. Jupyter Notebooks: can't productionize with 'em, can't train without 'em. So what do you do? I wish I had a good answer here. There are practices that can help mitigate some of these issues (containerization) and understanding some of the anti-patterns that come up with Jupyter Notebooks certainly helps. But even knowing the issues, it’s still easy to fall into a notebook of ever-increasing size and iterations.
While technologies like Sagemaker and Paper Mill exist to make it easier to productionize notebooks, the challenges in being able to easily review changes aren’t well addressed. Reproducibility issues, since notebooks allow for non-linear execution of code, typically make these tools non-starters for me. (Though I do recognize that both Papermill and Sagemaker have success stories - if you've used them in your production pipelines, I'd love to hear from you)
These challenges in reviewability and reproducibility keep me up at night as a Machine Learning Engineer. There are some tools that are making reviewability better: nbdime, reviewNB and Neptune to name a few, but they still don't solve the whole problem. Most of these solutions are trying to solve a data science problem with a solution built specifically for software engineering. I have strong feelings about this, but I can't say it any better than Charna did in her article for DevOps.com: MLOps is more than just automation.
One of the things that excites me about being at Kaskada is the opportunity to help build a solution that addresses both the data scientist’s needs for flexible iteration and the machine learning engineer's needs for reviewable and reproducible work. I'm excited for a solution that let's my inner data scientist go wild, without traumatizing my inner machine learning engineer. Jupyter won't go away, and I will always love it, but it's time for newer, better tools that let data scientists be data scientists while serving the needs of the machine learning engineer. Let’s serve the whole problem.
Written by Max Boyd, Data science lead
5 months ago
I never met you, but I know I love you.Read More
5 months ago
A guide to MLOps for data scientists: Part 4Read More
6 months ago
Processes enable the peopleRead More
6 months ago
The impact of the right answer to the wrong problem on data science problemsRead More