Blog | A month ago | 2 — 3 mins

Part 2: What data scientists could learn from software engineers

"If you're not embarrassed by your first product release, you've released it too late" - Reid Hoffman

Last time we talked about the software developer’s mantra “Make it work, make it right, make it fast” and started an overview of how data scientists could learn from it and adapt it for their own use. In his podcast, “Masters of Scale”, Reid Hoffman, LinkedIn Co-founder and partner at Greylock capital focuses on embracing rapid iteration. He’s encouraging startups and entrepreneurs to ship products in an imperfect state, then rapidly iterate to improve them. Make it work for data science is precisely about applying this principle. Ship an imperfect model and set yourself up for rapid experimentation with production models. In this post, we'll talk about the drawbacks of some of the more traditional data science workflows and how building out a simple production system can save you a lot of pain in the long run.

CRISP-DM

The "CRoss-Industry Standard Process for Data Mining" (CRISP-DM) has been around for over 20 years - and is widely used and referenced.

Process diagram showing the relationship between the different phases of CRISP-DM

CRISP-DM gets a lot right. It’s an iterative process and it’s often non-linear, jumping back and forth between steps as needed. However, for modern data scientists focused on shipping models to production, following this standard ends up leading to challenges when it comes time to take ML models to production. Because deployment happens at the end, here are some of the common problems that arise:

  • Time. If you wait until the model is perfect before starting your deployment plan or starting to put things into deployment, you could end up waiting months (I’ve been there). A perfect model that’s sitting on the shelf doesn’t provide value.

  • Problems. Every time you take something from the lab to production there ends up being problems of some kind. We've detailed some of these before in common problems taking ML from lab to production. The longer you spend on development before you deploy, the more these problems will accumulate.

  • Perfect is the enemy of Good. It's common to spend more time trying to create a 'perfect' model, or a model that's just a little better than the current one before going through the trouble of deploying it.

Why start with deploying to production?

One of Steven Covey's “7 habits of highly effective people” was to "Begin with the end in mind." That is, to envision the end-state that you want to attain and build out from there. This advice is good for life, and also good for ML systems. Now imagine that we've built a simple production system, let’s look at how that addresses the problems that often come up due to deploying at the end:

  • Time. You'll no longer be waiting months after your perfect model to develop things. You can start with a very simple, basic architecture, and improve your deployment architecture alongside your model.

  • Problems. You can circumvent many of these common problems moving from the lab to production. You're already in production - so it's much easier to tailor your lab experiments to match.

  • Perfect is the enemy of Good. If you already have a model in production, then _any_ model that is better than the current one is worth shipping. This leads to continuous development and improvement.

These problems are circumvented by starting with a simple production system and stupid model.

It leads to continuous development

Continuous development is one of the more pronounced breakthroughs and improvements in the software development community. Most teams are practicing some form of Agile development, where smaller, incremental improvements are delivered to a project, rather than huge monolithic updates. These smaller improvements also tend to be easier to review and understand (and if there's issues then they don't lose nearly as much time and work). By continuously pushing smaller updates, you end up with a much more manageable system.

The same holds true for model development. Once your model is in production, it just becomes about updating and improving on that model. It much more closely couples the state of the lab and the state of production so that they're never far apart. We'll talk about the virtues of continuous development more in the forthcoming "Make it Right" post.

If you're starting to feel convinced - stay tuned and I'll jump into how you should go about making your very first production system, and why, to put an ML spin on Reid Hoffman’s quote, "if you're not embarrassed by your first model release, you've released too late".


Written by Max Boyd, Data science lead