Blog | 2 months ago | 3 — 5 mins

A look into disappearing data and degraded performance preventing ML models from shipping

Surprising but true: 80% of your models never make it out of the lab and into production, and when they do, they more often than not become stale and hard to update. Today, we’ll cover two common problems you might have hit recently: disappearing data and degraded performance. They’re so common that it doesn’t matter the size of your company, whether you work for a “tech” (or “non-tech”) company, or how many teams of people are dedicated to shipping your product.

Disappearing data

Data is constantly changing. Our data warehouses, data lakes, streaming data sources, etc. are constantly growing. New features in the product create new telemetry; you’ve bought a new data source to supplement a new model; an existing database has just gone through a migration; someone accidentally began initializing a counter at 0 instead of 1 in the last deploy... the list could go on. What could possibly go wrong?

Any one of the above changes brings about challenges in ML. First, let’s tackle data availability between your online and offline data sources. You’ve made it all the way through feature engineering, model training, and cross-validation, iterated several times, and you’re finally ready to productionize your model. It turns out, however, that the data you had access to during training and validation is somehow different than what’s available in production.

Now what? More often than not, your model is shelved, sometimes for months or even forever. It may take multiple sprints to add another production feature to be available at the moment of prediction. One way to solve these challenges is to use a development environment that supports read-only connections to your production data pipeline directly instead of pulling from offline sources. This, however, may move your problem up, because you’ll need to make the business case for adding new data to the pipeline before you can experiment on whether or not it’s worth the data engineering effort to do so.

Another possible solution is to build the data pipeline yourself, but you’re likely to make rookie mistakes as data engineering is challenging and isn’t your specialty. A third and emerging solution is to develop and register your features to a feature store. There are different approaches to feature stores, some approaches allow for offloading the computation and productionizing of APIs for you, these feature stores can help solve this particular challenge. Allowing for experimental model iteration locally and serving of APIs to your data engineering team that can be used directly in production.

Degraded performance

Ok, you’ve solved the disappearing data problem, but now that the model is live, you aren’t saving as much money as you projected with the update. What happened? Oftentimes, this means that something has changed about parameters of the data that your model relies on, such as the average, minimum or maximum value expected by the model.

To verify, you could take a new validation set from a smaller, more recent time window to see if your results are different than expected. That was it — somehow the price of toilet paper has skyrocketed! But, how do you ensure this degradation doesn’t happen in the future? One way is to normalize your data so that you’re training and predicting on normalized data. However, it can be tricky to find the right value to normalize with and might not show the relationship to your target that you once had.

Another solution is to use event-based data that has your training set separated by time from your validation set. Keep a held out set of your data to validate the accuracy of the model against the most recent events. This can catch the problem early so that you know if the shape of your data is changing faster than you expect.

An additional step is saving the parameters that your model depends on with the version of the API you’ve released and writing monitoring tests against new values using something like Great Expectations. For instance, maybe you’re looking for 90% of new data to be within a certain range of your parameters. This will allow you to know when performance is likely to have degraded below what is expected and it’s time to retrain.

These are only two of several common problems you might encounter when trying to take your ML model out of the lab and into production. Is there a problem you keep running into that you’d like to talk about or one you’d like help solving? Send me a note and we’ll get into it!


Written by Charna Parkey, Ph.D., Data science lead