Blog | 2 months ago | 3 — 5 mins

How data treated as "facts" degrades model accuracy and a proposed solution using event streams

I'm not the same person that I was 5 years ago. Regardless of what metric you use: socially, emotionally, professionally, physically I have made substantial change (hopefully progress!) in each of these areas. Unfortunately, many companies aren't set up to account for this change - and this can lead to their ML models making mistakes ranging from annoying to deeply hurtful.

The State Of The World

Most analytical databases are built on a model that deals with "facts" and "dimensions." “Facts” are typically representations of the transactions that occur in the OLTP system used by the company, whereas “dimensions” reflect the information about the various entities. Unfortunately, oftentimes these systems are designed to reflect the current state of the world and don't handle changing dimensions particularly well. Let's take a look at an example customer.

An example dimension table

These may seem innocuous, but all of these values can end up changing, either due to correcting inaccuracies, or genuine shifts in my customer data. Here's a few ways:

  • "Max" isn't my legal first name.

  • I could move.

  • It's not uncommon for gender to change.

  • I wasn't _actually_ born in 1887, that might be a typo that needs fixing.

Can your data systems handle these changes? And if they can, do they still retain the history of what information used to be there? Let’s talk about why you want to have systems in place to handle changing data and data errors.

Changing Data

Having incorrect past feature vectors can cause serious issues. In the world of housing and pricing, this can cause large errors. I've built hilariously bad predictions for buildings which used to be houses, but are now apartment buildings due to databases that hadn't updated with changes. And vice versa - I've scratched my head at how some pretty great houses sold for so little in the past - because their old feature vector was overwritten. By switching to a model that enables entities to evolve over time (and capture those changes) it will vastly improve the quality of your models.

It really comes down to keeping your training data clean and keeping false positives and false negatives out. Suppose that I respond well to receiving ads (I don’t, but for the example let’s say I do) and next week I moved to Austin, TX. Historically I’ve responded well to ads for Seattle-based restaurants, but once I move to Austin, I won’t. There are three likely scenarios here:

  1. My address is updated in a dimension table and my Austin address is associated with Seattle transactions.

  2. My address is not updated in a dimension table and my Seattle address is associated with Texas transactions.

  3. A new row is created for me with the new address, and I’m represented as two separate entities in the system.

Each of these scenarios comes with their own set of problems and issues:

  1. My model will learn that ads for Seattle-area restaurants work well in Austin - subjecting Austin residents to ads that they can never use. Additionally, the model will lose fidelity in geographic information, causing model degradation for that feature.

  2. If my address is not changed, then any current model won’t be able to serve me accurate recommendations (instead taunting me with ads for Dick’s Drive-In). If the change in location isn’t recorded - the model won’t be able to learn the cause of my change in behavior, and may misattribute it to other factors.

  3. Making two copies of me in the system will cause numerous issues. Your user metrics will clearly be wrong, one version of me appears to be a lost customer, whereas the other is a recently acquired customer. The list goes on.

Handling Data Errors

The ability to understand why a model made the predictions it made in the past will save your butt at some point. Let’s imagine that your model just made a prediction costing $100,000. Was it a flaw in the model? Was it an error in the data? Without some way of either snapshotting the feature vector fed to your model or being able to re-build the point-in-time feature vector, you're going to end up with an unsatisfactory “I don’t know” (never the ideal answer to “Why did we lose $100,000?”). This is the main reason why you don't want to delete data errors, but instead correct them instead as a separate event. Acknowledging that something was an error will absolutely help you going forward in understanding your system (and how others interact with it). Covering up the error just means that you're losing data and information.

How To Do This

There are two main ways to do this. (1) You can either maintain a database and log all changes that are made to that database or (2) log all events and construct a database from that event log. There’s a great article by Jay Kreps about how these things end up being complementary, and I would be remiss if I didn’t mention it.

Personally, I prefer the paradigm where all incoming data and events are stored in event-logs or event streams and then databases are computed from those events. This isn’t a new concept by any means, but advances in cloud computing, cheap storage, and data transfer have made this method much more practical. If you build your data structures from these event streams, you can define processes to read these events and interpret them. I've found this method significantly better for programmatically reconciling when two event streams are providing conflicting information.

The main benefit of this approach is that you can build multiple databases and database views from a single event stream. This means that you can have a database system designed for auditing purposes to answer questions about what you knew  at the time. Whereas you can have another database that deals with what the actual state of the world was. This allows you the flexibility to understand the impacts of data errors and why you made the predictions you made in the past, as well as ensuring that you have accurate, reliable data for constructing your current machine learning models.

Any system that looks to make repeated predictions for the same target over time should look to ensure that the attributes of that target can change and adjust, but that the historic state remains available. Building your models and data systems off of event streams empowers you to capture the changing state of the entities in your system accurately and to ensure that you’re making the best predictions possible.


Written by Max Boyd, Data science lead