Blog | 2 years ago | 6 — 8 mins

The impact of the right answer to the wrong problem on data science problems

"If I had an hour to save the world, I'd spend the first 55 minutes defining the problem." - Albert Einstein

In my experience, the most expensive mistake in data science is when you’ve spent days, weeks or even months collecting data and building solutions to the wrong problem. Previously, I talked about false positives and false negatives, and how it's important to carefully take into consideration the impact that your model's predictions will have (and what decisions you're making). In statistics, these errors are called (uncreatively) Type I and Type II errors. In 1957, Allyn W. Kimball called "the error committed by giving the right answer to the wrong problem" a Type III error, or what he called "an error of the third kind".

To illustrate this point, let’s talk about dating. Dating in general is hard, and in the times of COVID-19, it's even harder. People, including myself, have turned to online dating in order to try and meet people in a socially distant way. These systems all have a number of filters and levers to set preferences, and it's difficult to tell how loose or strict I should be with these levers.

To help with this, I might take inspiration from Peter Backus' 2010 paper, "Why I don't have a girlfriend," and apply the Drake Equation to dating. He started with the total number of people in the UK, then filtered down by gender, location, educational level, age and a placeholder estimate of attraction - exactly what I might do for a dating app. While Backus came up with 26 potential matches, let's say I used a modified version of this model to adjust my criteria so that I had 1000.

Let's fast forward to Valentine's day, where I'm sitting across from my date at a romantic restaurant and they ask me "Am I the only one for you?" Using my previously built model, I quickly respond, "No! You're one out of 1000..." Before I know it, I'm seated alone adding a manual "-1" adjustment term to my model. I've just made a classic error, I gave the correct solution to the wrong problem.

Where do they come from

OK, so if this is the most expensive issue that we encounter, how can we avoid making this error? In general, I've seen this occur from one common issue: the solution wasn't built by data scientists who fully understand the business problem and context. Where does this go awry? I've seen it largely happen in 2 different ways:

  • The solution was designed by management (either data science management or business management)

  • The solution was designed by data scientists who were separated from the business problem

Data science and model problems are highly data and context specific, so solutions tend to require specific knowledge of both. It's a common occurrence, especially in NLP and image classification, that models trained in one context fail to perform when applied to another. Subtle changes in the question being asked can have drastic impacts on what a successful model looks like. When solutions are designed without fully understanding the context those models exist in, you'll build the wrong solution to the problem.

I've worked on several systems to automate work and oftentimes the ask is for a fully automated solution. These systems are hard and unrealistic, and it’s all-too-common for directors and executives to demand them. We don't know what's going to work ahead of time, and often the best solution is one we discover along the way, not one that we had planned out in the beginning. Take to heart what Sir Alexander Fleming said about his discovery of penicillin: “One sometimes finds what one is not looking for. When I woke up just after dawn on Sept. 28, 1928, I certainly didn’t plan to revolutionize all medicine by discovering the world’s first antibiotic, or bacteria killer. But I guess that was exactly what I did.”

What's easy vs. what's hard

One of the common themes in data science is that people (including data scientists) often don't have a great grasp of what's easy and what's hard. I've often been asked to build a single model for tasks that are "simple" for a human to do, which actually require several stacked models in order to complete. There’s a reason that “identify all the pictures with stop lights” tends to be a reasonably effective method of detecting bots.

The reason that we call it data "science" and base our processing around running experiments is because we don't know for sure what will and won't work. Successful model design and data analysis typically requires being as close to the data as possible, a role that data science management seldom fills. Management (both data science management and overall business management) is great at identifying business objectives, but in the data science space, they aren't particularly great at solving them. Data science is a creative endeavor, which needs the latitude to try different model paths to achieve success.

In order to build effective solutions, data scientists need to understand and empathize with the end users of the model. Failing to do so leads to solving problems for the wrong person. I've seen this happen two ways: either the data scientist thinks they know the problem as well as (or better even!) than their users, or they're denied access and face-time to their users and instead only interact with managers and go-betweens.

I've spent months building a model, working to improve a process that was described by a manager, only to discover that the way that the manager interacted with the problem was drastically different than the way the workers interacted with the problem. Business leaders, managers (and to some extent data scientists) tend to view and deal with problems in the aggregate, whereas end-users deal with the one problem in front of them right now.

Pretend you were working for a dating app company, and you were asked to improve their matchmaking model. The business leaders may see things in terms of the number of “daily average likes” or “daily average right-swipes” or subscriptions. The needs and desires of your users - to find companionship - may never come up. You’d end up building a model that may work for the business leaders in the short term, but fail to meet the needs of your users in the long run.

How to solve the right problem

So, what can you do? First, you need to make sure that the solutions aren't coming from the top down, but are emerging from the conversations with your business partners and your observations of the data. This can mean uncomfortable conversations with your boss or other business leaders. If model designs are coming from Directors, VPs or C-levels, your models are probably being designed by the wrong people.

Second, you need to build rapport and empathy with the end-users of your model. What is the problem that they are trying to solve? How do they think about it? If you can, walk a mile in their proverbial shoes to understand what their goals, dreams, and pains are. What information is most important to them (and how will they use the information you give them)? Do this early, and often.

One of my most successful modeling endeavors occurred because I spent an hour a week with a team for about 6 months before I ever started building a model for them. By the time I had bandwidth to build the model, I’d already spent 26 hours understanding the problem domain, and had a surer understanding of the problem space. If I had waited until the problem had already arrived on my desk, I would have taken significantly longer than 26 hours for me to get up to speed and to understand the domain space. Even if I could have spent 26 hours in that first week, my domain partners probably would not have been able to take 26 hours from their job teaching me.

How can your model best help them answer the questions that they're asking themselves and not just vocalizing to you? In other words - when they ask you "Am I the only one for you?" do they want your number of potential matches, or are they asking "Do you love me?"

Written by Max Boyd, Data science lead