Blog | 3 days ago | 5 — 7 mins

Calculating the tradeoffs between false positives and false negatives

COVID-19 False Positives vs. False Negatives

With the current state of the world, the public eye has been centered around testing. Governor Mike DeWine of Ohio recently tested positive for COVID-19, and then subsequently tested negative. This sparked debate and articles about the accuracy of tests, and if you're anything like me, you've probably had conversations with friends and family on the intricacies of testing, false positives, false negatives and specificity.

In general, it seems like an opportunity to talk about how we design tests and how those tests should be guided by the context that they're in. Most folks don't study or understand statistics in a nuanced manner, so it's part of our job as data scientists to do that translation and help frame those conversations in a manner that is understood. Coincidentally, this is also the number one reason why I would pass on a data science candidate; they understand the models and formula, but haven't dug into the reasoning of why you might make different choices. Let's talk about what the different kinds of errors are, what they mean in context, and then how to use them to make intelligent, ethical decisions.

Understand your errors


Most folks who think about experimental design follow it in a formulaic approach. Typically they'll set a null hypothesis of "no effect," set a significance threshold of 0.05, and they'll conduct a two-tailed normal distribution test (even if that test isn't appropriate). Setting aside the choice of test and assumed distribution, there are deep implications in the selection of the null hypothesis and the significance threshold that need to be grappled with.

what choice will you make in the absence of evidence to the contrary?

With the null hypothesis, you're stating your default stance - "what choice will you make in the absence of evidence to the contrary?" Imagine that you need to save 200 ms in compute time for your model, and you've found a change in your feature engineering that will accomplish this. Instead of a default position of “only publish this change if it improves model accuracy,” you may want a default position of "let's push this change to improve model performance, unless it significantly diminishes model accuracy."

Possibly the easiest way to think about it is are you “innocent until proven guilty,” or are you “guilty until proven innocent?" Every time you establish a default position, you are expressing your values as an individual and as an organization. Too many data scientists abdicate this responsibility by just using defaults, and not grappling with the harder question: “what do you value?” There is no easy answer to this question, it requires understanding the impacts of your model when it makes incorrect predictions.

False Positives

False positives happen. There's no getting around it. One of the things that I like about statistics is that we always allow a chance that what we're seeing is just a fluke in the data. If you set a significance threshold of 5%, you're allowing yourself a 5% chance of a false positive. But before you go setting that significance threshold, you need to grapple with "what will happen when a false positive occurs?"

If you're building a classification model, it's vital that you understand the impact of your errors. Different models have different repercussions. If we're building a fraud detection model, and we mark a transaction as potentially fraudulent, the worst case scenario for us as a business is that we lose a customer, whereas the worst case scenario for a customer is that they aren’t able to make a purchase that they might desperately need.

If we're testing for COVID-19 and we get a false positive, we have someone quarantine for 14 days which could could impact their ability to work and provide for themselves and their family or until they get 3 days of subsequent negative tests (which shouldn't be too difficult given they're negative for COVID-19). If we're in the justice system, and we get a false positive, then we'd be sentencing an innocent person to prison, or even to death.

False Negatives

False negatives also happen, sometimes there's not enough data to definitively rule out the null hypothesis. No model is perfect at detection, and sometimes things will slip by. So what happens if we miss something?

In the case of fraud, we will lose money, whether that's $100, $1000, or $10,000, we can quantify the cost fairly well. In the case of COVID, a false negative is trickier because we're not measuring it in terms of dollars but in terms of lives. Catching the disease late may mean that someone gets too sick before they get the help they need. Even if they remain asymptomatic, they can spread it to others and get them sick. In the justice system, a false negative means that a guilty person goes free.

Calculating the Tradeoffs

False positives and false negatives are always a trade-off. By increasing the sensitivity of your test, you can reduce false negatives at the risk of more false positives. (Conversely, reducing the sensitivity reduces false positives in favor of false negatives). From here, we can actually start to have the harder, more impactful conversations.

In the fraud case, the conversation either becomes "how much money are we willing to lose to avoid inconveniencing and losing customers" or "how much are we willing to inconvenience customers in order to avoid losing money?" (notice that these are different default stances). The contrast between these statements reveals which values we prioritize.

In the COVID case, we're asking "how much are we willing to inconvenience people in order to save lives?" or "how many lives are we willing to sacrifice in order to avoid inconveniencing people?". Finally, in the justice system case it's "how many innocent people are we willing to sacrifice to catch the guilty" compared to "how many guilty people are we willing to let free to spare the innocent." These are moral and ethical questions that we as data scientists cannot avoid.

We are responsible for the models that we build, and the assumptions and choices that we make. Understanding how you're framing these questions reveals your values and baseline assumptions. Every time you run a test, you’re making an ethical decision, keep that in mind and make an intentional choice.

Written by Max Boyd, Data science lead