Models are Imperfect, but Some are Useful

All Models Are Wrong (But Some Are Useful)

Share

As I write this, we’re all hunkered down “flattening the curve” of the COVID-19 global pandemic.

The good news: Lots of Very Smart People have created many predictive analytics models to help us manage the pandemic. We’re all being exposed to these predictive analytics far more often than most of us have been in our pre-COVID-19 lives.

The bad news: Many of these models use different inputs, different heuristics, and come to different—some slightly, some significantly—conclusions.

The differences in these models brought to mind the aphorism that inspired the title, generally attributed to the statistician George Box. There’s lots of wisdom there, so let’s unpack it.

What do we mean by “wrong”?

Analytics models are “wrong” in the same way that maps are “wrong”—that is to say, they’re necessarily simplified and idealized.

“I have a map of the United States... Actual size. It says, 'Scale: 1 mile = 1 mile.' I spent last summer folding it. I hardly ever unroll it. People ask me where I live, and I say, ‘E6’.” ― Steven Wright

Everyone understands that “the map is not the territory”. Similarly, the analytic is not the data—an analytic model requires the best-available data (just as maps do), but the path from data to analytic is a lossy process.

Models are also like maps in that there are many types. Different types of analytic models can be applied to the same source data, and each may offer different kinds of insight, helping to uncover a larger “truth” that can be used to help improve the chance of reaching desired outcomes.

Wrong can be better

It might be tempting to think of model simplification and idealization as a shortcut. However, sometimes it’s necessary for usability and decision-making.

Subway maps are a great example of this. They’re often radically different than the actual geometry of the system, as shown in this animated morph between (1) the map of the Paris Metro subway system and (2) the real-world geometry of the system.

Importantly, the approximation of the subway map adds clarity and reveals the internal logic of the system in a way that the geometric map does not.

All models are approximations

To get from data to an analytic, models must make assumptions. This means that all analytics, to a greater or lesser degree, are approximations.

“This makes all analytics, to a greater or lesser degree, approximations.”

Explicit assumptions

Some assumptions are explicit, meaning that a human has made decisions in the process of creating an analytic.

As a simple example, consider the “views” count of a video advertisement—a seemingly simple metric used as an input for lots of analytic models. Do we count it as a view only when the entire ad plays? Or must it play for a minimum time, or for a minimum percentage of the ad duration? And during that time, must every pixel of the video advertisement be in view 100% of the time?

Interestingly, even in the case of this incredibly-simple metric, there is no universal standard. YouTube counts an ad as “viewed” if 30 seconds of an un-skippable ad plays, while Facebook happily counts three seconds of playback as a view. LinkedIn lowers that to two seconds, and only requires 50% of the video to be in view.

Implicit assumptions

Other assumptions are implicit, meaning that they haven’t been expressed and may not even be known. They may be a side-effect of the algorithm used, or even of the data used by an AI algorithm.

One critical aspect of data science is attempting to understand the implicit assumptions that the data, the process, and the algorithms used may be making. This is often difficult, since it may be difficult (and sometimes impossible) to determine how an AI algorithm arrived at a result.

For example, an AI algorithm may be trained on decades of data. If the data itself has bias represented in it, the AI algorithms using that data will also be biased. Famously, the face recognition system that Amazon sells to police departments matched 28 members of Congress with mugshots, with most of those matches for Congresspeople of color.

The false authority of exact numbers

According to a 2015 study of mergers and acquisitions, investors who offer “precise” bids for company shares yield better market outcomes than those who provide round-numbered bids.

Models generally result in exact numbers instead of rounded ones. The problem with that is that our pattern-seeking brains interpret those exact numbers as “more authoritative” than rounded ones, even though both are estimates.

So, beware exact numbers. As a reader, remind yourself that any numbers representing things that haven’t happened yet are estimates. As a model creator, help people who will consume the output of your analytic model understand that although a model’s output may appear to be precise, it’s a guess which is sure to be wrong (but hopefully right enough to be useful).

Assumptions and “black swan” events

A black swan event is a difficult-to-predict event that means that “normal” is no longer normal. As I write this, a recent example is the ongoing 2019–20 coronavirus pandemic.

The societal changes we’ve made as a result of COVID-19 were unimaginable by most just a few months earlier. Although larger companies (Teradata included) had pandemic playbooks, the pandemic has been a litmus test for the explicit and implicit assumptions baked into analytic models.

Many models continued to “just work”, through foresight and (sometimes) luck. Many simply broke, resulting in some of the temporarily operational chaos we’ve seen in early 2020.

Models are useful

Yes, they’re imperfect. Models are approximations and depend on assumptions, implicit and explicit. It’s important to never forgot that all models are “wrong”— and that that’s not only okay, but desirable.

And yet, analytic models are incredibly useful and important, and a primary tool for getting from “data” to “insight”. Analytic models are how our customers extract value from a truly incredible amount of data, turning that data into actionable business practices to get the outcomes they want.

(Author):
Charles Wiltgen

As part of Teradata’s Technology & Innovation Office's Communications team, Charles contributes to technical communications with a focus on IoT/IIoT/AoT. He brings to Teradata 20+ years’ industry experience building and marketing products, technologies, and tech ecosystems for consumers and developers. At Apple, he enabled developers and high-value media partners to blaze the digital media trail with professional and consumer digital media products. At Qualcomm and Kyocera, Charles helped pioneer mobile media services and apps. At Marvell, he helped unlock a new generation of Internet of Things products with an open source embedded device platform. He created and hosts Teradata's "Datacast" podcast, available in your favorite podcast app.

View all posts by Charles Wiltgen

Connect

Advanced analytics for a safe return to office.

Culture & Community

COVID-19 Pandemic Analytics for a Safe Return-To-Office

Learn how Teradata is using Advanced Analytics to guide its safe Return-To-Office (RTO) policy for its global employees. Read more.

July 12, 2021 | 4 min read

Tech Trends

Modeling the Risk of COVID-19 for Effective Pandemic Response

Teradata experts propose a risk model to quantify the risk of infection and vulnerability for COVID-19 using individual-level demographic and behavioral data.

January 27, 2021 | 4 min read

Using data science to forecast COVID-19 infections.

Vantage

Forecasting COVID-19 Using Teradata Vantage

Teradata data scientists utilized Teradata technologies to develop models to accurately project the number of COVID-19 confirmations and deaths. Learn more.

July 16, 2020 | 9 min read

How to use data and analytics to build an early warning system for COVID-19.

Tech Trends

COVID-19: Risk Analytics for Building an Early Warning System

Advanced analytics & AI techniques can help in curtailing the COVID-19 pandemic. This post describes an analytics prototype to build an early warning system for COVID-19.

May 6, 2020 | 4 min read

Data-driven supply chains are they key to avoiding disruptions like in the COVID-19 pandemic.

Tech Trends

COVID-19: Supply Chain and The Great Disruption

In light of COVID-19's massive disruption of our global economy and daily lives, it is more important than ever to enable fully digital and data-driven supply chains.

April 23, 2020 | 8 min read

How data and analytics can help stop the spread of COVID-19.

Tech Trends

Breaking the COVID-19 Chain with Data Analytics

How can Teradata's data analytics platform help communities stop the spread of COVID-19? Find out more.

April 17, 2020 | 3 min read

Your privacy is important. Your personal information will be collected, stored, and processed in accordance with the Teradata Global Privacy Policy.

All Models Are Wrong (But Some Are Useful)

What do we mean by “wrong”?

Wrong can be better

All models are approximations

Explicit assumptions

Implicit assumptions

The false authority of exact numbers

Assumptions and “black swan” events

Models are useful

(Author):
Charles Wiltgen

COVID-19 Pandemic Analytics for a Safe Return-To-Office

Modeling the Risk of COVID-19 for Effective Pandemic Response

Forecasting COVID-19 Using Teradata Vantage

COVID-19: Risk Analytics for Building an Early Warning System

COVID-19: Supply Chain and The Great Disruption

Breaking the COVID-19 Chain with Data Analytics

Turn your complex data and analytics into answers with Teradata Vantage.

Your privacy is important. Your personal information will be collected, stored, and processed in accordance with the Teradata Global Privacy Policy.

All Models Are Wrong (But Some Are Useful)

What do we mean by “wrong”?

Wrong can be better

All models are approximations

Explicit assumptions

Implicit assumptions

The false authority of exact numbers

Assumptions and “black swan” events

Models are useful

(Author): Charles Wiltgen

Related Posts

COVID-19 Pandemic Analytics for a Safe Return-To-Office

Modeling the Risk of COVID-19 for Effective Pandemic Response

Forecasting COVID-19 Using Teradata Vantage

COVID-19: Risk Analytics for Building an Early Warning System

COVID-19: Supply Chain and The Great Disruption

Breaking the COVID-19 Chain with Data Analytics

Turn your complex data and analytics into answers with Teradata Vantage.

(Author):
Charles Wiltgen