I’ve been running more and more into the idea that “just train yourself a blackbox model with AI if you want prediction”.

I’ll refer to the “blackbox” as an “opaquebox”1 and dive into why this modeling strategy is not useful for real world inference.

Real World Inference

Real World Inference (RWI) is typically what we all care about. In medicine, it’s definitely the thing we’re most interested in - even though scientific studies rarely give it to us.

If inference is defined as “determining whether an independent variable $X$ causally affects a dependent variable $Y$” then the way we typically do science is through an experiment where we keep all other variables fixed, or somehow accounted for2. These “other variables” we’ll refer to as unknowns 3.

RWI is focused on answering that same core question, except without keeping all other variables fixed. Ideally,we don’t even really need to measure all the other variables - we’d find a clever way to account for them regardless of what they are 4.

Opaque Box

The strategy of “throwing all of your observations into an AI algorithm and letting it predict” boils down to estimating a map between your variables of interest. Sometimes that’s just ${X,Y}$ but sometimes it’s a whole bunch of other variables too 5.

But when you dump all those other variables in, you’re learning a mapping given those observations. That, alone, doesn’t magically tell you how your variables relate to each other when those other variables are different than anything you observed, or even outside of the window your observations lived in 6.

Unknown Unknowns

Even if you somehow account for the above 7 we’re still left with the inconvenient truth: there are millions of unknown unknowns. That is, there are all sorts of variables that likely interact with ${X,Y}$ that we didn’t measure, or that we’re not even aware of!

Each of those variable potentially affects how $ Y \sim \alpha X$ - think $ Y \sim \alpha \cdot w \cdot x$ or $Y \sim \alpha \cdot (w - w_c) \cdot x$, where $w$ is the unknown unknown in our inference.

So the opaquebox model you learned is only really reliable in a narrow range of values for your unknowns (known or unknown). How, exactly, did you understand anything about your IV ~ DV relationship in a robust way for RWI?


  1. There are some downstream analogies where I use color and ‘opacity’ is more congruent with the concept here. ↩︎

  2. Fixing a variable happens in the empirical space - the real world. Accounting for a variable happens in the analytical space - the pen-and-paper we do before the compute step. ↩︎

  3. There are two types of unknowns - known unknowns and unknown unknowns. ↩︎

  4. This is where geometry comes in, and the concept of invariances and symmetries and etc. become powerful. If we understand certain properties of the equations underlying our observation, then we can do a clever “division” and remove the effects of a nuisance variable without ever having to measure it. This cannot be done in the idealized NHST context because we hyperfocus on just two parts while putting a massive inferential burden on the empirical side. ↩︎

  5. This is where architecture becomes crucial. We get nothing for free, and DL itself comes with some architectural constraints that affect what family of functions it is able to approximate. That is - when you dump all the other variables in also, what interactions amongst them can you actually express? That’s not a trivial question… ↩︎

  6. The term “convex hull” here is useful, but I’ll avoid it here. ↩︎

  7. AI can actually be very useful for accounting for empirical uncertainty. We can indeed come in and learn some higher-order interactions between all observables. The issue is that AI is not magic - it can’t tell us how unknown unknowns can insert themselves into the mix. That is where, bizarrely enough, I think science as a social effort comes in and nonintuitively helps. Much more later… ↩︎