Meme culture is a many-splendored thing, but the memes that spoke to them, I learned, were the despairing “failson” memes that underline the futility of our efforts towards things like mastery, respectability, etc.
hello internet i would like you to all know the extremely important information that i am VERY CUTE and also i cleaned the living room when my wife was being sad and stuff
Tldr: there are circumstances (which might only occur with infinitesimal probability, which would be a relief) under which a perfect Bayesian reasoner with an accurate model and reasonable priors – that is to say, somebody doing everything right – will become more and more convinced of a very wrong conclusion, approaching certainty as they gather more data.
Thanks for this post, it helped me understand an interesting-seeming paper that I’ve also found tough to read.
Digression
Freedman and Diaconis published a whole bunch of Bayesian consistency counterexamples like this over the course of their careers. I’m honestly not sure whether any of them have clear practical significance, although I think they have theoretical significance by showing that Bayesian inference is harder to write down as a complete and satisfactory piece of mathematics than some might think.
Specifically, I get a “Counterexamples in Analysis” flavor from them (for one thing, they are literally counterexamples in analysis). They are symptoms of the fact that the natural mathematical setting for probability is a setting with a lot of counter-intuitive pathologies. So, it shouldn’t be surprising that these examples exist: if they didn’t, then the formalization of Bayesian inference would have gone unusually smoothly.
End digression
Here are some thoughts about this example specifically.
Null sets
It’s crucial that the true parameter be exactly 0.25, not just close to 0.25. Otherwise the inconsistency would violate Doob’s result, that the Bayesian is consistent except on a set of prior measure 0. The example can work the way it does because {θ: θ=0.25}, like any singleton set, is a set of measure zero (a null set) in this prior.
IMO, the intuition that the example is troubling actually conflicts with the prior in the example. The prior makes {θ: θ=0.25} a null set, which means it views things that happen in that set and only there as negligible. For example, the behavior there won’t influence any expectation values, so it won’t influence any decisions made by maximizing expected utility over the posterior.
The prior is saying we can “write off” arbitrary pathologies happening only at this point (or only happening at any given point). If we don’t think the exact value θ=0.25 can be written off like this, we should put a point mass there in our prior. To put it another way, while it’s theoretically interesting to explore what can go wrong for a Bayesian on one of their null sets, if you think it’s important what happens on the null sets then you are effectively saying they aren’t null sets (in your opinion). The Bayesian who does view them as null sets actually doesn’t mind the pathologies, and behaves consistently given that.
Could something go wrong in practice?
Now on to something a bit more interesting. At the end, you write:
But… just because this effect can’t mislead you literally forever doesn’t mean it can’t mislead you for a very long time.
That is: if we look at some non-null set like {θ: θ-ε < 0.25 < θ+ε}, then yeah, for (prior-)almost all θ in the set, we will eventually converge. But as we make ε small, the convergence will take longer and longer as we are “fooled” by more large observations. Is this bad?
I don’t think so. One way to describe the situation is as follows. Let E_n be the event (in observation space) that “an observation demonstrates the threshold is ≥ n”. Then we’ve defined a sequence of events {E_n} with these properties:
(i) For each event E_n, there are “two fundamentally different ways” the event could happen, corresponding to the θ~0.25 and θ~0.75 regions. We have two “types” of hypotheses: I’ll call these hypotheses of “the first class” (θ~0.25) and “the second class” (θ~0.75).
(ii) For any fixed n, both of the ways for E_n to happen have non-zero prior mass.
(iii) For large n, the prior mass of the first way E_n could happen (θ~0.25) is small relative to the prior mass of the second way (θ~0.75). As n goes to infinity, this ratio goes to zero.
Now, for any specificvalue of n, these don’t seem problematic at all. We have two classes of hypotheses, both capable of explaining events of type E_n. But as we follow the sequence E_n, letting n grow large, we’re considering types of observations that can only be explained by more and more (prior-)unlikely variants of the first hypothesis class.
It doesn’t seem bad at all that these observations push us toward the second hypothesis class. The observations can be explained two ways: either θ~0.25 and θ is very closely fine-tuned (where the extremity of the “very” grows with n), or θ~0.75 and more generic. All else being equal, this really does weigh toward θ~0.75.
So there’s nothing wrong with the updates on any specific E_n. What still feels worrying, if anything does, is something about the limit in (iii).
After all, for every n, there is a positive-prior-mass set of hypotheses in the “first class” that would yield E_n if actually true. Yet as n grows large, we find E_n to be more and more overwhelming evidence against the first class in favor of the second. Isn’t that weird?
Actually, it’s completely normal. Again, we must take the prior seriously; otherwise we’re only quibbling with the prior, not with “Bayes” itself. (Or perhaps we are pointing out that Bayes can be tricky in practice, but not undermining it in theory.)
So: it is true that for any n, the event E_n could occur due to either a first-class or a second-class situation. But for very large n, we should be very surprised to see a first-class hypothesis causing E_n: the stars have to really align for that to happen.
As we follow E_n into the limit, the cases where the truth has θ~0.25 get more and more inconvenient for the Bayesian. But they also get more and more improbable, in terms of prior mass. That’s why the Bayesian updates away from θ~0.25: as n grows large, an increasingly (prior-)unlikely coincidence is necessary to preserve the belief that we’re near θ=0.25 and not near θ=0.75. So, yes, if a very unlikely situation occurs and mostly resembles some very likely situation, the Bayesian is going to have a bad time, but they’re having a bad time because they rationally conclude they’re in the likely situation and just happen to be wrong by (increasingly unlikely) construction.
That’s not to say that this was immediately obvious to me, and I think it’s a useful example of how a prior can imply things you don’t realize it implies. This behavior is rational given a reasonable-looking continuous prior over values of θ. If there’s something weird going on, it’s possibly that you don’t think the “reasonable-looking” prior is actually reasonable, once you consider everything it implies. Or, on the other hand, that you do find it reasonable upon reflection but don’t find all of its consequences immediately intuitive, even though it (or things like it) are suppose to capture your real state of prior knowledge. But now I’m slipping into some argument I’m much less confident in, so I should stop here.
I was having fun changing the editor font in Atom to various obviously inappropriate choices, but then then I hit upon this font “Trattatello” and … help, this is so pretty???