bayes: a kinda-sorta masterpost
I have written many many words about “Bayesianism” in this space over the years, but the closest thing to a comprehensive “my position on Bayes” post to date is this one from three years ago, which I wrote when I was much newer to this stuff. People sometimes link that post or ask me about it, which almost never happens with my other Bayes posts. So I figure I should write a more up-to-date “position post.”
I will try to make this at least kind of comprehensive, but I will omit many details and sometimes state conclusions without the corresponding arguments. Feel free to ask me if you want to hear more about something.
I ended up including a whole lot of preparatory exposition here – the main critiques start in section 6, although there are various critical remarks earlier.
10. It’s just regularization, dude
(N.B. the below is hand-wavey and not quite formally correct, I just want to get the intuition across)
My favorite way of thinking about statistics is the one they teach you in machine learning.
You’ve got data. You’ve got an “algorithm,” which takes in data on one end, and spits out a model on the other. You want your algorithm to spit out a model that can predict new data, data you didn’t put in.
“Predicting new data well” can be formally decomposed into two parts, “bias” and “variance.” If your algorithm is biased, that means it tends to make models that do a certain thing no matter what the data does. Like, if your algorithm is linear regression, it’ll make a model that’s linear, whether the data is linear or not. It has a bias.
“Variance” is the sensitivity of the model to fluctuations in the data. Any data set is gonna have some noise along with the signal. If your algorithm can come up with really complicated models, then it can fit whatever weird nonlinear things the signal is doing (low bias), but also will tend to misperceive the noise as signal. So you’ll get a model exquisitely well-fitted to the subtle undulations of your dataset (which were due to random noise) and it’ll suck at prediction.
There is a famous “tradeoff” between bias and variance, because the more complicated you let your models get, the more freedom they have to fit the noise. But reality is complicated, so you don’t want to just restrict yourself to something super simple like linear models. What do you do?
A typical answer is “regularization,” which starts out with an algorithm that can produce really complex models, and then adds in a penalty for complexity alongside the usual penalty for bad data fits. So your algorithm “spends points” like an RPG character: if adding complexity helps fit the data, it can afford to spend some complexity points on it, but otherwise it’ll default to the less complex one.
This point has been made by many people, but Shalizi made it well in the very same post I linked earlier: Bayesian conditionalization is formally identical to a regularized version of maximum likelihood inference, where the prior is the regularizing part. That is, rather than just choosing the hypothesis that best fits the data, full stop, you mix together “how well does this fit the data” with “how much did I believe this before.”
But hardly anyone has strong beliefs about models before they even see the data. Like, before I show you the data, what is your “degree of belief” that a regression coefficient will be between 1 and 1.5? What does that even mean?
Eliezer Yudkowsky, strong Bayesian extraordinaire, spins this correspondence as a win for Bayesianism:
So you want to use a linear regression, instead of doing Bayesian updates? But look to the underlying structure of the linear regression, and you see that it corresponds to picking the best point estimate given a Gaussian likelihood function and a uniform prior over the parameters.
You want to use a regularized linear regression, because that works better in practice? Well, that corresponds (says the Bayesian) to having a Gaussian prior over the weights.
But think about it. In the bias/variance picture, L2 regularization (what he’s referring to) is used because it penalizes variance; we can figure out the right strength of regularization (i.e. the variance of the Gaussian prior) by seeing what works best in practice. This is a concrete, grounded, practical story that actually explains why we are doing the thing. In the Bayesian story, we supposedly have beliefs about our regression coefficients which are represented by a Gaussian. What sort of person thinks “oh yeah, my beliefs about these coefficients correspond to a Gaussian with variance 2.5″? And what if I do cross-validation, like I always do, and find that variance 200 works better for the problem? Was the other person _wrong? _But how could they have known?
It gets worse. Sometimes you don’t do L2 regularization. Sometimes you do L1 regularization, because (talking in real-world terms) you want sparse coefficients. In Bayes land, this
can be interpreted as a Bayesian posterior mode estimate when the regression parameters have independent Laplace (i.e., double-exponential) priors
Even ignoring the mode vs. mean issue, I have never met anyone who could tell whether their beliefs were normally distributed vs. Laplace distributed. Have you?
tl;dr: Regularization is not the point of the prior. Even when we’re not regularizing, the prior is an indispensable part of useful machinery for producing “hedged” estimates, which are good in all plausible worlds.
OK, here’s the whole post.
The quoted section is about whether Bayesians can explain regularization. We know regularization helps, and we’re going to do it in any case, but Bayesians purport to explain why and when it helps. See, for example, the above @yudkowsky quote, as well as this one:
The point of Bayesianism isn’t that there’s a toolbox of known algorithms like max-entropy methods which are supposed to work for everything. The point of Bayesianism is to provide a coherent background epistemology which underlies everything; when a frequentist algorithm works, there’s supposed to be a Bayesian explanation of why it works. I have said this before many times but it seems to be a “resistant concept” which simply cannot sink in for many people.
nostalgebraist is making Yudkowsky very happy in his post, by arguing with his actual belief in the status of Bayesianism as a background epistemology. nostalgebraist’s point is that Bayesianism doesn’t explain why or how we regularize, and more generally that we shouldn’t try to judge inferential methods by how Bayesian they are. nostalgebraist is summarizing this as “Bayesianism is just regularization,” which is a not entirely serious inversion of a common Bayesian position, that “regularization is just Bayesian statistics.”
I disagree with nostalgebraist about all this, and I’m going to write a post about why, maybe next week. This current post, which will be quite long, is absolutely not about the issue of whether Bayesianism explains regularization. I start by describing this issue just to show that I understand the real point of the OP, and that I am being quite deliberate when I completely ignore it in the following.
What I want to focus on is nostalgebraist’s half-joking statement that Bayesian inference is just regularization. While he’s not being entirely serious, he may be partly serious, and in any case it’s what a lot of people actually believe. For example, in replies framed as defenses of the Bayesian framework, @4point2kelvin writes “You can definitely think of anything Bayesian as ‘maximum likelihood with a prior.’ But even though the prior has to be (somewhat) arbitrary when the hypothesis-space is infinite, I still think it’s useful.” Plus, once I’ve shown Bayes isn’t just regularization, then I get to say what else it is.
I’m going to start with some technicalities, focusing on the mode vs mean issue nostalgebraist alluded to. Then I’m going to show an example where Bayesian estimation improves on maximum likelihood, without any of the increase in bias that Shalizi suggests is necessary, and explain what’s going on.
Reblogging because this is good and I want to have it on my blog + remind myself to read it more closely so I can actually say something about the issues it raises
(via raginrayguns)
