bayes: a kinda-sorta masterpost
I have written many many words about “Bayesianism” in this space over the years, but the closest thing to a comprehensive “my position on Bayes” post to date is this one from three years ago, which I wrote when I was much newer to this stuff. People sometimes link that post or ask me about it, which almost never happens with my other Bayes posts. So I figure I should write a more up-to-date “position post.”
I will try to make this at least kind of comprehensive, but I will omit many details and sometimes state conclusions without the corresponding arguments. Feel free to ask me if you want to hear more about something.
I ended up including a whole lot of preparatory exposition here – the main critiques start in section 6, although there are various critical remarks earlier.
I like this post. I myself would say that I’m only a “weak Bayesian”, and that while I do solidly believe in various “Bayesian brain” theories, those theories are *muuuuuch* more philosophically pragmatist than the Strong Bayesian epistemological program.
My big request would be whether anyone knows how to “replace” probability theory. What I really want is a way of predicting stuff that lets information flow top-down *and* bottom-up, allows for continuously graded inferences, and allows for arbitrarily complicated structures and connections. Most statistical and machine-learning methods, outside of those described below, *don’t* allow for that! This is why I stick by my Weak Bayesianism even when it visibly sucks.
That said, there are some formal developments Nostalgebraist has missed here.
* Nonparametrics! It’s not as if nobody has ever thought about the Problem of New Ideas before. There’s a whole subfield of Bayesian nonparametric statistics devoted to handling exactly this. The idea is that you start with a “nonparametric” prior model (a probabilistic model of an infinite-dimensional sample space). Sure, this model will assign probabilities over objects that are formally infinite, but you only ever have to actually deal with finite portions of them that talk about your finite data. Whenever new data appears to require a New Idea, though, the model will summon one up with approximately the right shape. You can Monte Carlo sample increasingly large/complex finite elements of the posterior, and you never have to hold the infinite object in your head to be doing probabilistic inference with it.
* Probabilistic programming! This one’s related to nonparametrics, since part of its purpose is to make nonparametrics easy to handle computationally. In a probabilistic programming language, we can perform inference (both conditionalization and marginalization) in any model whose conditional-dependence structure corresponds to some program. In practice, this means writing programs that flip coins, and then conditioning on observed flips to find the weights. It’s actually surprisingly intuitive for having so much mathematical and computational machinery behind it. It’s also Turing-universal: any distribution from which a computer can sample in finite time corresponds to some probabilistic program. So we have a model class including everything we think a physical machine can cope with!
* Divergences are universal performance metrics. Any predictive model - frequentist or Bayesian - can be *considered* to give an approximate posterior-predictive distribution. An information divergence (usually a Kullback-Leibler divergence) then defines a “loss function” between the true empirical distribution over held-out sample data and an equivalent sample from the predictive distribution. The higher the loss, the worse the predictive model, and the actual number can be (AFAIU) approximately calculated (certainly I’ve handled code that calculates approximate sample divergences). A good frequentist model will have a low divergence (loss), and a bad Bayesian model will have a high divergence (loss). This gives a good definition for a *bad* Bayesian model: one in which the posterior predictive doesn’t predict well. This technique is regularly used in Bayesian statistics to evaluate and criticize models.
What’s important here is that sample spaces like, “Countable-dimensional probability distributions” (Dirichlet processes), “Uncountable-dimensional continuous functions” (Gaussian processes), and “all stochastic computer programs” seem to give us increasingly broad classes of probability models. We would like to then do the reverse of old-fashioned Bayesian statistics: instead of starting with a restricted model, we can start with a very broad model and restrict it using our domain knowledge about the problem at hand. We then plug-and-play some computational stuff to perform inference.
Of course, it doesn’t yet work well in practice, but these things are regularly used to model really complex stuff, up to and including thought. Again, those are Weak Bayesian theories, and we care more about a Monte Carlo or variational posterior with a low predictive loss than about finding God’s own posterior distribution.
Another important choice to make is indeed how you interpret probability. I’ve actually liked the more measure-y way, once it was explained to me. “Propositions” are then interpreted as subspaces of the sample space. This seems like the Right Thing: you can start with a very complex model defined by some program or some infinite object or whatever, and then treat finite events within it as logical propositions. Those propositions will obey Boolean logic, but their logical relations will come from the model, rather than the other way around. An infinite-dimensional model will then also allow for an infinite number of propositions.
I consider this a fairly good example of how sometimes you should build your philosophy *on top of* the math and science that you know can work, rather than the other way around. Philosophy is an *output* of thought, so if you want new philosophy, you need new thoughts to think, and if you want new thoughts to think, you need to get them from the world.
This is an extremely interesting response, thank you.
I was totally ignorant of Bayesian nonparametrics until now and it is the sort of thing I should (and want to) know about. Do you have any recommendations about what to read first? Seems like there are a lot of references out there.
Any links about probabilistic programming that you think are especially good + relevant would be appreciated too.
I’m not sure I agree with your paragraph about divergences (or perhaps I don’t understand it). I’m aware of the K-L divergence, and it’s true that you can get a “posterior distribution” of some kind out of any predictive model. (In classification tasks, this is straightforward because the predictions are usually probabilistic anyway; it’s a little less clear to me how this works with regression, since the point estimates we make in regression don’t attempt to match the intrinsic/noise variance in the data, which would affect the K-L divergence.)
But there’s more than one way to compare two probability distributions, and I don’t see that “K-L divergence from empirical distribution of validation set” is the one best loss function for probabilistic modeling. For one thing, we’re presumably going to want to use the joint distributions of all our variables (so that the model has to get the relation of X to Y right, not just match the overall relative counts for Y). But that’s a potentially high-dimensional distribution which we’re sparsely sampling, so the literal empirical distribution will have spurious peaks centered at each data point, and we’d need to do some density reconstruction to get something more sensible – at which point it’s not clear that we trust this reference distribution more than our model’s posterior, since both involve approximate inference from the data.
Also, I know the K-L divergence has a bunch of special properties, but I’ve always been wary when people say that it is the one correct way to compare 2 distributions (or that there is one correct way). To make the case it seems like you’d need some link between the special properties and the thing you want to do. And in practice we use various loss functions (various proper scoring rules for classification, say) that aren’t (obviously?) the K-L div in disguise; is this wrong?
(via principioeternus)
