@raginrayguns I know you’ve written before about the idea of “getting Occam’s razor for free” in Bayesian stats via the use of Bayes factors, without an explicit simplicity prior
I was reading about this in the book Information Theory, Inference, and Learning Algorithms by MacKay, and something seemed very dissatisfying to me about it. The idea is that if you are comparing two models that each have some parameters, you should compute the likelihood of the data for each model by integrating over your prior for all possible values of the parameters. If model is more complex (more parameters), then on average a smaller fraction of the volume in parameter space is going to be compatible with any given data set, because the parameter space can explain more things. So even if your overall prior probability of M_1 and M_2 is the same, and you have similar priors over the parameters for both, you’ll naturally penalize the more complex one.
So for instance if you have a linear model y = a*x + b (+ noise), and a quadratic model y = a*x^2 + b*x + c (+ noise), the latter can always fit a given data set a bit better if you choose the optimal parameter values. But if you integrate over a whole space of possible parameter values, and your data looks linear, the quadratic model with only fit well in a small region of parameter space near a = 0, and the integral will include the low likelihoods outside that area. This will favor the simpler linear model, even if you didn’t give it a higher overall prior probability to start with.
This is a way of doing the thing that I’m used to thinking of in bias/variance terms.
What seems unsatisfying is that there is this appearance of naturalness – of not having to make any arbitrary choices – but really everything depends on what you treat as “same model, different parameters” vs. “different models.” You do have an arbitrary choice, the choice you make when you break down the possibility space into a set of models, and then further into models-with-given-parameter-values.
For instance, in the above example, including the linear model as a distinct hypothesis is arguably redundant, since every model in that class also appears in the quadratic class, along the surface with a = 0. But in the quadratic class, this set (a plane in a 3D space) is going to have prior probability 0 unless you put a lump of probability mass (Dirac delta) on that plane. So the above setup, where you have equal prior probability for “linear model” and “quadratic model” is equivalent to only having “quadratic model” … with this weird, non-obvious lump in the prior. And indeed, the latter is the correct way of seeing this prior, unless we want to allow our probability space to have two copies of the same outcome. So the idea that our prior is not privileging the linear model over the quadratic is not quite true, and the whole thing feels like sleight of hand.
The idea of a prior based on Kolmogorov complexity solves this issue, but at the cost of introducing something uncomputable. Maybe minimum description length also solves the issue? But MacKay says it “has no apparent ad- vantages over the direct probabilistic approach.”
You could also avoid this problem by insisting that all your models be mutually exclusive, but this is not true for many model comparisons we would want to do in practice (like the above), and it also isn’t true for some of MacKay’s examples.
