Install Theme

the-moti:

nostalgebraist:

Thanks to “GPT-3″ i’ve been reading a bunch of ML papers again.  For some reason, this pretty good one got me thinking about an Bayesian statistics issue that strikes me as important, but which I haven’t seen discussed much.

——

Here I’m talking about “Bayesianism” primarily as the choice to use priors and posteriors over hypotheses rather than summarizing beliefs as point estimates.

To have a posterior distribution, you need to feed in a prior distribution.  It’s deceptively easy to make a prior distribution feel natural in one dimension: point to any variable whatsoever in the real world, and say:

“Are you sure about that?  Perfectly sure, down the the last micron/microsecond/whatever?  Or are you fairly agnostic between some values?  Yeah, it’s the latter.  Okay, why not average over the predictions from those, rather than selecting one in a purely arbitrary way?”

This is very convincing!

However, when you add in more variables, this story breaks down.  It’s easy enough to look at one variable and have an intuitive sense, not just that you aren’t certain about it, but what a plausible range might be.  But with N variables, a “plausible range” for their joint distribution is some complicated N-dimensional shape, expressing all their complex inter-dependencies.

For large N, this becomes difficult to think about, both:

  • combinatorially: there is an exploding number of pairwise, three-way, etc. interactions to separately check in your head or – to phrase it differently – an exploding number of volume elements where the distribution might conceivably deviate from its surrounding shape

  • intellectually: jointly specifying your intuitions over a larger number of variables means expressing more and more complete account of how everything in the world relates to everything else (according to your current beliefs) – eventually requiring the joint specification of complex world-models that meet, then exceed, the current claims of all academic disciplines

——

Rather than thinking about fully “Bayesian” and “non-Bayesian” approaches to the same N variables, it can be useful to think of a spectrum of choice to “make a variable Bayesian,” which means taking something you previously viewed as constant and assigning it a prior distribution.

In this sense, a Bayesian statistician is still keeping most variables non-Bayesian.  Even if they give distributions to their parameters, they make hold the model’s form constant.  Even if they express a prior over model forms (say a Gaussian Process) they still may hold constant various assumptions about the data-collecting process, indeed they may treat the data as “golden” and absolute.  And even if they make that Bayesian, there are still the many background assumptions needed to make modern scientific reasoning possible, few of which are jointly questioned in any one research project.

So, the choice is not really about whether to have 0 Bayesian variables or >0.  The choice is which variables to make Bayesian.  Your results are (effectively) a joint distribution over the Bayesian variables, conditional on fixed values of all the non-Bayesian variables.

We usually have strong intuitions about plausible values for individual variables, but weak or undefined ones for joint plausibility.  This is almost the definition of “variable”: we usually parameterize our descriptions in terms of the things we can most directly observe.  We have many memories of directly observing many directly-observable-things (variables), and hence for any given one, we can easily poll our memories to get a distribution sample over it.

So, “variables” are generally the coordinates on which our experience gives us good estimates of the true marginals (not the marginals of any model, but the real ones).  If we compute a conditional probability, conditioned on the value of some “variables” – i.e. if we make those variables non-Bayesian – this gives us something that’s plausible if and only if all the conditioning variables are all independently plausible, which is the kind of fact we find it easy to check intuitively.

If we make the variable Bayesian, we instead get a plausibility condition involving the prior joint distribution over it and the rest.  But this is the kind of thing we don’t have intuitions over.

——

But that’s all too extreme, you say!  We have some joint intuitions over variables.   (Our direct observations aren’t optimized for independence, and have many obvious redundancies.)  In these cases, what prior captures our knowledge?

Let’s run with the idea from above, that our 1D intuitions come from memories of many individual observations along that direction.  That is, they are a distribution statistically estimated from data somehow.  The Bayesian way to do that would be to take some very agnostic prior, and update it with the data.

When you’ve noticed patterns across more than one dimension, the story is the same: you have a dataset in N dimensions, you have some prior, and you compute the posterior. 

In other words, “determining the exact prior that expresses your intuitions” is equivalent to “performing statistical inference over everything you’ve ever observed.”  The more dimensions are involved, the more difficult this becomes just as a math problem – inference is hard in high dimensions.

So there’s a perfectly good Bayesian story explaining why we have a good sense of 1D plausibilities but not joint ones.  (1D inference is easier.)  A practical Bayesian knows about these relative difficulties when they’re wrangling with their prior now and their posterior after the new data.

But the same difficulties call into question their prior now, and would encourage relaxing it to something that only requires estimating 1D plausibilities, if possible.  But that’s just a non-Bayesian model, one that conditions on its variables.  Recognizing the difficulty structure of Bayesian inference as applied to the past can motivate modeling choices we would call “non-Bayesian” in the present.

Frequentist methods, rather than taking a variable to be constant, also try to obtain guaranteed accuracy regardless of the value of the variable. One can view this as trying to optimize accuracy in the worst case of the variable. It’s often equivalent to optimize accuracy in the worst case over probability distributions of the variable.

Could one try to optimize accuracy of some prediction, in the worst case over all probability distributions of the n variables with given marginals?

This sounds mathematically very complicated to compute but maybe there is a method to approximate certain versions of it which has some nice properties. 


Could one try to optimize accuracy of some prediction, in the worst case over all probability distributions of the n variables with given marginals?

This sounds like an interesting topic, but it isn’t really what I was going for in the OP.

But the difference wasn’t very clear in what I wrote – possibly not even in my head as I wrote it – so I should write it out more clearly now.

—-

I’m considering situations like, say, you have variables (x_1, x_2, x_3, y) and maybe your primary goal is to predict y.  You don’t have a good prior sense of how the variables affect either other, but you can draw empirical samples from their joint distribution.

(If the variables are properties of individuals in a population, this is sampling from the population.  If the variables are “world facts” with only a single known realization, like constants of fundamental physics, you can at least get the best known estimate for each one, an N=1 sample from the joint [insofar as the joint exists at all in this case].)

Compare two approaches:

(1) The “fully Bayesian” approach.  Start by constructing a joint prior

P_prior(x_1, x_2, x_3, y)

then use data to update this to

P_posterior(x_1, x_2, x_3, y)

and finally make predictions for y from the marginal

P_posterior(y) = ∫ P_posterior(x_1, x_2, x_3, y) dx_1 dx_2 dx_3

(2) A “non-Bayesian” approach.  Compute a conditional probability:

P(y | x_1, x_2, x_3)

Then make predictions for y by simply plugging in observed values for x_1, x_2, x_3.

——

In (2), you defer to reality for knowledge of the joint over (x_1, x_2, x_3).  This guarantees you get a valid conditional probability no matter what that joint is, and without knowing anything about it.  Because any values you plug in for (x_1, x_2, x_3) are sampled from reality, you don’t have to know how likely these values were before you observed them, only that they have in fact occurred.  Since they’ve occurred, the probability conditioned on them is just what you want.

As an extreme example, suppose in reality x_1 = x_2, although you aren’t aware of this.

Any time you take an empirical measurement, it will just so happen to have x_1  x_2 (approximate due to measurement error).  Your predictions for y, whatever other problems they might have, will never contain contributions from impossible regions where |x_1 - x_2| is large.

In (1), however, your posterior may still have significant mass in the impossible regions.  Your prior will generally have significant mass there (since you don’t know that x_1 = x_2 yet).  In the infinite-data limit your posterior will converge to one placing zero mass there, but your finite data will at best just decrease the mass there.  Thus your predictions for y have error due to sampling from impossible regions, and only in the infinite-data limit do you obtain the guarantee which (2) provides in all cases.

——

I want to emphasize that both approaches have a way of “capturing your uncertainty” over (x_1, x_2, x_3) – often touted as an advantage of the Bayesian approach.

In the Bayesian approach (1):

Uncertainty is captured by marginalization.  At the end you report a single predictive distribution P(y), which averages over a joint that is probably wrong in some unknown way.

When you learn new things about the joint, such as “x_1 = x_2,″ your previously reported P(y) is now suspect and you have to re-do the whole thing to get something you trust.

In the non-Bayesian approach (2):

Uncertainty is captured by sensitivity analysis.  You can see various plausible candidates for (x_1, x_2, x_3), so you evaluate P(y | x_1, x_2, x_3) across these and report the results.

So, rather than one predictive distribution, you get N = number of candidates you tried.  If it turns out later that some of the candidates are impossible, you can simply ignore those ones and keep the rest (this is Bayesian conditionalization on the new information).

——

In summary, marginals as predictive distributions for a target y only reflect your true state of belief insofar as you have good prior knowledge of the joint over the predictors X.

When you don’t have that, it’s better not to integrate for P(y) over volume elements for X, but instead just to compute the integrand at volume elements for X.

This provides something you can query any time you see a sample having some particular value for X, and lets you gradually ignore or emphasize volume elements as you gain knowledge about their mass.  (If you eventually gain full knowledge of the joint over X, you are now in position to integrate if you want, getting the same result as the Bayesian would with the same knowledge.)

I still feel like there’s a way to state this all more simply, but it still eludes me, so here we are.