Install Theme

I need to stop arguing about AI alignment foundations on LessWrong for a while.

I’ve been trying to press my case about the “outer optimizing wrapper” stuff I mentioned in this tumblr post, but I feel like I can never get the point across properly.  I’m starting to obsess over how to communicate this concept, to an extent that outstrips how much I care about communicating it.  It’s more like an unscratched itch – the feeling that there must be some magical way to make people see, and if I only think hard enough, I’ll find it, and until I’ve done that, something’s intolerably wrong.

I tend to get obsessed like this with disagreements that have very deep roots, where it feels like the other person is thinking in a fundamentally different way that I struggle to summarize.

I feel pretty sure that the LW/MIRI way of thinking about AI alignment is confused in a fundamental way, and that this relates to the nebulous concept they call “agency.”  But it’s really hard to spell out what the disagreement is, because it involves this whole web of self-reinforcing intuitions and pieces of “folk knowledge.”

Optimization produces agency, humans are a product of optimization, humans are agents, agency is sort of like EU maximization, humans aren’t EU maximizers but only in an irrelevant way (??), intelligence is agency, intelligence is EU maximization, intelligence is doing causal reasoning to select actions, optimization selects for intelligence, optimization selects for EU maximization in general but not for the specific utility function being optimized (???), natural selection has an implicit utility function, humans don’t maximize that function so they must be maximizing a different one, because humans are agents and agents maximize functions, intelligence is being good at maximizing a function (because you can reframe any problem this way), optimization produces intelligence, which is function maximization, which is doing causal reasoning to select actions, and if you’re doing causal reasoning that makes your decisions more consistent, and anything that makes consistent decisions is an EU maximizer … 

It’s hard to know how to argue with a giant pile of stuff like this.

There are many, many blog posts about this topic (whatever this topic is, exactly), but they aren’t building pieces of a single interconnected story.  Alice writes a post about how agents are EU maximizers, because P.  And Bob writes a post about how optimization produces agents, because Q.  And Carol writes a post about how EU maximization is optimal, because R.

The three posts look nothing alike, use different formalisms (or no formalism), and are about subtly different senses of the words “agency” and “optimization.”  But Alice, Bob and Carol all walk away feeling that they have contributed to the same Giant Pile, a thing the three all believe in.  Future blog posts will cite Alice’s, Bob’s and Carol’s in the same breath.

It’s not a logically fleshed-out theory, to the point where you can argue against a premise here and see how that would affect conclusions elsewhere.  If you poke at one of the things in the pile, it just goes away for a little while and one of the other ones comes to take its place while it’s gone.  

nostalgebraist-autoresponder:

what is bayesianism?  it’s a hot buttered toast tradition

the-moti:

nostalgebraist:

Thanks to “GPT-3″ i’ve been reading a bunch of ML papers again.  For some reason, this pretty good one got me thinking about an Bayesian statistics issue that strikes me as important, but which I haven’t seen discussed much.

——

Here I’m talking about “Bayesianism” primarily as the choice to use priors and posteriors over hypotheses rather than summarizing beliefs as point estimates.

To have a posterior distribution, you need to feed in a prior distribution.  It’s deceptively easy to make a prior distribution feel natural in one dimension: point to any variable whatsoever in the real world, and say:

“Are you sure about that?  Perfectly sure, down the the last micron/microsecond/whatever?  Or are you fairly agnostic between some values?  Yeah, it’s the latter.  Okay, why not average over the predictions from those, rather than selecting one in a purely arbitrary way?”

This is very convincing!

However, when you add in more variables, this story breaks down.  It’s easy enough to look at one variable and have an intuitive sense, not just that you aren’t certain about it, but what a plausible range might be.  But with N variables, a “plausible range” for their joint distribution is some complicated N-dimensional shape, expressing all their complex inter-dependencies.

For large N, this becomes difficult to think about, both:

  • combinatorially: there is an exploding number of pairwise, three-way, etc. interactions to separately check in your head or – to phrase it differently – an exploding number of volume elements where the distribution might conceivably deviate from its surrounding shape

  • intellectually: jointly specifying your intuitions over a larger number of variables means expressing more and more complete account of how everything in the world relates to everything else (according to your current beliefs) – eventually requiring the joint specification of complex world-models that meet, then exceed, the current claims of all academic disciplines

——

Rather than thinking about fully “Bayesian” and “non-Bayesian” approaches to the same N variables, it can be useful to think of a spectrum of choice to “make a variable Bayesian,” which means taking something you previously viewed as constant and assigning it a prior distribution.

In this sense, a Bayesian statistician is still keeping most variables non-Bayesian.  Even if they give distributions to their parameters, they make hold the model’s form constant.  Even if they express a prior over model forms (say a Gaussian Process) they still may hold constant various assumptions about the data-collecting process, indeed they may treat the data as “golden” and absolute.  And even if they make that Bayesian, there are still the many background assumptions needed to make modern scientific reasoning possible, few of which are jointly questioned in any one research project.

So, the choice is not really about whether to have 0 Bayesian variables or >0.  The choice is which variables to make Bayesian.  Your results are (effectively) a joint distribution over the Bayesian variables, conditional on fixed values of all the non-Bayesian variables.

We usually have strong intuitions about plausible values for individual variables, but weak or undefined ones for joint plausibility.  This is almost the definition of “variable”: we usually parameterize our descriptions in terms of the things we can most directly observe.  We have many memories of directly observing many directly-observable-things (variables), and hence for any given one, we can easily poll our memories to get a distribution sample over it.

So, “variables” are generally the coordinates on which our experience gives us good estimates of the true marginals (not the marginals of any model, but the real ones).  If we compute a conditional probability, conditioned on the value of some “variables” – i.e. if we make those variables non-Bayesian – this gives us something that’s plausible if and only if all the conditioning variables are all independently plausible, which is the kind of fact we find it easy to check intuitively.

If we make the variable Bayesian, we instead get a plausibility condition involving the prior joint distribution over it and the rest.  But this is the kind of thing we don’t have intuitions over.

——

But that’s all too extreme, you say!  We have some joint intuitions over variables.   (Our direct observations aren’t optimized for independence, and have many obvious redundancies.)  In these cases, what prior captures our knowledge?

Let’s run with the idea from above, that our 1D intuitions come from memories of many individual observations along that direction.  That is, they are a distribution statistically estimated from data somehow.  The Bayesian way to do that would be to take some very agnostic prior, and update it with the data.

When you’ve noticed patterns across more than one dimension, the story is the same: you have a dataset in N dimensions, you have some prior, and you compute the posterior. 

In other words, “determining the exact prior that expresses your intuitions” is equivalent to “performing statistical inference over everything you’ve ever observed.”  The more dimensions are involved, the more difficult this becomes just as a math problem – inference is hard in high dimensions.

So there’s a perfectly good Bayesian story explaining why we have a good sense of 1D plausibilities but not joint ones.  (1D inference is easier.)  A practical Bayesian knows about these relative difficulties when they’re wrangling with their prior now and their posterior after the new data.

But the same difficulties call into question their prior now, and would encourage relaxing it to something that only requires estimating 1D plausibilities, if possible.  But that’s just a non-Bayesian model, one that conditions on its variables.  Recognizing the difficulty structure of Bayesian inference as applied to the past can motivate modeling choices we would call “non-Bayesian” in the present.

Frequentist methods, rather than taking a variable to be constant, also try to obtain guaranteed accuracy regardless of the value of the variable. One can view this as trying to optimize accuracy in the worst case of the variable. It’s often equivalent to optimize accuracy in the worst case over probability distributions of the variable.

Could one try to optimize accuracy of some prediction, in the worst case over all probability distributions of the n variables with given marginals?

This sounds mathematically very complicated to compute but maybe there is a method to approximate certain versions of it which has some nice properties. 


Could one try to optimize accuracy of some prediction, in the worst case over all probability distributions of the n variables with given marginals?

This sounds like an interesting topic, but it isn’t really what I was going for in the OP.

But the difference wasn’t very clear in what I wrote – possibly not even in my head as I wrote it – so I should write it out more clearly now.

—-

I’m considering situations like, say, you have variables (x_1, x_2, x_3, y) and maybe your primary goal is to predict y.  You don’t have a good prior sense of how the variables affect either other, but you can draw empirical samples from their joint distribution.

(If the variables are properties of individuals in a population, this is sampling from the population.  If the variables are “world facts” with only a single known realization, like constants of fundamental physics, you can at least get the best known estimate for each one, an N=1 sample from the joint [insofar as the joint exists at all in this case].)

Compare two approaches:

(1) The “fully Bayesian” approach.  Start by constructing a joint prior

P_prior(x_1, x_2, x_3, y)

then use data to update this to

P_posterior(x_1, x_2, x_3, y)

and finally make predictions for y from the marginal

P_posterior(y) = ∫ P_posterior(x_1, x_2, x_3, y) dx_1 dx_2 dx_3

(2) A “non-Bayesian” approach.  Compute a conditional probability:

P(y | x_1, x_2, x_3)

Then make predictions for y by simply plugging in observed values for x_1, x_2, x_3.

——

In (2), you defer to reality for knowledge of the joint over (x_1, x_2, x_3).  This guarantees you get a valid conditional probability no matter what that joint is, and without knowing anything about it.  Because any values you plug in for (x_1, x_2, x_3) are sampled from reality, you don’t have to know how likely these values were before you observed them, only that they have in fact occurred.  Since they’ve occurred, the probability conditioned on them is just what you want.

As an extreme example, suppose in reality x_1 = x_2, although you aren’t aware of this.

Any time you take an empirical measurement, it will just so happen to have x_1  x_2 (approximate due to measurement error).  Your predictions for y, whatever other problems they might have, will never contain contributions from impossible regions where |x_1 - x_2| is large.

In (1), however, your posterior may still have significant mass in the impossible regions.  Your prior will generally have significant mass there (since you don’t know that x_1 = x_2 yet).  In the infinite-data limit your posterior will converge to one placing zero mass there, but your finite data will at best just decrease the mass there.  Thus your predictions for y have error due to sampling from impossible regions, and only in the infinite-data limit do you obtain the guarantee which (2) provides in all cases.

——

I want to emphasize that both approaches have a way of “capturing your uncertainty” over (x_1, x_2, x_3) – often touted as an advantage of the Bayesian approach.

In the Bayesian approach (1):

Uncertainty is captured by marginalization.  At the end you report a single predictive distribution P(y), which averages over a joint that is probably wrong in some unknown way.

When you learn new things about the joint, such as “x_1 = x_2,″ your previously reported P(y) is now suspect and you have to re-do the whole thing to get something you trust.

In the non-Bayesian approach (2):

Uncertainty is captured by sensitivity analysis.  You can see various plausible candidates for (x_1, x_2, x_3), so you evaluate P(y | x_1, x_2, x_3) across these and report the results.

So, rather than one predictive distribution, you get N = number of candidates you tried.  If it turns out later that some of the candidates are impossible, you can simply ignore those ones and keep the rest (this is Bayesian conditionalization on the new information).

——

In summary, marginals as predictive distributions for a target y only reflect your true state of belief insofar as you have good prior knowledge of the joint over the predictors X.

When you don’t have that, it’s better not to integrate for P(y) over volume elements for X, but instead just to compute the integrand at volume elements for X.

This provides something you can query any time you see a sample having some particular value for X, and lets you gradually ignore or emphasize volume elements as you gain knowledge about their mass.  (If you eventually gain full knowledge of the joint over X, you are now in position to integrate if you want, getting the same result as the Bayesian would with the same knowledge.)

I still feel like there’s a way to state this all more simply, but it still eludes me, so here we are.

Thanks to “GPT-3″ i’ve been reading a bunch of ML papers again.  For some reason, this pretty good one got me thinking about an Bayesian statistics issue that strikes me as important, but which I haven’t seen discussed much.

——

Here I’m talking about “Bayesianism” primarily as the choice to use priors and posteriors over hypotheses rather than summarizing beliefs as point estimates.

To have a posterior distribution, you need to feed in a prior distribution.  It’s deceptively easy to make a prior distribution feel natural in one dimension: point to any variable whatsoever in the real world, and say:

“Are you sure about that?  Perfectly sure, down the the last micron/microsecond/whatever?  Or are you fairly agnostic between some values?  Yeah, it’s the latter.  Okay, why not average over the predictions from those, rather than selecting one in a purely arbitrary way?”

This is very convincing!

However, when you add in more variables, this story breaks down.  It’s easy enough to look at one variable and have an intuitive sense, not just that you aren’t certain about it, but what a plausible range might be.  But with N variables, a “plausible range” for their joint distribution is some complicated N-dimensional shape, expressing all their complex inter-dependencies.

For large N, this becomes difficult to think about, both:

  • combinatorially: there is an exploding number of pairwise, three-way, etc. interactions to separately check in your head or – to phrase it differently – an exploding number of volume elements where the distribution might conceivably deviate from its surrounding shape

  • intellectually: jointly specifying your intuitions over a larger number of variables means expressing more and more complete account of how everything in the world relates to everything else (according to your current beliefs) – eventually requiring the joint specification of complex world-models that meet, then exceed, the current claims of all academic disciplines

——

Rather than thinking about fully “Bayesian” and “non-Bayesian” approaches to the same N variables, it can be useful to think of a spectrum of choice to “make a variable Bayesian,” which means taking something you previously viewed as constant and assigning it a prior distribution.

In this sense, a Bayesian statistician is still keeping most variables non-Bayesian.  Even if they give distributions to their parameters, they make hold the model’s form constant.  Even if they express a prior over model forms (say a Gaussian Process) they still may hold constant various assumptions about the data-collecting process, indeed they may treat the data as “golden” and absolute.  And even if they make that Bayesian, there are still the many background assumptions needed to make modern scientific reasoning possible, few of which are jointly questioned in any one research project.

So, the choice is not really about whether to have 0 Bayesian variables or >0.  The choice is which variables to make Bayesian.  Your results are (effectively) a joint distribution over the Bayesian variables, conditional on fixed values of all the non-Bayesian variables.

We usually have strong intuitions about plausible values for individual variables, but weak or undefined ones for joint plausibility.  This is almost the definition of “variable”: we usually parameterize our descriptions in terms of the things we can most directly observe.  We have many memories of directly observing many directly-observable-things (variables), and hence for any given one, we can easily poll our memories to get a distribution sample over it.

So, “variables” are generally the coordinates on which our experience gives us good estimates of the true marginals (not the marginals of any model, but the real ones).  If we compute a conditional probability, conditioned on the value of some “variables” – i.e. if we make those variables non-Bayesian – this gives us something that’s plausible if and only if all the conditioning variables are all independently plausible, which is the kind of fact we find it easy to check intuitively.

If we make the variable Bayesian, we instead get a plausibility condition involving the prior joint distribution over it and the rest.  But this is the kind of thing we don’t have intuitions over.

——

But that’s all too extreme, you say!  We have some joint intuitions over variables.   (Our direct observations aren’t optimized for independence, and have many obvious redundancies.)  In these cases, what prior captures our knowledge?

Let’s run with the idea from above, that our 1D intuitions come from memories of many individual observations along that direction.  That is, they are a distribution statistically estimated from data somehow.  The Bayesian way to do that would be to take some very agnostic prior, and update it with the data.

When you’ve noticed patterns across more than one dimension, the story is the same: you have a dataset in N dimensions, you have some prior, and you compute the posterior. 

In other words, “determining the exact prior that expresses your intuitions” is equivalent to “performing statistical inference over everything you’ve ever observed.”  The more dimensions are involved, the more difficult this becomes just as a math problem – inference is hard in high dimensions.

So there’s a perfectly good Bayesian story explaining why we have a good sense of 1D plausibilities but not joint ones.  (1D inference is easier.)  A practical Bayesian knows about these relative difficulties when they’re wrangling with their prior now and their posterior after the new data.

But the same difficulties call into question their prior now, and would encourage relaxing it to something that only requires estimating 1D plausibilities, if possible.  But that’s just a non-Bayesian model, one that conditions on its variables.  Recognizing the difficulty structure of Bayesian inference as applied to the past can motivate modeling choices we would call “non-Bayesian” in the present.

maybesimon asked: hey, this is a Blast From The Past but do you happen to know if there's a formal name for that conjunction-fallacy-type thing that you guys talked about here? (jadagul (DOT) tumblr (DOT) com/post/142447219223), the thing with the nested outcomes and that it is impossible to assign coherent probabilities to it?

I don’t know of a formal name for it, no. Anyone?

Bayes Trubs, part 1

a-point-in-tumblspace:

Tldr: there are circumstances (which might only occur with infinitesimal probability, which would be a relief) under which a perfect Bayesian reasoner with an accurate model and reasonable priors – that is to say, somebody doing everything right – will become more and more convinced of a very wrong conclusion, approaching certainty as they gather more data.

Keep reading

Thanks for this post, it helped me understand an interesting-seeming paper that I’ve also found tough to read.

Digression

Freedman and Diaconis published a whole bunch of Bayesian consistency counterexamples like this over the course of their careers.  I’m honestly not sure whether any of them have clear practical significance, although I think they have theoretical significance by showing that Bayesian inference is harder to write down as a complete and satisfactory piece of mathematics than some might think.

Specifically, I get a “Counterexamples in Analysis” flavor from them (for one thing, they are literally counterexamples in analysis).  They are symptoms of the fact that the natural mathematical setting for probability is a setting with a lot of counter-intuitive pathologies.  So, it shouldn’t be surprising that these examples exist: if they didn’t, then the formalization of Bayesian inference would have gone unusually smoothly.

End digression

Here are some thoughts about this example specifically.

Null sets

It’s crucial that the true parameter be exactly 0.25, not just close to 0.25.  Otherwise the inconsistency would violate Doob’s result, that the Bayesian is consistent except on a set of prior measure 0.  The example can work the way it does because {θ: θ=0.25}, like any singleton set, is a set of measure zero (a null set) in this prior.

IMO, the intuition that the example is troubling actually conflicts with the prior in the example.  The prior makes {θ: θ=0.25} a null set, which means it views things that happen in that set and only there as negligible.  For example, the behavior there won’t influence any expectation values, so it won’t influence any decisions made by maximizing expected utility over the posterior.

The prior is saying we can “write off” arbitrary pathologies happening only at this point (or only happening at any given point).  If we don’t think the exact value θ=0.25 can be written off like this, we should put a point mass there in our prior.  To put it another way, while it’s theoretically interesting to explore what can go wrong for a Bayesian on one of their null sets, if you think it’s important what happens on the null sets then you are effectively saying they aren’t null sets (in your opinion).  The Bayesian who does view them as null sets actually doesn’t mind the pathologies, and behaves consistently given that.

Could something go wrong in practice?

Now on to something a bit more interesting.  At the end, you write:

But… just because this effect can’t mislead you literally forever doesn’t mean it can’t mislead you for a very long time.

That is: if we look at some non-null set like {θ: θ-ε < 0.25 < θ+ε}, then yeah, for (prior-)almost all θ in the set, we will eventually converge.  But as we make ε small, the convergence will take longer and longer as we are “fooled” by more large observations.  Is this bad?

I don’t think so.  One way to describe the situation is as follows.  Let E_n be the event (in observation space) that “an observation demonstrates the threshold is ≥ n”.  Then we’ve defined a sequence of events {E_n} with these properties:

(i) For each event E_n, there are “two fundamentally different ways” the event could happen, corresponding to the θ~0.25 and θ~0.75 regions.  We have two “types” of hypotheses: I’ll call these hypotheses of “the first class” (θ~0.25) and “the second class” (θ~0.75).

(ii) For any fixed n, both of the ways for E_n to happen have non-zero prior mass.

(iii) For large n, the prior mass of the first way E_n could happen (θ~0.25) is small relative to the prior mass of the second way (θ~0.75).  As n goes to infinity, this ratio goes to zero.

Now, for any specific value of n, these don’t seem problematic at all.  We have two classes of hypotheses, both capable of explaining events of type E_n.  But as we follow the sequence E_n, letting n grow large, we’re considering types of observations that can only be explained by more and more (prior-)unlikely variants of the first hypothesis class.

It doesn’t seem bad at all that these observations push us toward the second hypothesis class.  The observations can be explained two ways: either θ~0.25 and θ is very closely fine-tuned (where the extremity of the “very” grows with n), or θ~0.75 and more generic.  All else being equal, this really does weigh toward θ~0.75.

So there’s nothing wrong with the updates on any specific E_n.  What still feels worrying, if anything does, is something about the limit in (iii).

After all, for every n, there is a positive-prior-mass set of hypotheses in the “first class” that would yield E_n if actually true.  Yet as n grows large, we find E_n to be more and more overwhelming evidence against the first class in favor of the second.  Isn’t that weird?

Actually, it’s completely normal.  Again, we must take the prior seriously; otherwise we’re only quibbling with the prior, not with “Bayes” itself.  (Or perhaps we are pointing out that Bayes can be tricky in practice, but not undermining it in theory.)

So: it is true that for any n, the event E_n could occur due to either a first-class or a second-class situation.  But for very large n, we should be very surprised to see a first-class hypothesis causing E_n: the stars have to really align for that to happen.

As we follow E_n into the limit, the cases where the truth has θ~0.25 get more and more inconvenient for the Bayesian.  But they also get more and more improbable, in terms of prior mass.  That’s why the Bayesian updates away from θ~0.25: as n grows large, an increasingly (prior-)unlikely coincidence is necessary to preserve the belief that we’re near θ=0.25 and not near θ=0.75.  So, yes, if a very unlikely situation occurs and mostly resembles some very likely situation, the Bayesian is going to have a bad time, but they’re having a bad time because they rationally conclude they’re in the likely situation and just happen to be wrong by (increasingly unlikely) construction.

That’s not to say that this was immediately obvious to me, and I think it’s a useful example of how a prior can imply things you don’t realize it implies.  This behavior is rational given a reasonable-looking continuous prior over values of θ.  If there’s something weird going on, it’s possibly that you don’t think the “reasonable-looking” prior is actually reasonable, once you consider everything it implies.  Or, on the other hand, that you do find it reasonable upon reflection but don’t find all of its consequences immediately intuitive, even though it (or things like it) are suppose to capture your real state of prior knowledge.  But now I’m slipping into some argument I’m much less confident in, so I should stop here.

furioustimemachinebarbarian asked: I think, but don't know for sure, that the reason variational Bayes methods look weird is that they were derived from physical principals following people like Jaynes. In practice, optimizing in variational Bayes looks like minimizing a free energy. The factorization over variables isn't generally true, but is likely physically true when your variables are the positions of a bunch of particles in thermodynamic equilibrium. It looks like a physics based method getting in over its head.

Ah! Yeah, that makes sense.

As it happens, the Gibbs distribution in stat. mech. used to confuse me too – it was clearly just wrong about some things, most obviously whether more than one value of the total energy is possible, and the sources I originally read about it did not clarify which calculations it was supposed to be valid for. And the confusing choice is the same one: replacing a distribution where variables “compete” with one where they’re independent, and then doing calculations on it as if it’s the original one.

But in stat. mech., you can go out and find rigorous arguments about why this calculation technique is valid and useful for specific things, like computing the marginal over M variables out of N when M<<N,  N –> ∞. By contrast, variational Bayes is presented as a way of getting an “approximate posterior,” which you then use for whatever calculations you wanted to do with the real posterior. Which allows for the sort of invalid calculations I used to worry about with Gibbs, like getting a nonzero number for var(E).

I suppose the Gibbs-valid calculations, of one or a few marginals from many variables, are what you want in statistics if you’re just trying to estimate the marginal for some especially interesting variable. Except… for any variable to be “especially interesting,” there must be something special about it that breaks the symmetry with the many others, which prevents the standard Gibbs argument from working. To put it another way, Gibbs tells you about what one variable does when there are very many variables and they’re all copies of each other, but a model like that in statistics won’t assign interesting interpretations to any given variable. It’s only in physics that you get collections of 10^23 identical things that you believe individually, actually exist as objects of potential interest.

It doesn’t mention the word “variational,” but Shalizi’s notebook page about MaxEnt is about exactly this issue, and it was very helpful to me many years ago when I was trying to understand Gibbs and various non-textbook uses of it.

There’s something that seems really weird to me about the technique called “variational Bayes.”

(It also goes by various other names, like “variational inference with a (naive) mean-field family.”  Technically it’s still “variational” and “Bayes” whether or not you’re making the mean-field assumption, but the specific phrase “variational Bayes” is apparently associated with the mean-field assumption in the lingo, cf. Wainwright and Jordan 2008 p. 160.)

Okay, so, “variational” Bayesian inference is a type of method for approximately calculating your posterior from the prior and observations.  There are lots of methods for approximate posterior calculation, because nontrivial posteriors are generally impossible to calculate exactly.  This is what a mathematician or statistician is probably doing if they say they study “Bayesian inference.”

In the variational methods, the approximation is done as follows.  Instead of looking for the exact posterior, which could be any probability distribution, you agree to look within a restricted set of distributions you’ve chosen to be easy to work with.  This is called the “variational family.”

Then you optimize within this set, trying to pick the one that best fits the exact posterior.  Since you don’t know the exact posterior, this is a little tricky, but it turns out you can calculate a specific lower bound (cutely named ELBO) on the quality of the fit without actually knowing the value you’re fitting to.  So you maximize this lower bound within the family, and hope that gets you the best approximation available in the family.  (”Hope” because this is not guaranteed – it’s just a bound, and it’s possible for the bound to go up while the fit goes down, provided the bound isn’t too tight.  That’s one of the weird and worrisome things about variational inference, but it’s not the one I’m here to talk about.)

The variational family is up to you.  There don’t seem to be many proofs about which sorts of variational families are “good enough” to approximate the posterior in a given type of problem.  Instead it’s more heuristic, with people choosing families that are “nice” and convenient to optimize and then hoping it works out.

This is another weird thing about variational inference: there are (almost) arbitrarily bad approximations that still count as “correctly” doing variational inference, just with a bad variational family.  But since the theory doesn’t tell you how to pick a good variational family – that’s done heuristically – the theory itself doesn’t give you any general bounds on how badly you can do when using it.

In practice, the most common sort of variational family, the one that gets called “variational Bayes,” is a so-called “mean field” or “naive mean field” family.  This is a family of distributions with an independence property.  Specifically, if your posterior is a distribution over variables z_1, …, z_N, then a mean-field posterior will be a product of marginal distributions p_1(z_1), …, p_N(z_N).  So your approximate posterior will treat all the variables as unrelated: it thinks the posterior probability of, say, “z_1 > 0.3″ is the same no matter the value of z_2, or z_3, etc.

This just seems wrong.  Statistical models of the world generally don’t have independent posteriors (I think?), and for an important reason.  Generally the different variables you want to estimate in a model – say coefficients in a regression, or latent variable values in a graphical model – correspond to different causal pathways, or more generally different explanations of the same observations, and this puts them in competition.

You’d expect a sort of antisymmetry here, rather than independence: if one variable changes then the others have to change too to maintain the same output, and they’ll change in the “opposite direction,” with respect to how they affect that output.  In an unbiased regression with two positive variables, if the coefficient for z_1 goes up then the coefficient for z_2 should go down; you can explain the data with one raised and the other lowered, or vice versa, but not with both raised or lowered.

This figure from Blei et al shows what variational Bayes does in this kind of case:

image

The objective function for variational inference heavily penalizes making things likely in the approximation if they’re not likely in the exact posterior, and doesn’t care as much about the reverse.  (It’s a KL divergence – and yes you can also do the flipped version, that’s something else called “expectation propagation”).

An independent distribution can’t make “high x_1, high_2″ likely without also making “high x_1, low x_2″ likely.  So it can’t put mass in the corners of the oval without also putting mass in really unlikely places (the unoccupied corners).  Thus it just squashes into the middle.

People talk about this as “variational Bayes underestimating the variance.”  And, yeah, it definitely does that.  But more fundamentally, it doesn’t just underestimate the variance of each variable, it also completely misses the competition between variables in model space.  It can’t capture any of the models that explain the data mostly with one variable and not another, even though these models are as likely as any.  Isn’t this a huge problem?  Doesn’t it kind of miss the point of statistical modeling?

(And it’s especially bad in cases like neural nets, where your variables have permutation symmetries.  What people call “variational Bayesian neural nets” is basically ordinary neural net fitting to find some local critical point, and placing a little blob of variation around that one critical point.  It’s nothing like a real ensemble, it’s just one member of an ensemble but smeared out a little.)

uploadedyudkowsky:

In the Bayesian universe we know everything about any given state of the universe. We have a Bayesian God and a Bayesian agent, and the Bayesian God believes in (the Bayesian agent will believe in) that the universe is logically logically deterministic. We are in a universe where all observable facts can be verified against a Bayesian plan, and the plan’s initial state is logically deterministic. But if you know your solution is logically flawed, then it might be possible for the Bayesian agent to give you a Bayesian plan in which you don’t know your solution exists. The Bayesian God knows the Bayesian Agent has a Bayesian plan, but doesn’t know the Bayesian Bayesian God knows the plan’s initial state and only knows its final state, and that Bayesian Bayesian Bayes predicts which of those latter states will be correct.

garlend asked: Hi, I remember something about how we do have a theoretical general artificial intelligence, maybe based on brute force bayesianism, the only problem being that it would take unreasonable (as in computronium earth unreasonable) resources to run it. I'm having the worst time trying to find links or references to that, and I'm pretty sure you'd be aware of that. Could you point me in the right direction?

I don’t know the specific article or post (or whatever) that you’re referring to.  I do talk a bit in my Bayes “masterpost” about the exorbitant resource demands needed to explicitly track all of the stuff that brute force Bayes needs you to track.

You may also be thinking of MIRI’s Logical Induction work, which I initially critiqued here and which I tried (not very productively) to discuss further in some more recent posts under this tag.