Install Theme

bayes: a kinda-sorta masterpost

I have written many many words about “Bayesianism” in this space over the years, but the closest thing to a comprehensive “my position on Bayes” post to date is this one from three years ago, which I wrote when I was much newer to this stuff.  People sometimes link that post or ask me about it, which almost never happens with my other Bayes posts.  So I figure I should write a more up-to-date “position post.”

I will try to make this at least kind of comprehensive, but I will omit many details and sometimes state conclusions without the corresponding arguments.  Feel free to ask me if you want to hear more about something.

I ended up including a whole lot of preparatory exposition here – the main critiques start in section 6, although there are various critical remarks earlier.

[Edit 5/30/22: I later wrote two posts critiquing Bayesianism from a totally different – and more original? – angle than any of the ones covered here.  Check them out too, if this subject interests you.]

Keep reading

(long OP snipped for space, but may be worth looking at for context)

I thought about this more, and concluded that Jaynes’ propositional formulation of probability is more different from Kolmogorov than I thought, and really does have an advantage in this case.

Both the Jaynes and Kolmogorov formulations have some some of Boolean algebra structure, which the probabilities are constrained to respect.  In Kolmogorov, this arises because probabilities get assigned to sets, and you get a Boolean algebra from the unions/intersections of the sets.  In Jaynes, probabilities are assigned to propositions and the Boolean algebra comes from logical and/or/not.

The difference is in the “basic atoms” of the algebra.  In Kolmogorov, the sets contain things, called “outcomes,” and the “smallest” set you can make (besides the empty set) is a singleton with just one of the outcomes in it.  All of these distinct singletons are mutually exclusive events, so they can’t be proper subsets of one another, which is the structure you need if you want to encode material conditionals (i.e. the kind of “A→B” that is equivalent to “¬A or B”).  In Kolmogorov, material conditionals only arise from the subset relationships between bigger, non-atomic events.

But in Jaynes, the basic atoms are atomic propositions (i.e. propositions like “A” which are not aliases for compounds like “B or C,” but are undefined within the algebra itself).  For two atoms A and B, the material conditional “¬A or B” is a perfectly good compound proposition, and we can (say) begin by being unsure of its truth value and later update after we prove it (by conditioning on it), as @eclairsandsins​ was suggesting upthread.

If we try to translate this back to the Kolmogorov picture, the compound propositions follow the same algebra as Kolmogorov events (sets), but the atomic propositions are weird “sets of unknown structure”: their unions, intersections, etc. exist in the algebra but we don’t know anything about their contents.  This is exactly the sort of thing I needed in the OP to encode material conditionals, and generally seems preferable to the “you have to already know everything” feel of the Kolmogorov setup.

If we move from material conditionals to counterfactual conditionals, things get harder.  (Counterfactual conditionals are the type I mentioned earlier that can be false even if antecedent and consequent are false: e.g. “if I were in Mexico, I’d be in Asia” is false even though I am in neither.)  The Boolean algebra can tell us that certain things are logically possible even if they are contingently false (e.g. A and ¬A are both in the algebra, but we happen to assign P(A) = 0), but I don’t think this does the whole job, for which I think we need predicate calculus or modal logic rather than just propositional logic.

For instance, even if Linda is not a bankteller at all, I believe it is false (in the counterfactual conditional sense) that “(Linda is a bankteller) → (Linda is a feminist bankteller).”  Not just because non-feminist banktellers are logically possible, but actually possible.  For this I need the “there exists” of predicate calculus, or (better?) the “is possible” of modal logic.  David Chapman has a post saying that Jaynes’ setup doesn’t generalize to predicate calculus; I linked it a while back and remember getting pushback from @raginrayguns​ about it, but I don’t remember the details.

Anyway, the motivation for all this was thinking about logical induction.  If we want to reason sensibly about the probabilities of logical statements, we ought to be able to do it about (at least) statements of first-order predicate calculus, and possibly higher-order (?) as well.  We’re talking about trying to assign probabilities to math conjectures here; we at least need to be able to deal with Peano arithmetic.

The MIRI formalism outsources this task to the e.c. traders, which doesn’t tell us how efficiently it can actually be done; if for instance we need a separate trader for each case of a “for all” sentence, that would still get the job done in MIRI’s asymptotic sense, but effectively means that we can’t do it in practice (one would like to be able to take in the sentence “all at once”).  Quantifiers can be parcelled out into conjunctions/disjunctions of countably many propositions, but we’d like to avoid that if we can (one should be able to do mathematical induction without needing infinite storage space to hold P(n) for every integer.)  IDK, it just seems like “can we do this sort of probability in practice” is the real question, and the MIRI work sidesteps that.

(via nostalgebraist)

eclairsandsins:

nostalgebraist:

I’ve been thinking a bit about the how to get a “uniform rather than pointwise” version of the logical induction stuff.

It seems like a lot of the challenge of the problem is generic to Bayesianism, and not particular to “logical” or “mathematical” outcomes.  Anyone who reads my posts on this stuff knows I have an axe to grind about how, outside of specialized small domains, you don’t have a complete sigma-algebra to put probabilities on.  (Because you aren’t logically omniscient, you don’t know all the logical relations between hypotheses, which means you don’t know all of the subset relationships between sets in your algebra.)

The case of “logical induction” forces Bayesians to think about this even if they wouldn’t otherwise, since the prototypical/motivating examples involve math, a world in which we are continually discovering facts of the form “A implies B.”  So assuming logical omniscience would be assuming we know all the theorems at the outset, in which case we wouldn’t need LI to begin with.

But the problem that “A may imply B even though you don’t know it does” is generic and comes up for Bayesian inference about real-world events, too.  A good solution to this problem would be very interesting and important (?) even if it didn’t, in itself, handle the “logical” aspects of “logical induction” (like “what counts as evidence for a logical sentence”).


I have to imagine there is work on this problem out there, but I have had a hard time finding it.

The basic mathematical setup would have to involve some “incomplete” version of a sigma-algebra (generically, a field of sets), where not all of the union/intersection information is “known.”  This is a bit weird because when we talk about a collection of sets, we usually mean we know what is in the sets, and that information contains all the relations like “A is a subset of B” (i.e. B implies A), when we want to make some of them go away.

A Boolean algebra is like a field of sets where we forget what the sets contain, and just leave them as blank symbols that happen to have union/intersection (AKA “join/meet”) relations with one another.  That seems closer to what we want, except that we need some of the join/meet operations to give undefined results.  There are Boolean algebras where not everything has a join/meet (those that aren’t complete, in the complete lattice sense), but this seems like a thing having to do with inf/sup stuff in infinite spaces and isn’t really what we want.  (Despite my username, I know very little about algebra and am just flying blind on Wikipedia here.)

An example of the sort of thing I want to do is the following.  Say we are assigning probabilities in (0,1) to P(A), P(B), P(A=>B), and P(B=>A).  Suppose P(A=>B) > P(B=>A), that is, we think it’s more likely that A implies B than the reverse (and in particular, more likely than A<=>B).

Now consider P(A) and P(B).  The probabilities above say we’re most likely to be in a world where A=>B and not vice versa, in which case we should have P(A) > P(B), or we’ll be incoherent.  So it seems like we should have P(A) > P(B) right now.  Of course, this will make us incoherent if it turns out that we are in the B=>A or A<=>B worlds, but we think those are less likely.  In betting terms, the losses we might incur from incoherence in a likely world should outweigh those we’d incur from incoherence in an unlikely world.

What we’re really doing here, I guess, is treating the implication (i.e. subset relation) as a random event, so implicitly there is a second, complete probability space whose events (or outcomes?) include the subset relations on the first, incomplete probability space (the one discussed above).  Maybe you could just do the whole thing this way?  I haven’t tried it, I’m curious what would happen

Anyway, I can’t help but think there must be the right math tools out there for doing this kind of thing, and I just don’t know about it.  Anyone have pointers?

Um, if P(A → B) > P(B → A) then P(A) < P(B) not greater. Imagine if P(A → B) = 90% and P(B → A) = 30%. The only difference between the truth tables of A → B vs B → A is that the former is false only when A is true and B is false, and the latter is false only when B is true and A is false. So, P(A and ¬B) = 10% and P(B and ¬A) = 70%. The latter tells you that P(B) > 70% and P(¬A) > 70% aka P(A) < 30%. Therefore P(B) > P(A). To get a more general proof use variables instead of 90% and 30%.

By the way, “A → B” is just another way of saying “¬A or B,” and you just apply normal probability to the latter.

Ah, I think we are using two different meanings for the → sign.

In the Kolomogorov defn. of a probability space, we have to have a sigma-algebra (which specifies sets [“events”] and their union/intersection relations) before we assign any probabilities.  If A, B are in the sigma-algebra and B is a subset of A, this is interpreted as “if event A happens, event B must also happen.”  If we are taking A and B to be propositions, this means “A implies B.”

In the usual Bayesian framework, the events are propositions, but (our beliefs about) truth and falsehood are represented by probability assignments we give to the events, and we can only make these assignments if we have the sigma-algebra already.  So the sigma-algebra encodes implication relationships which we are supposed to assent to before we take the step where we say certain propositions are true (P=1) or false (P=0).

To use the classic example, the sigma-algebra will have “Linda is a feminist bankteller” (A) as a subset of “Linda is a bankteller” (B).  Then when we go and assign probabilities, the probability axioms tell us that we must respect this implication (A→B).  Among other things this will mean that we assign probability 1 to “¬A or B,” for the trivial reason that “¬A or B” is the set of all outcomes.  But this is not the sort of thing that the framework allows us to not know, and then figure out: it is fixed by the sigma-algebra at the outset.

So when I write things like P(A → B), I am talking about the sort of relation we normally get from the sigma-algebra.  Such a relation goes beyond the truth tables: the sigma-algebra normally tells us things like “if Linda is a feminist bankteller, Linda is a bankteller” which are true (in the relevant sense) even if Linda is neither of those things in reality (in which case the truth tables are mute).  There’s a connection to math progress here, in that often mathematicians are concerned about the consequences of assuming certain axioms but agnostic about the truth of the axioms; “the well-ordering theorem is equivalent to the axiom of choice” is interesting, even thought you will be hard pressed to find people who think they’re both true or both false (it is contested what that would even mean!).

It sounds like you’re coming from Jaynes’ approach to probability, while I’m used to Kolmogorov; the two are close to equivalent, but I’ll have to think more about whether Jaynes’ version makes this problem easier.

I’ve been thinking a bit about the how to get a “uniform rather than pointwise” version of the logical induction stuff.

It seems like a lot of the challenge of the problem is generic to Bayesianism, and not particular to “logical” or “mathematical” outcomes.  Anyone who reads my posts on this stuff knows I have an axe to grind about how, outside of specialized small domains, you don’t have a complete sigma-algebra to put probabilities on.  (Because you aren’t logically omniscient, you don’t know all the logical relations between hypotheses, which means you don’t know all of the subset relationships between sets in your algebra.)

The case of “logical induction” forces Bayesians to think about this even if they wouldn’t otherwise, since the prototypical/motivating examples involve math, a world in which we are continually discovering facts of the form “A implies B.”  So assuming logical omniscience would be assuming we know all the theorems at the outset, in which case we wouldn’t need LI to begin with.

But the problem that “A may imply B even though you don’t know it does” is generic and comes up for Bayesian inference about real-world events, too.  A good solution to this problem would be very interesting and important (?) even if it didn’t, in itself, handle the “logical” aspects of “logical induction” (like “what counts as evidence for a logical sentence”).


I have to imagine there is work on this problem out there, but I have had a hard time finding it.

The basic mathematical setup would have to involve some “incomplete” version of a sigma-algebra (generically, a field of sets), where not all of the union/intersection information is “known.”  This is a bit weird because when we talk about a collection of sets, we usually mean we know what is in the sets, and that information contains all the relations like “A is a subset of B” (i.e. B implies A), when we want to make some of them go away.

A Boolean algebra is like a field of sets where we forget what the sets contain, and just leave them as blank symbols that happen to have union/intersection (AKA “join/meet”) relations with one another.  That seems closer to what we want, except that we need some of the join/meet operations to give undefined results.  There are Boolean algebras where not everything has a join/meet (those that aren’t complete, in the complete lattice sense), but this seems like a thing having to do with inf/sup stuff in infinite spaces and isn’t really what we want.  (Despite my username, I know very little about algebra and am just flying blind on Wikipedia here.)

An example of the sort of thing I want to do is the following.  Say we are assigning probabilities in (0,1) to P(A), P(B), P(A=>B), and P(B=>A).  Suppose P(A=>B) > P(B=>A), that is, we think it’s more likely that A implies B than the reverse (and in particular, more likely than A<=>B).

Now consider P(A) and P(B).  The probabilities above say we’re most likely to be in a world where A=>B and not vice versa, in which case we should have P(A) > P(B), or we’ll be incoherent.  So it seems like we should have P(A) > P(B) right now.  Of course, this will make us incoherent if it turns out that we are in the B=>A or A<=>B worlds, but we think those are less likely.  In betting terms, the losses we might incur from incoherence in a likely world should outweigh those we’d incur from incoherence in an unlikely world.

What we’re really doing here, I guess, is treating the implication (i.e. subset relation) as a random event, so implicitly there is a second, complete probability space whose events (or outcomes?) include the subset relations on the first, incomplete probability space (the one discussed above).  Maybe you could just do the whole thing this way?  I haven’t tried it, I’m curious what would happen

Anyway, I can’t help but think there must be the right math tools out there for doing this kind of thing, and I just don’t know about it.  Anyone have pointers?

raginrayguns replied to your post “I rag a lot on Bayesianism, but when I think about it, there’s a…”

determining the variance of the prior from the data actually looks a lot like what the bayesian method does, when you use a hyperprior over the variance parameter, and think about the updates on each individual data point (only the data points at the beginning of the sequence will affect the hyperparameter, I’m thinking; the hyperparameter will stabilize before the actual parameters do. Hmm I guess I don’t actually know this.) HOwever I don’t see any bayesian justificati
on for WHY you would use a hyperprior, whereas the algorithmic modeling framework………….. may provide an undertstandable justification for setting th ehyperparameter from the data? I actualy don’t get it. I dont’ actualy knoww whether adding a model selection step to cross validation where you choose a ridge regression lambda gives better results than just making one up before you look at the data. Somebody knows this but not me 
am I overcomplicating this? Lets say you divide your dataset into training data X and test data Y. The log likelihood function decomposes as log p(X) + log p(Y|X). The algorithmic modeling approach is essetially just dropping log p(X).

I was thinking about this on the plane today, in relation to this paper I mentioned in a reply earlier.  The authors use a Gaussian prior over regression weights beta_i like in ridge regression, but instead of setting specific values for prior variance (tau_i), they introduce a hyperprior for it, a Jeffreys prior (p(tau_i) proportional to 1/tau_i).

They marginalize out the hyperprior, which makes the likelihood non-Gaussian, so you wouldn’t look at it and think “oh a Gaussian prior.”  And then they take MAP estimates of the betas because they want sparsity (it’s supposed to be a competitor to LASSO).  And … they it does as well or better than standard regularization techniques, despite having no free parameters.  Which is pretty spooky.

It also is hard to compare to what the algorithmic modeler would do.  They would make all the taus equal to a single regularization parameter and select the optimal value by cross-validation, grid-searching over some interval [a,b] of their choice.  This is close to having a uniform hyperprior on [a,b] that’s zero outside, except you aren’t averaging over all models in [a,b], you’re picking the best one.  So it’s like having that hyperprior and using MAP to estimate the taus.  Whereas in that paper, they average over the hyperprior and then do MAP at the end.

The usual idea in algorithmic modeling is that you should always have at least one parameter that controls model complexity, so you can optimize it with cross-validation.  If you don’t have such a parameter, it intuitively seems like you’re being wasteful – there is some optimal level of complexity, you usually can’t confidently derive it a priori, so if you don’t learn it from your data you’re making an assumption for no reason.  The algorithm in the paper, though, does learn the complexity from the data, by having a prior over it and then updating.  It just does this automatically, in a way that apparently works well, but it’s mysterious why.

As you said, it’s not clear what Bayesian reason there is to use a hyperprior here.  In the two examples I can find of Jaynes doing regression with an explicit prior for the coefficients (the seasonal trends in Ch 17 and the model comparison in Ch 20), he just uses a Gaussian and seems to think this choice doesn’t need justification.  You could maybe argue that the hyperprior is correct because it lets you use the principle of indifference – you introduce this variance, but you don’t actually know what it should be, so in order to be more agnostic, you use an uninformative prior on it, and now you have no more parameters to set.

But that’s still an a priori argument.  Why does it work in practice?  The arguments about indifference make more sense when you’re talking about possibilities you could actually measure (you don’t want to prematurely focus on some possibilities over others).  But our taus here are features of our uncertainty, not the real world.  You can’t measure them by updating on data (if the model is correct, the posterior variance will asymptotically go to zero).  So it’s still mysterious why it works.

About the log p(X) + log p(Y|X) thing – how does that work with cross-validation, where every data point gets to be part of Y in some fold?

I rag a lot on Bayesianism, but when I think about it, there’s a similar orthodoxy that I accept without thinking about it, one that presumably has its flaws.  It is roughly equivalent to Breiman’s “algorithmic modeling,” and is the standard view in machine learning and data science.

As far as I can tell, although algorithmic modeling is the perspective used by many working statisticians (data scientists are working statisticians), people rarely mention it along with frequentism and Bayesianism as a grand-scale perspective on statistics and probability.  One possible reason is that it is hard to prove theoretical results about.  Another is that it is not obvious it has its own “interpretation of probability,” although I think it might (I need to think more about this).


Here is how I think the three schools view the classic subject of linear regression:

Classical.  (This is what Jaynes calls “orthodox statistics,” and is frequentist, although the term “frequentism” doesn’t cover all of it.)

The classical statistician does what Breiman calls “data modeling.”  To do linear regression, they first postulate that in reality, y is related to x by a linear equation.  (They also assume the usual stuff about uncorrelated errors, etc.)  Supposing this is true, they now want to estimate the true values of the coefficients in the equation – the coefficients that, figuratively, nature uses when it generates the data.

This means the classical statistician cares about unbiased estimators.  It is often possible to obtain lower variance – roughly, less sensitivity to noise in the data – by using a biased estimator.  But the classical statistician doesn’t find this very interesting: if they can’t get a good unbiased estimate of the true coefficients, that means they don’t have enough data.

Bayesian.  What distinguishes the Bayesian approach is that it introduces a prior distribution for the parameters it wants to estimate.  This can be justified in various ways.  A Bayesian data-modeler would say that there are true values for the parameters, and the prior represents their state of prior knowledge about them.  But a Bayesian who does not want to posit true values might say the prior merely represents the expectations they have about the results of the regression itself.

In any event, the prior typically makes the Bayesian’s estimators biased.  A Gaussian prior on the coefficients, even one with very large variance, will bias estimates toward the prior mean.  This is not necessarily a bad thing.  Jaynes rightly points out that if you are willing to use a biased estimator, your model will fit the data better (lower mean-squared-error, or MSE), because it has lower variance.

Importantly, the Bayesian’s prior is set before the data come in.  (At least in orthodox versions of Bayesianism; in practice some people do empirical Bayes.)  If the bias introduced by the prior makes the model fit better, that is nice, but it is not the fundamental rationale for using the prior.  For an Objective Bayesian like Jaynes, there is a unique correct prior for any regression, derived from first principles and prior knowledge before you see the data.  If another prior happened to obtain a lower MSE, the Bayesian would reject this, because MSE is not their criterion for evaluating priors.


Algorithmic modeling.  The algorithmic modeler may use one of several estimators, but a very popular one, ridge regression, is equivalent to what the Bayesian would do, with a prior a Bayesian would see as pretty sensible.  It is a Gaussian prior on the coefficients, with mean zero, no correlations between coefficients, and equal variance for all the coefficients.

But the algorithmic modeler determines the variance of this prior from the data.  Their sole goal is making the model do as well as possible when shown x/y pairs it hasn’t seen before.  Jaynes pointed out that biased estimators can lower MSE, and the algorithmic modeler takes this a step further by determining the bias from the data itself.  The algorithmic modeler calls this process “regularization” (or more generally “dealing with the bias-variance tradeoff”), and they do it every time they fit a model.

The algorithmic modeler’s relationship to “nature” is something like: “I never know enough at the outset to just postulate a model for what I think nature is doing.  What I do is use an algorithm which takes in data, and returns a predictive model, from a given model class.  This model has some parameters, but I don’t claim that there are true, platonic values for those parameters.  If I want to understand nature, I will first try out algorithms until one spits out a very predictive model of the data, then inspect the resulting model to see what it is doing.”

This approach differs from both the Bayesian and classical approaches in that it emphasizes the search for a good model after the data has already been seen.  In principle, the Bayesian approach is supposed to handle this by assigning prior probabilities to different models or model classes, then updating in response to the data.  But this still precludes an open-ended search for models: one must specify every model under consideration before the data is seen, to ensure that the prior probabilities add up to one.  To add a new model after the fact, the Bayesian would have to change the prior probability of all the other models, and change all their posteriors accordingly.  The algorithmic modeler is free to try out any new model they want at any time, without any accounting work to “incorporate” it.

[less certain of the following than the above]

The algorithmic modeler also has their own interpretation of probability.  It is more classical than Bayesian in that it says probabilities are frequencies.  But it doesn’t equate a probability to the limiting frequency in some hypothetical string of infinitely many trials.  It equates the probability to the frequencies in the data available.  By making generalization error the gold standard of quality, and always assessing generalization error by cross-validation or train-test splits, the algorithmic modeler is saying that they want to minimize expected error where the expectation is computed over the frequency distribution in the data.

Of course, the algorithmic modeler does not literally believe their data are more real than nature.  If they have reason to believe the frequencies in their data are not representative of the frequencies in the real population, they will try to collect better data.  In this sense they are just orthodox frequentists.  But they still place fundamental importance on expected values computed using the sample frequencies as probabilities, something that neither of the other schools do.

I’ve been reading some E. T. Jaynes lately (parts of PT:LoS I hadn’t read plus some of his papers).  I think I may have overestimated his philosophical ambitions in the past, probably because I didn’t separate him and Yudkowsky clearly enough in my mind.

Jaynes is unusual in that he’s very pragmatic-minded yet also very anti-eclectic.  A lot of pragmatic people will pick and choose methods from different schools of thoughts, using one here and another there, on the principle of “whatever works.”  Jaynes also adopts the principle of “whatever works,” but he is convinced that his preferred method always works best in every case.  Unlike many texts on Bayesianism, his big book is not focused on arguments like Dutch Books that try to establish Bayesian superiority in the general case once and for all; instead he gives the reader an endless succession of concrete, quantitative “problems,” and shows again and again how the (Objective) Bayesian methods are faster, cleaner, easier, more robust, etc. than some alternatives.  (“If you juggle the variables and get the right priors / it’ll pull your butt out of many a fire … )

This focus on “problems” should give pause to anyone who, like Yudkowsky, wants to base a whole epistemology on Jaynes.  The more I read Jaynes, the more it seems like he was interested in giving practical advice to working scientists, rather than in giving a systematic account of “how science works.”  The title “Probability Theory: The Logic of Science” makes the book sound like it’s trying to tell you how science works, but what he means by “the logic of science” is really “the logic of working scientists”: he gives a systematic and rigorous account of the kind of reasoning scientists use in practice when they have to estimate a spectrum or derive a specific heat or whatever, without saying this can be patched together to form a full picture of the scientific enterprise.

This is not always clear in his writing about Bayesianism per se, but it’s very clear in his writing about Maximum Entropy methods.  Jaynes was an Objective Bayesian, meaning he thought that prior distributions were not a matter of personal choice, that they could be deduced objectively and that two people “with the same information” ought to have the same distribution.  His recipe for making prior distributions had two parts: non-informative priors and MaxEnt.

Non-informative priors are a really cool and kind of spooky thing where you can deduce the exact form of a distribution just from the transformation properties it must have, and thus deduce a unique (!) prior distribution compatible with the information “I know this is a standard deviation of something, but I have no clue what it is.”  So that’s what you start out with when you know as little as you possibly could.  When you know more than that, Jaynes says you should incorporate this by using MaxEnt, which tells you (roughly speaking) how to form the “equivalent of” a uniform distribution if you’re restricted to only use distributions with certain constraints.

So far, so good, but where do the constraints come from?  Jaynes always assumes that our prior knowledge comes in the form of exact constraints on the mean of our prior distribution.  This is a very natural thing to do in statistical mechanics, which Jaynes wrote a lot about, but as many people have noted, it is very strange as a principle of general inference.  Our prior distribution (as Jaynes keeps reminding us) is meant to represent our state of knowledge about the world, not some feature of the world itself (except incidentally).  It is hard to imagine a case in which we have evidence saying that, although the world could be many different ways, our internal expression of our knowledge about it must make a certain average prediction.  Indeed, Jaynes belabors this very point on p. 40 of this article, while arguing against a claim that MaxEnt and Bayes were inconsistent: he says they cannot be inconsistent because the empirical information which goes into Bayes – observed frequency counts, for instance – does not take the form of an assertion about your distribution.  I agree!  But this only makes it more mysterious where these assertions do come from.

In practice, when Jaynes solves a problem with MaxEnt, he either chooses a textbook-ish problem in which the constraint is simply asserted as part of the problem, or he chooses problems where your prior is supposed to match observed frequencies so that the constraint rule is less bizarre.  Here’s an example of the latter.  On pp. 48-63 of the same paper, he analyzes empirical frequency counts from a possibly-biased 6-sided die by first making physical arguments about the sorts of bias that are likely to arise in a die.  These take the form of constraints on functions of the probabilities assigned to the six faces, with some undetermined parameters corresponding to the extent to which the die is weighted.  These physical arguments are not about states of knowledge; they only happen to carry over to the prior in this case because our “state of knowledge” about the result of a given roll is supposed to line up in a particular way with the physical form of the die.  He then tries MaxEnt with one constraint, then with two, in each case estimating the parameters by using sample means as exact constraints, and doing a chi-squared test for goodness of fit; once he has imposed two constraints, the test doesn’t reject the MaxEnt distribution and he declares success.

He immediately addresses the obvious concerns with this procedure, for instance about the interpretation of sample means as constraints.  He shows that among probability distributions with some constraint on these means, the one which gives the data the highest likelihood is the one where the constraints are set equal to the sample means.  This is not surprising (a model perfectly tuned to the observations will assign them high likelihood), but it assumes at the outset that we are supposed to set constraints on means, which is not obvious.  Indeed, this approach falls prey to the same problem with maximum likelihood that Jaynes identifies in Section 13.9 of PT:LoS, where he shows that it is equivalent to estimation with an all-or-nothing loss function:

The maximum-likelihood criterion is the one in which we care only about the chance of being exactly right; and, if we are wrong, we don’t care how wrong we are. This is just the situation we have in shooting at a small target, where ‘a miss is as good as a mile’. But it is clear that there are few other situations where this would be a rational way to behave; almost always, the amount of error is of some concern to us, and so maximum likelihood is not the best estimation criterion.

Typically Jaynes prefers the estimates given by a squared-error loss function, which are means rather than modes of the posterior.  This has a nice regularization effect, and corresponds to the idea of mixing pure likelihood and background knowledge so that you don’t make overly radical, overfitting jumps based on small amounts of data.  But that is precisely what Jaynes advocates when using MaxEnt: he specifically asserts that the sample means can be used as constraints whether they are taken over many observations or just a few.

A very simple example illustrates the problem with this.  Suppose we roll a six-sided die once and observe a 6.  Taking literally the idea of using small-N sample means as constraints, we are forced to pick our MaxEnt distribution from those distributions with mean 6, and there is only one such distribution, the one that assigns probability 1 to the 6 face and 0 to the other faces.  Obviously this is absurd to take as a prior (if you ever roll your die and get anything but a six, you will have to divide by zero in Bayes’ rule).  I am sure that in this case Jaynes would say that we really know additional prior information about dice, e.g. that “dice in general” have means around 3.5 and so we should use that in our constraint.  But this does not have to be about a die; it could be some totally abstract multinomial process which we know nothing about at the outset, and still this would be a rash and bad inference.

(Jaynes says that uncertainty about the mean <f> can be supplied to MaxEnt via a constraint on <f^2>, but that doesn’t help here, as our N=1 sample has zero variance, and anyway you can’t get a mean of 6 without zero variance.)

This all makes more sense in light of Jaynes’ pragmatic focus.  If we want to provide a completely systematic set of rules for inference – so that they could be programmed into the hypothetical “robot” Jaynes frequently discusses – then we must worry about “obviously bad” inferences like the one above.  The robot will follow the rules we give it even when the results are “obviously bad” (it has no notion of obviousness outside of the rules).  But if we are giving practical advice to human scientists, we don’t need to give them a rule telling them not to do the thing I just described, because they wouldn’t try it in the first place.  They would gather more that one data point before using MaxEnt – which contradicts the claim that MaxEnt works for any N, but can be justified by demanding that we write down a “sensible” problem, like the textbook-ish ones Jaynes uses, before we apply his methods.

Indeed, I don’t think Jaynes ever clarifies what specific mixture of Bayes and MaxEnt his robot is supposed to use; there might be some recipe which would avoid pitfalls like the above, but Jaynes does not seem very interested in it.  As PT:LoS goes on, he says less and less about systematic rules and the robot, and focuses more and more on “solving problems” in the manner of a physics lecturer doing board work.  The real intended user of his methods is a human physicist, not a robot, and he is satisfied with methods that work well when judiciously applied, even if they are not foolproof (and thus not a complete theory of inference).

A final example of Jaynes’ pragmatism: I was surprised to find, in a late (1991) paper, Jaynes happily conceding something which I had always thought was a knock-down point against the Bayesian approach.  I had always make a big fuss over the fact that the approach doesn’t tell you how to modify your hypothesis space if it is not complete to start with, or how to re-distribute probabilities after modifications.  Jaynes advocates using MaxEnt to deal with “refinements,” i.e. breakdowns of the possibilities into finer and finer details.  At one level of description you might apply MaxEnt over possible structures of a crystal; then, to the possible arrangements of molecules within the crystal; then to possible arrangements of atoms, and so on (cf. p. 15 here).  But this doesn’t work if you do not have exhaustive knowledge of the possibilities on any one level.  In Section 6 of the 1991 paper, Jaynes admits he finds it awkward that MaxEnt requires a hypothesis space, and hopes for a development that will extend his theory to cases without one:

Our bemusement is at the fact that in problems where we do not have even an hypothesis space, we have at present no officially approved way of applying probability theory; yet intuition may still give us a strong preference for some conclusions over others. Is this intuition wrong; or does the human brain have hidden principles of reasoning as yet undiscovered by our conscious minds? We could use some new creative thinking here.

I was talking to someone yesterday about my usual objections to representing beliefs/credences as probabilities, specifically the stuff about how IRL you don’t fully know the sample space and event space, and probability theory doesn’t tell you what to do about this.  

For instance, if you encounter an argument that “A implies B” – where A and B are the kind of ideas which you’d be assigning credences to – and the argument convinces you, you now know that A (as a set in the event space) is a subset of B.  You didn’t know that before.  Yet you had some concept of what “A” and “B” were, or you wouldn’t have gotten anything out of the argument – you needed to know which sets in your event space corresponded to the ones in the argument.  But although you knew about those sets, you didn’t know about that subset relation.  How do you “update” on this information, or formalize this kind of uncertainty at all?  It’s conceivable that you could and it would be very cool to do it, but probability theory itself doesn’t include this case – which to me is an argument (one of many) that probability theory is not the right set of tools for formalizing belief and inference.

Anyway, the person I was talking to mentioned this recent (Sept. 2016) preprint from MIRI, “Logical Induction,” which tackles the problem just mentioned.  (There’s also an abridged version, 20 pp. instead of 130 pp.)  I have not read it yet, beyond the first few pages, but it looks cool.  (Reportedly there’s a lot of cool math in there but the method for doing the thing is absurdly inefficient, doubly exponential time complexity or something.)

My understanding is that MIRI people want to formalize “logical uncertainty” in order to make TDT work (because TDT invokes the notion without formalizing it), not to make Bayesianism/Jaynesianism work.  But it’s refreshing to see people interested in this kind of problem, because, from my perspective, it is the sort of new math Bayesians/Jaynesians would need to have in order to make their perspective compelling.  There’s this giant looming problem with trying to apply results about “ideal Bayesians” / “Jaynes’ robot” to finite beings that keep learning new things about their sample and event spaces, and I would have expected people to notice this long ago and get to work developing new formalisms to deal with it.  And maybe that’d result in some super-powerful reasoning method, or maybe it’d result in something useless because it turns out the computational complexity is necessarily very high, but in any event there’d be cool math and an interesting line of thought to follow.

(I keep saying there is very little work about this stuff out there.  Maybe I’m wrong?  I haven’t been able to find it, in any event)

ETA: this also doesn’t have the problem @jadagul identified with earlier MIRI papers, that they read like crosses between papers and research proposals – they prove a whole bunch of different properties/implications of the criterion they define at the start, and I’d imagine there are at least several Least Publishable Units in there

napoleonchingon:

nostalgebraist:

There’s this thing in statistical mechanics that I’ve never really understood.  Specifically, in the application of statistical mechanics to fluids, although it seems like a fundamental issue that would also come up outside of that particular case.

(Cut for length and because not everyone is interested in this.  Pinging @bartlebyshop and @more-whales because I suspect they understand this kind of thing – but don’t feel any obligation to read this unless you want to)

Keep reading

Am probably misunderstanding this, so please be cautious. Also, am not a theorist and understand little to nothing about numerical modelling.

But. In the second approach, is it actually true that the ensemble of microstates is specified beforehand? Aren’t you optimizing the ensemble of microstates to get maximum entropy given the (let’s say) energy expectation value constraint? If you started with your microstates already set and they were non-interacting, they’d just be propagating according to dynamical laws and you wouldn’t be optimizing anything. You’d just specify all the individual microstates beforehand and just watch them evolve and you’re not going to get any information that you didn’t put in.

But if you have a collection of microstates, you can’t look at all the microstates at once, so you look at individual microstates from the ensemble of microstates you have and then try to get a probabilistic picture of the entire ensemble. This would be like having one tank in your lab that you occasionally can take really good photos of, and trying to generalize to what is happening in the tank in general from those photos

If i understand your last sentence correctly, it’s a description of ergodicity: the time average of a function over a single trajectory is equal to its ensemble average (by which we mean the expectation value of the function over the invariant measure for the system).  This is often taken as a postulate in these kinds of papers, and it is one justification for thinking about hypothetical ensembles even if you only ever have one copy of the system in reality.

What I am questioning here is the process used to determine the invariant measure (which we then use to compute the expectation values).  An invariant measure is any measure (i.e. probability distribution, roughly) that is constant under the dynamics, and generally it won’t be unique.  For instance, if the dynamics conserves energy, and has the Liouville property (preserves phase space volume), then any measure that depends on the energy alone is an invariant measure: it’s constant on every energy surface, and the dynamics just carry volume around on energy surfaces.

So the Gibbs (”macrocanonical”) measure, which has probability density proportional to exp(-const*E), happens to be an invariant measure.  But so is the ”microcanonical” measure, which puts all the probability mass on one energy sphere and has zero probability density everywhere else.  Or, pick any function f(E) that integrates to one over phase space, make that the probability density, and you’ve got an invariant measure.  If you need an invariant measure that also has a certain expectation value <E>, fine, just rescale your f appropriately (this is what the constant in the Gibbs measure does).

Now, if you have an single copy of the system with a known energy E_0, and you’re trying to use ergodicity to predict time averages over its trajectory, then it seems clear to me that only one of these measures will give you exactly accurate results: the “microcanonical” one.  That’s because the actual system never has any energy value besides E_0, so any distribution that puts mass on any other energy surface will give you some wrong answers.  For instance, define g to be a function of the state which is 1 when E > E_0 and 0 otherwise.  The time average of g is zero, because the energy is always E_0, never greater.  But the expectation of g with respect to the Gibbs measure, say, is positive (if perhaps small).

Nonetheless, the Gibbs measure often works very well as an approximation to the microcanonical measure.  Specifically, if you just look at the marginal distribution of some finite set of state variables (say, if the variables are x_1 through x_n, we look at the marginal distribution of x_1 through x_k, where k < n), then in the limit as n goes to infinity (with k fixed), this marginal distribution is the same for both measures.  There are reasons why this works, which involve large deviations theory and which I only half understand (from looking over these notes when I was trying to understand this stuff years ago).  And the reasons do say that maximizing entropy is the right thing to do to get this property (I think).

However, none of this means that the Gibbs measure is uniquely correct, or that it is justified by some principle of indifference, or that it is the “least biased” or (god help us) “most probable” probability distribution consistent with physics, or that it works because it maximizes entropy and physical systems tend to maximize entropy – which are all things that people say about it all the time.  It is good only insofar as it approximates the microcanonical measure, and (as far as I can tell) if you can work with the latter you always should.  As Ellis says in those notes:

Among other reasons, the canonical ensemble was introduced by Gibbs in the hope that in the limit n → ∞ the two ensembles are equivalent; i.e., all macroscopic properties of the model obtained via the microcanonical ensemble could be realized as macroscopic properties obtained via the canonical ensemble. However, as we will see, this in general is not the case.

(In other words, the whole point is approximating the microcanonical measure.)

Treating the “entropy maximization with expectation value constraints” procedure as the axiomatically correct, “least biased” thing to do would lead one to conclude that the Gibbs measure is better than the microcanonical measure – for instance, that the function g described earlier should have a nonzero expectation, and that it is somehow “biased” to say otherwise.

I guess if this is all a way to get tractable approximations to the microcanonical distribution, which is in turn just the distribution that says “we don’t know anything except that the energy (or whatever) has this value” – then I guess that’s fine by me.  But the rhetoric surrounding it all, I guess, is frustrating and confusing.  Maximizing entropy makes physical sense if you’re comparing different macrostates and asking which is more likely, but the Gibbs measure just so happens to “maximize entropy” for a fixed macrostate, which then leads people to say it’s the “most probable” distribution because they’ve mentally associated “most probable” with “maximum entropy,” and then Jaynes comes along and says that it’s also the least biased, as if it captures the principle of indifference, when if you know the macrostate the principle of indifference is expressed in the microcanonical measure, not the Gibbs measure … argh!!  It feels like all this terminology was invented by some sadist to maximize confusion

(via sungodsevenoclock)

There’s this thing in statistical mechanics that I’ve never really understood.  Specifically, in the application of statistical mechanics to fluids, although it seems like a fundamental issue that would also come up outside of that particular case.

(Cut for length and because not everyone is interested in this.  Pinging @bartlebyshop and @more-whales because I suspect they understand this kind of thing – but don’t feel any obligation to read this unless you want to)

Keep reading