Install Theme

bayes: a kinda-sorta masterpost

I have written many many words about “Bayesianism” in this space over the years, but the closest thing to a comprehensive “my position on Bayes” post to date is this one from three years ago, which I wrote when I was much newer to this stuff.  People sometimes link that post or ask me about it, which almost never happens with my other Bayes posts.  So I figure I should write a more up-to-date “position post.”

I will try to make this at least kind of comprehensive, but I will omit many details and sometimes state conclusions without the corresponding arguments.  Feel free to ask me if you want to hear more about something.

I ended up including a whole lot of preparatory exposition here – the main critiques start in section 6, although there are various critical remarks earlier.

0. What is “Bayesianism”?

Like most terms ending in -ism, this can mean a number of different things.  In its most limited sense, “Bayesianism” is a collection of technical/mathematical machinery, analogous to a tool or toolbox.  This collection, which I will call “the Bayesian machinery,” uses a particular way of representing knowledge, and if you can represent your knowledge in that way, the machinery tells you how to alter it when presented with new evidence.

The Bayesian machinery is frequently used in statistics and machine learning, and some people in these fields believe it is very frequently the right tool for the job.  I’ll call this position “weak Bayesianism.”  There is a more extreme and more philosophical position, which I’ll call “strong Bayesianism,” that says that the Bayesian machinery is the single correct way to do not only statistics, but science and inductive inference in general – that it’s the “aspirin in willow bark” that makes science, and perhaps all speculative thought, work insofar as it does work.  (I.e., if you’re doing these things right, you’re being Bayesian, whether you realize it or not.)

Strong Bayesianism is what E. T. Jaynes and Eliezer Yudkowsky mean when they say they are Bayesians.  It is usually what I am talking about when I say I “don’t like” Bayesianism.  I think strong Bayesianism is dead wrong.  I think weak Bayesianism may well be true, in that the Bayesian machinery may well be a very powerful set of tools – but I want to understand why, in a way that defines the power of a tool by some metric other than how Bayesian it is.

Weak Bayesianism is like preferring to use a hammer in many cases when someone else would use another tool.  Strong Bayesianism is like saying that a hammer is the only tool you ever need, and that other tools only work insofar as they are somehow hammer-like.  To put it in these terms, I think the Bayesian “hammer” might be a great tool, but if so it is great because there happen to be a lot of nails in the world, not because everything is and must philosophically be a nail.

(Some accounts focus on the idea that Bayesians have a different “interpretation of probability” than non-Bayesians, as if this is the crucial difference.  I think this is a bit misleading, although not false, so I won’t lean on it.  More on such things below.)

1. What is the Bayesian machinery?

The machinery contains two things: a framework for representing the knowledge you have right now, and a rule for updating the representation when new evidence arrives.

A crucial point here is that the framework can be taken as prescriptive, but doesn’t have to.  That is, one could treat the machinery as a specialized tool for handling knowledge that happens to be representable in this way, or one could (with the strong Bayesians) say that all knowledge should be representable in this way.

1a. Synchronic Bayesian machinery

Anyway, here’s what the Bayesian knowledge-representation looks like.  You start with some fixed collection of “ideas” or “hypotheses.”  These are assertions about how the world is or might later be, things like “the earth is not the center of the universe” or “it will rain tomorrow.”  There are (at least) two ways to formalize these: E. T. Jaynes makes them propositions in a propositional logic, and the more standard Kolmogorov approach makes them sets (“events” in probability jargon).  The key thing is that these hypotheses can be combined to make new ones, via logical “and”/“or”/“not” (equivalently: set intersection/union/complement).  Thus, there is some structure to the collection of hypotheses, some internal relations between them: if the hypothesis “A and B” turns out to be true, then the hypotheses “A” and “B” are also true.

Knowledge about the world (or some part of it) is represented, in this setup, by “degrees of belief” assigned to hypotheses.  The idea is that you may be certain of some hypotheses, and then you may be not completely certain of others but still have a sense of whether they’re more likely to be true than false, and if so, by how much.

These degrees of belief are formalized as numbers between 0 and 1, where 0 is “certainly false” and 1 is “certainly true.”  Additionally, these numbers are supposed to satisfy some rules to make them compliant with the logical relations between hypotheses.  For instance, your degree-of-belief number for “A and B” shouldn’t be higher than your number for “A” alone.  (One can make elaborate justifications for rules like this, but they’re common sense anyway.)

Now, a magical thing happens here: this whole setup is formally identical to the one used to represent probabilities in math, where the hypotheses correspond to events, and the degree-of-belief numbers correspond to the probabilities that they happen.  And conveniently, this lets us put anything we know about “ordinary” probabilistic situations (like dice rolls) into this framework along with everything else.  “The probability of rolling an even number on this die is 0.5″ becomes “my degree-of-belief in ‘this die will produce an even number when rolled’ is 0.5,” and that degree of belief sits happily alongside your degrees of belief in Copernicanism and life after death and everything else.

(This is where the idea about the “Bayesian interpretation of probability” comes from.  But that gets things backwards, I think: the machinery doesn’t start with “what is probability?” and then decide it ought to mean “degree of belief,” it starts by formalizing degrees of belief and then notices the resulting formalism is just probability.)

1b. Diachronic Bayesian machinery

Everything in 1a up there is just about representing your beliefs at a single time.  What about changing your mind?

This part is actually really weird.  At any given time, the Bayesian’s knowledge is in shades of grey: many (most?) hypotheses are not deemed true or false, but somewhere in between.  However, the Bayesian machinery only tells you how to account for new knowledge if it is known for sure.

That is, the machinery expects that you will learn things of the form “A is true” – absolutely true, degree-of-belief 1.  It then gives you a very sensible rule for what to do in response.  It’s easiest to understand if you think in the Kolmogorov (set) formulation, where hypotheses are like regions in space, which may overlap or contain each other: for instance, the hypothesis “A and B” is just the region of overlap between hypotheses A and B.  The degrees of belief are like the “areas” of the sets (in a precise sense: the math of probability is a special cases of the math of areas/volumes).

The machinery tells you that when you learn “A is true,” you discard all the area outside the set “A,” and keep all the rest.  If a set wasn’t fully contained in A, then part of it has been “chopped off,” and its area decreases.  Any set completely outside A (i.e. logically inconsistent with A) now has area 0.  Now some sets have lower area, while some others (the ones completely inside A) have the same area.  This means the total area has gone down, and to keep it the same, you multiply all areas by a constant factor; this has the effect of increasing the areas of the sets completely inside A.

The technical term for this whole operation is “computing the conditional probability given A.”  The Bayes machinery says that when you learn A, you should change all your probabilities to these conditional probabilities.  In the philosophical literature, this is called “conditionalization.”

Concrete example: consider your degree of belief in the hypothesis H, which says “it will rain tomorrow, or God is a giant purple mouse.”  (That’s an inclusive or.)  You really doubt the thing about God, but tomorrow’s forecast calls for rain, so you grudgingly assign high probability to this weird composite statement.

But now tomorrow arrives, and it’s not raining.  So you discard all the probability you were assigning to anything on account of “it will rain tomorrow.”  Since that’s where almost all of the probability for H was coming from, H now has way lower probability; most of it has been “chopped off.”

3.  What is Bayes’ Theorem?

Bayes’ Theorem is just the mathematical definition of conditional probability, except with some terms rearranged.  (It takes two lines or so of high school algebra to get from the definition to Bayes’ Theorem.)

The rearranged form is often easier to work with than the original definition.  But mathematically and conceptually, it isn’t any fancier than the fact, which you may have memorized in school, that (a+b)^2 = a^2 + 2ab + b^2.  Accordingly, it is sometimes called by the more pedestrian name “Bayes’ rule.”

The phrase “Bayes’ Theorem” has caused a fair amount of confusion, by creating the nebulous sense that the Bayesian machinery is rigorously grounded in some single fundamental mathematical result, some deep inviolable idea like conservation of energy.  There are nontrivial mathematical arguments that purport to show why the Bayesian machinery is good, but they have nothing to do with the theorem-ness of Bayes’ Theorem.

4. How does the Bayesian machinery relate to the more classical approach to statistics?

If you don’t like the toolbox analogy, I’m so sorry, because I’m going to run wild with it here.

The classical approach to statistics tried to create a big box of tools that each individually had desirable properties, without aiming to make the whole set complete or even necessarily consistent, and without giving any One True Way to use the tools.  If a classical statistician wants to unscrew a particular screw, say, they will look around in the toolbox for a screwdriver that has the right shape, and which won’t break under the necessary torque, or melt under the ambient temperature, etc.  If there isn’t a suitable one in the box, they may use the closest one they can find, or try to build one themselves.

The Bayesian machinery is just a single tool, which is supposed to be able to unscrew any screw, hammer any nail, etc.  If you believe this, or believe it’s at least close to being true, this has the great advantage that you don’t need to fumble around in the toolbox every single time you want to do something.

The classical toolbox also has a lot of oddities.  It is not too hard to find cases where the tools will do something misleading, bizarre, or even illogical, such as treating two fundamentally identically problems as somehow different (e.g. violating the likelihood principle).  This is because the tools were designed to guarantee certain desirable things to the user, and nothing beyond that.  The labels on the tools say things like “won’t melt below 300° F,” and you are in fact guaranteed that, but the same screwdriver might turn out to instantly vaporize when placed in water, or when held in the left hand.  Whatever is not guaranteed on the label is possible, however dangerous or just plain dumb it may be.

(Concretely, these guarantees are things like the famous and much-maligned 5% criterion for significance.  Classical hypothesis tests are designed so that if you use them right (which is not easy!), they will only give you false positives some percentage of the time, and false negatives some other percentage of the time.  In principle, you can choose these percentages, but increasing one will decrease the other.  In practice, the standard is to just set the false positive rate to 5%.  From a tool-building perspective, this is fine: it’s like saying the tool won’t melt below 300° F.  But there is something unsatisfyingly ad hoc about this: why 300° and not 400°?)

It is a lot harder to get bizarre behavior out of the Bayesian one-perfect-tool, but this is not necessarily a good thing.  The Bayesian machinery requires you to formalize the problem you are solving in a specific way, with a space of hypotheses and their logic-conforming probabilities.  If it is easy and straightforward to do this, then the tool usually works well; if it is not, then you can’t use the tool at all, in which case you can’t even get to the point of getting a wrong or strange answer out of it.

In this case, it’s hard to tell whether the fault is with the tool (for not solving the problem) or with you (for not figuring out how to phrase the problem “correctly”).  There is a temptation to say the problem is always with the user, not with the tool; that the Bayesian machinery is this advanced alien device that can do anything, if only we humans were smart enough to press the right combination of buttons.  But in the real world, either the screw gets unscrewed or it doesn’t.

5. Why is the Bayesian machinery supposed to be so great?

This still confuses me a little, years after I wrote that other post.  A funny thing about the Bayesian machinery is that it doesn’t get justified in concrete guarantees like “can unscrew these screws, can tolerate this much torque, won’t melt below this temperature.”  Instead, one hears two kinds of justifications:

(a) Formal arguments that if one has some of the machinery in place, one will be suboptimal unless one has the other parts too

(b) Demonstrations that on particular problems, the machinery does a slick job (easy to use, self-consistent, free of oddities, etc.) while the classical tools all fail somehow

E. T. Jaynes’ big book is full of type (b) stuff, mostly on physics and statistics problems that are well-defined and textbook-ish enough that one can straightforwardly “plug and chug” with the Bayesian machinery.  The problem with these demos, as arguments, is that they only show that the tool has some applications, not that it is the only tool you’ll ever need.

Examples of type (a) are Cox’s Theorem and Dutch Book arguments.  These all start with the hypotheses and logical relations already set up, and try to convince you (say) if you have degrees of belief, they ought to conform to the logical relations.  This is something of a straw man argument, in that no one actually advocates using the rest of the setup but not imposing these relations.  (Although there are interesting ideas surprisingly close to that territory.)

The real competitors to Bayes (e.g. the classical toolbox) do not have the “hypothesis space + degrees of belief” setup at all, so these arguments cannot touch them.

6. Get to the goddamn point already.  What’s wrong with Bayesianism?

Here we need to distinguish weak from strong again.  As I’ve noted multiple times above, the Bayes machinery tends to work well in cases that are easy to express in its formalism.  My contention is that many cases cannot be easily expressed in that formalism, and that blindly trying to “squeeze them into the right shape” is a bad approach.

7. The problem of ignored hypotheses with known relations

The biggest problem is with the “hypotheses and logical relations” setup.

The setup is deceptively easy to use in toy problems where you can actually list all of the possible hypotheses.  The classic example is a single roll of a fair six-sided die.  There is a finite list of distinct hypotheses one could have about the outcome, and they are all generated by conjunction/disjunction of the six “smallest” hypotheses, which assert that the die will land on one specific face.  Using the set formalism, we can write these as

{1}, {2}, {3}, {4}, {5}, {6}

Any other hypothesis you can have is just a set with some of these numbers in it.  “2 or 5″ is {2, 5}.  “Less than 3” is just {1, 2}, and is equivalent to “1 or 2.”  “Odd number” is {1, 3, 5}.

Since we know the specific faces are mutually exclusive and exhaustive, and we know their probabilities (all 1/6), it’s easy to compute the probability of any other hypothesis: just count the number of elements.  {2, 5} has probability 2/6, and so forth.  Conditional probabilities are easy too: conditioning on “odd number” means the possible faces are {1, 3, 5}, so now {2, 5} has conditional probability 1/3, because only one of the three possibilities is in there.

Because we were building sets out of individual members, here, we automatically obeyed the logical consistency rules, like not assigning “A or B” a smaller probability than “A.”  We assigned probability 2/6 to “2 or 5″ and probability 1/6 to “2,” but we didn’t do that by thinking “hmm, gotta make sure we follow the consistency rules.”  We could compute the probabilities exactly from first principles, and of course they followed the rules.

In most real-world cases of interest, though, we are not building up hypotheses from atomic outcomes in this exact way.  Doing that is equivalent to stating exact necessary and sufficient conditions in terms of the finest-grained events we can possibly imagine; to do it for a hypothesis like “Trump will be re-elected in 2020,” we’d have to write down all the possible worlds where Trump wins, and the ones where he doesn’t, in terms of subatomic physics.

Instead, what we have in the real world is usually a vast multiple of conceivable hypotheses, very few of which we have actively considered (or will ever consider), and – here’s the kicker – many of these unconsidered hypotheses have logical relations to the hypotheses under consideration which we’d know if we considered them.

That sentence was probably unintelligible, so here’s a great example, courtesy of @jadagul​ (quoted from here):

I think a better example is the statement: “California will (still) be a US state in 2100.” Where if you make me give a probability I’ll say something like “Almost definitely! But I guess it’s possible it won’t. So I dunno, 98%?”

But if you’d asked me to rate the statement “The US will still exist in 2100”, I’d probably say something like “Almost definitely! But I guess it’s possible it won’t. So I dunno, 98%?”

And of course that precludes the possibility that the US will exist but not include California in 2100.

And for any one example you could point to this as an example of “humans being bad at this”. But the point is that if you don’t have a good sense of the list of possibilities, there’s no way you’ll avoid systematically making those sorts of errors.

Consider the following list of statements: 1) in 2100, the US will exist. 2) In 2100, the US will contain states. 3) In 2100, the US will contain states west of the Mississippi. 4) In 2100, the US will contain states west of the Rockies. 5) In 2100, the US will contain California.

In my judgment, all of those statements are “almost certainly true.” And there’s content to that, as a matter of “giving credence to propositions about the future.” But if you want me to assign “probabilities” then you want me to assign numbers to all of those statements in a way that’s consistent across all those statements. And there’s no possible way to do that unless you have a list of all the possible propositions.

Try it. And then ask what you think the probability is that in 2100, the US contains any states bordering the Pacific.

Once you are thinking about both statements, it’s obvious that you should assign lower probability to “California will (still) be a US state in 2100″ than to “the US will still exist in 2100.”  The logical relationship here is something you already (implicitly) know.  Likewise with the relations between all the other statements jadagul listed.  But trying to respect these relationships in practice is impossible, because there are endless matryoshka-doll sequences of nested statements like this stretching in every direction, and you’re only thinking about some tiny subset of this multitude when you try to assign a probability.

This issue is invisible in the usual positive presentations of Bayesianism, which take the hypothesis space for granted, and motivate it by toy examples like dice where this problem doesn’t come up.

7b. Okay, but why is this a problem?

I said earlier that many problems of interest don’t fit well into the Bayesian setup.  This is an example.  Whatever the actual knowledge representation inside our brains looks like, it doesn’t seem like it can be easily translated into the structure of “hypothesis space, logical relations, degrees of belief.”

It seems like we know implicitly about a lot of logical relations that that we are incapable of simultaneously considering.  We can only apply these logical relations once we’ve started thinking about them.  So if you ask a human a sequence of questions like jadagul’s, you will see them awkwardly trying to “shoehorn in” each relation as they are made aware of it.  “I said the probability of A was 95% and the probability of B was 94%, but now you’ve asked me about C, and shit, I know the probability of C has to be between those two.  Uh, how about 94.5%?”  It’s not like “94.5%” is a number this person would have naturally come up with, if you’d asked them about C first.  They just knew they needed to squeeze it between 94 and 95.

Some people like to talk about how this is a symptom of human irrationality.  But it’s not clear to me what is irrational about this.  In the above paragraph, the human was actively trying to obey the laws of logic, and in fact succeeding.  The real problem was that they were unable to simultaneously think about every conceivable hypothesis with implicitly-known logical relations to A and B.  But this would be a problem for any finite being.

If your method requires infinite storage space and computation speed to use, the problem is not that finite beings aren’t sufficiently rational, it’s that your method doesn’t work.

8. The problem of new ideas

Closely related to the issues in Section 7 is the fact that Bayesianism does not tell you how to come up with new, good hypotheses.

For a mere statistical method, this is not much of a problem; the classical toolbox doesn’t do this either.  But for a purported complete theory of rational inductive inference?  Well … 

The strong-Bayesian folklore includes tales about the amazing powers of a “perfect Bayesian,” a creature formalized by Solomonoff induction (and Marcus Hutter’s AIXI, which gives it a utility function so it can make decisions).  This mythological giant is truly logically omniscient, aware of every hypothesis, scientific and otherwise.  Thus, it does not need to think, per se; it merely jiggles its degrees-of-belief by conditionalization, and the best ideas bubble to the top, since they were already there at the start.  (There are some cool theorems about how surprisingly well this creature would do given that it has to consider every hypothesis, even the silly ones.)

A finite being, though, ought (one would think) to have new ideas sometimes.  Strong Bayesianism claims that Bayes is the active ingredient in science, but if you look at the history of science, theory plays a major role which is inextricable from that played by experiment.  The genius of Newton and Einstein was not that they were “rational” enough to recognize that their theories were assigned high “conditional probability” given the evidence; their genius was the ability to come up with their theories in the first place, to explain evidence that had already been observed (something with no place in Bayesianism).

The folktale of the “perfect Bayesian” is misleading here, because it encourages us to mimic an ideal creature who never needs to invent new theories.  If we try our hardest to be like it, we will write down all the theories we can come up with now, and then sit back and conditionalize and never think again.  There is no reason to think this is a good idea for finite beings.

8b. The natural selection analogy

I owe this one to Cosma Shalizi, who noted (see here) that Bayesian conditionalization is formally identical to the replicator equation from mathematical biology.  In this formal correspondence, hypotheses are “genotypes,” probabilities are frequencies within the population, and the conditional likelihood of a hypothesis (given the latest piece of evidence) is its “fitness” in the current “environment.”  If a hypothesis is “fit” (explains the evidence relatively well), it grows to occupy more and more of the population, while the less “fit” hypotheses die off.

This is half of the usual natural selection story.  The other half is mutation and other sources of continually generated variation, which produce “new ideas” (new genotypes) in each generation.

Why does Bayesianism lack an equivalent of mutuation, sex, and the like?  Because if you were a perfect Bayesian, you’d know every hypothesis beforehand.

If nature were “perfect” in this sense, there would indeed be no need for mutation.  Every possible organism would already exist, and the fittest would survive.  But we recognize that in the real world, mutation is crucially important.  Likewise, an account of how to generate new ideas is crucially important, even if the “perfect Bayesian” wouldn’t need it.  The perfect Bayesian is just a fairy story, like the hypothetical earth created with every possible organism on it.

9. Where do priors come from?

Well, honey, when a misguided ideal and a need for mathematical tractability love each other very much … 

Seriously, though.  This is a major focus of many arguments over Bayes (perhaps too much of a focus?).  Once you have your initial degrees-of-belief in place, fine, you can update them by conditionalization.  But where do you get the initial ones (the “priors”) from?

In principle, there are two schools of thought about this.  “Subjective Bayesians” think that the degrees of belief are a private matter, that you can believe whatever you want as long as it’s logically consistent and subject to conditionalization.  “Objective Bayesians” think there is a One True Way to assign the priors.  Jaynes was one of the latter, and favored the “maximum entropy” approach, which says that you should choose the prior by maximizing a measure of uncertainty, subject to constraints specifying things you know to be true.

For Jaynes, these constraints took the form of various functionals of the distribution (typically the mean, plus sometimes the variance).  How you are supposed to derive the exact mean value of your degree-of-belief function from evidence has never been made precisely clear by anyone, AFAIK.  And as far as I can tell, not many Bayesian statisticians use maximum entropy in practice.

But once you admit some degree of choice into the matter, you lose the “just one tool” aspect that was appealing in the first place.  If people had subjective states of knowledge that unproblematically mapped onto probabilities, we might be able to take the subjective Bayesian proposal seriously, but we don’t (see 7-7b above).  In actual, practical statistics – and remember, that means the problems where things are easiest to write down and justify in mathematical terms – the priors are often chosen for computational convenience, not for objective correctness or because they’re what the practitioner “actually believes.”  (Conjugate priors, for instance.)

Bayesians like to snipe at the classical toolbox for being a random assortment of gizmos rather than a unified mathematical approach.  But anyone doing Bayes in practice has to choose a prior, and each prior effectively generates a distinct “gizmo” with its own properties.  Sometimes people even choose the priors for the properties of the tool they generate, which is indistinguishable from the classical approach.  Speaking of which … 

10. It’s just regularization, dude

(N.B. the below is hand-wavey and not quite formally correct, I just want to get the intuition across)

My favorite way of thinking about statistics is the one they teach you in machine learning.

You’ve got data.  You’ve got an “algorithm,” which takes in data on one end, and spits out a model on the other.  You want your algorithm to spit out a model that can predict new data, data you didn’t put in.

“Predicting new data well” can be formally decomposed into two parts, “bias” and “variance.”  If your algorithm is biased, that means it tends to make models that do a certain thing no matter what the data does.  Like, if your algorithm is linear regression, it’ll make a model that’s linear, whether the data is linear or not.  It has a bias.

“Variance” is the sensitivity of the model to fluctuations in the data.  Any data set is gonna have some noise along with the signal.  If your algorithm can come up with really complicated models, then it can fit whatever weird nonlinear things the signal is doing (low bias), but also will tend to misperceive the noise as signal.  So you’ll get a model exquisitely well-fitted to the subtle undulations of your dataset (which were due to random noise) and it’ll suck at prediction.

There is a famous “tradeoff” between bias and variance, because the more complicated you let your models get, the more freedom they have to fit the noise.  But reality is complicated, so you don’t want to just restrict yourself to something super simple like linear models.  What do you do?

A typical answer is “regularization,” which starts out with an algorithm that can produce really complex models, and then adds in a penalty for complexity alongside the usual penalty for bad data fits.  So your algorithm “spends points” like an RPG character: if adding complexity helps fit the data, it can afford to spend some complexity points on it, but otherwise it’ll default to the less complex one.

This point has been made by many people, but Shalizi made it well in the very same post I linked earlier: Bayesian conditionalization is formally identical to a regularized version of maximum likelihood inference, where the prior is the regularizing part.  That is, rather than just choosing the hypothesis that best fits the data, full stop, you mix together “how well does this fit the data” with “how much did I believe this before.”

But hardly anyone has strong beliefs about models before they even see the data.  Like, before I show you the data, what is your “degree of belief” that a regression coefficient will be between 1 and 1.5?  What does that even mean?

Eliezer Yudkowsky, strong Bayesian extraordinaire, spins this correspondence as a win for Bayesianism:

So you want to use a linear regression, instead of doing Bayesian updates?  But look to the underlying structure of the linear regression, and you see that it corresponds to picking the best point estimate given a Gaussian likelihood function and a uniform prior over the parameters.

You want to use a regularized linear regression, because that works better in practice?  Well, that corresponds (says the Bayesian) to having a Gaussian prior over the weights.

But think about it.  In the bias/variance picture, L2 regularization (what he’s referring to) is used because it penalizes variance; we can figure out the right strength of regularization (i.e. the variance of the Gaussian prior) by seeing what works best in practice.  This is a concrete, grounded, practical story that actually explains why we are doing the thing.  In the Bayesian story, we supposedly have beliefs about our regression coefficients which are represented by a Gaussian.  What sort of person thinks “oh yeah, my beliefs about these coefficients correspond to a Gaussian with variance 2.5″?  And what if I do cross-validation, like I always do, and find that variance 200 works better for the problem?  Was the other person wrong?  But how could they have known?

It gets worse.  Sometimes you don’t do L2 regularization.  Sometimes you do L1 regularization, because (talking in real-world terms) you want sparse coefficients.  In Bayes land, this

can be interpreted as a Bayesian posterior mode estimate when the regression parameters have independent Laplace (i.e., double-exponential) priors

Even ignoring the mode vs. mean issue, I have never met anyone who could tell whether their beliefs were normally distributed vs. Laplace distributed.  Have you?

11. Bayesian “Occam factors”

These are supposed to show that Occam’s razor naturally happens in the Bayesian machinery even if you don’t explicitly try to put it in.  They don’t work.  I’m too tired by now to explain this, I wrote a post about it a while ago.  (Edit: the post is here)

  1. nolrai reblogged this from nostalgebraist and added:
    This is what i have tried and failed to explain.
  2. joseph-ratliff reblogged this from nostalgebraist
  3. 4point2kelvin reblogged this from raginrayguns and added:
    This is totally right.
  4. nostalgebraist reblogged this from raginrayguns and added:
    Reblogging because this is good and I want to have it on my blog + remind myself to read it more closely so I can...
  5. scientiststhesis reblogged this from raginrayguns and added:
    TL;DR: regularisation is just Bayesianism, but Bayesianism is not just regularisation
  6. phenoct reblogged this from raginrayguns
  7. raginrayguns reblogged this from nostalgebraist and added:
    nostalgebraist is making Yudkowsky very happy in his post, by arguing with his actual belief in the status of...
  8. nostalgebraist said: 1. i recently re-read that newcomb thing and had a similar reaction 2. i don’t get the relevance of VNM to bayes, since the probabilities are part of the setup rather than the conclusion; in particular it can’t be a justification of subjective probability, since the probabilities given in the setup are supposed to be true independent of the decision-maker. maybe savage’s theorem is more relevant?
  9. plain-dealing-villain said: Re: relevance to MIRI’s mission: If you want an unbounded intelligence to do the right thing, it must think correctly. Flawless epistemology is a precondition for that, because solving moral philosophy is a precondition for it and how can you do that without a solved epistemology?
  10. the-grey-tribe reblogged this from jadagul
  11. jadagul reblogged this from nostalgebraist and added:
    I’m enjoying this thread. Offhand, if I wanted to formalize that last thing, I wouldn’t have a...
  12. adjoint-law reblogged this from nostalgebraist
  13. nostalgebraist said: @furioustimemachinebarbarian: i haven’t looked at any of that stuff in detail. i’ve p much ignored most “fuzzy” stuff bc i hadn’t seen anything showing it was especially useful for formalizing uncertainty (vs. just being one immediately obvious way to do so), but that’s mostly my ignorance speaking
  14. furioustimemachinebarbarian reblogged this from nostalgebraist and added:
    Have you read any Zadeh and any of the work following him? It was exactly these sorts of questions that lead to his...