Install Theme

@raginrayguns I know you’ve written before about the idea of “getting Occam’s razor for free” in Bayesian stats via the use of Bayes factors, without an explicit simplicity prior

I was reading about this in the book Information Theory, Inference, and Learning Algorithms by MacKay, and something seemed very dissatisfying to me about it.  The idea is that if you are comparing two models that each have some parameters, you should compute the likelihood of the data for each model by integrating over your prior for all possible values of the parameters.  If model is more complex (more parameters), then on average a smaller fraction of the volume in parameter space is going to be compatible with any given data set, because the parameter space can explain more things.  So even if your overall prior probability of M_1 and M_2 is the same, and you have similar priors over the parameters for both, you’ll naturally penalize the more complex one.

So for instance if you have a linear model y = a*x + b (+ noise), and a quadratic model y = a*x^2 + b*x + c (+ noise), the latter can always fit a given data set a bit better if you choose the optimal parameter values.  But if you integrate over a whole space of possible parameter values, and your data looks linear, the quadratic model with only fit well in a small region of parameter space near a = 0, and the integral will include the low likelihoods outside that area.  This will favor the simpler linear model, even if you didn’t give it a higher overall prior probability to start with.

This is a way of doing the thing that I’m used to thinking of in bias/variance terms.

What seems unsatisfying is that there is this appearance of naturalness – of not having to make any arbitrary choices – but really everything depends on what you treat as “same model, different parameters” vs. “different models.”  You do have an arbitrary choice, the choice you make when you break down the possibility space into a set of models, and then further into models-with-given-parameter-values.

For instance, in the above example, including the linear model as a distinct hypothesis is arguably redundant, since every model in that class also appears in the quadratic class, along the surface with a = 0.  But in the quadratic class, this set (a plane in a 3D space) is going to have prior probability 0 unless you put a lump of probability mass (Dirac delta) on that plane.  So the above setup, where you have equal prior probability for “linear model” and “quadratic model” is equivalent to only having “quadratic model” … with this weird, non-obvious lump in the prior.  And indeed, the latter is the correct way of seeing this prior, unless we want to allow our probability space to have two copies of the same outcome.  So the idea that our prior is not privileging the linear model over the quadratic is not quite true, and the whole thing feels like sleight of hand.

The idea of a prior based on Kolmogorov complexity solves this issue, but at the cost of introducing something uncomputable.  Maybe minimum description length also solves the issue?  But MacKay says it “has no apparent ad- vantages over the direct probabilistic approach.”

You could also avoid this problem by insisting that all your models be mutually exclusive, but this is not true for many model comparisons we would want to do in practice (like the above), and it also isn’t true for some of MacKay’s examples.

plain-dealing-villain:

nostalgebraist:

plain-dealing-villain:

nostalgebraist:

Posting to say that I should stay out of the latest Bayes epistemology exchange bc I have nothing new to say about it, but also to link “Solvitur ambulando” again and reiterate that I don’t understand assigning/updating probabilities when you only have an incomplete, time-dependent picture of the sigma-algebra

It’s been looking increasingly likely that Thompson sampling is actually the correct approach to approximate for bounded agents. But it’s still Bayes in the limit, and doesn’t throw out terribly different conclusions.

I don’t understand how Thompson sampling helps here.  It still has a known and fixed sigma-algebra, just one with more dimensions.  Do you have a link to a resource on this?

No, but it keeps coming up that in situations where Bayesian induction is optimal for unbounded agents, bounded agents get better answers faster by using the Thompson sampling approach. (The two main ones I was thinking of are this paper on the grain of truth, from MIRI and the standard wisdom about running multi-armed bandit problems in A/B testing, where the Thompson approach is asymptotically the same but has much better constants.)

There hasn’t been anyone, AFAIK, who’s set out to show that Thompson variants of Bayesian inference are optimal for bounded agents in the same way that Bayesian inference is optimal for unbounded ones. But I’m increasingly expecting that it’s true, and is just waiting for us to discover the proof. This proof would probably deal with your issues, assuming it exists to be discovered.

I don’t think this is addressing my issue?  I’m not talking about whether or not Thompson sampling is good if you can do it, I’m talking about whether you can do it.

In a Bayesian approach to the multi-armed bandit, you know the probability space you are assigning probabilities on (each outcome consists of an expected payout value for all the bandits at once, and each event is just some set of those outcomes, and usually the bandit payouts are known to be independent so the event space is really simple).  The MIRI paper is quite complicated but seems to assume that the probability space is known (otherwise you couldn’t do Thompson sampling).

I don’t understand what you mean by “bounded” here.  The general sort of problem I am referring to is this: “you are trying to assign probabilities to individual events, but you don’t have complete information about the event space they belong to.”

(via jiskblr)

plain-dealing-villain:

nostalgebraist:

Posting to say that I should stay out of the latest Bayes epistemology exchange bc I have nothing new to say about it, but also to link “Solvitur ambulando” again and reiterate that I don’t understand assigning/updating probabilities when you only have an incomplete, time-dependent picture of the sigma-algebra

It’s been looking increasingly likely that Thompson sampling is actually the correct approach to approximate for bounded agents. But it’s still Bayes in the limit, and doesn’t throw out terribly different conclusions.

I don’t understand how Thompson sampling helps here.  It still has a known and fixed sigma-algebra, just one with more dimensions.  Do you have a link to a resource on this?

(via jiskblr)

Posting to say that I should stay out of the latest Bayes epistemology exchange bc I have nothing new to say about it, but also to link “Solvitur ambulando” again and reiterate that I don’t understand assigning/updating probabilities when you only have an incomplete, time-dependent picture of the sigma-algebra

Frequentist statistics is just an unsystematic grab bag of assorted tools.

Bayesian statistics is just an unsystematic grab bag of assorted tools, except with different labels.  Instead of saying what the tools are good and bad for doing, the labels say things about priors.  Usually no one is sure what their prior is so they just use the ones that are good for what they’re doing.

raginrayguns replied to your post “I find Leah Libresco’s conversion frustrating.  She always stresses…”

from http://americamagazine.org/content/all-things/my-journey-atheist-catholic-11-questions-leah-libresco: “Like Alasdair MacIntyre, I kind of found my way into virtue ethics and, once I was convinced that was the most likely model for how morality works, I had to figure out what virtue ethics implied, and I wound up believing that it was the Catholic faith.” ??? ??? ??
like, does virtue ethics imply that Jesus Christ historically rose from the dead and spoke to his disciples after being crucified? if so, hooowwwwwww
oh ok I read her original post. Moral law looks created by a person, and the knowability of moral law by humans looks like it was created by a person?
ok yeah idgi but it’s a separate thing from what youre saying
oh no it ISNT a separate thing from what you’re saying. It sounds like you’re saying that she never expands on this! Never talks in detail about why she thinks what she (thinks she) knows about morality is only explainable by the existence of god (as described by Catholicism)

Yeah.  I don’t know if she really never expands on it, because it’s hard to prove a nonexistence claim, but she tends to talk as though it doesn’t really need addressing?

I listened to part of this yesterday and there were points where she is asked about some part of the ethics-to-Catholicism bridge, but her answers were really unsatisfying (to me).  At 7:00 Mehta asks about why Catholicism appealed to her specifically, and she talks about liking the social structure of the Catholic church (likening it, strangely, to democracy), not about the doctrines seeming true.  At 20:58 he asks “why not Deism?” and she gives a confusing answer about how the God of Deism doesn’t “have any explanatory power”, doesn’t make any predictions or affect how you live your life, so “you might as well have not added it at all.”  She compares it to phlogiston, clearly thinking of that one LW post.

There are a few things I don’t get here.  (Most pedantically, phlogiston did make some predictions – and was falsified – but I get the concept, it’s the “virtus dormativa” thing.)  “You might as well have not added it at all” confuses me because she was facing a problem that was solved by introducing some Godlike being – wouldn’t introducing a Deist God at least solve that problem?  Then doesn’t it have some “explanatory power”?  Maybe it’s true that it doesn’t make predictions, but that could just be an responsible expression of uncertainty – “I think there has to be some element like this in the system, but I can’t pin down any of its properties yet.”

Mehta, naturally, asks her what she can predict now that she couldn’t predict before, and she gives some answers about human psychology.  Which I guess is fair given that the original question was “how do humans come to know about morality?”  But it’s still like, man, Catholic doctrine has a whole lot of implications, and you’re taking that all on board this lightly?

Now that I think about it, it kinda reminds me of the adoption of Bayesianism/Jaynesianism in the LW community – “this theory has a few really cool selling points” plus “all these cool smart people subscribe to this theory” leads to wholesale acceptance of the theory, sidestepping all the ink that has already been spilled on the specific question of whether you should accept the theory, what it gets you, what it costs you.

raginrayguns:

in the past i’ve (implicitly?) taken the position that bayesianism is the “right” paradigm to explain the success of inferential procedures. Whatever “right” means. And, as far as it goes–the scientific method is an inferential procedure, but logical uncertainty and difficulty thinking of the correct hypotheses are critical parts of the scientific method, whereas Bayesianism assumes logical omniscience (shouldn’t have to, but no one’s shown how to actually do the calculations when you relax that assumption) and a complete hypothesis space. Jaynes (PT:LoS 5.5.2): “…in practice, the situation faced by the scientist is so complicated that there is little hope of applying Bayes’ theorem to give quantitative results about the relative status of theories. Also there is no need to do this, because the real difficulty of the scientist is not in the reasoning process itself; his common sense is quite adequate for that. The real difficulty is in learning how to formulate new alternatives which better fit the facts. Usually, when one succeeds in doing this, the evidence for the new theory soon becomes so overwhelming that nobody needs probability theory to tell him what conclusions to draw.” So when I’m arguing for Bayesianism as a paradigm for explaining inferential procedures, the procedures I’m talking about have not usually been in science, but rather statistical estimation and hypothesis testing procedures.

This position was a little different from what you’d think a Bayesian arguing about statistics would say. You might think a Bayesian would be saying, “use Bayesian statistics.” But I’m not, I’m saying “use Bayesianism to explain the success/failure of statistical methods.” Use whatever’s convenient, understand with Bayesianism. Well, understand with whatever’s convenient too, but only Bayesianism will explain all the successes and all the failures, and I have tended to look for Bayesian explanations. Yudkowsky: “The point of Bayesianism isn’t that there’s a toolbox of known algorithms like max-entropy methods which are supposed to work for everything. The point of Bayesianism is to provide a coherent background epistemology which underlies everything; when a frequentist algorithm works, there’s supposed to be a Bayesian explanation of why it works.” In fact, in my own statistical work, I actually avoid Bayesian solutions for some applications (hypothesis testing, confidence intervals). Though for others it’s fine (estimation).

The justification, to me, was pretty simple. What you want to explain is why some conclusion you got from the data–let’s call it f(data) where f maps data to conclusions of some form–is the correct one. What Bayesianism can provide is an explanation of why f(data) is likely, through a calculation of p(f(data)|data). And there you go–you’ve explained why it’s the right conclusion.

So why the past tense? (position was, justification was…) Well, it occurred to me yesterday that this explains the wrong thing. We’re not trying to explain why f(data) is the best conclusion. In fact, we don’t even know that it is! Really we’re trying to explain why a statistical procedure has worked many times in the past. This requires calculating that p(f(data)=truth|truth) is high, for the values of “truth” that have been encountered in the past–a frequentist calculation. Or, that p(f(data)) is high, where we’ve marginalized over p(truth) which is a frequency distribution of truths that have been encountered in the past. This would be empirical Bayes (Bayes with an empirically determined prior), and still essentially frequentist.

But, I still do believe that Bayesianism explains which answer is right in your particular application. As I’ve put it before, “Bayesianism is a paradigm for explaining why a conclusion H is rational, given evidence E.” In practice, you’re often going to do something kind of frequentist and say “this inferential procedure usually works, so I’ll use it here and its conclusions should be trustworthy.” But frequentism only covers the “usually works” part: if you want to explain why the conclusion you get is likely in this particular situation you need a Bayesian explanation.

So, then, it seems there’s no conflict between Bayesianism and frequentism. They apply to two different domains, which leaves no intersection where one can choose between the two. No choice, nothing to argue about. Right?

Actually, it turns out I have plenty to argue about. Here are my theses:

  1. Bayesian estimation is not just regularized maximum likelihood, and its success is not always explained by bias-variance trade-offs. Sometimes the Bayesian estimator has no bias, just lower variance. Witness: in estimating Cauchy location, the Bayesian posterior mean has lower mean-squared error than maximum likelihood, despite a uniform prior and no bias. Likewise, maximum likelihood is not, as Bayesians sometimes say, just Bayesian estimation with a uniform prior. How could it be, if it’s demonstrably worse?
  2. Part of the above problem is conflation of Bayesian estimation with MAP. So thesis 2, by Peter Cheeseman: “MAP is crap”. More detail: it’s Bayesian estimation with a loss function that is parametrization-dependent and not at all what you want, since it considers answers that are close to and far from the truth as equal losses.
  3. Bayesian model selection can select simple hypotheses over complex ones, even when they have the same prior probability. Witness this in variable selection in linear regression, in Peter Hoff A First Course in Bayesian Data Analysis, section 9.3.1, “Bayesian Model Comparison.” This is related to thesis 1: Bayesian statistics is not just putting penalties on stuff. But it’s also its own thesis: Bayesianism provides an explanation for some cases of Occam’s razor. Kevin Kelly calls this the “argument from Bayes factors”, and provides a good explanation and a counter-argument
  4. the Bayesian answer is the best given your data, and if you think you can do better, it’s because your beliefs don’t match the prior. The optional stopping paradox is sleight of hand: use a prior that rules out θ ∈ {0}, and demonstrate you get bad performance when (and only when) θ ∈ {0}, when if that was actually your prior this wouldn’t bother you. The Diaconis-Freedman inconsistent Bayesian location estimate comes from getting data from a continuous distribution and using a prior that assumes it is discrete.

None of these four are obvious, it’s all stuff that people disagree with me about, on Tumblr and among professional stats and philosophy writers.

Good stuff, your points 1-4 are interesting and I want to think about them more

Re: underlying Bayesian explanations, I guess I tend to think in terms of making a model for your data (the “statistical learning perspective” or something like that), in which case the sense of “correct” I want doesn’t correspond intuitively to “likely.”  If I’m estimating a model parameter, I want it to produce a good (predictive) model, rather than having a high probability of being the “true value” of the parameter.  (The parameter lives in my model, not in reality)

I guess if you’re estimating something non-parametric like quantiles you might really not have a model?  (If you are estimating a moment, you are at least making the “modeling assumption” that the moment exists)

To talk about how good my model is, relative to other models, I can talk about its expected performance on some loss function – and that calculation may involve its own modeling choices, but I at least have an intuitive sense of what I am trying to model there.  By contrast I’m not sure what “how likely is my model?” means

raginrayguns replied to your post “Deborah Mayo’s “error statistics” sounds very interesting, and looks…”
Wait superior for what? You bring up frequentism so I think you’re talking about data analysis, but then when someone asks about bayesianism you start talking about like, Bayesian confirmation theory in philosophy of science, which is not really opposed to frequentism is it? I mean it’s a paradigm for a totally different application

Oh I meant for philosophy of science, hopefully with implications down the line for experimental design and data analysis.

Contrasting “frequentism” and “Bayesianism” in the OP was confusing, sorry.  Error statistics is a theory of scientific inference in which “frequentist” concepts like hypothesis tests play a central role, and presents itself as an alternative to Bayesian theories of scientific inference

inferentialdistance:

nostalgebraist:

Deborah Mayo’s “error statistics” sounds very interesting, and looks like it could be a superior alternative to both frequentism and Bayesianism.  Sort of a version of frequentism that explicitly incorporates the process of data generation and experimental design, and thus (reportedly) avoids some of the ugly oddities of frequentism.  But when I try to read her book on it, it’s just … so … boring.

I wonder whether this is why it has not gotten very much attention.  Error statistics needs an advocate with flair – a Jaynes.

What’s wrong with Bayesianism?

This is a big question.  At some point I should write up a post collecting my opinions on the matter in one place

But basically (focusing on the critiques Mayo is interested in): the scientific method seems to “work,” i.e. scientists continually produce and agree upon new theories that turn out to be very fruitful prediction-wise.  Suppose that what we want, here, is an account of “what makes science work” that lets both understand science better and suggest ways to improve existing scientific practice.  (Perhaps this could be sort of like the discovery that aspirin is what makes willow bark work, allowing us to make aspirin tablets, etc.)

The critique is then that Bayesianism does not really provide this.  It provides a framework for agents which have complete sets of pre-existing credences (“prior probabilities”), but actual scientific practice doesn’t seem to involve such things.  One can “reconstruct” episodes in scientific history by postulating sets of credences that would have caused the scientists to behave as they did, but this does not imply that “update on your pre-existing credences” is a good rule for new practice.  (Mayo compares this to someone who shows that any famous painting could have been produced by using some paint-by-numbers kit, and then claims that using paint-by-numbers kits is the correct way to make great art.)

There are various technical problems with implementing Bayesianism in practice (having to do with lacking complete priors, lacking epistemic closure, etc.), but even without these, there is the problems that it is not clear what Bayesianism is advising us to do.  We can’t snap our fingers and suddenly have prior credences in every conceivable hypothesis, and it’s not clear what incremental steps in that direction might look like, nor is it clear that they are advisable, since there’s been no clear demonstration that prior credences have been instrumental in the success of science.

(via inferentialdistance)

metagorgon replied to your post “Deborah Mayo’s “error statistics” sounds very interesting, and looks…”

where *is* a good reference? this is also a very unsearchable name.
are you referring to a 43-page paper titled Error Statistics on the vtech website, cited by 55?

http://www.phil.vt.edu/dmayo/personal_website/Error_Statistics_2011.pdf

Ah, I was referring to her ~500-page book Error and the Growth of Experimental Knowledge.  I haven’t read that paper but it looks like a better, or at least briefer, introduction to the idea.  If you do read the paper (or anyone else reading this does) I’d love to hear thoughts on it.  (I will be busy today and won’t get around to it soon, if ever)