Install Theme

Unscrambling the second law of thermodynamics →

Just to complicate things, and because this is like the Platonic form of a post I would make:

Cosma Shalizi is skeptical of Jaynes’ interpretation of the second law, saying it would actually imply that entropy decreases over time.

And Eliezer Yudkowsky is skeptical of Shalizi’s argument.  (See discussion here).

(I remember being confused by Shalizi’s argument when I first read it.  I should read it again.)

nostalgebraist:

vaniver:

nostalgebraist:

So: what’s the deal with Akaike information criterion vs. Bayesian information criterion?  "Information theory” and “Bayesianism” are both things with a lot of very devoted adherents and here they appear superficially to give different answers

They correspond to different priors. AIC has a bit better underlying framework (from an information theory point of view) and I believe better empirical validation.

Ah, OK.  I found this paper through Wikipedia, about AIC as Bayesian with a different (better?) prior, which looks good.

BIC has the advantage that it will converge asymptotically to the true model if the true model lies in the set of models being fitted, although it’s disputable how important this is.  And BIC can be derived using a minimum description length approach (can you get AIC this way too?).

One of the things I am wary of here is the sense that “information theory is magic” – e.g. in the paper linked above:

Their celebrated result, called Kullback-Leibler information, is a fundamental quantity in the sciences […] Clearly, the best model loses the least information relative to other models in the set […]

Using AIC, the models are then easily ranked from best to worst based on the empirical data at hand. This is a simple, compelling concept, based on deep theoretical foundations (i.e., entropy, K-L information, and likelihood theory).

Maybe I just don’t understand information theory, but I’m confused why I should care that the K-L divergence is “deep” and “fundamental,” here.  The question at hand is how to select a model based on some sort of estimate of how the model will generalize from the training set.  In practice I hear people justify using things like AIC by saying “well, obviously, you want the most information,” where “most information” is just a verbal tag we’ve associated with the K-L divergence and I’m not sure what mathematical weight I should give to it.  If AIC does well, and this is because it is based on information theory, I would like to understand this in a nonverbal way – what property of K-L divergence made it a good choice here, ignoring suggestive words like “information”?

Reblogging because I’m really curious about this – I’ve been aware of information theory for a long time but I’ve never been sure how it justified choices like this, and I feel like I must be just missing something major / “obvious.”

@su3su2u1, @lambdaphagy, @raginrayguns, et. al.?

(via nostalgebraist)

vaniver:

nostalgebraist:

So: what’s the deal with Akaike information criterion vs. Bayesian information criterion?  "Information theory” and “Bayesianism” are both things with a lot of very devoted adherents and here they appear superficially to give different answers

They correspond to different priors. AIC has a bit better underlying framework (from an information theory point of view) and I believe better empirical validation.

Ah, OK.  I found this paper through Wikipedia, about AIC as Bayesian with a different (better?) prior, which looks good.

BIC has the advantage that it will converge asymptotically to the true model if the true model lies in the set of models being fitted, although it’s disputable how important this is.  And BIC can be derived using a minimum description length approach (can you get AIC this way too?).

One of the things I am wary of here is the sense that “information theory is magic” – e.g. in the paper linked above:

Their celebrated result, called Kullback-Leibler information, is a fundamental quantity in the sciences […] Clearly, the best model loses the least information relative to other models in the set […]

Using AIC, the models are then easily ranked from best to worst based on the empirical data at hand. This is a simple, compelling concept, based on deep theoretical foundations (i.e., entropy, K-L information, and likelihood theory).

Maybe I just don’t understand information theory, but I’m confused why I should care that the K-L divergence is “deep” and “fundamental,” here.  The question at hand is how to select a model based on some sort of estimate of how the model will generalize from the training set.  In practice I hear people justify using things like AIC by saying “well, obviously, you want the most information,” where “most information” is just a verbal tag we’ve associated with the K-L divergence and I’m not sure what mathematical weight I should give to it.  If AIC does well, and this is because it is based on information theory, I would like to understand this in a nonverbal way – what property of K-L divergence made it a good choice here, ignoring suggestive words like “information”?

(via vaniver)

So: what’s the deal with Akaike information criterion vs. Bayesian information criterion?  "Information theory” and “Bayesianism” are both things with a lot of very devoted adherents and here they appear superficially to give different answers

slatestarscratchpad:

nostalgebraist:

slatestarscratchpad:

Hey @nostalgebraist, did you see Clark Glymour got dragged into saying something vaguely positive about CFAR?

Also, even though my brain knows OZY Magazine is not actually written by Ozy, my heart refuses to believe it.

Not sure why this is surprising?  My main experience with Glymour is as someone who complains about causal inference through regression and favors causal inference via the Causal Markov Condition, none of which seems inconsistent with anything CFAR does.

My main association with him was he wrote “Why I Am Not A Bayesian”. That was him, right?

He did, but it’s mostly about the usual stuff involving Bayesianism in philosophy of science – very interesting if you want to know whether Eliezer Yudkowsky is right in claiming that Bayesianism is the key to all methodologies and the source of magic powers, not as interesting if you are just trying to teach people about decision-making and cognitive biases.

That is, telling ordinary people to think about “Bayes’ Theorem” in the context of stuff like medical statistics, in order to get them to avoid the Base Rate Fallacy, is something that every statistician in the academy can probably sign onto – they all agree that the Base Rate Fallacy is a fallacy.  Whether you class this as “just another cognitive bias to avoid” or “the secret of the universe” is probably impactful to some degree, but I’m not sure which way CFAR currently goes at the moment, and I guess I imagine the consequences of the latter choice would be “make people think avoiding the Base Rate Fallacy is really really important,” which doesn’t seem obviously pernicious.

Unless those people are philosophers of science, or people who happen to be thinking about precisely the sort of problems where these philosophical issues are practically important.  But I’m not sure that’s who CFAR is training?  Like, I can’t imagine Glymour (or Deborah Mayo, or Cosma Shalizi, et. al.) exclaiming, “but these people coming out of CFAR are going to be erroneously convinced they can reconstruct the justification for adopting Kepler’s Laws in Kepler’s own time better than a hypotheco-deductivist would be able to!”  That isn’t their job, and if it is their job, a seminar on Bayes’ Theorem is not going to affect them one way or the other.

(via slatestarscratchpad)

slatestarscratchpad:

Hey @nostalgebraist, did you see Clark Glymour got dragged into saying something vaguely positive about CFAR?

Also, even though my brain knows OZY Magazine is not actually written by Ozy, my heart refuses to believe it.

Not sure why this is surprising?  My main experience with Glymour is as someone who complains about causal inference through regression and favors causal inference via the Causal Markov Condition, none of which seems inconsistent with anything CFAR does.

I just remembered that there’s a website called “Bayesian Bodybuilding”

su3su2u1-deactivated20160226 asked: Let me know if that answered your quantum question, I can try to expand a bit more if I need to.

I think you answered it, yeah.  Mostly the complex number thing just threw me for a loop.

(Also I think I have a bunch of confusions about exactly how the probabilities in QM can be thought of – I remember picking up something from John Earman’s book on determinism that ended up in my mind as “the probabilistic nature of QM can’t be interpreted as mere subjective uncertainty about a set of objective possibilities, because of interference,” but Quantum Bayesianism works somehow, so apparently that isn’t quite right.  At the time I thought it somehow ruled out Quantum Bayesianism and was disappointed because “collapse is a Bayesian update” had seemed like a cool idea when I first heard about it.  Obviously, I do not understand any of this very well)

it’s been far too long: a bayes effortpost

scientiststhesis:

antisquark:

jadagul:

su3su2u1:

scientiststhesis

That last sentence is the fundamental disagreement, I’d say. To a Bayesian, the way you measure the quality of your inferences is by seeing how close that inference is to what Bayes says it should be.

And I’m saying that this is a bad way to measure the quality of an inference.  You need to measure it based on “does this inference do what I need it to do.”  The reason people use Dempster Shafer instead of Bayes in sensor fusion problems is that it matches reality better.  

The Bayesian thesis, in this case, says that when you split the data into two sets, and draw inferences about the second based on the first, then using Bayes’ Theorem on the first will always give you the best inferences about the second, and whenever it looks like some other method is doing better, it’s because you injected information into this other method that you didn’t allow Bayes.

But you just said best inference was defined by Bayes, in which case Bayesian inference is always best by definition. 

 I stipulate you don’t actually believe that, and do think you need to validate the model- in which case your choice of validation metric matters.  Should you use a validation metric informed by the problem you are working on (i.e. should you use a validation metric that accounts for how the inference will be used?).   

Well, for what it’s worth, I’m using a Dynamic Bayesian Network at my lab right now to model the production of certain chemical compounds as a function of transcriptional information from a microbial ecosystem, so at the very least I put my money where my mouth is. And the problem is being solved.

(And I’m doing exactly the thing you said, about splitting the data into two sets, naturally, that’s how you validate models :P)

So that last aside seems to indicate you don’t actually trust in Bayes- you are validating your model!  What are you using as a metric?  

Let’s say someone working with you came along with a non-Bayesian model and it turned out to work much better, based on this metric.  Would you switch to the new model, or insist that there must be some information you can put into Bayes that will let you do as well?  Is it worth your time to try various Bayesian analyses until you get the same result?  

Now, imagine a situation where you don’t have any reasonable way to construct a prior- something like ‘you have N samples from an unknown distribution F.  Estimate F’  

Frequntists have a good, non-Bayesian method to estimate F based on law of large numbers.  I don’t know a clean Bayesian way to do this.  If Bayes hits a wall at such a simple problem, why should I expect it to do perfectly elsewhere?  

So I think the useless-but-philosophical point is this.

Any admissible decision rule is equivalent to a Bayesian inference with some prior. You’ve often asked the reasonable question “Wait, if you’re picking your prior to give a frequentist result, why is that Bayesian instead of frequentist?”

But from the Bayesian perspective, your prior represents the information you have about your data that isn’t your data. (See the Gelman post I mentioned recently). So if you have reason to think that your frequentist analysis (or whatever) is outperforming your Bayes-with-uniform-prior or whatever, that is “information about the problem” that should be incorporated in your prior.

In other words–everyone agrees the hard problem of Bayesian analysis is “where do I get a prior?” The validation step isn’t to validate the idea of doing Bayesian inference, it’s to validate the prior.

So ignoring computation difficulty (which I know jack all about), a philosophical Bayesian would say, oh, you shouldn’t do a Bayesian analysis and a frequentist analysis and…; you should be doing Bayesian analyses with various priors, one of which is uniform and one of which is exponential and one of which mimics frequentism etc. And when you figure out which prior works well, that tells you what model and prior you should be using for your situation.“

In an ideal theoretical approach you don’t actually do Bayesian analysis with several different priors. You use one prior which might be a convex linear combination of several models among which you are uncertain. The most radical example of this is the Solomonoff prior which includea *all* computable models.

Yeah that’s exactly it. Bayesianism-as-a-philosophy is… pretty contentless, in practice, at least right now. We don’t have a Solomonoff prior, since it is literally uncomputable. We can try to create reasonable approximations to it, but we’ll always have P(actually it’s something I haven’t thought of) ~ 5% or so.

In practice, I don’t think there’s any disagreement between @su3su2u1 (who I can’t mention for some reason) and I. In scientific practice in the real world, we’ll probably do very similar things. Bayes just gives me the background theoretical/philosophical framework for it all.

But I agree with jadagul, the point is mostly philosophical. It’s not completely philosophical because sometimes this informs what models I’m going to try first, but well. Yeah.

(This got longer than I expected, and surely far longer than it needs to be.  Sorry about that, and sorry if it’s pedantic / belaboring points everyone knows.)

There is a subtlety here, which is both philosophical and practical (I think).  It’s that when you define an ideal you’re trying to approximate, the way you describe that ideal may produce a non-unique metric of comparison to the ideal.  Even if the ideal would be perfect if you could actually do it, you also want the ideal to have the property that “small deviations from this ideal produce small deviations from perfection.”

That’s a pretty hand-wavy statement, so here’s a concrete example.  In math, it’s possible to write down multiple infinite series that all converge to the same thing.  All of these are equally good “ideals,” so to speak – if you could actually add up all the terms, you would get the exact answer.  However, if you want to approximately compute an answer, you’re only going to be adding up a finite number of terms, so it matters how fast the series converges.

For instance, “tan(pi/4) = 1, so pi is 4 times the arctangent of 1” is a perfectly good characterization of pi.  (It’s an “ideal,” in the above language.)  In a certain sense one could say the following: “pi is 4 times the arctangent of 1, so an approximation of pi is good exactly insofar as it strives to approximate 4 times the arctangent of 1.”  But one has to be careful when interpreting that statement!  If you interpret it as “to approximate pi, I should write down a formula for 4 times the arctangent of 1, and then approximate that,” you’ll get the Leibniz formula, which converges very slowly, and isn’t useful for approximation.

Instead, if you want to approximate pi, you should use one of the known series that converges fast.  And in one sense these are perfectly consistent with the “arctangent ideal”: that is, using these series gets you closer to the number “4 times the arctangent of 1,” and one can justify them on that basis.  But compared to the Leibniz formula, they look less like an approximation of “4 times the arctangent of 1.”  If you sat down and simply thought “I want to approximate 4 times the arctangent of 1, what should I do,” your first stab would probably get you the Leibniz formula, which wouldn’t be good.

On the other hand, the series that work better are based on much less intuitive characterizations of pi.  You’d never get them by just sitting down and thinking “hmm, what’s a nice, simple, natural way to characterize the exact value of pi?”  The intuitive ideal gives you a bad practical method, and less intuitive ideals (which are equally good as ideals, i.e. they all equal pi exactly) give you better ones.

The analogy here is that “Solomonoff induction” is (or could be like) the “arctangent ideal.”  It’s definitely a clean, intuitive way to characterize ideal reasoning, such that you can look at it and immediately say “yep, if I could do that exactly I certainly would,” just as anyone who knows trigonometry can sit down and say “yep, pi sure is 4 times the arctangent of 1.”  But if you try to characterize the quality of a method by looking at it as a truncation of Solomonoff induction, there’s no guarantee that you aren’t doing the same thing as someone who uses the Leibniz formula.  In other words, “Bayesian methods” (roughly, truncations of Solomonoff induction) may be worse approximations of ideal Bayesian reasoning than certain “non-Bayesian” methods, just as truncations of the intuitive formula for “4 times the arctangent of 1″ aren’t very good approximations of the number “4 times the arctangent of 1.″

Are there reasons to think this might be true?  Well, you mention the issue of P(actually it’s something I haven’t thought of).  Solomonoff induction doesn’t have this problem, but “Bayesian methods” in the real world do, so we have to check how much this deviation from perfection costs us.

And what it costs us is basically: “if there’s a really good theory you haven’t thought of, its probability won’t go up when you observe all the great evidence in its favor, and likewise the probability of the theories you have thought of will be too high.”  (There is a paper which makes this claim formal, although I’m not entirely satisfied with the presentation.)

Now, you could justly object that this is an impossible problem to get around and that any statistical method (in the usual sense of the term) will have it.  If you have not thought of general relativity yet, then even if you have seen every observation in its favor, you won’t be able to say “I’m much more confident in GR than Newtonian gravity” (because you don’t know about GR).  At best, using the observations, you’ll become less confident in Newtonian gravity and more confident in “stuff I haven’t thought of.”  But with Bayesian methods this can sometimes go very wrong – the paper linked above gives toy examples where (say) you observe new evidence that supports a theory you haven’t thought of, and your probability for your old theory should go down to 10^(-3), but instead it stays at 0.99999.  If you want to use probabilities as degrees of belief – and, say, make decisions on their basis – this is pretty bad!  Even if you don’t have a good theory yet, you’d want to at least know not to bet super-confidently on the existing one.  You’d want to know how little you know.

(And these terrible bets would not be made by a perfect Solomonoff inductor, so this really is a case where we’re using something like the Leibniz formula – doing a naive truncation of the ideal, and ending up with a really bad approximation of the ideal.  Could we get better betting odds with another method?  Perhaps – but if so, it might not look like a “Bayesian method,” just as good series for pi may not look like a “series for 4 times the arctangent of 1.”  Strange business!  To be closer to the perfect Solomonoff inductor you may in fact need to ditch “Bayesian methods.”)


A different look at why “Bayesian methods” might not be good approximations is provided in this Cosma Shalizi post (you didn’t think I was going to let you off without one of those, did you?).  The post says a lot of stuff, but the part I’m referring to here is the idea that Bayesian updating is formally identical to the (discrete-time) Replicator equation, which models natural selection in a population of fixed types (no mutation).

If we continue the analogy, we can think of Solomonoff as a bizarre sort of “perfect evolution” in which there is no need for mutation because every possible species already exists, and so the fittest simply take over more and more of the population.  This is indeed one possible way of characterizing “ideal evolution” (like “4 times the arctangent of 1″ is a way of characterizing pi) – it would indeed do better than actual evolution, and actual evolution could be said to be good insofar as it’s approximating it.

But now, in the analogy, “Bayesian methods” are not the natural selection we know and love, but instead a type of process that says “okay, let’s not have mutation (since it isn’t there in our characterization of the ideal), but since we can’t think of all possible organisms, let’s just list all the organisms we can think of, and then let the fittest survive among those.”  This might be an okay idea, depending on what you’re trying to do, but it isn’t what got us our endless forms most beautiful, and it is surely misleading to say that this is “the only correct way to evolve organisms” because it approximates “ideal evolution.”  At least in this context, it’s obvious that your ideal is misleading you by not including mutation, and that including mutation in your practical methods might be a good idea.

(Back on the statistics side of the analogy, this would correspond to letting your hypotheses randomly mutate – that is, a genetic algorithm, which the fitness given by the conditional likelihood.  This is not generally what people have in mind when they think of “Bayesian methods,” but hey, it might actually be a better approximation of Solomonoff induction.  Shalizi speculates a little about this at the end of the post, but AFAIK he hasn’t done more work on this idea, which is a shame, because it’s really interesting.  I would be kind of surprised if no one has done this kind of thing, and would love a pointer to the literature.)


ETA: I realized I finished this post without really clarifying how I thought I was disagreeing with anyone.  I think where I disagree with scientiststhesis is that I think “ideal Bayes always works” is only a useful statement when you’re talking about really ideal Bayes, where your sample space includes every computable hypothesis.  Outside of that context, which never shows up IRL, we can’t even say that Bayes would be best even if we had perfectly formulated all the information we have into a prior.  If it’s not a prior over all computable hypotheses, it’s still just a truncation, and it might be a badly-behaved truncation.

(via scientiststhesis-at-pillowfort)

uncrediblehallq:

After writing my big-ass post about LessWrong, I’m realizing I am confused both about others’ and my own reaction to the Sequences.

On the one hand, it seems totally obvious that one of their main themes is attacking mainstream experts & scientific methodology. On the other, few people seem to talk about them that way.

My own experience is this: I discovered Overcoming Bias when Yudkowsky was blogging there, I *think* during the zombies sequence because I was big into the philosophy blogosphere and other people in the philosophy blogosphere were linking to it. And I just treated it as another blog for a long time, reading only the posts that looked interesting, not always paying close attention to whether it was Yudkowsky or Robin Hanson who’d written a particular post.

I gradually got more into the issue of AI. And I do mean gradually–like it was maybe in 2012 that I first spent a significant part of my time thinking about it, that was a few years after the “Sequences” ended. I did some research asistant work (on a contract basis) for MIRI, and only *after* that decided to systematically read through the Sequences, so I could better participate in conversations on LessWrong (where they were taken as gospel). But even when I did *that*, I somehow didn’t really notice the anti-mainstream-science stuff.

Maybe it was because while there was some subtle hostility to mainstream experts on LessWrong, it wasn’t really overt compared to, say, all the talk about Bayesianism. So when I got to the “Science vs. Bayesianism” stuff, I focused on the “Bayesianism” side of the equation, and came away with, “Okay, I’m mostly sticking with the methods of science, and Bayesian reasoning looks impossible to apply directly most of the time, but I can see Bayes as a useful thing to mine for heuristics, at least.”

But this is just my particular experience–hardly anybody seems to zero in on the anti-mainstream science stuff. Maybe they notice it and all make the same decision not to talk about it too much? Or are their perceptions of the Sequences too colored by the more reasonable stuff towards the beginning? I dunno. The way people talk about the Sequences is still very strange to me. Other people’s thoughts?

Personally, the anti-mainstream-science stuff has never stood out to me as an entire class because I disagree with some of it but pretty much agree with some of it, and didn’t realize that the unobjectionable part might be there for the sake of the objectionable part.

I am with you on the MWI and cryonics bits, but I didn’t really connect them to Yudkowsky’s swipes at falsificationism, which I took as a set of correct (if kind of obvious) points made with the usual bombast.

That is, I think he has a real point that the naive falsificationist picture of “just make up whatever you want and see what you can’t disprove” cannot account for the way scientists seem to be able to continually home in on good hypotheses within the giant space of conceivable ideas (most of them bad).  We are clearly doing something right in our idea-generating process, and that means we might be able to distill what we are doing right and improve it.  This all seems true to me, although I’m not sure if anyone really believes in naive falsificationism.

But you’re right – EY seems to believe that “scientists” really do believe in naive falsificationism, and that this is what makes them “too slow” at exploring idea space (they are not really just testing arbitrary ideas but they could be doing better at testing the right ideas), and that he has a better method which has gotten him to cryonics, MWI, etc.  This is ludicrous.

But I do find the “there has to be a good way to do this” sentiment appealing, even if I don’t think EY has found it.

(Relatedly I have been meaning to read Deborah Mayo’s book Error for many years now, but I have never gotten access to a copy I could read on anything but a computer screen, and I have trouble reading books on computer screens.  I would be really curious what the LW crowd would make of it, since it seems to be about this sort of thing, and draws non-Bayesian or even anti-Bayesian conclusions.)