Install Theme

There’s this thing in statistical mechanics that I’ve never really understood.  Specifically, in the application of statistical mechanics to fluids, although it seems like a fundamental issue that would also come up outside of that particular case.

(Cut for length and because not everyone is interested in this.  Pinging @bartlebyshop and @more-whales because I suspect they understand this kind of thing – but don’t feel any obligation to read this unless you want to)

Keep reading

automatic-ally asked: hey man, did you have a post about Nick Bostrom's Superintelligence a while back? I've been reading him recently and was hoping other people had better articulated some of the issues I've had with him. I thought I remembered seeing you post about it, but couldn't find anything when I searched.

I did, although I only read a small bit of the book, and other people’s takes (shlevy’s goodreads review, or su3′s review which is still online somewhere) are probably more useful.

My main post is here, there’s another post here and a follow-up to the latter here.  There’s some ensuing debate in the notes on all of those.

Just a quickie, riffing on @anosognosicredux‘s post here:

The way I’d personally sum up Meditations on Moloch is:

“Capitalism produces horrors.  It does so in the same way as a lot of other things that produce horrors, namely, the horrific logic of natural selection: the things that appear in the world tend to appear because they have the quality of Being Better At Propagating Themselves Than Other Things That Were Around.  Companies that try to make a compromise between ‘competing’ and ‘being not terrible’ get ‘outcompeted’ by companies that just focused on ‘competing’ and thus we see the latter everywhere.

We want to fight against this, and we have various ways.  It would be great to institute Fully Automated Luxury Communism.  But a country/region/group/whatever that does Fully Automated Luxury Communism will care about ‘being not terrible’ and thus will not devote 100% of its time and resources to ‘competing.’  Something terrible which cares only about ‘competing’ will arise and ‘outcompete’ it, and then we will see the latter, not the former.  And our FALC period will be a great time for people who lived in it, but those with the bad luck to be born too late will not see it.  They might try again themselves, and succeed for a while, and then get ‘outcompeted’ again.

Horrifically, this is just the underlying logic of reality, and on a long enough timescale, it will always get you.  You will build something that cares about ‘not being terrible,’ and your great-grandchildren will look around and not see it, because it was ‘outcompeted’ by something terrible.  You are in fact the great-(great-etc.)-grandchildren of many people who have done this.

And even worse, is getting harder to not be terrible.  Technological advance is going to unleash even more and more ‘competitive’ (although inane, ugly, and destructive) forces.

There appears to be no way out of this, not without something that can oversee and guide the entire cosmos, overruling every possible manifestation of the horrific logic everywhere it might arise – that is, God.  If you don’t believe in God, you might at least believe it’s not impossible to create God.  This sounds both implausible and hubristic, and it is, but it’d be even more hubristic to think that we can ever create anything else that won’t be ‘outcompeted’ and leave our great-grandchildren in yet another shithole.  So: let’s create God.”

(N.B. Scott puts the last bit as “killing God” because in that section he’s using “God” to mean “the horrific logic,” but the point is the same)

Anyway, that’s the hip monster on the block these days, if you were wondering.

raginrayguns:

lambdaphagy:

nostalgebraist:

nostalgebraist:

vaniver:

nostalgebraist:

So: what’s the deal with Akaike information criterion vs. Bayesian information criterion?  "Information theory” and “Bayesianism” are both things with a lot of very devoted adherents and here they appear superficially to give different answers

They correspond to different priors. AIC has a bit better underlying framework (from an information theory point of view) and I believe better empirical validation.

Ah, OK.  I found this paper through Wikipedia, about AIC as Bayesian with a different (better?) prior, which looks good.

BIC has the advantage that it will converge asymptotically to the true model if the true model lies in the set of models being fitted, although it’s disputable how important this is.  And BIC can be derived using a minimum description length approach (can you get AIC this way too?).

One of the things I am wary of here is the sense that “information theory is magic” – e.g. in the paper linked above:

Their celebrated result, called Kullback-Leibler information, is a fundamental quantity in the sciences […] Clearly, the best model loses the least information relative to other models in the set […]

Using AIC, the models are then easily ranked from best to worst based on the empirical data at hand. This is a simple, compelling concept, based on deep theoretical foundations (i.e., entropy, K-L information, and likelihood theory).

Maybe I just don’t understand information theory, but I’m confused why I should care that the K-L divergence is “deep” and “fundamental,” here.  The question at hand is how to select a model based on some sort of estimate of how the model will generalize from the training set.  In practice I hear people justify using things like AIC by saying “well, obviously, you want the most information,” where “most information” is just a verbal tag we’ve associated with the K-L divergence and I’m not sure what mathematical weight I should give to it.  If AIC does well, and this is because it is based on information theory, I would like to understand this in a nonverbal way – what property of K-L divergence made it a good choice here, ignoring suggestive words like “information”?

Reblogging because I’m really curious about this – I’ve been aware of information theory for a long time but I’ve never been sure how it justified choices like this, and I feel like I must be just missing something major / “obvious.”

@su3su2u1, @lambdaphagy, @raginrayguns, et. al.?

Oops, didn’t have a chance to get to this earlier.  Others have already chimed in with sensible responses, but here’s another way to think about it non-verbally, especially if you want to ask “why KL divergence in the first place?” rather than “why AIC?”

KL divergence arises naturally when you ask the question “what does it mean for two distributions in a parametric family to be ‘close’ to one another?”  Take univariate Gaussians parametrized by mu and sigma, and consider each measure as a point in a 2-D parameter space.  Consider some plausible things we’d like to say about this space.  First, for any two measures (mu1, sigma1) and (mu2, sigma2), the distance between them should vary with the difference between mu1 and mu2: the further apart the means are, the “further apart” the distributions are.  But secondly, as sigma1 and sigma2 grow larger, the difference in the means should matter less.  As sigma goes to infinity, the value of the mean washes out and there is only really only one Gaussian distribution left, with its density smeared out over the entire real line.

If we think about what this means for the geometry of the parameter space, we realize that it’s not Euclidean.  In fact it’s hyperbolic: we’ve got a half-plane that draws to a single point as sigma goes to infinity.  This motivates us to ask what the appropriate metric tensor is.  It turns out (and here you must imagine my hands waving hard enough to achieve lift-off) that if you take the Hessian of the KL divergence with respect to the parameters, you get the Fisher information matrix and that does the job quite nicely.  The KL divergence is then, roughly, measuring our surprisal about the samples coming off of our distribution of interest as we move through parameter space. 

(This is backwards from the usual presentation, and I’m not sure what you’d get if you went through this exercise with some other notion of distance between distributions, like L1, L2 or TV.  KL divergence has so many other useful properties that I would expect the Fisher-Rao metric to be canonical in some sense, but I don’t know which.)

Okay, I tried to think this through a bit with L2 distance, and I think I’m dropping several levels in HabitRPG as a consequence, I really need to be writing a fellowship applicaiton, anyway….

so here’s the formula I got for L2 distance between two normals

image

So, as for the properties you described.

  1. Increases with |mu1 - mu2|. Yes it does.
  2. Rate of increase with |mu1-mu2| is lower with higher sigmas. I set sigma1=sigma2 and plotted it, and yup, the plot is less steep when sigma1=sigma2 is higher.
  3. Is zero when the sigmas are infinity. Yup.

So…. I guess the same argument… applies? You lost me at hyperbolic geometry ‘cause idk what that is. But definitely L2 fits the picture you painted as well as KL.

There’s a difference though, which is that KL distance between two normals is a convex function of |mu1-mu2|, right? The bigger the difference already is, the more increasing it counts? L2 distance on the other hand is not. So, like, if we set mu1 to 0, and consider positive mu2, then d/dmu2 KL is an increasing function of mu2. But d/dmu2 L2 looks like this:

image

so, what’s that all mean? idk.

This is all very interesting.  Another property you’d want is invariance under general changes of variables, which L2 doesn’t have, but K-L has (the scaling cancels in the fraction, and outside the fraction it gets cancelled by dx).

(via raginrayguns)

@reddragdiva​ linked (here) to a post about something called Perceptual Control Theory, and how it ostensibly conflicts with both (1) Bayesianism and (2) the “stimulus-response” view of behavior.

The post claims that the stimulus-response theory is refuted by tasks in which people respond to external stimuli in a way that continually corrects for outside disturbances, like a thermostat does.  One finds that that the pattern of behavior (over time) is highly correlated with the disturbance, but has a very low correlation with the stimulus itself (composed, at any time, of the disturbance plus the person’s correction).

The post’s author quotes numbers that appear to be Pearson correlation coefficients, but then makes the startling jump to mutual information (which can measure nonlinear dependence as well):

So in a control task, the "stimulus” – the perception – is uncorrelated with the “response” – the behaviour. To put that in different terminology, the mutual information between them is close to zero. But the behaviour is highly correlated with something that the subject cannot perceive.

This claim seemed startling to me, and I got kind of nerd-sniped by it.  (I mean, my air conditioner is a control system like this, and presumably there’s some dependence between its responses and the current temperature?? It shuts off when the temperature gets low enough!)  And I concluded that the statement above didn’t make sense.

The post includes a link to a java demo where you can do such a task yourself.  A line moves around on the screen and you try to move your mouse to keep it fixed at a reference point.  At the end, you get a plot like this

image

The red trace, C, is where the line was relative to the reference point (my goal was to keep it at zero).  The blue trace D was the imposed offset I was trying to correct for, the green trace M is the position of my mouse.  (The black trace is where M would be if I’d done perfectly.)

Correlations (I assume Pearson – nothing about mutual information on the page, anyway) are listed in along the top.  The correlation between M and D is nearly -1, indicating I was doing a good job counteracting the disturbance.  OTOH, the correlation between C and M is only 0.198, which the demo page says is surprising:

When you are able to control the distance between cursor and target, keeping that controlled variable equal to zero, you will see that the cursor-mouse (C-M) correlation is rather low (usually between -.2 and .2). This is surprising if you think of cursor movements as the stimulus for the mouse movements (the response). All you can see in this task is cursor movement, which is at all times a combined result of disturbance and mouse movements. Nevertheless, mouse movements are strongly (negatively) correlated with the invisible disturbance rather than with the visible cursor movements.

This is also what the blog post means about stimulus and response being unrelated.

Does this interpretation make sense?  First, note that I definitely felt like I was responding to an immediate stimulus when I was playing the game – when the line moved right, I moved left, and vice versa.  Describing this in terms of the above variables is a little difficult, though.  When C (cursor position) changed, M (my mouse position) changed in response.  But C itself is the sum of M and D, so my own movements influence it, and you don’t want a measure of the relationship that thinks my own movements are responses to themselves.

What I was actually responding to was not where the cursor was, but how fast the cursor seemed to be moving when you subtracted out my own movements.  This is simply the time derivative of D, and you could model my behavior by writing dM/dt = -dD/dt.  But in the interpretation above, this is supposed to be something magical, because supposedly I “can’t see” D.  But of course I can see D, or rather its time derivative – it is precisely what I’m seeing when I say “hey, the cursor’s drifting off to the left now, time to move right.”

So what does the C-M correlation actually represent?  If I’m doing well, C is close to zero.  M, on the other hand, spans a large range.  Looking at the image, we see that for one part of the game it was positive (canceling negative D) and for the other part it was negative. 

A strong negative C-M correlation would then mean that I tended to err on the left side of correct when my mouse was on the right of center, and vice versa for right/left.  This is a sort of tendency one could conceivably have, but it has nothing to do with stimulus and response.  When my mouse was far to the right, say, this was not because I thought “ah, the Stimulus is left, my Response will be right!”, it’s because I’d drifted over to the right as my previous motions were summed up.  (This would be the integral part of a PID controller.  The blog poster mentions PID controllers, but doesn’t seem to have realized how they refute what they’re saying.)

Nonetheless, the blog poster seems quite confident in their radical conclusions, which apparently overturn much of mainstream psychology:

This is 180 degrees around from the behavioural stimulus-response view, in which you apply a stimulus (a perception) to the organism, and that causes it to emit a response (a behaviour). I shall come back to why this is wrong below. But there is no doubt that it is wrong. Completely, totally wrong. To this audience I can say, as wrong as theism. That wrong. Cognitive psychology just adds layers of processing between stimulus and response, and fares little better.

Ah, good old Less Wrong!

Another thought I had while driving around my old neighborhood today: “oh, it’s that house.  I don’t know who lives there and have never been inside IRL, but a lot of weird shit has happened there in my dreams”