Install Theme

napoleonchingon:

napoleonchingon:

A better “Mathematician’s Apology” 

sorry about the math

Seriously, that would be a better book than “A Mathematician’s Apology” is currently.

#this is a mathematician’s apology hateblog

(via sungodsevenoclock)

I was talking to someone yesterday about my usual objections to representing beliefs/credences as probabilities, specifically the stuff about how IRL you don’t fully know the sample space and event space, and probability theory doesn’t tell you what to do about this.  

For instance, if you encounter an argument that “A implies B” – where A and B are the kind of ideas which you’d be assigning credences to – and the argument convinces you, you now know that A (as a set in the event space) is a subset of B.  You didn’t know that before.  Yet you had some concept of what “A” and “B” were, or you wouldn’t have gotten anything out of the argument – you needed to know which sets in your event space corresponded to the ones in the argument.  But although you knew about those sets, you didn’t know about that subset relation.  How do you “update” on this information, or formalize this kind of uncertainty at all?  It’s conceivable that you could and it would be very cool to do it, but probability theory itself doesn’t include this case – which to me is an argument (one of many) that probability theory is not the right set of tools for formalizing belief and inference.

Anyway, the person I was talking to mentioned this recent (Sept. 2016) preprint from MIRI, “Logical Induction,” which tackles the problem just mentioned.  (There’s also an abridged version, 20 pp. instead of 130 pp.)  I have not read it yet, beyond the first few pages, but it looks cool.  (Reportedly there’s a lot of cool math in there but the method for doing the thing is absurdly inefficient, doubly exponential time complexity or something.)

My understanding is that MIRI people want to formalize “logical uncertainty” in order to make TDT work (because TDT invokes the notion without formalizing it), not to make Bayesianism/Jaynesianism work.  But it’s refreshing to see people interested in this kind of problem, because, from my perspective, it is the sort of new math Bayesians/Jaynesians would need to have in order to make their perspective compelling.  There’s this giant looming problem with trying to apply results about “ideal Bayesians” / “Jaynes’ robot” to finite beings that keep learning new things about their sample and event spaces, and I would have expected people to notice this long ago and get to work developing new formalisms to deal with it.  And maybe that’d result in some super-powerful reasoning method, or maybe it’d result in something useless because it turns out the computational complexity is necessarily very high, but in any event there’d be cool math and an interesting line of thought to follow.

(I keep saying there is very little work about this stuff out there.  Maybe I’m wrong?  I haven’t been able to find it, in any event)

ETA: this also doesn’t have the problem @jadagul identified with earlier MIRI papers, that they read like crosses between papers and research proposals – they prove a whole bunch of different properties/implications of the criterion they define at the start, and I’d imagine there are at least several Least Publishable Units in there

Cool plot that resulted when I was testing something

Cool plot that resulted when I was testing something

*sits here for 30 minutes trying to think up a joke about how the large cardinals in set theory are like the star wars expanded universe*

Just remembered a conversation with my old undergrad thesis adviser (early 2010), where I was getting close to done, and he remarked “you may be discovering a certain compulsive element in yourself”

That was very much the sort of thing that sounded natural and non-patronizing coming out of his mouth (physics prof in his 70s, with the reputation of being a “deep thinker” and the elderly gravitas to match), and it was also clearly true.  I was thinking about this as I wrote more labored exposition for my graduate thesis.  I was thinking my writing probably sounded like it did in my undergrad thesis (7 years ago!), and on a whim I re-read some of my undergrad thesis, and it did.  Because I always want to clarify everything conceptually, spelling out “there is this and then there is this and here’s an analogy that helped me understand how they are different,” saying why we’re doing X and not X’ by launching into a list of all the ways X and X’ differ or might differ, breaking everything down into discrete distinct cells with explicitly explained boundaries between each and its neighbors

I’m not a wizard with mathematical manipulations or with vaulting abstraction, and I’m not especially “creative” in mathy stuff, either.  If I am good at anything it is taking the stuff people have already said, thinking over it for way more time than anyone else would need to understand it “for practical purposes,” and then spitting out a lovingly compulsive account the stuff they said as a formal structure – one presented in long-winded prose, not in actual specs

I’m just not very good at attaining the “for practical purposes” sense of any technical concept that people are expected to in the mathematical sciences – if I try to, I get a concept formed the wrong way, and I make bad inferences from it, and I end up the dim bulb in the room.  (I do try, and that is what happens.)  Either I don’t get it, or I get it, and once I get to the latter state, I can at least explain the thing in a way that doesn’t make me feel like a dim bulb.  A way I’m proud of.  I explain the fuck out of those things.

In the folder for my undergrad thesis, I opened up a text file, and it was full of many earnest notes to myself, categorizing the papers I was reading (in terms of “software,” “problem,” and “sophistication,” the latter referring to a three-point, precisely defined scale for model complexity I had invented earlier in the notes), trying to reconcile statements I’d read in papers or textbooks that seemed contradictory (I have since learned that academics sometimes write things that are not strictly true, because the intended reader will “know what they mean”), noting down conceptual things I didn’t understand and then the solutions I’d figured out.

I forgot just how much systematic thought I put into that thing!  I wonder if anyone ever read it.  (I’m not whining about this; I knew at the the time that no one reads undergrad theses.)  I mean, I didn’t even finish doing the scientific thing I actually set out to do anyway.  But the exposition of the background material (which comprises 80% of the text) was good, in that compulsive way!  When I had to, say, describe the models I’d been reading about in the literature, I did invent a precise definition of the “type of model” they were and give it a made-up name and corresponding acronym, but you know, that was a sensible thing to do!  It helped!

Just in the course of writing this post I remembered that, when I’d been horribly confused by the very informal textbook for a first-year graduate course, I briefly took to writing up my own set of TeXed notes, written like a textbook with an authorial voice (“so future students won’t have to be confused,” I thought bombastically).  I only got 11 pages in before giving up, and the thing is full of stuff like

Scaling arguments are formally similar to perturbation theory arguments. The difference is that in a scaling argument, we do not consider small deviations from an “unperturbed problem,” but instead cause certain terms in a problem to become small by pulling out factors that reflect estimates of the size of each term in the problem. Typically there will be no identifiable “basic” state relative to which “small deviations” are occurring.

(Later, “the scaling procedure” is specified as a 6-point list.)

Of course, there’s a certain group of people in this territory who love to insist on strict conceptual clarity – the pure mathematicians.  But I like applications (and, TBH, am just not that abstract of a thinker).  You can get books and papers that present a given application “for mathematicians,” but with those the conceptual clarity hitches a bunch of other baggage for the ride.

I’m starting to feel more and more affinity for the programming/CS world, because it tends to be compulsive about concepts and their boundaries the way I am – in my case just by nature, but in programming you need to be that way, both (trivially) because you have to state things precisely to program a computer, but (more interestingly) because bad abstractions quickly start causing real consequences and real headaches for real people, and “carelessly conflating two very similar categories” is not a pedant’s pet issue but the kind of thing that actually makes things break.

nostalgebraist:
““Then we see that” ”
More fun stuff from the same book:

nostalgebraist:

“Then we see that”

More fun stuff from the same book:

image
“Then we see that”

“Then we see that”

napoleonchingon:

nostalgebraist:

There’s this thing in statistical mechanics that I’ve never really understood.  Specifically, in the application of statistical mechanics to fluids, although it seems like a fundamental issue that would also come up outside of that particular case.

(Cut for length and because not everyone is interested in this.  Pinging @bartlebyshop and @more-whales because I suspect they understand this kind of thing – but don’t feel any obligation to read this unless you want to)

Keep reading

Am probably misunderstanding this, so please be cautious. Also, am not a theorist and understand little to nothing about numerical modelling.

But. In the second approach, is it actually true that the ensemble of microstates is specified beforehand? Aren’t you optimizing the ensemble of microstates to get maximum entropy given the (let’s say) energy expectation value constraint? If you started with your microstates already set and they were non-interacting, they’d just be propagating according to dynamical laws and you wouldn’t be optimizing anything. You’d just specify all the individual microstates beforehand and just watch them evolve and you’re not going to get any information that you didn’t put in.

But if you have a collection of microstates, you can’t look at all the microstates at once, so you look at individual microstates from the ensemble of microstates you have and then try to get a probabilistic picture of the entire ensemble. This would be like having one tank in your lab that you occasionally can take really good photos of, and trying to generalize to what is happening in the tank in general from those photos

If i understand your last sentence correctly, it’s a description of ergodicity: the time average of a function over a single trajectory is equal to its ensemble average (by which we mean the expectation value of the function over the invariant measure for the system).  This is often taken as a postulate in these kinds of papers, and it is one justification for thinking about hypothetical ensembles even if you only ever have one copy of the system in reality.

What I am questioning here is the process used to determine the invariant measure (which we then use to compute the expectation values).  An invariant measure is any measure (i.e. probability distribution, roughly) that is constant under the dynamics, and generally it won’t be unique.  For instance, if the dynamics conserves energy, and has the Liouville property (preserves phase space volume), then any measure that depends on the energy alone is an invariant measure: it’s constant on every energy surface, and the dynamics just carry volume around on energy surfaces.

So the Gibbs (”macrocanonical”) measure, which has probability density proportional to exp(-const*E), happens to be an invariant measure.  But so is the ”microcanonical” measure, which puts all the probability mass on one energy sphere and has zero probability density everywhere else.  Or, pick any function f(E) that integrates to one over phase space, make that the probability density, and you’ve got an invariant measure.  If you need an invariant measure that also has a certain expectation value <E>, fine, just rescale your f appropriately (this is what the constant in the Gibbs measure does).

Now, if you have an single copy of the system with a known energy E_0, and you’re trying to use ergodicity to predict time averages over its trajectory, then it seems clear to me that only one of these measures will give you exactly accurate results: the “microcanonical” one.  That’s because the actual system never has any energy value besides E_0, so any distribution that puts mass on any other energy surface will give you some wrong answers.  For instance, define g to be a function of the state which is 1 when E > E_0 and 0 otherwise.  The time average of g is zero, because the energy is always E_0, never greater.  But the expectation of g with respect to the Gibbs measure, say, is positive (if perhaps small).

Nonetheless, the Gibbs measure often works very well as an approximation to the microcanonical measure.  Specifically, if you just look at the marginal distribution of some finite set of state variables (say, if the variables are x_1 through x_n, we look at the marginal distribution of x_1 through x_k, where k < n), then in the limit as n goes to infinity (with k fixed), this marginal distribution is the same for both measures.  There are reasons why this works, which involve large deviations theory and which I only half understand (from looking over these notes when I was trying to understand this stuff years ago).  And the reasons do say that maximizing entropy is the right thing to do to get this property (I think).

However, none of this means that the Gibbs measure is uniquely correct, or that it is justified by some principle of indifference, or that it is the “least biased” or (god help us) “most probable” probability distribution consistent with physics, or that it works because it maximizes entropy and physical systems tend to maximize entropy – which are all things that people say about it all the time.  It is good only insofar as it approximates the microcanonical measure, and (as far as I can tell) if you can work with the latter you always should.  As Ellis says in those notes:

Among other reasons, the canonical ensemble was introduced by Gibbs in the hope that in the limit n → ∞ the two ensembles are equivalent; i.e., all macroscopic properties of the model obtained via the microcanonical ensemble could be realized as macroscopic properties obtained via the canonical ensemble. However, as we will see, this in general is not the case.

(In other words, the whole point is approximating the microcanonical measure.)

Treating the “entropy maximization with expectation value constraints” procedure as the axiomatically correct, “least biased” thing to do would lead one to conclude that the Gibbs measure is better than the microcanonical measure – for instance, that the function g described earlier should have a nonzero expectation, and that it is somehow “biased” to say otherwise.

I guess if this is all a way to get tractable approximations to the microcanonical distribution, which is in turn just the distribution that says “we don’t know anything except that the energy (or whatever) has this value” – then I guess that’s fine by me.  But the rhetoric surrounding it all, I guess, is frustrating and confusing.  Maximizing entropy makes physical sense if you’re comparing different macrostates and asking which is more likely, but the Gibbs measure just so happens to “maximize entropy” for a fixed macrostate, which then leads people to say it’s the “most probable” distribution because they’ve mentally associated “most probable” with “maximum entropy,” and then Jaynes comes along and says that it’s also the least biased, as if it captures the principle of indifference, when if you know the macrostate the principle of indifference is expressed in the microcanonical measure, not the Gibbs measure … argh!!  It feels like all this terminology was invented by some sadist to maximize confusion

(via sungodsevenoclock)

I’m working on the introduction to my thesis now, and it’s actually been really fun, because I’ve had to do a super-speedy tour of a bunch of tangentially relevant literature (much of which I’d only read a very long time ago, if at all), and I have a much better understanding now of various things that confused me on my first encounters with the same material.  Some of that is more “scientific maturity” or w/e, but some of it is just that if you breeze through enough papers in a short enough time, you get a clearer sense of how vague some of the terminology is.

Some terms get used in different ways by different authors, and if you’re ingesting papers at a slower rate then it’s easy to chalk this up to some confusion of your own.  But when you blast through a bunch of papers and get them simultaneously in tabs/windows you start to notice that, no, one author clearly means something different by this term than some other author – once you place a paragraph from one paper next to a paragraph from the other, it’s inarguable.  In one case today, I even found an author carping about the ambiguity of one of these terms in a footnote!  And it sure feels good to know that sometimes it’s not me, it’s them.