Install Theme
What’s a mathematical phantom? According to Wraith, it’s an object that doesn’t exist within a given mathematical framework, but nonetheless “obtrudes its effects so convincingly that one is forced to concede a broader notion of existence”.

Like a genie that talks its way out of a bottle, a sufficiently powerful mathematical phantom can talk us into letting it exist by promising to work wonders for us. Great examples include the number zero, irrational numbers, negative numbers, imaginary numbers, and quaternions. At one point all these were considered highly dubious entities. Now they’re widely accepted. They “exist”. Someday the field with one element will exist too!

John Baez

This Week’s Finds in Mathematical Physics, Week 259

(via eka-mark)

Oh cool it’s a name for the things I talked about here!

(via eka-mark)

Everyone who takes basic statistics has it drilled into them that “correlation is not causation.” (When I took psych. 1, the professor said he hoped that, if he were to come to us on our death-beds and prompt us with “Correlation is,” we would all respond “not causation.”) This is a problem, because one can infer correlation from data, and would like to be able to make inferences about causation. There are typically two ways out of this. One is to perform an experiment, preferably a randomized double-blind experiment, to eliminate accidental sources of correlation, common causes, etc. That’s nice when you can do it, but impossible with supernovae, and not even easy with people. The other out is to look for correlations, say that of course they don’t equal causations, and then act as if they did anyway. The technical names for this latter course of action are “linear regression” and “analysis of variance,” and they form the core of applied quantitative social science, e.g., The Bell Curve.

Graphical models are, in part, a way of escaping from this impasse.

(Cosma Shalizi’s notebook on Graphical Causal Models)

Shalizi talks quite a bit (not just in the IQ context) about how amassing many correlations is not the right way to do inference about causation, and how there are good statistical methods for causal inference, but most social scientists don’t know about them.

If you like to read and talk about correlations this is probably stuff worth reading.  See e.g. Shalizi’s notebook on Causal Inference.  Shalizi recommends work by Clark Glymour on this subject, including a book (“The Mind’s Arrows: Bayes Nets and Graphical Causal Models in Psychology”), and a paper which is available online (controversy-related enticement: it is framed as a critique of The Bell Curve).

If you have been following this blog for long enough, I’m sure you know that I am a Cosma Shalizi fanboy, but for newcomers: his website is full of wonders involving just about every subject, and especially likely to be worth reading if you are interested in physics, statistics, and/or quantitative social science.

object and meta

This is a follow-up to a few posts from last night and this morning, explaining why I don’t really understand or trust the distinction between “object level” and “meta level.”

Only click the readmore if that sounds interesting.

Keep reading

raginrayguns asked: I could be misinterpreting Jaynes, in the part where I think he's saying you need the problem to have enough invariances before you can use maximum entropy. Because, now that I'm looking through the book, I'm not finding any applications of this combined trnasformation group+maximum entropy strategy. And I AM finding parts where he's talking about normal distributions and just casually says "well you know the general magnitude of the error so you think about that as a variance and do maxent"

Hmm, that would accord with what I said in my most recent post (just edited) — that people seem to just jump from “mean and variance” to “Gaussian” in practice without worrying about these things.  So maybe the resolution is that you always have the invariances in practice?  I’ll think about it

raginrayguns replied to your post “Oh also raginrayguns if you could point me to the particular place…”

Chapter 12 of Probability Theory. 12.3: Continuous distributions.

Thanks.  I didn’t realize that his issue was how to specify the background distribution (what I called q(x)).  If I have any thoughts after reading the chapter and thinking about it, I’ll post them.

This is irrelevant to the issue at hand (which is “the best version of” MaxEnt), but I think this problem is often (or at least sometimes) ignored by MaxEnt practitioners in practice.  I’ve heard the cliche (or canard?) “the MaxEnt distribution with a given mean and variance is a Gaussian” a number of times in seminars, to the point that it seems to be the one thing about MaxEnt that everyone in my department remembers, even if they don’t use it at all themselves.

Oh also raginrayguns if you could point me to the particular place you’re thinking of where Jaynes talks about entropy in continuous spaces, that would be helpful for me.  I think he’s talked about these things in several places and they might not all contain the same ideas

raginrayguns:

nostalgebraist:

raginrayguns replied to your post “Quick response to hot-gay-rationalist’s most recent post — I have to…”

I think you’re actually just equivocating independence in the subjective prob dist for a single observation with independence in the empirical frequency dist. And moments of the 1-obs subjective prob dist with moments of the empirical frequency dist

In the former case, maybe — I don’t understand this stuff well enough yet

In the latter case, my worry is about something like: I can press a button to cause one random event x distributed with mean mu and variance sigma^2 to happen.  I know nothing else about the distribution of x.  I have a utility function that is positive for small |x| but negative for large |x|.

With the MaxEnt Gaussian, my expected utility is positive and I should push the button.  In real life (if this were a scenario that could really happen), I might be worrying that the real distribution has heavy tails and that could cause the expected utility to be negative.  Is this worry “irrational”?  Must I press the button to be rational?

I have to leave now so I can’t write anymore

This is an interesting problem.

(Though I’ve modified it slightly in working on it - instead of utility=|X|, I’m doing utility is positive if |X|>a, 0 otherwise.)

Some thoughts right now:

  1. I think the answer would be clearer with a discrete distribution, where the different possibilities were defined in some non-arbitrary way
  2. You can’t actually apply maximum entropy with just the assumptions that have been given. Jaynes points out that if you take a limit from a continuous to a discrete process, the maximum entropy distribution depends on the limiting density of discrete points around each possible value of X. And says that to define a maximum entropy distribution when it’s not a limit, when it’s continuous to begin with, you need to also have invariance to certain kinds of transformation of the problem, or something. What followed was beyond me.

So…. you brought this up as an example of a case where your beliefs don’t seem well defined by probabilities. And it may be that you’re right, but in a… different sense than intended? In the sense that the background info doesn’t lead to an assignment by the principle of maximum entropy, so there’s nobody who really claims to have a rule for assigning probabilities in this case.

(although it must be said that I conveniently forgot about the limitations on maxent in continous spaces until you pointed out a flaw in the result of maxent! Who knows how often I do stuff like that)

My (not full) understanding of the issue you’re raising is as follows:

The discrete version of the Shannon entropy, (sum p_i log(p_i)), works fine in discrete cases.

The “naive” generalization of this to a continuous space is just to replace the sum with an integral and get (integral p(x) log(p(x)) dx).  This is what Shannon originally did, and goes by the name of “differential entropy.”

Unfortunately, this is not invariant under changes of coordinates.  If we change coordinates, p and dx will rescale in ways that cancel out, but the rescaled p will also appear inside the log, which changes the overall result.

There is something similar called “relative entropy” or “Kullback-Leibler divergence,” which compares two distributions p and q.  It looks like (integral p(x) log(p(x)/q(x)) dx).  This is invariant under changes of coordinates, because the transformations of p(x) and q(x) will cancel inside the log.

It turns out you can “rescue” differential entropy by writing something like the K-L divergence formula, where “q(x)” is not necessarily a probability density (it doesn’t have to sum to 1), but is a density in the sense that it transforms like one under changes of coordinates.  Then, again, the densities will cancel in the log.

You can then show that this is the appropriate limit of a set of discrete distribution densities, where q(x) represents the limit of the “density of points” at x.  Note that this could just be q(x) = 1, giving us back the original differential entropy formula.  The point is that if we change coordinates to y(x), q(x) will change along with it.  I think Jaynes was the first to have this insight (?).

This is all in the service of having a measure of uncertainty that doesn’t care how you write your coordinates.  You wouldn’t want your choice for the best prior distribution to be different for different choices of coordinates – that choice shouldn’t matter.  I don’t think this means that the problem itself needs to satisfy any specific conditions.  It just means that your opinions about it shouldn’t depend on how you parameterize the outcome space, which really should be true for any problem, I’d think?

(via raginrayguns)

Well, he also delivered us from GOTO, so it’s a wash, really.

I have a hard time giving him credit for that one because I have a hard time imagining an era in which GOTO wasn’t “considered harmful”

What were they thinking

Right from the beginning, and all through the course, we stress that the programmer’s task is not just to write down a program, but that his main task is to give a formal proof that the program he proposes meets the equally formal functional specification. While designing proofs and programs hand in hand, the student gets ample opportunity to perfect his manipulative agility with the predicate calculus. Finally, in order to drive home the message that this introductory programming course is primarily a course in formal mathematics, we see to it that the programming language in question has not been implemented on campus so that students are protected from the temptation to test their programs. And this concludes the sketch of my proposal for an introductory programming course for freshmen.

This is a serious proposal, and utterly sensible. Its only disadvantage is that it is too radical for many, who, being unable to accept it, are forced to invent a quick reason for dismissing it, no matter how invalid.

this is what esdger dijkstra actually believed