“Where do Bayesians get their numbers from anyway,” installment (n+1)
[cut cut]
[snup]
[snop]
[snip]
[snrp]
[snep]
I think it’s because the uniform Ap distribution does not represent a state of “full uncertainty.” It in fact represents the statement “I think any probability is just as plausible as any other one,” and for most possible hypotheses, that is not a good description of most agents’ state of uncertainty about them! And once again we run into the problem of conflicting intuitions because I think that states of really total uncertainty are the ones that need to obey the rules of probability the most because they’re the easiest to fuck up.
And this problem is just isomorphic to the “finding priors” problem, which is the greatest weakness of the method. The Solomonoff solution, priors proportional to the negative exponential of the hypothesis’ complexity, is one that also appeals to me intuitively and mathematically - the basic argument that adding one bit makes twice as many hypotheses available and therefore should cut probability in half looks very good to me.
But there is no such thing as total uncertainty. Even in the ice cream case, you know something about the human population in the present and how it’s likely to change in the next 25 years, you know something about the prevalence of ice cream, etc - and even if you were completely ignorant, I hardly think you’d believe that “it’s as likely as not that that proposition is true,” or that literally any probability assignment is as justifiable as any other. The uniform Ap distribution is not special nor does it represent full uncertainty, because there is no such thing as full uncertainty, and because different amounts of information are represented by different Ap distributions (another example Jaynes uses is that of the existence of life in Mars at some point in the past, whose Ap distribution for himself he claims he’d describe as something like the Haldane prior).
—EDIT to add:
But also, as you and raginrayguns pointed out, it’s in practice impossible to really use the same heuristic rule about assigning more “reasonable” plausibilities to things when the thing is murky enough, as humans. But… all of my argumentation isn’t supposed to apply to humans :P I’m not concerned with what we can actually accomplish in the physical universe, I’m concerned with what’s the optimal way of reasoning, what’s the golden standard, hyper ideal, the Platonic Form of reasoning. Whether we can implement that and what we can do when we can’t is a completely different topic. But if I can determine that Bayes is indeed the ideal perfect unachievable golden standard of reasoning, then I can move on to see what approximations I can make and when and how I should deviate from the strict uncomputable solution.
OK, first of all a response to the last point: this conversation started with me objecting to some statements by a human, which are still floating up there above all the [snip]s and [snop]s.
As I’ve been interpreting it, this conversation has been about human approximations and not ideal reasoning, and to the extent that you’re only talking about ideal reasoning, you aren’t addressing the original question, which was “is the behavior of Robin Hanson (who is a human, not a JaynesBot) justified here? Why or why not?”
Second: ultimately, I think the conflict here comes down to this question:
“Is it right to ‘add’ information we don’t feel like we know in order to make our representation of our uncertainty obey the probability axioms?”
It seems like our intuitions are diametrically opposed here. I actively think this extra information shouldn’t be added, while you think that rationality demands that we add such information, and refer to not doing so as “fucking up.”
I’m going to give two examples of what I mean by “adding information.” Note that I’m using “information” in an informal sense, not in the Shannon sense or anything. And also that, as always, I’m an amateur in these things and I wouldn’t ever want to imply that I’m the first to have ever thought of these objections – merely that I don’t know what the expert responses to them look like (though they presumably exist).
A thing I was getting at with my “conjunction” points yesterday is that if you know a probability distribution, you know information about the dependence structure of the events in it. However, sometimes I don’t feel like I know the dependence structure of a set of “murky” events. Representing my uncertainty as a distribution requires me to choose one, and this feels wrong.
For instance, let’s look at your example of the machine with two lights. In that example there was no problem because the dependence structure was given (we know P(blue|red) = 0 and vice versa).
But now suppose I have the same machine, just as mysterious, except now I know any combination of the lights would be possible. The possibilities are now {neither, red only, blue only, both}. We could split this into various events like “red” which would be the set {red only, both}. (I’ll give the event {both} the name “red&blue.”)
Knowing nothing about the machine, I don’t know what the dependence structure is. Maybe the two lights are independent like two flipped coins, and P(red&blue) = P(red)*P(blue). Or maybe they have some kind of dependence: maybe only “red only” and “blue only” are possible, or maybe only “neither” and “both” are possible.
What should my A_p distributions be here? They can’t be uniform for each outcome because there are 4 outcomes and the means have to sum to 1, not 2. (They still should have support over the whole interval [0,1] because maybe the machine just does blue every time or w/e.) There’s probably a MaxEnt answer here?
In any case, whatever answer I choose, it will imply a dependence structure. If I have a probability distribution over the space {neither, red only, blue only, both} then I can compute things like P(red|blue).
But actually I feel totally uncertain about those things, which I was not informed of at the outset. They are “extra information,” and the idea that an object involving this information is the “right” representation of my state of uncertainty seems strange to me.
Is there any one dependence structure here that “correctly” represents my total lack of knowledge about the dependence structure of the machine’s behavior? It feels counter-intuitive that there would be, though maybe there is. Maybe I’m missing something here?
Here’s a second example of “extra information.”
Famously, if all you’re given are a mean and a variance, the MaxEnt prob. dist. on R is a Gaussian.
In going from “mean mu, variance sigma^2” to “N(mu,sigma^2)” I become able to compute many new things. For instance, I can now compute any moment of this distribution. I could tell you its fourth moment, say (and it would be finite).
However, the information provided is also consistent with other distributions, such as the Student’s t with nu > 2 (technically, a non-standardized Student’s t). However, the Student’s t does not have defined nth moment for n >= nu. So, the information provided is consistent with a Student’s t with nu = 3, but the fourth moment of that distribution is undefined (in the sense of being “infinite,” i.e. the integral diverges to +infinity).
So, suppose you come along and say, “Rob, there’s a distribution with mean mu and variance sigma^2.” And I think, okay, there's some nonzero chance it’s a Student’s t. After all, that’s a distribution that comes up in real life.
Now you ask “okay, Rob, what’s its fourth moment?” And if I were a MaxEnt machine I’d happily spit out the fourth moment of N(mu,sigma^2). But I, Rob, know that Student’s t is out there in the world, and that its fourth moment is infinite! How do I incorporate this knowledge into a probability distribution? If I have, say, some distribution over distributions in which the relevant Student’s t has a nonzero probability epsilon of being the right one, even if epsilon is tiny, the “expected value” of the fourth moment will still be infinite (infinity * epsilon = infinity), and I’ll spit out “infinity,” not “fourth moment of N(mu,sigma^2).”
So it seems like doing MaxEnt makes me forget that Student’s t is a possibility, in that it leads me to draw conclusions that seem unreasonable unless I think a Student’s t is actually impossible. The information “mean mu, variance sigma^2” doesn’t just fail to tell me the fourth moment; it fails to tell me that it’s even finite. The state of uncertainty I’m in, having received that information, seems very poorly represented by a Gaussian.
Again, I don’t doubt that people have thought about these issues and come up with answers to them; I just don’t know what the answers are. And in sum, the disagreement here seems to come down to the question of whether states of uncertainty should or shouldn’t be “altered” to give me a prior that acts like a probability distribution.



