Install Theme

Something cool I found out about in that Agent Foundations conversation was this paper on the “speed prior,” which is like Solomonoff but with probabilities inversely proportional to the time it takes to compute things.  Does away with uncomputability issues, and you can get some “excellent” (the authors’ word) bounds for it.  (Don’t really feel qualified to evaluate the paper, plus I just haven’t looked at it in much detail, but it seems cool)

bayes: a kinda-sorta masterpost

raginrayguns:

@nostalgebraist:

5. Why is the Bayesian machinery supposed to be so great?

This still confuses me a little, years after I wrote that other post.  A funny thing about the Bayesian machinery is that it doesn’t get justified in concrete guarantees like “can unscrew these screws, can tolerate this much torque, won’t melt below this temperature.”  Instead, one hears two kinds of justifications:

(a) Formal arguments that if one has some of the machinery in place, one will be suboptimal unless one has the other parts too

(b) Demonstrations that on particular problems, the machinery does a slick job (easy to use, self-consistent, free of oddities, etc.) while the classical tools all fail somehow

E. T. Jaynes’ big book is full of type (b) stuff, mostly on physics and statistics problems that are well-defined and textbook-ish enough that one can straightforwardly “plug and chug” with the Bayesian machinery.  The problem with these demos, as arguments, is that they only show that the tool has some applications, not that it is the only tool you’ll ever need.

Examples of type (a) are Cox’s Theorem and Dutch Book arguments.  These all start with the hypotheses and logical relations already set up, and try to convince you (say) if you have degrees of belief, they ought to conform to the logical relations.  This is something of a straw man argument, in that no one actually advocates using the rest of the setup but not imposing these relations.  (Although there are interesting ideas surprisingly close to that territory.)

The real competitors to Bayes (e.g. the classical toolbox) do not have the “hypothesis space + degrees of belief” setup at all, so these arguments cannot touch them.

Yeah, Jaynes starts with Cox’s theorem, which I think of as a sort of filter, which you can drop a system through and see where it gets stuck, and if it doesn’t get stuck and makes it all the way through, it’s probability theory. But he doesn’t really present any other systems that you can drop through the filter. He mostly criticizes orthodox statistics which you can’t really drop through it.

When I first read read Jaynes, the example I dropped through Cox’s theorem is fuzzy logic, defining Belief(A and B) = min(Belief(A), Belief(B)), and disjunction as maximum. This gets stuck because you can hold Belief(A) constant and increase Belief(B) without necessarily increasing Belief(A and B). That’s not allowed. I was very impressed with Cox’s theorem for excluding this, since I had not even noticed this property, and when brought to my attention it was in fact unreasonable.

It makes me wonder, if I would have been less impressed if I had started by using Dempster-Shafer theory as an example. Dempster-Shafer theory is the “interesting idea” that nostalgebraist linked to above. I’m writing this post to discuss it more thoroughly. tl;dr summary: Dempster-Schafer theory can be thought of as breaking the rule that there’s a “negation function” mapping Belief(~A) to Belief(A), and makes you wonder why we really need such a function.

So, as everyone in the internet Bayesianism discourse knows, Dempster-Schafer theory gives every proposition two numbers. These are the belief, Bel(A), and the plausibility, Plaus(A). Belief is how much it’s supported by the evidence, and plausibility is the degree to which it’s allowed by the evidence. Plausibility is higher.

As few discoursers seem to realize, Plaus(A) is just 1-Bel(~A), so in a sense Bel is all you need. It’s interesting, then, to drop Bel through Cox’s theorem, and see where it gets stuck.

And the first place I notice is at the following desideratum in Cox’s theorem:

There exists a function S such that, for all A, Bel(~A) = S(Bel(A)).

Bel(A) breaks this rule, supposedly ruling it out as a quantification of confidence. But how bad is it, really?

Suppose I’m happily using Dempster-Shafer theory for, I don’t know, assessment of fraud risk, when strawman!Cox bursts into my office, and declares “I’ve come to save you from your irrational degrees of belief!”

As the perfectly reasonable foil to this hysterical and unreasonable strawman, I reply in a tone of pure, innocent curiosity: “What do you mean? I’d love any opportunity to improve my fraud detection.”

“Well,” Cox begins, filliping a coin and covering it, “your Bel(Heads)=0.5, and your Bel(~Heads)=0.5, right?”

“Certainly,” I reply.

“And this case you’re reviewing, Bel(Fraud) = 0.5, correct?”

“Absolutely.”

“And your Bel(~Fraud)?”

“0.2.”

“That’s irrational!” he shrieks, throwing his hands in the air and revealing that the coin was a heads. “Let S be the function that maps from Bel(A) to Bel(~A). What’s S(0.5)? Is it 0.5, or 0.2?” He puts his hands on my desk, leans forward, and demands, “Which is it?”

“There is no such function,” I reply. “Why should there be?”

So, what can Cox do to convince me my assignments are irrational? Or that my fraud detection would be more efficient if there existed this negation function S?

So, that’s where I end up when I drop Dempster-Shafer Bel through Cox’s theorem, and this time I don’t feel I’ve revealed any flaw in the system.

Shafer himself says the same thing, actually:

Glenn Shafer:

Most of my own scholarly work has been devoted to representations of uncertainty that depart from the standard probability calculus, beginning with my work on belief functions in the 1970s and 1980s and continuing with my work on causality in the 1990s [18] and my current work with Vladimir Vovk on game-theoretic probability ([19], www.probabilityandfinance.com). I undertook all of this work after a careful reading, as a graduate student in the early 1970s, of Cox’s paper and book. His axioms did not dissuade me. As Van Horn notes, with a quote from my 1976 book [17], I am not on board even with Cox’s implicit assumption that reasonable expectation can normally be expressed as a single number. I should add that I am also unpersuaded by Cox’s two explicit axioms. Here they are in Cox’s own notation:

1. The likelihood ∼ b|a is determined in some way by the likelihood b|a: ∼ b|a = S(b|a). where S is some function of one variable.

2. The likelihood c ·b|a is determined in some way by the two likelihoods b|a and c|b · a: c · b|a = F(c|b · a, b|a), where F is some function of two variables.

I have never been able to appreciate the normative claims made for these axioms. They are abstractions from the usual rules of the probability calculus, which I do understand. But when I try to isolate them from that calculus and persuade myself that they are self-evident in their own terms, I draw a blank. They are too abstract—too distant from specific problems or procedures—to be self-evident to my mind.

Shafer goes on to quote and respond to Cox’s argument that there should exist F, but since I’m talking about S, I’m gonna look up how Jaynes argued for it.

ET Jaynes:

Since the propositions now being considered are of the Aristotelian logical type which must always be either true or false, the logical product AA̅ is always false, the logical sum A+A̅ always true. The plausibility that A is false must depend in some way on the plausibility that it is true. If we define u ≣ w(A|B), v ≣ w(A̅|B), there must exist some functional relation

v = S(u)

And that’s it. To explain notation w is the function that is eventually shown to have a correspondence with a probability mass function, overbar means “not”, and logical “sums” and “products” are conjunctions and disjunctions.

So, why must there exist this functional relation? Perhaps instead, the belief in A could change without altering the belief in ~A? That can happen in Dempster-Shafer I think, and it does seem kind of crazy. But even disallowing that, and allowing that there must be a function between belief in A and ~A, is it really the same function for every A? Why should it be?

Anyway, yeah. So, idk if I’d say, like nostalgebraist does, that Dempster-Shafer theory is surprisingly close to having the hypothesis space + beliefs setup but without the same constraints. I’d say instead that it’s exactly that. But I’m not totally sure since I’ve only read the basics and maybe things change in more complex applications.

Good stuff!!

To be completely honest, when I was writing that part you quoted, I was like “oh shit wait, D-S does have the same setup, so how does it get around the Cox and Dutch Book type stuff, or maybe it doesn’t? um….” and then in the interests of getting on with the rest of the post, I just hedged by being vague (“surprisingly close to that territory”)

So thanks for answering the question I was curious about but had to ignore.

I started wondering about the equivalent of the above in the measure-theoretic picture (i.e. why K-S doesn’t define a probability measure).  If you translate “logical negation” to “set complement” like usual, then it violates additivity: A and ~A are disjoint, and together they make the whole space, so area(A) = area(whole space) - area(~A).  This seems easier to understand than the Cox S thing, which fits with what Shafer said.

(Apparently, instead of a measure, it’s a “fuzzy measure.”  Instead of additivity, a fuzzy measure just needs to get the correct order on what I was calling “obviously-nested” sets earlier)

I can see the strong intuition behind the Cox S desideratum.  You should be able to take the negation of everything without changing any of the content.  Like, when we talk about A and ~A, neither has the intrinsic property of “being the one with the tilde.”  (Likewise with sets A, A^c.)  You can see the desideratum as a relatively weak way of trying to make things symmetric under negation – everything goes through the same function, so hopefully every property of b|a will have an equivalent for S(b|a).

So, if there’s an asymmetry between one side and the other, what broke the initial symmetry?  How do you decide which side is which?  (That’s what I imagine the strawman!Cox figure saying)

But then, A and ~A are always distinct, even if not because “one has the tilde.”  So for the D-S-using fraud protection worker, it is easy to break the symmetry because “Fraud” and “not Fraud” are different things.  (Thus if they’d flipped all their tildes at the start, the symmetry would have broken the same way, “not Fraud” getting 0.2 and “Fraud” getting 0.5.)

Still, if we are understanding the “not” here either as logical negation or as set complement, this is still nonsensical.  Because in both those frameworks, the negation doesn’t contain any information not contained in the original.  Except …

If I think of “the information” used to specify sets S or S^c as a boundary, then S is “everything inside here” and S^c is “everything outside of here.”  Of course this visual picture is depending on topological notions not present in the sets alone, but it suggests something true about spaces of ideas/hypotheses: we can draw a boundary around some ideas we know about, and the “inside here” set is all stuff we know about, but the “outside of here” set includes all other ideas, including ones we haven’t thought of.  So this is a very natural distinction in practice.

How would you formalize that?  I guess you’d have set theory in a universe (=“outcome space”) that wasn’t fully known, so you could say stuff like “I know 1 and 2 are in the universe, and I can make the set {1, 2}, but I don’t know if 3 is in the universe.”  This probably exists but I don’t know what it’s called.

(via raginrayguns)

bayes: a kinda-sorta masterpost

principioeternus:

nostalgebraist:

I have written many many words about “Bayesianism” in this space over the years, but the closest thing to a comprehensive “my position on Bayes” post to date is this one from three years ago, which I wrote when I was much newer to this stuff.  People sometimes link that post or ask me about it, which almost never happens with my other Bayes posts.  So I figure I should write a more up-to-date “position post.”

I will try to make this at least kind of comprehensive, but I will omit many details and sometimes state conclusions without the corresponding arguments.  Feel free to ask me if you want to hear more about something.

I ended up including a whole lot of preparatory exposition here – the main critiques start in section 6, although there are various critical remarks earlier.

Keep reading

I like this post.  I myself would say that I’m only a “weak Bayesian”, and that while I do solidly believe in various “Bayesian brain” theories, those theories are *muuuuuch* more philosophically pragmatist than the Strong Bayesian epistemological program.

My big request would be whether anyone knows how to “replace” probability theory.  What I really want is a way of predicting stuff that lets information flow top-down *and* bottom-up, allows for continuously graded inferences, and allows for arbitrarily complicated structures and connections.  Most statistical and machine-learning methods, outside of those described below, *don’t* allow for that!  This is why I stick by my Weak Bayesianism even when it visibly sucks.

That said, there are some formal developments Nostalgebraist has missed here.

* Nonparametrics!  It’s not as if nobody has ever thought about the Problem of New Ideas before.  There’s a whole subfield of Bayesian nonparametric statistics devoted to handling exactly this.  The idea is that you start with a “nonparametric” prior model (a probabilistic model of an infinite-dimensional sample space).  Sure, this model will assign probabilities over objects that are formally infinite, but you only ever have to actually deal with finite portions of them that talk about your finite data.  Whenever new data appears to require a New Idea, though, the model will summon one up with approximately the right shape.  You can Monte Carlo sample increasingly large/complex finite elements of the posterior, and you never have to hold the infinite object in your head to be doing probabilistic inference with it.

* Probabilistic programming!  This one’s related to nonparametrics, since part of its purpose is to make nonparametrics easy to handle computationally.  In a probabilistic programming language, we can perform inference (both conditionalization and marginalization) in any model whose conditional-dependence structure corresponds to some program.  In practice, this means writing programs that flip coins, and then conditioning on observed flips to find the weights.  It’s actually surprisingly intuitive for having so much mathematical and computational machinery behind it.  It’s also Turing-universal: any distribution from which a computer can sample in finite time corresponds to some probabilistic program.  So we have a model class including everything we think a physical machine can cope with!

* Divergences are universal performance metrics.  Any predictive model - frequentist or Bayesian - can be *considered* to give an approximate posterior-predictive distribution.  An information divergence (usually a Kullback-Leibler divergence) then defines a “loss function” between the true empirical distribution over held-out sample data and an equivalent sample from the predictive distribution.  The higher the loss, the worse the predictive model, and the actual number can be (AFAIU) approximately calculated (certainly I’ve handled code that calculates approximate sample divergences).  A good frequentist model will have a low divergence (loss), and a bad Bayesian model will have a high divergence (loss).  This gives a good definition for a *bad* Bayesian model: one in which the posterior predictive doesn’t predict well.  This technique is regularly used in Bayesian statistics to evaluate and criticize models.

What’s important here is that sample spaces like, “Countable-dimensional probability distributions” (Dirichlet processes), “Uncountable-dimensional continuous functions” (Gaussian processes), and “all stochastic computer programs” seem to give us increasingly broad classes of probability models.  We would like to then do the reverse of old-fashioned Bayesian statistics: instead of starting with a restricted model, we can start with a very broad model and restrict it using our domain knowledge about the problem at hand.  We then plug-and-play some computational stuff to perform inference.

Of course, it doesn’t yet work well in practice, but these things are regularly used to model really complex stuff, up to and including thought.  Again, those are Weak Bayesian theories, and we care more about a Monte Carlo or variational posterior with a low predictive loss than about finding God’s own posterior distribution.

Another important choice to make is indeed how you interpret probability.  I’ve actually liked the more measure-y way, once it was explained to me.  “Propositions” are then interpreted as subspaces of the sample space.  This seems like the Right Thing: you can start with a very complex model defined by some program or some infinite object or whatever, and then treat finite events within it as logical propositions.  Those propositions will obey Boolean logic, but their logical relations will come from the model, rather than the other way around.  An infinite-dimensional model will then also allow for an infinite number of propositions.

I consider this a fairly good example of how sometimes you should build your philosophy *on top of* the math and science that you know can work, rather than the other way around.  Philosophy is an *output* of thought, so if you want new philosophy, you need new thoughts to think, and if you want new thoughts to think, you need to get them from the world.

This is an extremely interesting response, thank you.

I was totally ignorant of Bayesian nonparametrics until now and it is the sort of thing I should (and want to) know about.  Do you have any recommendations about what to read first?  Seems like there are a lot of references out there.

Any links about probabilistic programming that you think are especially good + relevant would be appreciated too.

I’m not sure I agree with your paragraph about divergences (or perhaps I don’t understand it).  I’m aware of the K-L divergence, and it’s true that you can get a “posterior distribution” of some kind out of any predictive model.  (In classification tasks, this is straightforward because the predictions are usually probabilistic anyway; it’s a little less clear to me how this works with regression, since the point estimates we make in regression don’t attempt to match the intrinsic/noise variance in the data, which would affect the K-L divergence.)

But there’s more than one way to compare two probability distributions, and I don’t see that “K-L divergence from empirical distribution of validation set” is the one best loss function for probabilistic modeling.  For one thing, we’re presumably going to want to use the joint distributions of all our variables (so that the model has to get the relation of X to Y right, not just match the overall relative counts for Y).  But that’s a potentially high-dimensional distribution which we’re sparsely sampling, so the literal empirical distribution will have spurious peaks centered at each data point, and we’d need to do some density reconstruction to get something more sensible – at which point it’s not clear that we trust this reference distribution more than our model’s posterior, since both involve approximate inference from the data.

Also, I know the K-L divergence has a bunch of special properties, but I’ve always been wary when people say that it is the one correct way to compare 2 distributions (or that there is one correct way).  To make the case it seems like you’d need some link between the special properties and the thing you want to do.  And in practice we use various loss functions (various proper scoring rules for classification, say) that aren’t (obviously?) the K-L div in disguise; is this wrong?

(via principioeternus)

identicaltomyself:

nostalgebraist:

Having thought about this for a few more minutes:

It seems like things are much easier to handle if, instead of putting any actual numbers (probabilities) in, we just track the partial order generated by the logical relations.  Like, when you consider a new hypothesis you’ve never thought about, you just note down “has to have lower probability than these ones I’ve already thought about, and higher probability than these other ones I’ve already thought about.”

At some point, you’re going to want to assign some actual numbers, but we can think of this step as more provisional and revisable than the partial order.  You can say “if I set P(thing) = whatever, what consequences does that have for everything else?” without committing to “P(thing) = whatever” once and for all, and if you retract it, the partial order is still there.

In fact, we can (I think) do conditionalization without numbers, since it just rules out subsets of hypothesis space.  I’m not sure how the details would work but it feels do-able.

The big problem with this is trying to do decision theory, because there you’re supposed to integrate over your probabilities for all hypotheses, whereas this setup lends itself better to getting bounds on individual hypotheses (“P(A) must be less than P(B), and I’ll willing to say P(B) is less than 0.8, so P(A) is less than 0.8″).  I wonder if a sensible (non-standard) decision theory can be formulated on the basis of these bounds?

I’ve seen papers on doing reasoning, based on propositions being more or less likely than other propositions, but without assigning numbers to the probabilities. Unfortunately, a half hour of poking around doesn’t turn up the papers I’m thinking of. The general area is called “valuation algebras on semirings”. In the case I remember, the semiring is Boolean algebra on propositions, which induces a partial order on the extent to which they are believed.

Anyway, that’s a not-very-useful half-assed reference. Now I’m going to switch to a more common mode of Tumblr discourse, i.e. talking about how what you say shows you’re thinking wrong (I may be misunderstanding what you say, but this being Tumblr, I will ignore that possibility.)

You’re operating on the principle that the goal of reasoning is to put probabilities on propositions. Then you find various problems involving e.g. what if you suddenly think of a new proposition, or realize that two propositions you thought were different are actually the same. But it seems to me that propositions are not the best thing to assign probabilities to.

What we want to find is a probability distribution over states of the world. Turning that into a probability for some proposition is a matter of adding up the probabilities of all the states of the world where that proposition is true. This is bog-standard measure theoretic probability theory, so it’s not just something I made up. You might find that thinking this way dissolves some of the perplexities you’ve been pondering in your last two posts.

Thanks for the pointer about valuation algebras on semirings.

About world states – I addressed that in my original post, when I contrasted the die roll example (where we really can describe world states) to real-world claims like “Trump will be re-elected in 2020.”

If we actually want to specify states of the real world at the level of measure theoretic outcomes (set elements, rather than sets), either we’ll throw away some of what we know about the world, or the outcomes would have to be things like quantum field configurations down to the subatomic scale.  (Indeed, even that would be throwing away knowledge, since we don’t have a unified theory of fundamental physics and aren’t fully committed to any of theories we do have; the outcome-level description would have to involve different candidate laws of physics plus states in terms of them.)

The natural reflex is to do some sort of coarse-graining, where we abstract away from the smallest-level description, but at that point we’re basically doing Jaynes’ propositional framework, since we’re allowing that our most basic units of description could be refined further (we don’t specify O(10^23) variables for every mole of matter, but we allow that we might learn some of those variables later).

TBH, I think I am so skeptical of Bayes in part because I am used to thinking in the measure-theoretic framework, and it just seems so obvious that we can’t do practical reasoning with descriptions that are required to be that complete.  Jaynes’ propositional framework seems like an attempt to avoid this problem, or at least hide it, which is why I’m focusing on it – it’s less clear that it’s unworkable.

(via identicaltomyself)

Having thought about this for a few more minutes:

It seems like things are much easier to handle if, instead of putting any actual numbers (probabilities) in, we just track the partial order generated by the logical relations.  Like, when you consider a new hypothesis you’ve never thought about, you just note down “has to have lower probability than these ones I’ve already thought about, and higher probability than these other ones I’ve already thought about.”

At some point, you’re going to want to assign some actual numbers, but we can think of this step as more provisional and revisable than the partial order.  You can say “if I set P(thing) = whatever, what consequences does that have for everything else?” without committing to “P(thing) = whatever” once and for all, and if you retract it, the partial order is still there.

In fact, we can (I think) do conditionalization without numbers, since it just rules out subsets of hypothesis space.  I’m not sure how the details would work but it feels do-able.

The big problem with this is trying to do decision theory, because there you’re supposed to integrate over your probabilities for all hypotheses, whereas this setup lends itself better to getting bounds on individual hypotheses (“P(A) must be less than P(B), and I’ll willing to say P(B) is less than 0.8, so P(A) is less than 0.8″).  I wonder if a sensible (non-standard) decision theory can be formulated on the basis of these bounds?

bayes: a kinda-sorta masterpost

lostpuntinentofalantis:

notthedarklord42:

nostalgebraist:

@lostpuntinentofalantis

I don’t think the fact that humans are bad at thinking up logical implications is a very strong argument against bayes, in the same way that “But Harold, you said you loved Chocolate earlier!” is an argument against preferences.

So, I will agree that there’s this non-monotonic thing. This is indeed a very good point against using Bayes as a mental tool! I am not disagreeing with that!

What I do disagree with is the idea that it’s ipso facto problematic. I think the correct way to do this is throw out your first estimate as a preliminary one, and then use the other logical implication questions as a way to generate a battery of knowledge in a kinda organic fashion. To use the original “California succession” thing, let’s say I think it’s unlikely, so I throw out 98% as my likelihood, then some else asks me the “USA still together” so I also generically throw out 98% but A HA!!!!!! THIS SEEMS WRONG, because the set of situations involving the US together but California leaving seems I dunno small or whatever, so I end up adjusting the probabilities as,  repeating until I’ve thought of all “relevant” probabilities.

But logically speaking isn’t this troublesome? Isn’t it terrible that in theory an adversary can choose a sequence of questions which allows them to set my probabilities? Well, not really. My claim is that thoughts of these logical implication things provide information because humans are really bad at accessing all the information they have, and that, yeah sure if the adversary controls how a person accesses their information, of course the person is screwed? So you hope that people have good internal “implication generating”  machinery, such that by the time that they have worked through a bunch of subset questions, they have dumped out all relevant information, and the ordering effects are washed out.

Which is a much more elaborate way of saying “guys stop throwing out random probabilities and sticking to them if you don’t have good intuition/facts doing cognitive work aaaaaaaahh”

I guess I can agree that nothing I said above is specifically motivated by Bayes, except for this vague feeling of “well, shit it turns out I’m actually really bad at incorporating all relevant information” and I think it’s really just unavoidable.

I don’t think this is a problem with humans, I think it’s much more fundamental.  The real issue is that these kinds of “obviously nested” statements have a “easy to check, hard to find” property, like with NP-complete problems.

Let’s define “A is obviously nested in B” as “if you describe both A and B to me, it’ll be immediately obvious to me that A is sufficient but not necessary for B.”  And let’s define an “obviously nested pair” as A, B where one is obviously nested in the other.

The “US in 2100″ statements mentioned earlier are all obviously nested pairs with one another.  But the ones mentioned are just a few examples; there are infinitely many statements of the same form, asking about slightly bigger or smaller regions of the US, that also form obviously-nested pairs with all other such statements.

And that whole infinite chain is just one “direction” in hypothesis space.  You can think about any other subject – existence of various markets and sub-markets (will candy be sold?  will lollipops?), demographics and sub-demographics, scientific ideas and special cases thereof, you name it – and produce an infinite obviously-nested chain like this.

In finite time (much less polynomial time), you can only explicitly think about some vanishingly small subset of these statements.  Yet you implicitly know infinitely many facts about them (about each chain, in fact, of which there are infinitely many).  There’s no way to sit down and think enough beforehand that all of the obvious-nesting information has been dumped out into an explicit representation (and that representation would take infinite space anyway).

Now, maybe there is a way to handle this in practice so that it doesn’t hurt you too much, or something.  Such a theory would be very interesting, but as far as I know it doesn’t exist, and it would have to exist for us to begin talking about how a finite being could faithfully represent its implicit knowledge in a prior.

(This is a human problem in the sense that you could make a machine which would lack all this implicit knowledge.  That machine would not have this problem, but it would know less than we do, so we’d be throwing away information if we tried to imitate it.)

Yet you implicitly know infinitely many facts about them (about each chain, in fact, of which there are infinitely many).  There’s no way to sit down and think enough beforehand that all of the obvious-nesting information has been dumped out into an explicit representation (and that representation would take infinite space anyway).

Now, maybe there is a way to handle this in practice so that it doesn’t hurt you too much, or something.

This sounds like a natural continuity/limits problem. It does seem like there could be infinite nesting like this, and that you do know information about each step of the chain. However, I’m not sure this necessarily needs infinitely many facts to describe, perhaps an overarching fact could sum them up, or the facts get ‘smaller’ as the chain does, so that together they form a finite total fact. Thinking about the obvious-nesting information sounds very much like taking a limit.

The geographic example has very literal continuity, with larger and smaller regions of the US. I’m actually quite surprised there isn’t such a theory already! Hypothesis space, even when infinite, is continuous, and that makes a big difference.


On a separate note, I’m not convinced that we couldn’t make do with a model where we only consider a finite universe, with discrete rather than continuous space. That would mean you could not take infinitely many different regions of the US. And it would mean that only finitely many events could possibly occur in a given time period, which intuitively seems like saying there will only be finitely many such different chains of hypotheses to worry about.

While it seems a bit artificial at times, I don’t think it’s too unreasonable to allow a theory like this to only cover finite cases, not when the finite case can approximate the infinite case arbitrarily closely. Then it seems we could reasonably represent our priors. 

I am a bad pun blog and I endorse this message as elaborating on my “eh it probably converges” intuition earlier.

I think we can afford to agree to disagree unless @nostalgebraist can help me intuition pump this a bit further on why doing the subset enumeration problem doesn’t (eventually) converge.

I will say that this substantially downgraded my belief that Bayes is complete; there is much more work to be done, and I think it’s totally reasonable to call out the “unfounded intuition” parts of *the bayes memeplex* from the more proper Edwin and Eliezer’s Excellent Adventure canon.

The continuity thing is interesting.

Re this

I’m actually quite surprised there isn’t such a theory already! Hypothesis space, even when infinite, is continuous, and that makes a big difference.

What immediately came to my mind is that the Bayes setup doesn’t demand that your prior be continuous in any underlying variable, so this doesn’t come up in proving “for all” and “there exist” statements about Bayesian agents, and is easy to dismiss as “just a special case” if you think like that.  On the more practical side, concrete applications of Bayes always tend to have continuous priors (bc they use familiar probability distributions that have PDFs); it’s easy to forget that you don’t necessarily have to do this, and so you don’t really think about how it might give you extra properties.

(And indeed, you don’t always want continuity even in spatial examples, since the real world has state lines and other borders, for instance.)

Anyway, even if you assume your prior is always continuous in one or more underlying variables (space, time?, etc?), that still leaves the functional form open.  One worry about these kinds of cases is that your contortions to squeeze things in will give you a prior with lots of unmotivated variations in slope (flat for a while, then steep for a while).  So in addition to continuity, you’d need some general assumption like “I think things tend to vary linearly (or whatever) w/r/t space,” which would get you most of the way to being able to pull consistent probabilities out of the air in any order.  Although you still have to deal with things that are not nested but not independent either, and make sure all those relations work out … IDK, if someone’s worked this all out in detail I’d love to see it, but it sounds really hard.

bayes: a kinda-sorta masterpost

@lostpuntinentofalantis

I don’t think the fact that humans are bad at thinking up logical implications is a very strong argument against bayes, in the same way that “But Harold, you said you loved Chocolate earlier!” is an argument against preferences.

So, I will agree that there’s this non-monotonic thing. This is indeed a very good point against using Bayes as a mental tool! I am not disagreeing with that!

What I do disagree with is the idea that it’s ipso facto problematic. I think the correct way to do this is throw out your first estimate as a preliminary one, and then use the other logical implication questions as a way to generate a battery of knowledge in a kinda organic fashion. To use the original “California succession” thing, let’s say I think it’s unlikely, so I throw out 98% as my likelihood, then some else asks me the “USA still together” so I also generically throw out 98% but A HA!!!!!! THIS SEEMS WRONG, because the set of situations involving the US together but California leaving seems I dunno small or whatever, so I end up adjusting the probabilities as,  repeating until I’ve thought of all “relevant” probabilities.

But logically speaking isn’t this troublesome? Isn’t it terrible that in theory an adversary can choose a sequence of questions which allows them to set my probabilities? Well, not really. My claim is that thoughts of these logical implication things provide information because humans are really bad at accessing all the information they have, and that, yeah sure if the adversary controls how a person accesses their information, of course the person is screwed? So you hope that people have good internal “implication generating”  machinery, such that by the time that they have worked through a bunch of subset questions, they have dumped out all relevant information, and the ordering effects are washed out.

Which is a much more elaborate way of saying “guys stop throwing out random probabilities and sticking to them if you don’t have good intuition/facts doing cognitive work aaaaaaaahh”

I guess I can agree that nothing I said above is specifically motivated by Bayes, except for this vague feeling of “well, shit it turns out I’m actually really bad at incorporating all relevant information” and I think it’s really just unavoidable.

I don’t think this is a problem with humans, I think it’s much more fundamental.  The real issue is that these kinds of “obviously nested” statements have a “easy to check, hard to find” property, like with NP-complete problems.

Let’s define “A is obviously nested in B” as “if you describe both A and B to me, it’ll be immediately obvious to me that A is sufficient but not necessary for B.”  And let’s define an “obviously nested pair” as A, B where one is obviously nested in the other.

The “US in 2100″ statements mentioned earlier are all obviously nested pairs with one another.  But the ones mentioned are just a few examples; there are infinitely many statements of the same form, asking about slightly bigger or smaller regions of the US, that also form obviously-nested pairs with all other such statements.

And that whole infinite chain is just one “direction” in hypothesis space.  You can think about any other subject – existence of various markets and sub-markets (will candy be sold?  will lollipops?), demographics and sub-demographics, scientific ideas and special cases thereof, you name it – and produce an infinite obviously-nested chain like this.

In finite time (much less polynomial time), you can only explicitly think about some vanishingly small subset of these statements.  Yet you implicitly know infinitely many facts about them (about each chain, in fact, of which there are infinitely many).  There’s no way to sit down and think enough beforehand that all of the obvious-nesting information has been dumped out into an explicit representation (and that representation would take infinite space anyway).

Now, maybe there is a way to handle this in practice so that it doesn’t hurt you too much, or something.  Such a theory would be very interesting, but as far as I know it doesn’t exist, and it would have to exist for us to begin talking about how a finite being could faithfully represent its implicit knowledge in a prior.

(This is a human problem in the sense that you could make a machine which would lack all this implicit knowledge.  That machine would not have this problem, but it would know less than we do, so we’d be throwing away information if we tried to imitate it.)

(via lostpuntinentofalantis)

bayes: a kinda-sorta masterpost

lostpuntinentofalantis:

nostalgebraist:

I have written many many words about “Bayesianism” in this space over the years, but the closest thing to a comprehensive “my position on Bayes” post to date is this one from three years ago, which I wrote when I was much newer to this stuff.  People sometimes link that post or ask me about it, which almost never happens with my other Bayes posts.  So I figure I should write a more up-to-date “position post.”

I will try to make this at least kind of comprehensive, but I will omit many details and sometimes state conclusions without the corresponding arguments.  Feel free to ask me if you want to hear more about something.

I ended up including a whole lot of preparatory exposition here – the main critiques start in section 6, although there are various critical remarks earlier.

Keep reading

This isn’t convincing to me (and I guess everything of this genre isn’t convincing to me) because, like, it seems to me that the infinite hypothesis thing is just a problem for every kind of thinking?
You can claim that frequentist tools only work in limited domains or whatever, but in my mind all you’ve done is swept the “oh no what if I didn’t think of relevant hypothesis??!??” problem into the “well yeah you’re going to get burnt by this if you use it out of bounds”.

To (ab)use the tool analogy, it turns out that all human made tools cannot survive in the middle of a supernova, and yes you’re technically correct that all the omnitool fanboys have been overselling the utility of omnitool usage in Exotic Space Environments, but the fact that all the non-omnitools have warnings about “cannot be used in supernovae” is not going to convince me that omnitools don’t exist, or are necessarily worse in all cases.

If you’re talking about Section 7, I’m not just saying that “there might be relevant hypotheses you hadn’t thought of,” I’m saying that it’s really hard to encode what you do know in a prior without throwing away some information.

In jadagul’s examples with the different regions in 2100, you already know (before you think about any of it) that those statements have a certain logical implication structure.  But you only start thinking about each relation as the relevant statement is brought to your attention.  Like, if you ask someone those questions in a non-monotonic order, they’ll have to take care to squeeze some probabilities inside others they’ve already stated, and this will make things clearly depend on the order of asking.  (In my example, the person said “94.5%” because they know they needed something between 94 and 95, even though they were giving whole-number answers at first, and would have given a whole number answer to the intermediate case if asked about it first.)

(BTW I once actually asked these questions sequentially to a rationalist meetup group as a way of making this point)

So the problem isn’t “your knowledge is finite” but “you can’t encode exactly what you know (and nothing else) in a prior, or at least I know of no way to do it.”

You could say this is just another thing warning that should go on the label, but it suggests that we’re actually using the wrong representation for our prior knowledge, and so we have a “garbage in, garbage out” type problem: Bayes is somehow failing to capture what we know, and we don’t (AFAIK) have any bounds or guarantees on what problems this will or won’t cause.  Whereas in the frequentist procedures, we can at least describe what it would look like for a human to use them correctly, and guarantee certain things for that human.

bayes: a kinda-sorta masterpost

4point2kelvin:

nostalgebraist:

I have written many many words about “Bayesianism” in this space over the years, but the closest thing to a comprehensive “my position on Bayes” post to date is this one from three years ago, which I wrote when I was much newer to this stuff.  People sometimes link that post or ask me about it, which almost never happens with my other Bayes posts.  So I figure I should write a more up-to-date “position post.”

I will try to make this at least kind of comprehensive, but I will omit many details and sometimes state conclusions without the corresponding arguments.  Feel free to ask me if you want to hear more about something.

I ended up including a whole lot of preparatory exposition here – the main critiques start in section 6, although there are various critical remarks earlier.

Keep reading

Finding the realio truilo bestio hypothesis by simple application of Bayes’ theorem requires infinite computing power: this is a true and important point. But you can also find the best hypothesis within the set of hypotheses you’ve actually thought of. The probability isn’t “right” - it neither matches the hypercomputing limit nor even tries to account for your own fallibility - but you can find the best hypothesis of those available (up to a magical prior).

I think this task, of finding the best hypothesis among some you’ve thought of, is a useful one for grounding the discussion and allowing comparison between different problem-solving methods. I think that solving this problem provides space for a Bayesianism that’s more substantive than just a collection of machinery, but is still part of a larger system for understanding human reasoning.

(Of course, choice of this goal [identify the best hypothesis] is itself not Bayesian - a more natural thing to do would be to frame this in terms of making empirical predictions based on the set of imagined hypotheses, in which case the Bayesian approach still gets some nice guarantees for the same reason that minimum message length prediction is expected to work [even if you don’t do anything uncomputable, you can still piggyback off of the nice properties of Solomonoff induction].)

One can still criticize the case of choosing between a list of hypotheses, given some data, as too abstract and not engaging enough with human limitations. But now I think this criticism is about equally deflationary for all the tools in all the toolboxes, and so it’s more emotionally appealing to reject it.

On the topic of regularization: Whenever you see the adjective “just” or “mere” in anything remotely philosophical, you can guess that that poor word is about to do some heavy lifting. So you can imagine what I anticipated upon reading that “Bayesianism is just regularization, dude.”

Funnily enough, I think the problem with the simple Bayesian interpretation of regularization (as you point out: who the heck has a prior that your model parameters are Gaussian-distributed with known variance?) is that they are insufficiently Bayesian. By this I mean that they tunnel-vision on a particular model, instead of trying to assign weights to a whole bunch of possible models and choosing between them based on what the data says, which involves applying Bayes’ rule way more, so it must be more Bayesian (:P). And of course, this isn’t an original idea: plenty of people are trying to do Bayesian hyperparameter optimization.

Interesting stuff.

When you talk about finding the best hypothesis (i.e. getting the order of the probabilities right, if not the numerical value), why do you think Bayes gives the right answer?  You say “up to a magical prior,” but if we ignore the prior, we just have the likelihood, and we’re talking about “best hypothesis = maximum likelihood hypothesis.”  This isn’t exactly a bad idea but it’s neither uniquely Bayesian nor a good encapsulation of what we mean by “best” here.

One reason it isn’t a good encapsulation is that maximum likelihood may work better with some regularization, which a good prior would do.  But then, people seem to have a lot of trouble coming up with and using coherent priors, plus this gives us enough freedom that we can often change the result (which is best) by changing the prior … I’m just not seeing why Bayes does the job we want here in some assured, or uniquely good, way.

a more natural thing to do would be to frame this in terms of making empirical predictions based on the set of imagined hypotheses, in which case the Bayesian approach still gets some nice guarantees for the same reason that minimum message length prediction is expected to work [even if you don’t do anything uncomputable, you can still piggyback off of the nice properties of Solomonoff induction]

I agree about the first part (mean vs. mode, right?), but I don’t think I’m familiar with the guarantees you refer to here – link?

About “just”: that was meant as semi-joking payback for all of the gotchas about how other methods are “just” Bayes in disguise.  Regularization is just Bayes, huh?  Well, guess what: Bayes is just regularization!!!

(via 4point2kelvin)

bayes: a kinda-sorta masterpost

derplefurf:

nostalgebraist:

I have written many many words about “Bayesianism” in this space over the years, but the closest thing to a comprehensive “my position on Bayes” post to date is this one from three years ago, which I wrote when I was much newer to this stuff.  People sometimes link that post or ask me about it, which almost never happens with my other Bayes posts.  So I figure I should write a more up-to-date “position post.”

I will try to make this at least kind of comprehensive, but I will omit many details and sometimes state conclusions without the corresponding arguments.  Feel free to ask me if you want to hear more about something.

I ended up including a whole lot of preparatory exposition here – the main critiques start in section 6, although there are various critical remarks earlier.

Keep reading

Worth noting that your Section 8 (considering more hypotheses as you go along, not enumerating an infinite hypothesis space at the start or using infite computational power) highlights a problem that Eliezer and company have acknowledged for years, worked hard on, and last year actually found a novel answer to. (The best way to understand the paper, currently, is probably this 90-minute lecture.)

https://www.youtube.com/watch?v=UOddW4cXS5Y

Computable approximate Bayesian reasoners, e.g. logical inductors (which provably converge to perfect Bayesian reasoning in the limit, and have a bunch of nice properties as they go along), are indeed weirder to ponder than Solomonoff Induction. The objection about priors has an interesting answer here (with some edge cases), but I really can’t explain it out of context. And of course, this a computable algorithm but not an effectively computable one.

But I’d like to note that while non-Bayesians were pointing out the issue as a “see, this is why Bayesian reasoning can’t do anything without infinite computation, might as well scrap that endeavor”, Eliezer and company were actually working on that issue.

I’m aware of that paper.  Here are my thoughts on it.

Re: your last paragraph – people tend to work on approaches they find relatively promising, so it shouldn’t be surprising that Bayesians worked on fixing problems with Bayes while non-Bayesians worked on improving other approaches.

(via profound-yet-trivial)