Install Theme

bayes: a kinda-sorta masterpost

principioeternus:

nostalgebraist:

I have written many many words about “Bayesianism” in this space over the years, but the closest thing to a comprehensive “my position on Bayes” post to date is this one from three years ago, which I wrote when I was much newer to this stuff.  People sometimes link that post or ask me about it, which almost never happens with my other Bayes posts.  So I figure I should write a more up-to-date “position post.”

I will try to make this at least kind of comprehensive, but I will omit many details and sometimes state conclusions without the corresponding arguments.  Feel free to ask me if you want to hear more about something.

I ended up including a whole lot of preparatory exposition here – the main critiques start in section 6, although there are various critical remarks earlier.

Keep reading

I like this post.  I myself would say that I’m only a “weak Bayesian”, and that while I do solidly believe in various “Bayesian brain” theories, those theories are *muuuuuch* more philosophically pragmatist than the Strong Bayesian epistemological program.

My big request would be whether anyone knows how to “replace” probability theory.  What I really want is a way of predicting stuff that lets information flow top-down *and* bottom-up, allows for continuously graded inferences, and allows for arbitrarily complicated structures and connections.  Most statistical and machine-learning methods, outside of those described below, *don’t* allow for that!  This is why I stick by my Weak Bayesianism even when it visibly sucks.

That said, there are some formal developments Nostalgebraist has missed here.

* Nonparametrics!  It’s not as if nobody has ever thought about the Problem of New Ideas before.  There’s a whole subfield of Bayesian nonparametric statistics devoted to handling exactly this.  The idea is that you start with a “nonparametric” prior model (a probabilistic model of an infinite-dimensional sample space).  Sure, this model will assign probabilities over objects that are formally infinite, but you only ever have to actually deal with finite portions of them that talk about your finite data.  Whenever new data appears to require a New Idea, though, the model will summon one up with approximately the right shape.  You can Monte Carlo sample increasingly large/complex finite elements of the posterior, and you never have to hold the infinite object in your head to be doing probabilistic inference with it.

* Probabilistic programming!  This one’s related to nonparametrics, since part of its purpose is to make nonparametrics easy to handle computationally.  In a probabilistic programming language, we can perform inference (both conditionalization and marginalization) in any model whose conditional-dependence structure corresponds to some program.  In practice, this means writing programs that flip coins, and then conditioning on observed flips to find the weights.  It’s actually surprisingly intuitive for having so much mathematical and computational machinery behind it.  It’s also Turing-universal: any distribution from which a computer can sample in finite time corresponds to some probabilistic program.  So we have a model class including everything we think a physical machine can cope with!

* Divergences are universal performance metrics.  Any predictive model - frequentist or Bayesian - can be *considered* to give an approximate posterior-predictive distribution.  An information divergence (usually a Kullback-Leibler divergence) then defines a “loss function” between the true empirical distribution over held-out sample data and an equivalent sample from the predictive distribution.  The higher the loss, the worse the predictive model, and the actual number can be (AFAIU) approximately calculated (certainly I’ve handled code that calculates approximate sample divergences).  A good frequentist model will have a low divergence (loss), and a bad Bayesian model will have a high divergence (loss).  This gives a good definition for a *bad* Bayesian model: one in which the posterior predictive doesn’t predict well.  This technique is regularly used in Bayesian statistics to evaluate and criticize models.

What’s important here is that sample spaces like, “Countable-dimensional probability distributions” (Dirichlet processes), “Uncountable-dimensional continuous functions” (Gaussian processes), and “all stochastic computer programs” seem to give us increasingly broad classes of probability models.  We would like to then do the reverse of old-fashioned Bayesian statistics: instead of starting with a restricted model, we can start with a very broad model and restrict it using our domain knowledge about the problem at hand.  We then plug-and-play some computational stuff to perform inference.

Of course, it doesn’t yet work well in practice, but these things are regularly used to model really complex stuff, up to and including thought.  Again, those are Weak Bayesian theories, and we care more about a Monte Carlo or variational posterior with a low predictive loss than about finding God’s own posterior distribution.

Another important choice to make is indeed how you interpret probability.  I’ve actually liked the more measure-y way, once it was explained to me.  “Propositions” are then interpreted as subspaces of the sample space.  This seems like the Right Thing: you can start with a very complex model defined by some program or some infinite object or whatever, and then treat finite events within it as logical propositions.  Those propositions will obey Boolean logic, but their logical relations will come from the model, rather than the other way around.  An infinite-dimensional model will then also allow for an infinite number of propositions.

I consider this a fairly good example of how sometimes you should build your philosophy *on top of* the math and science that you know can work, rather than the other way around.  Philosophy is an *output* of thought, so if you want new philosophy, you need new thoughts to think, and if you want new thoughts to think, you need to get them from the world.

This is an extremely interesting response, thank you.

I was totally ignorant of Bayesian nonparametrics until now and it is the sort of thing I should (and want to) know about.  Do you have any recommendations about what to read first?  Seems like there are a lot of references out there.

Any links about probabilistic programming that you think are especially good + relevant would be appreciated too.

I’m not sure I agree with your paragraph about divergences (or perhaps I don’t understand it).  I’m aware of the K-L divergence, and it’s true that you can get a “posterior distribution” of some kind out of any predictive model.  (In classification tasks, this is straightforward because the predictions are usually probabilistic anyway; it’s a little less clear to me how this works with regression, since the point estimates we make in regression don’t attempt to match the intrinsic/noise variance in the data, which would affect the K-L divergence.)

But there’s more than one way to compare two probability distributions, and I don’t see that “K-L divergence from empirical distribution of validation set” is the one best loss function for probabilistic modeling.  For one thing, we’re presumably going to want to use the joint distributions of all our variables (so that the model has to get the relation of X to Y right, not just match the overall relative counts for Y).  But that’s a potentially high-dimensional distribution which we’re sparsely sampling, so the literal empirical distribution will have spurious peaks centered at each data point, and we’d need to do some density reconstruction to get something more sensible – at which point it’s not clear that we trust this reference distribution more than our model’s posterior, since both involve approximate inference from the data.

Also, I know the K-L divergence has a bunch of special properties, but I’ve always been wary when people say that it is the one correct way to compare 2 distributions (or that there is one correct way).  To make the case it seems like you’d need some link between the special properties and the thing you want to do.  And in practice we use various loss functions (various proper scoring rules for classification, say) that aren’t (obviously?) the K-L div in disguise; is this wrong?

(via principioeternus)

identicaltomyself:

nostalgebraist:

Having thought about this for a few more minutes:

It seems like things are much easier to handle if, instead of putting any actual numbers (probabilities) in, we just track the partial order generated by the logical relations.  Like, when you consider a new hypothesis you’ve never thought about, you just note down “has to have lower probability than these ones I’ve already thought about, and higher probability than these other ones I’ve already thought about.”

At some point, you’re going to want to assign some actual numbers, but we can think of this step as more provisional and revisable than the partial order.  You can say “if I set P(thing) = whatever, what consequences does that have for everything else?” without committing to “P(thing) = whatever” once and for all, and if you retract it, the partial order is still there.

In fact, we can (I think) do conditionalization without numbers, since it just rules out subsets of hypothesis space.  I’m not sure how the details would work but it feels do-able.

The big problem with this is trying to do decision theory, because there you’re supposed to integrate over your probabilities for all hypotheses, whereas this setup lends itself better to getting bounds on individual hypotheses (“P(A) must be less than P(B), and I’ll willing to say P(B) is less than 0.8, so P(A) is less than 0.8″).  I wonder if a sensible (non-standard) decision theory can be formulated on the basis of these bounds?

I’ve seen papers on doing reasoning, based on propositions being more or less likely than other propositions, but without assigning numbers to the probabilities. Unfortunately, a half hour of poking around doesn’t turn up the papers I’m thinking of. The general area is called “valuation algebras on semirings”. In the case I remember, the semiring is Boolean algebra on propositions, which induces a partial order on the extent to which they are believed.

Anyway, that’s a not-very-useful half-assed reference. Now I’m going to switch to a more common mode of Tumblr discourse, i.e. talking about how what you say shows you’re thinking wrong (I may be misunderstanding what you say, but this being Tumblr, I will ignore that possibility.)

You’re operating on the principle that the goal of reasoning is to put probabilities on propositions. Then you find various problems involving e.g. what if you suddenly think of a new proposition, or realize that two propositions you thought were different are actually the same. But it seems to me that propositions are not the best thing to assign probabilities to.

What we want to find is a probability distribution over states of the world. Turning that into a probability for some proposition is a matter of adding up the probabilities of all the states of the world where that proposition is true. This is bog-standard measure theoretic probability theory, so it’s not just something I made up. You might find that thinking this way dissolves some of the perplexities you’ve been pondering in your last two posts.

Thanks for the pointer about valuation algebras on semirings.

About world states – I addressed that in my original post, when I contrasted the die roll example (where we really can describe world states) to real-world claims like “Trump will be re-elected in 2020.”

If we actually want to specify states of the real world at the level of measure theoretic outcomes (set elements, rather than sets), either we’ll throw away some of what we know about the world, or the outcomes would have to be things like quantum field configurations down to the subatomic scale.  (Indeed, even that would be throwing away knowledge, since we don’t have a unified theory of fundamental physics and aren’t fully committed to any of theories we do have; the outcome-level description would have to involve different candidate laws of physics plus states in terms of them.)

The natural reflex is to do some sort of coarse-graining, where we abstract away from the smallest-level description, but at that point we’re basically doing Jaynes’ propositional framework, since we’re allowing that our most basic units of description could be refined further (we don’t specify O(10^23) variables for every mole of matter, but we allow that we might learn some of those variables later).

TBH, I think I am so skeptical of Bayes in part because I am used to thinking in the measure-theoretic framework, and it just seems so obvious that we can’t do practical reasoning with descriptions that are required to be that complete.  Jaynes’ propositional framework seems like an attempt to avoid this problem, or at least hide it, which is why I’m focusing on it – it’s less clear that it’s unworkable.

(via identicaltomyself)

bayes: a kinda-sorta masterpost

lostpuntinentofalantis:

notthedarklord42:

nostalgebraist:

@lostpuntinentofalantis

I don’t think the fact that humans are bad at thinking up logical implications is a very strong argument against bayes, in the same way that “But Harold, you said you loved Chocolate earlier!” is an argument against preferences.

So, I will agree that there’s this non-monotonic thing. This is indeed a very good point against using Bayes as a mental tool! I am not disagreeing with that!

What I do disagree with is the idea that it’s ipso facto problematic. I think the correct way to do this is throw out your first estimate as a preliminary one, and then use the other logical implication questions as a way to generate a battery of knowledge in a kinda organic fashion. To use the original “California succession” thing, let’s say I think it’s unlikely, so I throw out 98% as my likelihood, then some else asks me the “USA still together” so I also generically throw out 98% but A HA!!!!!! THIS SEEMS WRONG, because the set of situations involving the US together but California leaving seems I dunno small or whatever, so I end up adjusting the probabilities as,  repeating until I’ve thought of all “relevant” probabilities.

But logically speaking isn’t this troublesome? Isn’t it terrible that in theory an adversary can choose a sequence of questions which allows them to set my probabilities? Well, not really. My claim is that thoughts of these logical implication things provide information because humans are really bad at accessing all the information they have, and that, yeah sure if the adversary controls how a person accesses their information, of course the person is screwed? So you hope that people have good internal “implication generating”  machinery, such that by the time that they have worked through a bunch of subset questions, they have dumped out all relevant information, and the ordering effects are washed out.

Which is a much more elaborate way of saying “guys stop throwing out random probabilities and sticking to them if you don’t have good intuition/facts doing cognitive work aaaaaaaahh”

I guess I can agree that nothing I said above is specifically motivated by Bayes, except for this vague feeling of “well, shit it turns out I’m actually really bad at incorporating all relevant information” and I think it’s really just unavoidable.

I don’t think this is a problem with humans, I think it’s much more fundamental.  The real issue is that these kinds of “obviously nested” statements have a “easy to check, hard to find” property, like with NP-complete problems.

Let’s define “A is obviously nested in B” as “if you describe both A and B to me, it’ll be immediately obvious to me that A is sufficient but not necessary for B.”  And let’s define an “obviously nested pair” as A, B where one is obviously nested in the other.

The “US in 2100″ statements mentioned earlier are all obviously nested pairs with one another.  But the ones mentioned are just a few examples; there are infinitely many statements of the same form, asking about slightly bigger or smaller regions of the US, that also form obviously-nested pairs with all other such statements.

And that whole infinite chain is just one “direction” in hypothesis space.  You can think about any other subject – existence of various markets and sub-markets (will candy be sold?  will lollipops?), demographics and sub-demographics, scientific ideas and special cases thereof, you name it – and produce an infinite obviously-nested chain like this.

In finite time (much less polynomial time), you can only explicitly think about some vanishingly small subset of these statements.  Yet you implicitly know infinitely many facts about them (about each chain, in fact, of which there are infinitely many).  There’s no way to sit down and think enough beforehand that all of the obvious-nesting information has been dumped out into an explicit representation (and that representation would take infinite space anyway).

Now, maybe there is a way to handle this in practice so that it doesn’t hurt you too much, or something.  Such a theory would be very interesting, but as far as I know it doesn’t exist, and it would have to exist for us to begin talking about how a finite being could faithfully represent its implicit knowledge in a prior.

(This is a human problem in the sense that you could make a machine which would lack all this implicit knowledge.  That machine would not have this problem, but it would know less than we do, so we’d be throwing away information if we tried to imitate it.)

Yet you implicitly know infinitely many facts about them (about each chain, in fact, of which there are infinitely many).  There’s no way to sit down and think enough beforehand that all of the obvious-nesting information has been dumped out into an explicit representation (and that representation would take infinite space anyway).

Now, maybe there is a way to handle this in practice so that it doesn’t hurt you too much, or something.

This sounds like a natural continuity/limits problem. It does seem like there could be infinite nesting like this, and that you do know information about each step of the chain. However, I’m not sure this necessarily needs infinitely many facts to describe, perhaps an overarching fact could sum them up, or the facts get ‘smaller’ as the chain does, so that together they form a finite total fact. Thinking about the obvious-nesting information sounds very much like taking a limit.

The geographic example has very literal continuity, with larger and smaller regions of the US. I’m actually quite surprised there isn’t such a theory already! Hypothesis space, even when infinite, is continuous, and that makes a big difference.


On a separate note, I’m not convinced that we couldn’t make do with a model where we only consider a finite universe, with discrete rather than continuous space. That would mean you could not take infinitely many different regions of the US. And it would mean that only finitely many events could possibly occur in a given time period, which intuitively seems like saying there will only be finitely many such different chains of hypotheses to worry about.

While it seems a bit artificial at times, I don’t think it’s too unreasonable to allow a theory like this to only cover finite cases, not when the finite case can approximate the infinite case arbitrarily closely. Then it seems we could reasonably represent our priors. 

I am a bad pun blog and I endorse this message as elaborating on my “eh it probably converges” intuition earlier.

I think we can afford to agree to disagree unless @nostalgebraist can help me intuition pump this a bit further on why doing the subset enumeration problem doesn’t (eventually) converge.

I will say that this substantially downgraded my belief that Bayes is complete; there is much more work to be done, and I think it’s totally reasonable to call out the “unfounded intuition” parts of *the bayes memeplex* from the more proper Edwin and Eliezer’s Excellent Adventure canon.

The continuity thing is interesting.

Re this

I’m actually quite surprised there isn’t such a theory already! Hypothesis space, even when infinite, is continuous, and that makes a big difference.

What immediately came to my mind is that the Bayes setup doesn’t demand that your prior be continuous in any underlying variable, so this doesn’t come up in proving “for all” and “there exist” statements about Bayesian agents, and is easy to dismiss as “just a special case” if you think like that.  On the more practical side, concrete applications of Bayes always tend to have continuous priors (bc they use familiar probability distributions that have PDFs); it’s easy to forget that you don’t necessarily have to do this, and so you don’t really think about how it might give you extra properties.

(And indeed, you don’t always want continuity even in spatial examples, since the real world has state lines and other borders, for instance.)

Anyway, even if you assume your prior is always continuous in one or more underlying variables (space, time?, etc?), that still leaves the functional form open.  One worry about these kinds of cases is that your contortions to squeeze things in will give you a prior with lots of unmotivated variations in slope (flat for a while, then steep for a while).  So in addition to continuity, you’d need some general assumption like “I think things tend to vary linearly (or whatever) w/r/t space,” which would get you most of the way to being able to pull consistent probabilities out of the air in any order.  Although you still have to deal with things that are not nested but not independent either, and make sure all those relations work out … IDK, if someone’s worked this all out in detail I’d love to see it, but it sounds really hard.

bayes: a kinda-sorta masterpost

@lostpuntinentofalantis

I don’t think the fact that humans are bad at thinking up logical implications is a very strong argument against bayes, in the same way that “But Harold, you said you loved Chocolate earlier!” is an argument against preferences.

So, I will agree that there’s this non-monotonic thing. This is indeed a very good point against using Bayes as a mental tool! I am not disagreeing with that!

What I do disagree with is the idea that it’s ipso facto problematic. I think the correct way to do this is throw out your first estimate as a preliminary one, and then use the other logical implication questions as a way to generate a battery of knowledge in a kinda organic fashion. To use the original “California succession” thing, let’s say I think it’s unlikely, so I throw out 98% as my likelihood, then some else asks me the “USA still together” so I also generically throw out 98% but A HA!!!!!! THIS SEEMS WRONG, because the set of situations involving the US together but California leaving seems I dunno small or whatever, so I end up adjusting the probabilities as,  repeating until I’ve thought of all “relevant” probabilities.

But logically speaking isn’t this troublesome? Isn’t it terrible that in theory an adversary can choose a sequence of questions which allows them to set my probabilities? Well, not really. My claim is that thoughts of these logical implication things provide information because humans are really bad at accessing all the information they have, and that, yeah sure if the adversary controls how a person accesses their information, of course the person is screwed? So you hope that people have good internal “implication generating”  machinery, such that by the time that they have worked through a bunch of subset questions, they have dumped out all relevant information, and the ordering effects are washed out.

Which is a much more elaborate way of saying “guys stop throwing out random probabilities and sticking to them if you don’t have good intuition/facts doing cognitive work aaaaaaaahh”

I guess I can agree that nothing I said above is specifically motivated by Bayes, except for this vague feeling of “well, shit it turns out I’m actually really bad at incorporating all relevant information” and I think it’s really just unavoidable.

I don’t think this is a problem with humans, I think it’s much more fundamental.  The real issue is that these kinds of “obviously nested” statements have a “easy to check, hard to find” property, like with NP-complete problems.

Let’s define “A is obviously nested in B” as “if you describe both A and B to me, it’ll be immediately obvious to me that A is sufficient but not necessary for B.”  And let’s define an “obviously nested pair” as A, B where one is obviously nested in the other.

The “US in 2100″ statements mentioned earlier are all obviously nested pairs with one another.  But the ones mentioned are just a few examples; there are infinitely many statements of the same form, asking about slightly bigger or smaller regions of the US, that also form obviously-nested pairs with all other such statements.

And that whole infinite chain is just one “direction” in hypothesis space.  You can think about any other subject – existence of various markets and sub-markets (will candy be sold?  will lollipops?), demographics and sub-demographics, scientific ideas and special cases thereof, you name it – and produce an infinite obviously-nested chain like this.

In finite time (much less polynomial time), you can only explicitly think about some vanishingly small subset of these statements.  Yet you implicitly know infinitely many facts about them (about each chain, in fact, of which there are infinitely many).  There’s no way to sit down and think enough beforehand that all of the obvious-nesting information has been dumped out into an explicit representation (and that representation would take infinite space anyway).

Now, maybe there is a way to handle this in practice so that it doesn’t hurt you too much, or something.  Such a theory would be very interesting, but as far as I know it doesn’t exist, and it would have to exist for us to begin talking about how a finite being could faithfully represent its implicit knowledge in a prior.

(This is a human problem in the sense that you could make a machine which would lack all this implicit knowledge.  That machine would not have this problem, but it would know less than we do, so we’d be throwing away information if we tried to imitate it.)

(via lostpuntinentofalantis)

bayes: a kinda-sorta masterpost

lostpuntinentofalantis:

nostalgebraist:

I have written many many words about “Bayesianism” in this space over the years, but the closest thing to a comprehensive “my position on Bayes” post to date is this one from three years ago, which I wrote when I was much newer to this stuff.  People sometimes link that post or ask me about it, which almost never happens with my other Bayes posts.  So I figure I should write a more up-to-date “position post.”

I will try to make this at least kind of comprehensive, but I will omit many details and sometimes state conclusions without the corresponding arguments.  Feel free to ask me if you want to hear more about something.

I ended up including a whole lot of preparatory exposition here – the main critiques start in section 6, although there are various critical remarks earlier.

Keep reading

This isn’t convincing to me (and I guess everything of this genre isn’t convincing to me) because, like, it seems to me that the infinite hypothesis thing is just a problem for every kind of thinking?
You can claim that frequentist tools only work in limited domains or whatever, but in my mind all you’ve done is swept the “oh no what if I didn’t think of relevant hypothesis??!??” problem into the “well yeah you’re going to get burnt by this if you use it out of bounds”.

To (ab)use the tool analogy, it turns out that all human made tools cannot survive in the middle of a supernova, and yes you’re technically correct that all the omnitool fanboys have been overselling the utility of omnitool usage in Exotic Space Environments, but the fact that all the non-omnitools have warnings about “cannot be used in supernovae” is not going to convince me that omnitools don’t exist, or are necessarily worse in all cases.

If you’re talking about Section 7, I’m not just saying that “there might be relevant hypotheses you hadn’t thought of,” I’m saying that it’s really hard to encode what you do know in a prior without throwing away some information.

In jadagul’s examples with the different regions in 2100, you already know (before you think about any of it) that those statements have a certain logical implication structure.  But you only start thinking about each relation as the relevant statement is brought to your attention.  Like, if you ask someone those questions in a non-monotonic order, they’ll have to take care to squeeze some probabilities inside others they’ve already stated, and this will make things clearly depend on the order of asking.  (In my example, the person said “94.5%” because they know they needed something between 94 and 95, even though they were giving whole-number answers at first, and would have given a whole number answer to the intermediate case if asked about it first.)

(BTW I once actually asked these questions sequentially to a rationalist meetup group as a way of making this point)

So the problem isn’t “your knowledge is finite” but “you can’t encode exactly what you know (and nothing else) in a prior, or at least I know of no way to do it.”

You could say this is just another thing warning that should go on the label, but it suggests that we’re actually using the wrong representation for our prior knowledge, and so we have a “garbage in, garbage out” type problem: Bayes is somehow failing to capture what we know, and we don’t (AFAIK) have any bounds or guarantees on what problems this will or won’t cause.  Whereas in the frequentist procedures, we can at least describe what it would look like for a human to use them correctly, and guarantee certain things for that human.

bayes: a kinda-sorta masterpost

4point2kelvin:

nostalgebraist:

I have written many many words about “Bayesianism” in this space over the years, but the closest thing to a comprehensive “my position on Bayes” post to date is this one from three years ago, which I wrote when I was much newer to this stuff.  People sometimes link that post or ask me about it, which almost never happens with my other Bayes posts.  So I figure I should write a more up-to-date “position post.”

I will try to make this at least kind of comprehensive, but I will omit many details and sometimes state conclusions without the corresponding arguments.  Feel free to ask me if you want to hear more about something.

I ended up including a whole lot of preparatory exposition here – the main critiques start in section 6, although there are various critical remarks earlier.

Keep reading

Finding the realio truilo bestio hypothesis by simple application of Bayes’ theorem requires infinite computing power: this is a true and important point. But you can also find the best hypothesis within the set of hypotheses you’ve actually thought of. The probability isn’t “right” - it neither matches the hypercomputing limit nor even tries to account for your own fallibility - but you can find the best hypothesis of those available (up to a magical prior).

I think this task, of finding the best hypothesis among some you’ve thought of, is a useful one for grounding the discussion and allowing comparison between different problem-solving methods. I think that solving this problem provides space for a Bayesianism that’s more substantive than just a collection of machinery, but is still part of a larger system for understanding human reasoning.

(Of course, choice of this goal [identify the best hypothesis] is itself not Bayesian - a more natural thing to do would be to frame this in terms of making empirical predictions based on the set of imagined hypotheses, in which case the Bayesian approach still gets some nice guarantees for the same reason that minimum message length prediction is expected to work [even if you don’t do anything uncomputable, you can still piggyback off of the nice properties of Solomonoff induction].)

One can still criticize the case of choosing between a list of hypotheses, given some data, as too abstract and not engaging enough with human limitations. But now I think this criticism is about equally deflationary for all the tools in all the toolboxes, and so it’s more emotionally appealing to reject it.

On the topic of regularization: Whenever you see the adjective “just” or “mere” in anything remotely philosophical, you can guess that that poor word is about to do some heavy lifting. So you can imagine what I anticipated upon reading that “Bayesianism is just regularization, dude.”

Funnily enough, I think the problem with the simple Bayesian interpretation of regularization (as you point out: who the heck has a prior that your model parameters are Gaussian-distributed with known variance?) is that they are insufficiently Bayesian. By this I mean that they tunnel-vision on a particular model, instead of trying to assign weights to a whole bunch of possible models and choosing between them based on what the data says, which involves applying Bayes’ rule way more, so it must be more Bayesian (:P). And of course, this isn’t an original idea: plenty of people are trying to do Bayesian hyperparameter optimization.

Interesting stuff.

When you talk about finding the best hypothesis (i.e. getting the order of the probabilities right, if not the numerical value), why do you think Bayes gives the right answer?  You say “up to a magical prior,” but if we ignore the prior, we just have the likelihood, and we’re talking about “best hypothesis = maximum likelihood hypothesis.”  This isn’t exactly a bad idea but it’s neither uniquely Bayesian nor a good encapsulation of what we mean by “best” here.

One reason it isn’t a good encapsulation is that maximum likelihood may work better with some regularization, which a good prior would do.  But then, people seem to have a lot of trouble coming up with and using coherent priors, plus this gives us enough freedom that we can often change the result (which is best) by changing the prior … I’m just not seeing why Bayes does the job we want here in some assured, or uniquely good, way.

a more natural thing to do would be to frame this in terms of making empirical predictions based on the set of imagined hypotheses, in which case the Bayesian approach still gets some nice guarantees for the same reason that minimum message length prediction is expected to work [even if you don’t do anything uncomputable, you can still piggyback off of the nice properties of Solomonoff induction]

I agree about the first part (mean vs. mode, right?), but I don’t think I’m familiar with the guarantees you refer to here – link?

About “just”: that was meant as semi-joking payback for all of the gotchas about how other methods are “just” Bayes in disguise.  Regularization is just Bayes, huh?  Well, guess what: Bayes is just regularization!!!

(via 4point2kelvin)

bayes: a kinda-sorta masterpost

derplefurf:

nostalgebraist:

I have written many many words about “Bayesianism” in this space over the years, but the closest thing to a comprehensive “my position on Bayes” post to date is this one from three years ago, which I wrote when I was much newer to this stuff.  People sometimes link that post or ask me about it, which almost never happens with my other Bayes posts.  So I figure I should write a more up-to-date “position post.”

I will try to make this at least kind of comprehensive, but I will omit many details and sometimes state conclusions without the corresponding arguments.  Feel free to ask me if you want to hear more about something.

I ended up including a whole lot of preparatory exposition here – the main critiques start in section 6, although there are various critical remarks earlier.

Keep reading

Worth noting that your Section 8 (considering more hypotheses as you go along, not enumerating an infinite hypothesis space at the start or using infite computational power) highlights a problem that Eliezer and company have acknowledged for years, worked hard on, and last year actually found a novel answer to. (The best way to understand the paper, currently, is probably this 90-minute lecture.)

https://www.youtube.com/watch?v=UOddW4cXS5Y

Computable approximate Bayesian reasoners, e.g. logical inductors (which provably converge to perfect Bayesian reasoning in the limit, and have a bunch of nice properties as they go along), are indeed weirder to ponder than Solomonoff Induction. The objection about priors has an interesting answer here (with some edge cases), but I really can’t explain it out of context. And of course, this a computable algorithm but not an effectively computable one.

But I’d like to note that while non-Bayesians were pointing out the issue as a “see, this is why Bayesian reasoning can’t do anything without infinite computation, might as well scrap that endeavor”, Eliezer and company were actually working on that issue.

I’m aware of that paper.  Here are my thoughts on it.

Re: your last paragraph – people tend to work on approaches they find relatively promising, so it shouldn’t be surprising that Bayesians worked on fixing problems with Bayes while non-Bayesians worked on improving other approaches.

(via profound-yet-trivial)

eclairsandsins:

nostalgebraist:

I’ve been thinking a bit about the how to get a “uniform rather than pointwise” version of the logical induction stuff.

It seems like a lot of the challenge of the problem is generic to Bayesianism, and not particular to “logical” or “mathematical” outcomes.  Anyone who reads my posts on this stuff knows I have an axe to grind about how, outside of specialized small domains, you don’t have a complete sigma-algebra to put probabilities on.  (Because you aren’t logically omniscient, you don’t know all the logical relations between hypotheses, which means you don’t know all of the subset relationships between sets in your algebra.)

The case of “logical induction” forces Bayesians to think about this even if they wouldn’t otherwise, since the prototypical/motivating examples involve math, a world in which we are continually discovering facts of the form “A implies B.”  So assuming logical omniscience would be assuming we know all the theorems at the outset, in which case we wouldn’t need LI to begin with.

But the problem that “A may imply B even though you don’t know it does” is generic and comes up for Bayesian inference about real-world events, too.  A good solution to this problem would be very interesting and important (?) even if it didn’t, in itself, handle the “logical” aspects of “logical induction” (like “what counts as evidence for a logical sentence”).


I have to imagine there is work on this problem out there, but I have had a hard time finding it.

The basic mathematical setup would have to involve some “incomplete” version of a sigma-algebra (generically, a field of sets), where not all of the union/intersection information is “known.”  This is a bit weird because when we talk about a collection of sets, we usually mean we know what is in the sets, and that information contains all the relations like “A is a subset of B” (i.e. B implies A), when we want to make some of them go away.

A Boolean algebra is like a field of sets where we forget what the sets contain, and just leave them as blank symbols that happen to have union/intersection (AKA “join/meet”) relations with one another.  That seems closer to what we want, except that we need some of the join/meet operations to give undefined results.  There are Boolean algebras where not everything has a join/meet (those that aren’t complete, in the complete lattice sense), but this seems like a thing having to do with inf/sup stuff in infinite spaces and isn’t really what we want.  (Despite my username, I know very little about algebra and am just flying blind on Wikipedia here.)

An example of the sort of thing I want to do is the following.  Say we are assigning probabilities in (0,1) to P(A), P(B), P(A=>B), and P(B=>A).  Suppose P(A=>B) > P(B=>A), that is, we think it’s more likely that A implies B than the reverse (and in particular, more likely than A<=>B).

Now consider P(A) and P(B).  The probabilities above say we’re most likely to be in a world where A=>B and not vice versa, in which case we should have P(A) > P(B), or we’ll be incoherent.  So it seems like we should have P(A) > P(B) right now.  Of course, this will make us incoherent if it turns out that we are in the B=>A or A<=>B worlds, but we think those are less likely.  In betting terms, the losses we might incur from incoherence in a likely world should outweigh those we’d incur from incoherence in an unlikely world.

What we’re really doing here, I guess, is treating the implication (i.e. subset relation) as a random event, so implicitly there is a second, complete probability space whose events (or outcomes?) include the subset relations on the first, incomplete probability space (the one discussed above).  Maybe you could just do the whole thing this way?  I haven’t tried it, I’m curious what would happen

Anyway, I can’t help but think there must be the right math tools out there for doing this kind of thing, and I just don’t know about it.  Anyone have pointers?

Um, if P(A → B) > P(B → A) then P(A) < P(B) not greater. Imagine if P(A → B) = 90% and P(B → A) = 30%. The only difference between the truth tables of A → B vs B → A is that the former is false only when A is true and B is false, and the latter is false only when B is true and A is false. So, P(A and ¬B) = 10% and P(B and ¬A) = 70%. The latter tells you that P(B) > 70% and P(¬A) > 70% aka P(A) < 30%. Therefore P(B) > P(A). To get a more general proof use variables instead of 90% and 30%.

By the way, “A → B” is just another way of saying “¬A or B,” and you just apply normal probability to the latter.

Ah, I think we are using two different meanings for the → sign.

In the Kolomogorov defn. of a probability space, we have to have a sigma-algebra (which specifies sets [“events”] and their union/intersection relations) before we assign any probabilities.  If A, B are in the sigma-algebra and B is a subset of A, this is interpreted as “if event A happens, event B must also happen.”  If we are taking A and B to be propositions, this means “A implies B.”

In the usual Bayesian framework, the events are propositions, but (our beliefs about) truth and falsehood are represented by probability assignments we give to the events, and we can only make these assignments if we have the sigma-algebra already.  So the sigma-algebra encodes implication relationships which we are supposed to assent to before we take the step where we say certain propositions are true (P=1) or false (P=0).

To use the classic example, the sigma-algebra will have “Linda is a feminist bankteller” (A) as a subset of “Linda is a bankteller” (B).  Then when we go and assign probabilities, the probability axioms tell us that we must respect this implication (A→B).  Among other things this will mean that we assign probability 1 to “¬A or B,” for the trivial reason that “¬A or B” is the set of all outcomes.  But this is not the sort of thing that the framework allows us to not know, and then figure out: it is fixed by the sigma-algebra at the outset.

So when I write things like P(A → B), I am talking about the sort of relation we normally get from the sigma-algebra.  Such a relation goes beyond the truth tables: the sigma-algebra normally tells us things like “if Linda is a feminist bankteller, Linda is a bankteller” which are true (in the relevant sense) even if Linda is neither of those things in reality (in which case the truth tables are mute).  There’s a connection to math progress here, in that often mathematicians are concerned about the consequences of assuming certain axioms but agnostic about the truth of the axioms; “the well-ordering theorem is equivalent to the axiom of choice” is interesting, even thought you will be hard pressed to find people who think they’re both true or both false (it is contested what that would even mean!).

It sounds like you’re coming from Jaynes’ approach to probability, while I’m used to Kolmogorov; the two are close to equivalent, but I’ll have to think more about whether Jaynes’ version makes this problem easier.

on MIRI’s “Logical Induction” paper

jadagul:

nostalgebraist:

When I first saw this paper, I said it looked impressive, and seemed a lot more substantial than MIRI’s other work, but I never really looked at it in much detail.  In the past week I’ve been having an extended conversation with a friend about it, and that spurred me to read it more closely.  I now think it’s much less impressive than it seems at first glance.

It occurred to me that the criticisms I made to my friends might be of wider interest, so I’ll write a post about them.

Summary:

The authors state something they call the “logical induction criterion,” meant to formalize a kind of ideal inference.  To satisfy this criterion, an inference procedure only needs to have certain asymptotic properties in the infinite-time limit.  Rather than being sufficient for good inference but too strong for practical computation (as the paper suggests), the criterion is too weak for good inference: it tolerates arbitrarily bad performance for arbitrarily long times.

The easiest way to see why this is problematic in practice is to consider the criterion-satisfying procedure constructed in the paper, called LIA.  Speaking very roughly, LIA makes a countably-infinite list (in unspecified/arbitrary order) of all the “mistakes” it could possibly make, and at discrete time n, does a brute force search for a set of beliefs which avoids the first n mistakes in the list.

Depending on the ordering of the list, LIA can make an arbitrarily bad mistake for an arbitrarily long time (some mistake has to go in slot number 3^^^3, etc.)  Nonetheless, LIA can be proven to asymptotically avoid every mistake, since for every mistake there is some time N at which it begins to avoid it.  Thus, LIA enjoys a very large number of nice-sounding asymptotic properties, but it converges for very different reasons than most algorithms one is used to hearing about: rather than converging to nice properties because it moves toward them, it simply exploits the fact that these properties can only fail in countably many ways, and ticks off those ways one by one.

Thus, LIA is like an “author-bot” which generates all possible character strings in some arbitrary order, and at each step consults a panel of readers to weigh in on its best work so far.  One could argue that this bot’s “asymptotic best work” will have any aspect of writerly brilliance imaginable, but that does not mean that the bot satisfies some “ideal writing criterion,” and indeed we’d expect its output in practice to lack any aspects of writerly brilliance.  (LIA differs from author-bot in that it has an objective, if very slow, way to find its “best work so far.”  But garbage in means garbage out.)

More detail under the cut

Keep reading

Good post, for those who are interested in such things.

I wanted to pull out one bit that I thought was interesting. Content warning for analysis.

Keep reading

Oh, yeah, it totally is a pointwise vs. uniform convergence thing!  Thanks, that is illuminating.

In the OP I wrote “it [the LI criterion] tolerates arbitrarily bad performance for arbitrarily long times,” but now I realize that’s not stating the full case, since it could be true even if we had uniform (but arbitrarily slow) convergence.  I linked your post in a Discord chat and said this:

i think “it’ll take an extremely long time, but eventually, the probabilities will be sensible” is implicitly imagining that the convergence is uniform, when it’s really pointwise.  like, we imagine that there’s some huge N such that if we wait until then, the probabilities will be generally “good,” i.e. will have a whole bunch of great properties at once.

but what we really have are just guarantees that each good property arrives sometime.  it doesn’t have to arrive simultaneously with properties that seem related, and we’re not guaranteed an overall “package deal” at any point (unless we can show that there’s an e.c. trader that gets us that whole package at once).

(via jadagul)

compulsive liars

bambamramfan:

nostalgebraist:

@diss-this-coarse-discourse

your reasons seem like mostly “because these people are basically just normal otherwise” (i.e. normally distributed) and I infer that, having seen the margins of innocuous, trivial lying, you have developed the opinion that truth is sort of less all-important than most others think. Most importantly, you hint that all people do these sort of barely-false, slight exaggerations in day to day conversation, but that that’s fine.

Even knowing some statements are pointless (pragmatically speaking) exaggerations, we can still trust people in some way. The method you suggest is limiting trust evaluations until they refer to something of substance. Otherwise we’re sort of just asking people to be relentlessly self-effacing, or worse, actually fully deceptive.

But like, I couldn’t do that personally; I can’t tolerate these quasi-white-lies. I don’t know any compulsive liars personally, but I’m quite familiar with white lying, and I pretty much hate it, terminally.

In the least charitable sense I find it deeply selfish and antagonistic, but, in a more self-interested sense, I dislike and wish to discourage the social environment it produces. In a way, it’s almost the opposite of what you suggest: zero tolerance of lies, but absolute tolerance of conduct.

I would much rather people be fully honest about literally everything, and self-sort / develop accurate models of themselves and the world that way, than the alternative, which I believe basically always exists: frequent expectations of deceit and exaggeration.

Why do you prefer otherwise? If I had to guess it would be a pessism about human nature (+ inability to actually enforce truthfulness perfectly) but I want to know for real

p.s. (please no exaggeration or white lying)
p.p.s just kidding

I didn’t mean to imply I prefer an environment with deceit and the expectation of deceit to one without those things.  I was talking about the world I actually find myself in, where there are going to be some deceitful people (some moreso, some less so) in any given environment.

Given this fact, how should I react to these people?  Should I shun them?  What I was saying was, most people who lie do so in somewhat predictable patterns – say, on specific topics or in specific social atmospheres – and if I know this, I can generally recognize the lies pretty well.  I still dislike it if people lie when it’s socially expected that they will be telling the truth (sometimes it isn’t, as in “bull sessions”).  But it’s just a character flaw like any other, and I’m not out there looking for people with no flaws.  I just look at the flaws and the upsides, and decide.

I guess I’m not understanding the other fork of your question.  You’re talking about an environment where there’s no deceit and no expectation of deceit?  But that would be unstable – any unscrupulous person could come in and be deceptive and abuse people’s lack of vigilance.  Is it pessimism about human nature to think that people might do this?  It strikes me as extremely optimistic to think they might not.  It’s like saying we’ll solve the Prisoner’s Dilemma by making it so no one ever defects and no one ever expects anyone to defect.  (Which requires that no one who would defect would ever enter the group of players, and that no player in the group will ever change their minds about defecting.)

So I guess I should ask you: what might a hypothetical community with no deceit and no expectation of deceit look like?  How might those things be enforced?

The sentiment nost is expressing which I find very true, might be best explained to rationalists in terms of empirically valid Schelling Points:

There are two models:

1. We have a Schelling Point against lying. Anyone who does lie, especially who lies a lot, in any field is likely to lie in other fields too, and ignore moral limits. Therefore we don’t trust them in any interaction.

2. The Schelling Point is best set up between spheres. Someone who lies to socially impress people, is no more likely to lie under testimony, or in order to steal money, or to convince you to sleep with them even though they are in a committed monogamous relationship.

Nost finds the latter model a better fit for the world around us. My experience (i’ve known liars of all stripes, but particularly the type OP cited was a housemate once and boy was he hilarious to the rest of us) agrees with that.

You have every right to the preference “I don’t want to be around social liars.” You may even have some good logic about why social liars deteriorate the community around them if tolerated. (People may disagree.)

But what would be wrong, and I think the OP is opposing, is the empirical claim that these people will do something that they don’t do anymore than normal people. (Said roommate did pay his rent on time, AFAIK.)

And a great deal of hostile exclusion requires that empirical jump. People aren’t satisfied to say “this is my preference to avoid them”, but must insist on many other sins the target is likely to commit if we tolerate them at all, and to treat social lying merely as a “red flag” to other issues, issues that are so serious no community could function around them. Which is why this Schelling Point is an important question to answer.

(via rasienna)