Install Theme

raginrayguns replied to your quote “@nostalgebraist​ I interpreted your LI post as arguing that it wasn’t…”

I think ive said this before, but I think if I was in this field, I’d be approaching from the opposite direction. I think Bayesianism came from, “hmm why does probability theory seem to reproduce a lot of qualitative good advice about reasoning, when it’s just counting possibilities? Is there another way to look at it?” I’d be trying to get something that works in SOME situations, and something else that works in OTHER situations, etc, and start from there for a general solution

This sounds right, although some cases are harder because you don’t have anything sitting around that does mysteriously well at the problem.  Like, clearly there are some things (say, humans) that can “do practical logical induction” pretty well, but we don’t have any prototype case that’s simple enough to analyze.

Maybe related: after I made that post, I thought “hmm, but do I have any positive examples of how to think about ideal behavior under constraints?”  And I thought, well, VC theory and PAC learning seem like examples of what I want.  These ideas start out with the observation that real “learners” (humans, ML algorithms) can do pretty well at inferring functions from finitely many examples, even though you can’t really be good at this in the fully general case where a function is just an arbitrary mapping between sets and f(x) tells you nothing about f(x’).  So they ask, how would the functions have to be restricted to make learning feasible?  And what sort of success metric captures the success of the real learners, without being too forgiving or too stringent?

So, these ideas start out by noting that there are versions of your problem that are too hard or too easy, and trying to find the “right scale” where some practical methods look better than others, rather than looking equally bad or good, or too coarsely distinguished.

Actually… now that I think about it, these are theories of online supervised learning, and logical induction is trying to do online supervised learning for logic. So, this isn’t just an analogy — these things are directly comparable. Just like with PAC learning, you could say: “if logic were just an arbitrary assignment of binary labels to sentences, then there would be no patterns to learn and the problem would be impossible. So, what properties does logic have that could make it more learnable? And how weak does our notion of learnability have to be?” Maybe you’ll find that PA is fundamentally too hard but some more restricted system is okay, maybe you’ll find that PAC is too hard but a weaker learning concept is appropriate, that sort of thing.

Viewed from the online supervised learning perspective, MIRI’s criterion essentially says, “you should use a nonparametric method, because a parametric method will have some blind spots that persist no matter how much data has come in, and someone without these blind spots could make winning trades against your parametric model forever.”

And this is … a fair if limited point about machine learning, although it has nothing really to do with logic!  Like, in a nonlinear online learning problem, someone using a random forest that grows with N could pump money out of someone using logistic regression, and they could do this forever.  This is true but it’s not the only fact that’s ever relevant about these two methods.  You can’t make a good general theory of online supervised learning out of just this one distinction, and it’s not clear why this distinction would be any more (or less) important for predicting logic than for anything else.

So this is picking the “wrong scale” for the problem, sorta – your evaluation is too strict in one way (you have to do well forever) and too forgiving in another (you only have to do well in the long run), and this ends up making a distinction between parametric and nonparametric and not getting any finer-grained than that, with the weird implication that the worst (consistent) nonparametric methods are better than the best parametric methods.  This is so simple, I’m really not sure why I didn’t think about it before.


One of the OpenPhil reviewers actually said something about the relation between PAC learning and MIRI’s framework (this was about the paper “Inductive Coherence”):

At a very high level, there is some overlap between the type of work considered here and work on making decisions optimally in the face of computational constraints, since making such decision might involve approximating probability. […] In spirit, the paper is also close to work going back to the 1960s on language identification in the limit. Work on this topic has been largely superseded by a weaker model, Valiant’s notion of PAC (probably approximately correct) learning. (As an aside, if I were a referee of this paper for a journal, I would ask the authors to compare their work to Gold’s work on language identification.)

I remember reading this wanting to understand it, but I looked up Gold’s work and found it technically challenging.  But I suspect that understanding this context, where people tried “language identification in the limit” and then moved to PAC learning, would be very illuminating.

The new MIRI blog post on “Embedded World-Models” says some of the same things I (among various others) been saying for a long time about the problems with standard Bayesian rationality.

Not sure what to make of that – did they change their minds about this stuff?  Were they always closer to my position than to the people who would argue against me when I said stuff like (direct quotes from the post follow):

Imagine a computer science theory person who is having a disagreement with a programmer. The theory person is making use of an abstract model. The programmer is complaining that the abstract model isn’t something you would ever run, because it is computationally intractable. The theory person responds that the point isn’t to ever run it. Rather, the point is to understand some phenomenon which will also be relevant to more tractable things which you would want to run.

I bring this up in order to emphasize that my perspective is a lot more like the theory person’s. I’m not talking about AIXI to say “AIXI is an idealization you can’t run”. The answers to the puzzles I’m pointing at don’t need to run. I just want to understand some phenomena.

However, sometimes a thing that makes some theoretical models less tractable also makes that model too different from the phenomenon we’re interested in.

The way AIXI wins games is by assuming we can do true Bayesian updating over a hypothesis space, assuming the world is in our hypothesis space, etc. So it can tell us something about the aspect of realistic agency that’s approximately doing Bayesian updating over an approximately-good-enough hypothesis space. But embedded agents don’t just need approximate solutions to that problem; they need to solve several problems that are different in kind from that problem.

[…]

Uncertainty about the consequences of your beliefs is logical uncertainty. In this case, the agent might be empirically certain of a unique mathematical description pinpointing which universe she’s in, while being logically uncertain of most consequences of that description.

Logic and probability theory are two great triumphs in the codification of rational thought. However, the two don’t work together as well as one might think.

Probability is like a scale, with worlds as weights. An observation eliminates some of the possible worlds, removing weights and shifting the balance of beliefs.

Logic is like a tree, growing from the seed of axioms. For real-world agents, the process of growth is never complete; you never know all the consequences of each belief.

Not knowing the consequences of a belief is like not knowing where to place the weights on the scales of probability. If we put weights in both places until a proof rules one out, the beliefs just oscillate forever rather than doing anything useful.

This forces us to grapple directly with the problem of a world that’s larger than the agent. We want some notion of boundedly rational beliefs about uncertain consequences; but any computable beliefs about logic must have left out something, since the tree will grow larger than any container.

[…]

In a traditional Bayesian framework, “learning” means Bayesian updating. But as we noted, Bayesian updating requires that the agent start out large enough to consider a bunch of ways the world can be, and learn by ruling some of these out.

Embedded agents need resource-limited, logically uncertain updates, which don’t work like this.

furioustimemachinebarbarian asked: Just read the post on occam's razor linked from your big Bayes post and it's a very clear example of a nebulous idea I've hand waived at in the past. Nothing much to add, just wanted to say bravo.

Thanks!

For those who are into that sort of thing: my big Bayes post from a while back got linked on LessWrong recently, and I’ve gotten involved in a few arguments in the comments.

Covers some of the same stuff we discussed over here when I first made the post, but there was an interesting (to me) dive into what actually happens if you use “zeroing out cells of truth tables” as a proxy for “noticing material conditionals you already believed but hadn’t taken into account”

endecision:

singular-they:

The same author has, according to my professor, alluded to being criticized by his colleagues for making the cover of the book four tastefully presented puppies because they thought it wasn’t serious enough

Okay I had to google this and it did not disappoint:

image

And here is an even more amazing image from the author’s blog:

image

(via solsticehappiness)

nostalgebraist:

nostalgebraist:

In need once again of melodic/dopaminergic academic motivation music, I return to Eternal Sonata soundtrack and remember that one of the boss themes is called “I Bet My Belief”

[paranoid whisper] the Bayesians … they're everywhere

Three and a half years later, I still wonder about this

And then on the same soundtrack, for those occasions when no amount of data can overcome a difference in priors: “Your Truth Is My False”

(via nostalgebraist)

nostalgebraist:

In need once again of melodic/dopaminergic academic motivation music, I return to Eternal Sonata soundtrack and remember that one of the boss themes is called “I Bet My Belief”

[paranoid whisper] the Bayesians … they're everywhere

Three and a half years later, I still wonder about this

notgrantpeters asked: Curious: what resources are you using to study Bayes-as-practiced? I've been putting it off for years (ever since Wasserman's All of Nonparametric Statistics tantalized me by excluding all of Nonparametric Bayes) but, especially if you'll be writing about it, now seems like a good time to learn

I haven’t looked into it in any organized way yet, so mostly random papers / blog posts / Wikipedia.  I’ll probably look into Gelman’s book sometime.

For nonparametric Bayes, I’ve just been reading random tutorial articles on Gaussian and Dirichlet processes (of which there are zillions).  Also, after @somervta mentioned David Duvenaud recently in another context, I’ve been looking into his research, esp. his PhD thesis (available on that page) about fancy things you can do with Gaussian processes by automatically building their kernels.

I’ve also been reading about variational autoencoders, which are a nice point of overlap between neural net stuff (finding a good low-dimensional encoding of a signal) and Bayes.  This post was helpful, although I don’t like some of the expositional choices there.

(The upshot is basically: in the Bayes perspective, you have some latent variable model where you’re willing to assume a distribution for the latent variables, but you don’t know the function that maps from them to the observed variables, so you have to do something fancy to learn that function while not knowing, for any particular data point, what value of the latent variables was actually realized.  In the neural net perspective, this is like training an autoencoder, where the “assumed distribution for the latent variables” appears as a regularizer encouraging your learned encoding to spread the training data out in a nice uniform way in the lower-dimensional space)

brief ignorant notes on bayesian methods

I have written a lot on this tumblr about (mostly against) “strong Bayesianism” or “Jaynesianism,” but I have mostly been silent about the pros and cons of Bayesian methods as they are actually practiced.  This is, honestly, because I don’t know much about Bayesian methods as they are actually practiced, although I am trying to learn more.

Back when I wrote that Bayes masterpost, @raginrayguns​ rightly took me to task for ignoring “hedging” as a virtue of Bayesian modeling.  Something that stands out to me when I read about Bayes-in-practice is that hedging is seen as extremely important – indeed, often as the whole point of the exercise.

This is quite different from the Jaynesian perspective, where both prior and posterior are representations of real beliefs, and hence it is important to get the prior “right” (through MaxEnt or something).  In practical Bayesian work, the prior is treated more as a way to do model averaging; what matters is not whether it philosophically “reflects our beliefs in the absence of evidence” but whether it leads to averaging over models in a way we like.

You have probably seen it before, but that Gelman/Shalizi paper is relevant here – says you should do hypotheco-deductivism with Bayesian models, where both the model class and the prior are falsifiable hypotheses.

One very intuitive (to me) justification for model averaging is automatic quantification of variance (and its consequences).  If you just fit one “best” model, you can happily chug along making predictions with it, but you ought to worry about how much each of these predictions would have varied if you had fitted the model on slightly different data (with different noise, say).  Since a Bayesian method effectively uses every model in the model class and averages over them, it perhaps captures this variability?  I am used to seeing this done with the bootstrap, which directly generates “different data”; there is supposedly a connection between the bootstrap and the Bayesian thing (which uses only the real data but still uses multiple models), but I don’t fully understand it yet.

A superficially obvious “gotcha” argument goes like this: “even if some averaging is being done under the hood, the Bayesian model still just outputs conditional probabilities, like any probabilistic model.  Thus ‘Bayesian averaging over model class C’ produces a single model for each training data set, and is thus choosing a single ‘best’ model from some other model class (call it C-prime).  One could then argue that it would be better to average models from C-prime according to some prior, obtaining C-prime-prime, and so on ad infinitum.”

I haven’t really worked that through and I don’t know whether it truly makes sense.  It also seems misleading in that it dismisses “averaging under the hood” as though this is a mere computational choice and can’t be discerned from the resulting conditional probabilities, but that doesn’t seem like it’s true.  Except for special cases (involving Gaussianity/linearity), I have a hard time thinking of apparently non-Bayesian methods that can be re-written as Bayesian averages in a nontrivial way.  (Random forests might be Monte Carlo sampling from trees according to likelihood? not sure.)

This suggests that there may be special features conferred by the Bayesian averaging process which can be read off of the results even if you didn’t know there was averaging under the hood, but if so, I don’t know what they are (or how to look for info on this).

In a machine learning context, Bayesian methods (relative to others) feel less solidly rooted in Breiman’s “algorithmic modeling" culture – like they still have one foot in the “data modeling” culture.  There is a great deal of focus on technical methods for sampling from ~*~*the posterior*~*~, with the implication that it is clearly this great amazing thing and we are justified in going to great lengths to approximately compute it.  This is a bit confusing to me since the posterior is just a combination of a model class and a prior, and the prior is often just some computationally convenient distribution (Gaussian, Dirichlet), so it seems like we’re working very hard to compute something whose definition we chose for our own convenience rather than its optimality.

Discussions of the Dirichlet process, for instance, often start out with talk of “adaptively choosing the number of clusters” – leading me to say “great, so what’s the best way to do that?” – and then jump into discussions of the Chinese restaurant process without telling me why the clusters should be generated in this way rather than any other.

(Actually, if someone can point me to a justification of the Dirichlet distribution that isn’t “it’s a conjugate prior, which is computationally convenient,” that would be helpful)

bayes: a kinda-sorta masterpost

raginrayguns:

nostalgebraist:

I have written many many words about “Bayesianism” in this space over the years, but the closest thing to a comprehensive “my position on Bayes” post to date is this one from three years ago, which I wrote when I was much newer to this stuff.  People sometimes link that post or ask me about it, which almost never happens with my other Bayes posts.  So I figure I should write a more up-to-date “position post.”

I will try to make this at least kind of comprehensive, but I will omit many details and sometimes state conclusions without the corresponding arguments.  Feel free to ask me if you want to hear more about something.

I ended up including a whole lot of preparatory exposition here – the main critiques start in section 6, although there are various critical remarks earlier.

Keep reading

10. It’s just regularization, dude

(N.B. the below is hand-wavey and not quite formally correct, I just want to get the intuition across)

My favorite way of thinking about statistics is the one they teach you in machine learning.

You’ve got data.  You’ve got an “algorithm,” which takes in data on one end, and spits out a model on the other.  You want your algorithm to spit out a model that can predict new data, data you didn’t put in.

“Predicting new data well” can be formally decomposed into two parts, “bias” and “variance.”  If your algorithm is biased, that means it tends to make models that do a certain thing no matter what the data does.  Like, if your algorithm is linear regression, it’ll make a model that’s linear, whether the data is linear or not.  It has a bias.

“Variance” is the sensitivity of the model to fluctuations in the data.  Any data set is gonna have some noise along with the signal.  If your algorithm can come up with really complicated models, then it can fit whatever weird nonlinear things the signal is doing (low bias), but also will tend to misperceive the noise as signal.  So you’ll get a model exquisitely well-fitted to the subtle undulations of your dataset (which were due to random noise) and it’ll suck at prediction.

There is a famous “tradeoff” between bias and variance, because the more complicated you let your models get, the more freedom they have to fit the noise.  But reality is complicated, so you don’t want to just restrict yourself to something super simple like linear models.  What do you do?

A typical answer is “regularization,” which starts out with an algorithm that can produce really complex models, and then adds in a penalty for complexity alongside the usual penalty for bad data fits.  So your algorithm “spends points” like an RPG character: if adding complexity helps fit the data, it can afford to spend some complexity points on it, but otherwise it’ll default to the less complex one.

This point has been made by many people, but Shalizi made it well in the very same post I linked earlier: Bayesian conditionalization is formally identical to a regularized version of maximum likelihood inference, where the prior is the regularizing part.  That is, rather than just choosing the hypothesis that best fits the data, full stop, you mix together “how well does this fit the data” with “how much did I believe this before.”

But hardly anyone has strong beliefs about models before they even see the data.  Like, before I show you the data, what is your “degree of belief” that a regression coefficient will be between 1 and 1.5?  What does that even mean?

Eliezer Yudkowsky, strong Bayesian extraordinaire, spins this correspondence as a win for Bayesianism:

So you want to use a linear regression, instead of doing Bayesian updates?  But look to the underlying structure of the linear regression, and you see that it corresponds to picking the best point estimate given a Gaussian likelihood function and a uniform prior over the parameters.

You want to use a regularized linear regression, because that works better in practice?  Well, that corresponds (says the Bayesian) to having a Gaussian prior over the weights.

But think about it.  In the bias/variance picture, L2 regularization (what he’s referring to) is used because it penalizes variance; we can figure out the right strength of regularization (i.e. the variance of the Gaussian prior) by seeing what works best in practice.  This is a concrete, grounded, practical story that actually explains why we are doing the thing.  In the Bayesian story, we supposedly have beliefs about our regression coefficients which are represented by a Gaussian.  What sort of person thinks “oh yeah, my beliefs about these coefficients correspond to a Gaussian with variance 2.5″?  And what if I do cross-validation, like I always do, and find that variance 200 works better for the problem?  Was the other person _wrong?  _But how could they have known?

It gets worse.  Sometimes you don’t do L2 regularization.  Sometimes you do L1 regularization, because (talking in real-world terms) you want sparse coefficients.  In Bayes land, this

can be interpreted as a Bayesian posterior mode estimate when the regression parameters have independent Laplace (i.e., double-exponential) priors

Even ignoring the mode vs. mean issue, I have never met anyone who could tell whether their beliefs were normally distributed vs. Laplace distributed.  Have you?

tl;dr: Regularization is not the point of the prior. Even when we’re not regularizing, the prior is an indispensable part of useful machinery for producing “hedged” estimates, which are good in all plausible worlds.

OK, here’s the whole post.

The quoted section is about whether Bayesians can explain regularization. We know regularization helps, and we’re going to do it in any case, but Bayesians purport to explain why and when it helps. See, for example, the above @yudkowsky quote, as well as this one:

Eliezer_Yudkowsky:

The point of Bayesianism isn’t that there’s a toolbox of known algorithms like max-entropy methods which are supposed to work for everything. The point of Bayesianism is to provide a coherent background epistemology which underlies everything; when a frequentist algorithm works, there’s supposed to be a Bayesian explanation of why it works. I have said this before many times but it seems to be a “resistant concept” which simply cannot sink in for many people.

nostalgebraist is making Yudkowsky very happy in his post, by arguing with his actual belief in the status of Bayesianism as a background epistemology. nostalgebraist’s point is that Bayesianism doesn’t explain why or how we regularize, and more generally that we shouldn’t try to judge inferential methods by how Bayesian they are. nostalgebraist is summarizing this as “Bayesianism is just regularization,” which is a not entirely serious inversion of a common Bayesian position, that “regularization is just Bayesian statistics.”

I disagree with nostalgebraist about all this, and I’m going to write a post about why, maybe next week. This current post, which will be quite long, is absolutely not about the issue of whether Bayesianism explains regularization. I start by describing this issue just to show that I understand the real point of the OP, and that I am being quite deliberate when I completely ignore it in the following.

What I want to focus on is nostalgebraist’s half-joking statement that Bayesian inference is just regularization. While he’s not being entirely serious, he may be partly serious, and in any case it’s what a lot of people actually believe. For example, in replies framed as defenses of the Bayesian framework, @4point2kelvin writes “You can definitely think of anything Bayesian as ‘maximum likelihood with a prior.’ But even though the prior has to be (somewhat) arbitrary when the hypothesis-space is infinite, I still think it’s useful.” Plus, once I’ve shown Bayes isn’t just regularization, then I get to say what else it is.

I’m going to start with some technicalities, focusing on the mode vs mean issue nostalgebraist alluded to. Then I’m going to show an example where Bayesian estimation improves on maximum likelihood, without any of the increase in bias that Shalizi suggests is necessary, and explain what’s going on.

Keep reading

Reblogging because this is good and I want to have it on my blog + remind myself to read it more closely so I can actually say something about the issues it raises

(via raginrayguns)