Install Theme

bayes: a kinda-sorta masterpost

raginrayguns:

nostalgebraist:

I have written many many words about “Bayesianism” in this space over the years, but the closest thing to a comprehensive “my position on Bayes” post to date is this one from three years ago, which I wrote when I was much newer to this stuff.  People sometimes link that post or ask me about it, which almost never happens with my other Bayes posts.  So I figure I should write a more up-to-date “position post.”

I will try to make this at least kind of comprehensive, but I will omit many details and sometimes state conclusions without the corresponding arguments.  Feel free to ask me if you want to hear more about something.

I ended up including a whole lot of preparatory exposition here – the main critiques start in section 6, although there are various critical remarks earlier.

Keep reading

10. It’s just regularization, dude

(N.B. the below is hand-wavey and not quite formally correct, I just want to get the intuition across)

My favorite way of thinking about statistics is the one they teach you in machine learning.

You’ve got data.  You’ve got an “algorithm,” which takes in data on one end, and spits out a model on the other.  You want your algorithm to spit out a model that can predict new data, data you didn’t put in.

“Predicting new data well” can be formally decomposed into two parts, “bias” and “variance.”  If your algorithm is biased, that means it tends to make models that do a certain thing no matter what the data does.  Like, if your algorithm is linear regression, it’ll make a model that’s linear, whether the data is linear or not.  It has a bias.

“Variance” is the sensitivity of the model to fluctuations in the data.  Any data set is gonna have some noise along with the signal.  If your algorithm can come up with really complicated models, then it can fit whatever weird nonlinear things the signal is doing (low bias), but also will tend to misperceive the noise as signal.  So you’ll get a model exquisitely well-fitted to the subtle undulations of your dataset (which were due to random noise) and it’ll suck at prediction.

There is a famous “tradeoff” between bias and variance, because the more complicated you let your models get, the more freedom they have to fit the noise.  But reality is complicated, so you don’t want to just restrict yourself to something super simple like linear models.  What do you do?

A typical answer is “regularization,” which starts out with an algorithm that can produce really complex models, and then adds in a penalty for complexity alongside the usual penalty for bad data fits.  So your algorithm “spends points” like an RPG character: if adding complexity helps fit the data, it can afford to spend some complexity points on it, but otherwise it’ll default to the less complex one.

This point has been made by many people, but Shalizi made it well in the very same post I linked earlier: Bayesian conditionalization is formally identical to a regularized version of maximum likelihood inference, where the prior is the regularizing part.  That is, rather than just choosing the hypothesis that best fits the data, full stop, you mix together “how well does this fit the data” with “how much did I believe this before.”

But hardly anyone has strong beliefs about models before they even see the data.  Like, before I show you the data, what is your “degree of belief” that a regression coefficient will be between 1 and 1.5?  What does that even mean?

Eliezer Yudkowsky, strong Bayesian extraordinaire, spins this correspondence as a win for Bayesianism:

So you want to use a linear regression, instead of doing Bayesian updates?  But look to the underlying structure of the linear regression, and you see that it corresponds to picking the best point estimate given a Gaussian likelihood function and a uniform prior over the parameters.

You want to use a regularized linear regression, because that works better in practice?  Well, that corresponds (says the Bayesian) to having a Gaussian prior over the weights.

But think about it.  In the bias/variance picture, L2 regularization (what he’s referring to) is used because it penalizes variance; we can figure out the right strength of regularization (i.e. the variance of the Gaussian prior) by seeing what works best in practice.  This is a concrete, grounded, practical story that actually explains why we are doing the thing.  In the Bayesian story, we supposedly have beliefs about our regression coefficients which are represented by a Gaussian.  What sort of person thinks “oh yeah, my beliefs about these coefficients correspond to a Gaussian with variance 2.5″?  And what if I do cross-validation, like I always do, and find that variance 200 works better for the problem?  Was the other person _wrong?  _But how could they have known?

It gets worse.  Sometimes you don’t do L2 regularization.  Sometimes you do L1 regularization, because (talking in real-world terms) you want sparse coefficients.  In Bayes land, this

can be interpreted as a Bayesian posterior mode estimate when the regression parameters have independent Laplace (i.e., double-exponential) priors

Even ignoring the mode vs. mean issue, I have never met anyone who could tell whether their beliefs were normally distributed vs. Laplace distributed.  Have you?

tl;dr: Regularization is not the point of the prior. Even when we’re not regularizing, the prior is an indispensable part of useful machinery for producing “hedged” estimates, which are good in all plausible worlds.

OK, here’s the whole post.

The quoted section is about whether Bayesians can explain regularization. We know regularization helps, and we’re going to do it in any case, but Bayesians purport to explain why and when it helps. See, for example, the above @yudkowsky quote, as well as this one:

Eliezer_Yudkowsky:

The point of Bayesianism isn’t that there’s a toolbox of known algorithms like max-entropy methods which are supposed to work for everything. The point of Bayesianism is to provide a coherent background epistemology which underlies everything; when a frequentist algorithm works, there’s supposed to be a Bayesian explanation of why it works. I have said this before many times but it seems to be a “resistant concept” which simply cannot sink in for many people.

nostalgebraist is making Yudkowsky very happy in his post, by arguing with his actual belief in the status of Bayesianism as a background epistemology. nostalgebraist’s point is that Bayesianism doesn’t explain why or how we regularize, and more generally that we shouldn’t try to judge inferential methods by how Bayesian they are. nostalgebraist is summarizing this as “Bayesianism is just regularization,” which is a not entirely serious inversion of a common Bayesian position, that “regularization is just Bayesian statistics.”

I disagree with nostalgebraist about all this, and I’m going to write a post about why, maybe next week. This current post, which will be quite long, is absolutely not about the issue of whether Bayesianism explains regularization. I start by describing this issue just to show that I understand the real point of the OP, and that I am being quite deliberate when I completely ignore it in the following.

What I want to focus on is nostalgebraist’s half-joking statement that Bayesian inference is just regularization. While he’s not being entirely serious, he may be partly serious, and in any case it’s what a lot of people actually believe. For example, in replies framed as defenses of the Bayesian framework, @4point2kelvin writes “You can definitely think of anything Bayesian as ‘maximum likelihood with a prior.’ But even though the prior has to be (somewhat) arbitrary when the hypothesis-space is infinite, I still think it’s useful.” Plus, once I’ve shown Bayes isn’t just regularization, then I get to say what else it is.

I’m going to start with some technicalities, focusing on the mode vs mean issue nostalgebraist alluded to. Then I’m going to show an example where Bayesian estimation improves on maximum likelihood, without any of the increase in bias that Shalizi suggests is necessary, and explain what’s going on.

Keep reading

Reblogging because this is good and I want to have it on my blog + remind myself to read it more closely so I can actually say something about the issues it raises

(via raginrayguns)

Among the upper class in 18th century pre-Victorian Britain, pubic hair from one’s lover was frequently collected as a souvenir.

blackblocberniebros:

kirins-forrest:

‘Hello land dog, I am water dog.’

Mammals just love to snuggle. That’s what mammals do. I love it so much.

(via blackblocberniebros-deactivated)

About fifteen years ago, Greg Hjorth began proving theorems on this topic. He apparently can’t stop, and now has over thirty publications.

Adweek magazine described Keiser as “the most visible character in an underground movement that has spurred hundreds of blog posts and videos, and played some small part in driving up the price of precious metals”.

Speaking of economic systems, have you guys heard about the Lange Model?  It’s p wild

Because prices are set by the central planning board “artificially” aiming to achieve planned growth objectives, it is unlikely that supply and demand will be in equilibrium at first. To produce the correct amount of goods and services, the Lange model suggests a trial-and-error method. If there is a surplus of a particular good, the central planning board lowers the price of that good. Conversely, if there is a shortage of a good, the board raises the price. This process of price adjustments takes place until equilibrium between supply and demand is achieved.

I need to get off tumblr and work.  Yell at me if I post here in the next ~5 hours

stumpyjoepete:

xhxhxhx:

I would appreciate it if some AI enthusiast would get mad at me

right now I’m objecting to a diffuse and incoherent set of fears about the future, but someone out there’s gotta have a theory of what mass technological unemployment actually looks like and a modestly granular account of the mechanisms by which artificial intelligence takes us there

I don’t have that model or that account, but y’all really seem to believe that the machines are right around the corner, so it’d be nice if someone laid it out somewhere

I wrote a lengthy harangue to @peopleneedaplacetogo about a week back, which appears here lightly edited:

If I were a smarter or better-informed person, would I feel differently about the intelligence explosion thesis? What do its better-informed advocates know that I don’t? What intuitions do they have that I lack? 

I guess you’d have to know what I believe before you could tell me why I’m wrong, but as a person who’s much closer to the technology than I am, what are the sources of the rationalist belief in artificial intelligence more generally?

Because, from the outside, with the little understanding of the technology that I have, it seems like intelligence is harder and progress more limited than the boosters are telling me. 

From the outside, throwing more processing power at the problem doesn’t seem to address the lack of sound concepts underpinning general machine intelligence, rather than specific intelligence.

The ‘machine learning’ we have, where we train algorithms on large data sets to sort the data and identify the patterns is impressive, sure, but the strength and limitations of ML suggest that we need more and more innovative conceptualizations and operationalizations of the problems we want the machines to address before we can apply machine power to any effect.

I apologize for my technological illiteracy; I’m sure I’m missing something crucial. I guess I just don’t have a good sense of what the conceptual paradigm for general intelligence would look like – “ML applied to the conceptualization of problems in the world”?

To which he replied:

I don’t have any specific knowledge of the topic either. I think a big intuition is just “don’t make strong predictions about what AI can or can’t do”.

What am I missing?

I’m very skeptical of an “intelligence explosion”, especially in the near future. On the other hand, there are plenty of people’s current jobs that are probably going to be automated soon.

Some random, unsourced ideas for tasks to be automated: reading those radiology images to diagnose stuff (image classification is getting real good), web design and other building-formulaic-websites-for-small-business (as good looking website-in-a-box services grow), certain types of translation (as machine translation allows fewer and less skilled translators to blow through many more documents), a lot of document related Charlie-work that is currently done by people in law firms (how has this not already happened?), and all of the tasks that pharmacists do that are visible from a consumer’s perspective (although legal barriers are probably much harder to overcome than technical ones).

I won’t speculate on the details of the broader technological unemployment story, but I do want to point out that the jobs being replaced are “knowledge worker” jobs, which are exactly the thing always being pitched as a replacement for the disappearing manufacturing jobs. As far as I can tell, the main jobs that are safe are on-site skilled labor (e.g., plumbers) and unskilled, low paying service jobs. This doesn’t seem like it will be that great for labor.

I am generally a skeptic of near-future AI worries, and particularly those based on recent successes in machine learning.  However,

(1) I second what @stumpyjoepete said.  Even the limited ML stuff we have now is enough to do some significant automation.  The automatic radiology thing is really happening, and I’m very curious how it will play out.  (Radiologists make, and thus have, a lot of money, and presumably they won’t go down without a political fight.  There are going to be big battles over licensing for medical software.)

(2) There is a real open question as to how far our current approaches to ML will scale up.  I am on the side that says they will not scale up “all the way,” and that many more theoretical advances will be necessary if we want to reproduce facets of human intelligence like language and abstract generalization.  But I don’t think this is an obvious or foregone conclusion.  Everyone’s looking at the current state of the field and trying to read it like tea leaves; no one has the information necessary to be confident.

I recently gushed about the 2016 academic book “Cerebral Cortex: Principles of Operation.”  One of the exciting/spooky things about reading this book, as someone who has worked with ML, is that the book describes much of the the structure and function of the cortex in terms that are intelligible in an engineering sense to someone used to artificial neural networks, even though the latter are usually said to be biologically unrealistic.  I say “in an engineering sense” because, while we’ve known the relevant facts for a long time, we have just now come to the point where people have experience doing practical engineering with similar structures.  So when we talk about the cortex having sequentially connected layers, or about its excitatory cells having local connectivity, I can look at that and say “oh, yeah, I’ve built things like that before, and I know why I did so.”

I recently watched a lecture on the immune system by a doctor, in which he casually described some of its mechanisms with terms like “this is a fail-safe.”  He was taking a perspective where, above and beyond merely describing mechanisms, you can look at the mechanisms and say “oh yeah, I can imagine building that.”

If one runs with this line of thought, one could imagine that the cortex (and probably the whole brain) is essentially a very complicated assemblage of basic building blocks which, individually, we are already using in engineering.  (This Paul Christiano post describes one version of this perspective.)  While it may or may not be feasible to make something with the same complexity and functionality in silico, this would be simply a technical challenge, not a theoretical one.

There are reasons to doubt this perspective.  The engineering ideas we are using were themselves inspired (though loosely and cartoonishly) by the brain, so perhaps we have simply taught ourselves to ignore all aspects of brain function that don’t look like the broad cartoon we have taken from neuroscience and used for engineering.  People will point to the fast rate at which the field of artificial neural nets is advancing, but it’s risky to extrapolate such rates.  (Theoretical physics moved similarly fast in the 20th century, going from old-school QM to QFT, explaining the weak and strong and electromagnetic forces, and it must have really looked like it was going to explain everything until it stalled out in the 70s because quantum gravity turns out to be way harder than everything else.)

But the perspective could be true.  It is defensible.

nostalgebraist:

voxette-vk replied to your post

Surely the generalization to other markets is “being good at satisfying demand”?

Ohhh, duh.  I am dumb :P  (Thanks to @mbwheats for also pointing this out)

I have to be somewhere soon so I shouldn’t write too much, but yes – this is a real and important tradeoff.  @furioustimemachinebarbarian said something good about this in this reblog, in that they framed it explicitly as a tradeoff

If you want the capitalist mode of production to work, people need to be able to reap returns from their activities that they can reinvest in capital.  But capital investment is just another element of the bundle of goods someone buys, so my argument as stated ought to apply to it as much as to anything else.  So my argument, as stated, was too broad.

I hope it was clear that my argument, as stated, was trying to establish the existence of a particular mechanism rather than provide a proposal.  I don’t actually want everyone’s wealth to be literally the same at all times (trying to cause this would break all sorts of other things too, I’d expect).  Rather, the point was that when the “initial endowments” are closer to equal, supply and demand (which I called “markets,” and which are a distinct desideratum from “capitalism”) work better.

Distinguishing capitalism from supply and demand is important.  I should have done it more clearly in the OP, but I am also not sure @neoliberalism-nightly was doing it sufficiently in their ask – as far as I can tell prediction markets are supposed to work because of supply and demand, even without capitalism (which is not yet having a non-negligible internal effect in them).

I’m no longer in a hurry, so let me expand on this a bit.

To be completely precise, the target of my post was the tradition in economics of distinguishing “efficiency” from “distribution.”  This distinction encourages economists to treat distribution (i.e. wealth [in]equality) as an outside concern that can be ignored when considering the market mechanism as a system.

The attitude is that the market “works” (in some “efficiency” sense) no matter what is going on with distribution, and insofar as we care about distribution, this is a separate value which we will in general have to trade off against “efficiency” / “the market working.”  (Although it may be possible in principle to alter distribution without introducing market distortions, it is not generally possible in near-term political practice.)

This story is internally consistent if you define “efficiency” in the usual way, which is Pareto optimality.  We know thanks to Arrow and Debreu (et. al.) that under some idealized assumptions, supply and demand will get us to a Pareto optimal outcome (First Theorem of Welfare Economics), and this is frequently viewed (see e.g. Stiglitz here) as a successful formalization of the views popularly associated with Adam Smith.  Even work that is critical of the invisible hand, such as Stiglitz’s, has tended to concede Pareto optimality as the correct formal desideratum, arguing only that markets do not achieve it in practice as much as the First Theorem would lead one to think.

By contrast, my position is that Pareto optimality does not capture the good things we wanted out of the invisible hand in the first place.  I first started thinking about this stuff after reading Brad deLong’s very entertaining post “A Non-Socratic Dialogue on Social Welfare Functions,” which I recommend reading.  (I am largely just repeating deLong here, and less stylishly at that.)

As in the OP, I think what we want out of the invisible hand is (at least) a market that “gives the people what they want” in some intuitively recognizable sense.

A Pareto optimal outcome is defined to be an outcome in which no one can be made better off without making anyone else worse off.  The phrase “can be made” should be interpreted as “by physically achievable means,” like transferring goods from one person to another.  That sounds obvious, but has significant implications.

The richer you are, the less marginal utility you will get (on average) from goods you acquire.  This is implicit in standard economic assumptions, to the extent that you cannot deny it without being very heterodox at best, and talking nonsense at worst.  (You can get it from the usual assumption of convex preferences, plus the idea that individuals have utility functions, since convex preferences correspond to [quasi-]concave utility functions.  Or, if you like, you can get concave utility functions from the assumption of loss aversion, without which finance makes no sense whatsoever.)

In practice, if people do deny it, they tend to do it by rejecting the utility concept as a whole (as the Austrians do).  But without some way to do interpersonal utility comparisons, I’m not sure how you can even state the invisible hand idea.  (How can individual self-interest serve the common good if there is no valid concept of “the common good”?)

OK, enough sidenotes.  As I said, the richer you are, the less marginal utility you will get (on average) from goods you acquire.  Thus, when there are large wealth inequalities, Pareto optimality is compatible with large sub-optimalities in sum-aggregated utility, in that it allows transfers (from rich to poor) which would increase summed utility a lot.  The bigger and more widespread the inequalities, the more sub-optimality we can have (in this sense) even if everything is still Pareto optimal.

There are much more rhetorically forceful ways to put this.  deLong puts it this way: if we say that the market’s desirable property is its tendency to produce Pareto optima, we are saying it optimizes a certain social welfare function, and if this function is a weighted sum of individual utilities, then it gives rich people bigger weights than poor people.  (He derives this formally here.)

In other words, by saying “we will consider efficiency first and worry about distribution later,” and defining efficiency as Pareto optimality, we are implicitly saying that what we really ask the market to do is “give the people what they want, weighted by wealth.  This is pretty clearly not what we originally wanted out of the invisible hand, and not something that one would ever come up with as a natural desideratum.  If the First Theorem vindicates the invisible hand, it is only by moving the goalposts.

Another way of putting it is that, by over-valuing the utility of the wealthy, the Pareto optimality desideratum treats the wealthy as utility monsters.