Install Theme

A little while ago, I was talking to someone about about AI futurism stuff, and it seemed like we disagreed about how to interpret fast progress in deep learning.  The other person thought that since deep learning has been moving fast, it’s plausible that it will continue to move fast, and so [some challenging problem] is likely to be solved soon, even if it looks hard to us.  (Because similarly hard-looking problems have been overcome in short succession in the recent past – that’s why we say the field is moving fast.)

I was wary of this, in part because I wasn’t sure that “many challenges overcome in a short time” actually meant the field was moving fast.  Even if discovering were just happening at some constant rate, we’d see some “clusters” like that.  This is the sort of possibility that one should always keep in mind explicitly, because our brains seem bad at accounting for it (the “clustering illusion”).

In other words, I had a “null model” in mind that was just a Poisson process.  And I wanted to know whether the appearance of clustering (“the pace is fast now”) could just be explained away by this null model.

This seems like the sort of thing that would have been studied, right?  I’ve seen people ask this question in other places: Lewis Fry Richardson studying a data set on war and peace and finding no more clumpiness than a Poisson process (see this fascinating article); R. D. Clarke showing the same about WWII German bomb targets in London; Shalizi and others on the clumpiness of British novel genres.  And the nature of scientific progress is a really important thing, so surely someone must have asked the same question about scientific advances?

Yet I couldn’t find anything on Google Scholar.  Everything I could find was by (or about) one researcher, who mostly studied rate of discoveries by an individual across their lifespan, rather than rate of discoveries by a field.  Anyone know of sources on this?

The zeitgeist of science and engineering in the twenty-first century is the integration of disciplines - that is, the bridging of the gaps between the formerly fragmented and distinct scientific disciplines, and the grappling with the many remaining grand challenge problems that lie at their intersection. There is thus an emerging need for educational institutions to distill and relate these scientific disciplines for the new generation of scientists who will ultimately accomplish their seamless integration. Towards this end, Professor Thomas Bewley has written Numerical Renaissance, which aims to provide a systematic, integrated, succinct presentation of efficient techniques for solving a wide range of practical problems on modern digital computers.

Ooh look at this cool numerics textbook with a publicly available draft version (PDF)

(N.B. I’ve looked over it for all of 30 seconds, could be terrible for all I know; just  sounds like an unusually comprehensive and unified introduction to the subject)

bayes: a kinda-sorta masterpost

raginrayguns:

nostalgebraist:

I have written many many words about “Bayesianism” in this space over the years, but the closest thing to a comprehensive “my position on Bayes” post to date is this one from three years ago, which I wrote when I was much newer to this stuff.  People sometimes link that post or ask me about it, which almost never happens with my other Bayes posts.  So I figure I should write a more up-to-date “position post.”

I will try to make this at least kind of comprehensive, but I will omit many details and sometimes state conclusions without the corresponding arguments.  Feel free to ask me if you want to hear more about something.

I ended up including a whole lot of preparatory exposition here – the main critiques start in section 6, although there are various critical remarks earlier.

Keep reading

10. It’s just regularization, dude

(N.B. the below is hand-wavey and not quite formally correct, I just want to get the intuition across)

My favorite way of thinking about statistics is the one they teach you in machine learning.

You’ve got data.  You’ve got an “algorithm,” which takes in data on one end, and spits out a model on the other.  You want your algorithm to spit out a model that can predict new data, data you didn’t put in.

“Predicting new data well” can be formally decomposed into two parts, “bias” and “variance.”  If your algorithm is biased, that means it tends to make models that do a certain thing no matter what the data does.  Like, if your algorithm is linear regression, it’ll make a model that’s linear, whether the data is linear or not.  It has a bias.

“Variance” is the sensitivity of the model to fluctuations in the data.  Any data set is gonna have some noise along with the signal.  If your algorithm can come up with really complicated models, then it can fit whatever weird nonlinear things the signal is doing (low bias), but also will tend to misperceive the noise as signal.  So you’ll get a model exquisitely well-fitted to the subtle undulations of your dataset (which were due to random noise) and it’ll suck at prediction.

There is a famous “tradeoff” between bias and variance, because the more complicated you let your models get, the more freedom they have to fit the noise.  But reality is complicated, so you don’t want to just restrict yourself to something super simple like linear models.  What do you do?

A typical answer is “regularization,” which starts out with an algorithm that can produce really complex models, and then adds in a penalty for complexity alongside the usual penalty for bad data fits.  So your algorithm “spends points” like an RPG character: if adding complexity helps fit the data, it can afford to spend some complexity points on it, but otherwise it’ll default to the less complex one.

This point has been made by many people, but Shalizi made it well in the very same post I linked earlier: Bayesian conditionalization is formally identical to a regularized version of maximum likelihood inference, where the prior is the regularizing part.  That is, rather than just choosing the hypothesis that best fits the data, full stop, you mix together “how well does this fit the data” with “how much did I believe this before.”

But hardly anyone has strong beliefs about models before they even see the data.  Like, before I show you the data, what is your “degree of belief” that a regression coefficient will be between 1 and 1.5?  What does that even mean?

Eliezer Yudkowsky, strong Bayesian extraordinaire, spins this correspondence as a win for Bayesianism:

So you want to use a linear regression, instead of doing Bayesian updates?  But look to the underlying structure of the linear regression, and you see that it corresponds to picking the best point estimate given a Gaussian likelihood function and a uniform prior over the parameters.

You want to use a regularized linear regression, because that works better in practice?  Well, that corresponds (says the Bayesian) to having a Gaussian prior over the weights.

But think about it.  In the bias/variance picture, L2 regularization (what he’s referring to) is used because it penalizes variance; we can figure out the right strength of regularization (i.e. the variance of the Gaussian prior) by seeing what works best in practice.  This is a concrete, grounded, practical story that actually explains why we are doing the thing.  In the Bayesian story, we supposedly have beliefs about our regression coefficients which are represented by a Gaussian.  What sort of person thinks “oh yeah, my beliefs about these coefficients correspond to a Gaussian with variance 2.5″?  And what if I do cross-validation, like I always do, and find that variance 200 works better for the problem?  Was the other person _wrong?  _But how could they have known?

It gets worse.  Sometimes you don’t do L2 regularization.  Sometimes you do L1 regularization, because (talking in real-world terms) you want sparse coefficients.  In Bayes land, this

can be interpreted as a Bayesian posterior mode estimate when the regression parameters have independent Laplace (i.e., double-exponential) priors

Even ignoring the mode vs. mean issue, I have never met anyone who could tell whether their beliefs were normally distributed vs. Laplace distributed.  Have you?

tl;dr: Regularization is not the point of the prior. Even when we’re not regularizing, the prior is an indispensable part of useful machinery for producing “hedged” estimates, which are good in all plausible worlds.

OK, here’s the whole post.

The quoted section is about whether Bayesians can explain regularization. We know regularization helps, and we’re going to do it in any case, but Bayesians purport to explain why and when it helps. See, for example, the above @yudkowsky quote, as well as this one:

Eliezer_Yudkowsky:

The point of Bayesianism isn’t that there’s a toolbox of known algorithms like max-entropy methods which are supposed to work for everything. The point of Bayesianism is to provide a coherent background epistemology which underlies everything; when a frequentist algorithm works, there’s supposed to be a Bayesian explanation of why it works. I have said this before many times but it seems to be a “resistant concept” which simply cannot sink in for many people.

nostalgebraist is making Yudkowsky very happy in his post, by arguing with his actual belief in the status of Bayesianism as a background epistemology. nostalgebraist’s point is that Bayesianism doesn’t explain why or how we regularize, and more generally that we shouldn’t try to judge inferential methods by how Bayesian they are. nostalgebraist is summarizing this as “Bayesianism is just regularization,” which is a not entirely serious inversion of a common Bayesian position, that “regularization is just Bayesian statistics.”

I disagree with nostalgebraist about all this, and I’m going to write a post about why, maybe next week. This current post, which will be quite long, is absolutely not about the issue of whether Bayesianism explains regularization. I start by describing this issue just to show that I understand the real point of the OP, and that I am being quite deliberate when I completely ignore it in the following.

What I want to focus on is nostalgebraist’s half-joking statement that Bayesian inference is just regularization. While he’s not being entirely serious, he may be partly serious, and in any case it’s what a lot of people actually believe. For example, in replies framed as defenses of the Bayesian framework, @4point2kelvin writes “You can definitely think of anything Bayesian as ‘maximum likelihood with a prior.’ But even though the prior has to be (somewhat) arbitrary when the hypothesis-space is infinite, I still think it’s useful.” Plus, once I’ve shown Bayes isn’t just regularization, then I get to say what else it is.

I’m going to start with some technicalities, focusing on the mode vs mean issue nostalgebraist alluded to. Then I’m going to show an example where Bayesian estimation improves on maximum likelihood, without any of the increase in bias that Shalizi suggests is necessary, and explain what’s going on.

Keep reading

Reblogging because this is good and I want to have it on my blog + remind myself to read it more closely so I can actually say something about the issues it raises

(via raginrayguns)

@argumate has been talking recently about hypothetical problems with ancap/libertarian-paradise-world, and it’s making me think about a very basic issue that I don’t see talked about enough.

Namely: all of the usual arguments about how markets are great (“aggregating information” and all that stuff) also say that wealth inequality makes markets worse at doing those things.  This is not a knock-down argument for having a state that redistributes wealth at gunpoint, but it is a reason to see wealth equality as a relevant concern even if you don’t have it as a terminal value.  Even if you don’t care about it, the market needs it to achieve the things you do care about.


I’ll generalize this in a moment, but first, let’s look at an especially clean example case: prediction markets.  Prediction markets are nice here because we don’t have to worry about thorny problems about aggregating utility to construct a “social welfare function.”  There isn’t a role for disagreements about values.  Everyone agrees about what we want out of a prediction market.  (Or rather, the disagreements that exist are technical rather than ethical.)

What we want out of prediction markets is a price that corresponds to the actual observed frequency of events.  Of course, this is not always possible – sometimes there is relevant information that no one in the market knows, so even a perfect information-aggregator (say, a rational being that knows everything that anyone in the market knows) would not get the right answer.

So at best, we can only ask for some sort of information-aggregating property, something like “prices reflect the average (i.e. mean) belief.”  This is desirable because we expect many sources of individual error to be uncorrelated, and these will wash out when we take the average.

But the prices in prediction markets reflect, at best, a wealth-weighted average of beliefs.  (For “wealth” here, read “quantity of money the individual is willing to invest in this market,” which is obviously constrained by wealth in a straightforward way.)  This is easy to see informally: if there are 1000 people who are only willing to buy $1 worth of shares each, and 1 person willing to buy $1000 worth of shares, the market mechanism will get an equally large signal from the one big spender as from the 1000 small spenders.

A formal version of this is derived in this paper: with logarithmic utility, prices equal the wealth-weighted mean of beliefs.  (If you’re worried about the log utility assumption, note that this is arguably the most favorable possible result for prediction markets, and much of that paper is dedicated to showing that other plausible utility functions do not yield very large deviations from it.)

Is it a problem that the results are wealth-weighted?  Well, not necessarily.  But it’s important to note that there are two different reasons it might be a problem.

First, assume (as in the paper) that we’re in the “many traders” limit, so there is a continuous distribution of beliefs, we have integrals rather than sums, etc.  In this case, what matters is the (Pearson) correlation of belief and wealth.  (If they are uncorrelated, the wealth-weighting will be invisible.)  This correlation will either help or hurt depending on whether the bigger spenders have more accurate beliefs in any given case; it seems hard to argue that they’ll have less accurate beliefs in general, which makes this concern easy to dismiss.

But second, suppose we are not in the “many traders” limit.  The worry with finitely many traders is a situation like the “1 vs. 1000″ example mentioned earlier, where the intuition that we are getting an average becomes misleading because the prices are so heavily affected by a small number of people.

Recall that the whole reason we’re interested in getting the average belief is that we expect uncorrelated errors to wash out if we average over a large number of people.  In situations like the “1 vs. 1000″ example, the inequality is making the effective population size smaller, i.e. making our law-of-large-numbers argument weaker.  From basic statistics, we’d expect the uncorrelated errors to get smaller by a factor of sqrt(N) when we average over N people.  That corresponds to the errors getting about 32 times smaller for N = 1001.  But in the 1 vs. 1000 case, half of the answer comes from the belief held by the single big spender, which (by hypothesis) carries random errors of the same size as everyone else’s, so the error is only cut down by (approximately) a factor of 2, not 32.


Now let’s extend this to more general markets.

This case is harder, because we don’t have an analogous law-of-large-numbers argument for the claim that the the price should reflect an unweighted population average.  To argue for that sort of claim in general, we must (horror of horrors!) introduce some sort of ethical assumption, say about no one being inherently more important than anyone else.

I was being facetious in the last sentence when I said “horror of horrors,” but there are real difficulties here.  The problem is not that some people might really be inherently more important than others, but that we are trying to do some sort of utility aggregation, and this is a famously thorny area.  So it may help to be more concrete.

The basic intuitive appeal of “invisible hand” type ideas is that the market will learn to provide what people want.  The phrase “what people want” has the same thorny issue just mentioned – how do we translate statements about what individual people want into a general statement about “what people want,” so that we can judge whether it is being provided (relatively well or poorly)?

The core of the idea is nonetheless pretty clear.  If a bunch of people want something, but not enough to buy it at the prevailing market price, someone will see the opportunity to make a profit by selling it at a lower price that these people will take.  After they take that opportunity, everyone else who produces the product will notice and lower their prices, and (after some equilibration) the market price will be low enough that people get the thing they want.  Likewise, if there is more demand for something than the low market price suggests, everyone will buy until there’s none of it left, at which point the suppliers will produce more because they can afford to do so by charging a higher price (assuming that supply curves slope upwards, which is not obvious and which I’ve heard is not always true IRL, but let’s grant it).  If you don’t allow these things to happen, you get Soviet bread lines and shortages of rent controlled housing.  Or so the argument goes.

OK, so here’s a brain-teaser for you: how much are homeless people willing to pay for housing?


Although there may be some exceptions (crust punks?), people do not generally become homeless because they simply value having a roof over their heads less than the average person does.  Many homeless people would be perfectly happy to pay the market price for housing if they could.  They just don’t have the money to.

In other words, the signal received by the market isn’t “preferences,” it’s “willingness to (actually) pay.”  It’s startling how rarely I see the distinction made between “willingness to pay” and “ability/capacity to pay”; in the academic literature it seems to be mainly made by economists interested in healthcare.  (See e.g. this paper, which presents the distinction as a novel modeling contribution, and has gotten only 2 citations since it was published in 2008, and this one, 3 citations since 2006.  If I am missing some large body of research here, let me know.)

Talking about this presents some technical difficulties, since there is no well-defined concept of “what someone would pay if they didn’t have to worry about their budget.”  For instance, what one is willing to pay in principle for vital necessities will scale up with budget in an unbounded fashion: I’m sure you could get Bill Gates to pay billions for a loaf of bread if the alternative was starvation, but this does not mean that a loaf of bread is “really” worth billions, and in fact does not mean much at all.  Even for non-essential goods, things can be pretty elastic, since many goods that are provably non-essential for human satisfaction can nonetheless feel effectively essential once one has satiated to them.  (You could extract a lot of my money by threatening to separate me from my internet connection, for instance.)

But it’s not as if spending patterns are unrelated to preferences.  If you give someone any fixed budget, they will buy some bundle of goods with it (for simplicity, you can view savings as just another good people may buy, so that everyone always “spends” their whole budget).  To determine someone’s preferences, give them a series of decreasing budgets, and watch which goods they are priced out of first and which they hold onto until the very end.  (If two people have different preferences, one person will buy more of some good than the other given a fixed budget of sufficient size, and as we decrease the budget, there will be some level at which one person is still buying some of it and the other isn’t.)

Thus, the market receives a signal about “what the people want” in the following form: it observes the extent to which the population has been priced out of buying it by their budget.

To clarify what this means, consider an example.  Suppose that everyone has the same budget.  Their spending patterns will vary, because preferences vary, but there will be trends.  For instance, there are some goods that almost everyone will be willing to pay you money for if they don’t have it (food, housing), and some goods that many people will happily do without.  Demand curves will be generated by people successively pricing themselves out (in) in response to price increases (decreases).  Few people will ever be willing to price themselves out of food or housing, so these goods will have nearly flat demand curves (low price elasticity of demand) with high intercepts, while goods that people will happily prices themselves out of (yachts, tchotchkes) will have steep demand curves (high price elasticity of demand) with low intercepts.  If some good has a given supply curve, it will be produced in a large quantity if it is of the former type (food, housing), and in a small quantity if it is of the latter type (yachts, tchotchkes) – interestingly, this is true no matter which way the supply curve slopes.

Thus, in this hypothetical world, a lot of resources go into producing food (or more relevantly, distributing food), and not as much into manufacturing yachts.  Because people – you, me, even Bill Gates – value food more than yachts, and the market mechanism responds to preferences.  The invisible hand works!  Chew on that, socialist planners!

But in our world, many resources are allocated to the production of bizarre luxury goods while billions go hungry.  Is this because “the people” want the former more than the latter?  Of course not.  No one wants the former more than the latter.  If you gave me the choice between food and my MacBook Air, I’d take the food, and so would you and Tim Cook and everyone else alive.

Why are resources misallocated in this way?  Because the starving have been priced out of food, while I have not been priced out of buying a MacBook Air, and the market only sees preferences in the form of the “what have people been priced out of” signal.

When people’s budgets are all the same (or similar), this signal results in production patterns that track people’s relative preferences about different goods.  When people’s budgets are wildly dissimilar, this does not occur.  The production patterns don’t even reflect rich people’s preferences, since they prefer essentials over luxuries just like everyone else.  (It satisfies rich people’s preferences, which is not the same thing as reflecting them.  Being rich means having the opportunity to buy things which have incredibly low, although still positive, marginal value to you.)

Does this mean we have to spread the wealth around at gunpoint?  Well, I don’t know.  We don’t need to do anything.  But the market cannot do its Adam Smithy magic if the wealth is very unevenly distributed.  Maybe you value not having a state more than you value the market doing its Adam Smithy magic!  But it is worth being clear that these values are in conflict.

selfreplicatingquinian:

nostalgebraist:

eka-mark:

The fact that fluid systems can behave in a large number of qualitatively different ways is reflected in the large number of dimensionless quantities used to characterize them.

And they’re all* named after people instead of what the mean

I’ve always wondered why this was the case, since it seems so bad for communication/understanding.  Maybe a trend got started and then kept going once fluid dynamicists noticed this was a promising way to make their mark on the field’s history?

*(almost)

Reynolds number is the only non-trivial quantity on that list I know of that provides a good heuristic of a system without context. Knowing the Reynolds number of your system and nothing else still tells me a *lot* about what you’re likely working on (eg biophysics vs. plasma vs. everyday hydrodynamics) or at least what it’s similar to.

They all seem to mostly fall in two categories pretty simply to me: simple ratios named mostly for convenience (2 words instead of 5 each of 20 times you’re talking about it in your paper), and those mostly applicable to narrow circumstances- whose names will only be used in high context communications among those naturally selected to care about those narrow circumstances. I suspect that most of these don’t get used in ways that are overly likely to disrupt communication except by those who are otherwise overusing jargon *anyway*. And there are even ones in the first group like the Mach number, which are well known enough they get used in common parlance to save a few syllables- “mach 3” vs. “three times the speed of sound”.

I think it’s so common in part because fluid dynamics is one of the only fields complicated and general enough to have so many explanatory coefficients fall out of the equations. Now that I think on it, you can practically break physics down into “fields where fluid dynamics sometimes applies” and “mechanics of only solid objects” and not leave much out (the former includes E&M and Quantum in case that’s non-obvious). With the former being so broad it’s no wonder there’s so many named quantities, it has such a large portion of the quantities period.

FWIW, I did fluid mechanics in grad school, and I found the naming scheme frustrating then.  It wasn’t that I had trouble literally remembering which was which – the ones that were relevant to me became second nature because I was immersed (no pun) in the subject – but I still found that the uninformative names added a slight but significant overhead when thinking about the subject.

The best way to convey this might be to look at some more informative names, and imagine the alternative.  An ideal example of an informative name might be “limit” – more or less the colloquial word for the idea formalized by the analysis concept.  Imagine if we called limits something like “Bolzano numbers.”  We’d all get used to it, of course, but something would be lost.  A description of a theorem using the “limit” terminology has a certain intuitive transparency, which would require a mental translation step with the “Bolzano number” terminology (”and what that means is … ”).

Or imagine if we always called the “stress tensor” the “Cauchy tensor” instead.  “Stress tensor” reminds you that you’re talking about stress without you having to do anything; with “Cauchy tensor,” we’d need to make the connection to stress inside our heads every single time, and although this process would become fast over time it would still add a certain extra opacity to things (especially when thinking while tired, etc.)

Admittedly, I think I’m a relatively “verbal” thinker as people who do math go, so this may be more of an issue for me than for others.

I think your point about “narrow circumstances” is partially true, but because – as you note – fluid mechanics is so broad, “narrow circumstances” within it can still be quite broad in other senses.

I did a lot of geophysical fluid mechanics, where two body forces were very important (the Coriolis force and gravity).  So some of the numbers that came up everywhere were the Rossby number (roughly, how important Coriolis is), the Froude number (roughly, the analogue of the Mach number for gravity-based waves), and the Richardson number (roughly, how strongly gravity is suppressing shear instability).  All of these are “specialized” in the sense that there are many fluids/flows of interest where they won’t be relevant – but the ones where they are relevant are not some esoteric sub-category, they’re “the earth’s atmosphere and ocean,” which are a pretty big deal.  More precisely, these are numbers that come up constantly when you work in this subfield, not just when you are studying a particular phenomenon.

eka-mark:

The fact that fluid systems can behave in a large number of qualitatively different ways is reflected in the large number of dimensionless quantities used to characterize them.

And they’re all* named after people instead of what the mean

I’ve always wondered why this was the case, since it seems so bad for communication/understanding.  Maybe a trend got started and then kept going once fluid dynamicists noticed this was a promising way to make their mark on the field’s history?

*(almost)

Something cool I found out about in that Agent Foundations conversation was this paper on the “speed prior,” which is like Solomonoff but with probabilities inversely proportional to the time it takes to compute things.  Does away with uncomputability issues, and you can get some “excellent” (the authors’ word) bounds for it.  (Don’t really feel qualified to evaluate the paper, plus I just haven’t looked at it in much detail, but it seems cool)

nostalgebraist:

@somervta helpfully linked my post about LIs on agentfoundations.org, MIRI’s forum for technical discussion, and it’s getting some comments over there – here’s the thread if you’re interested

(I didn’t link it myself because you can only register there via Facebook and I didn’t want to do that)

I clicked through with Facebook just to see what would happen, and instead of forcing me to go by my real name like I expected, it forced me to go by “240″ (because I’m user number 240).  I was like, OK, fine, and I wrote a really long comment, only to find that it’s only visible when I’m logged in?  Anyway, here’s confirmation that “240″ is really me in case anyone wanted it.

bayes: a kinda-sorta masterpost

raginrayguns:

@nostalgebraist:

5. Why is the Bayesian machinery supposed to be so great?

This still confuses me a little, years after I wrote that other post.  A funny thing about the Bayesian machinery is that it doesn’t get justified in concrete guarantees like “can unscrew these screws, can tolerate this much torque, won’t melt below this temperature.”  Instead, one hears two kinds of justifications:

(a) Formal arguments that if one has some of the machinery in place, one will be suboptimal unless one has the other parts too

(b) Demonstrations that on particular problems, the machinery does a slick job (easy to use, self-consistent, free of oddities, etc.) while the classical tools all fail somehow

E. T. Jaynes’ big book is full of type (b) stuff, mostly on physics and statistics problems that are well-defined and textbook-ish enough that one can straightforwardly “plug and chug” with the Bayesian machinery.  The problem with these demos, as arguments, is that they only show that the tool has some applications, not that it is the only tool you’ll ever need.

Examples of type (a) are Cox’s Theorem and Dutch Book arguments.  These all start with the hypotheses and logical relations already set up, and try to convince you (say) if you have degrees of belief, they ought to conform to the logical relations.  This is something of a straw man argument, in that no one actually advocates using the rest of the setup but not imposing these relations.  (Although there are interesting ideas surprisingly close to that territory.)

The real competitors to Bayes (e.g. the classical toolbox) do not have the “hypothesis space + degrees of belief” setup at all, so these arguments cannot touch them.

Yeah, Jaynes starts with Cox’s theorem, which I think of as a sort of filter, which you can drop a system through and see where it gets stuck, and if it doesn’t get stuck and makes it all the way through, it’s probability theory. But he doesn’t really present any other systems that you can drop through the filter. He mostly criticizes orthodox statistics which you can’t really drop through it.

When I first read read Jaynes, the example I dropped through Cox’s theorem is fuzzy logic, defining Belief(A and B) = min(Belief(A), Belief(B)), and disjunction as maximum. This gets stuck because you can hold Belief(A) constant and increase Belief(B) without necessarily increasing Belief(A and B). That’s not allowed. I was very impressed with Cox’s theorem for excluding this, since I had not even noticed this property, and when brought to my attention it was in fact unreasonable.

It makes me wonder, if I would have been less impressed if I had started by using Dempster-Shafer theory as an example. Dempster-Shafer theory is the “interesting idea” that nostalgebraist linked to above. I’m writing this post to discuss it more thoroughly. tl;dr summary: Dempster-Schafer theory can be thought of as breaking the rule that there’s a “negation function” mapping Belief(~A) to Belief(A), and makes you wonder why we really need such a function.

So, as everyone in the internet Bayesianism discourse knows, Dempster-Schafer theory gives every proposition two numbers. These are the belief, Bel(A), and the plausibility, Plaus(A). Belief is how much it’s supported by the evidence, and plausibility is the degree to which it’s allowed by the evidence. Plausibility is higher.

As few discoursers seem to realize, Plaus(A) is just 1-Bel(~A), so in a sense Bel is all you need. It’s interesting, then, to drop Bel through Cox’s theorem, and see where it gets stuck.

And the first place I notice is at the following desideratum in Cox’s theorem:

There exists a function S such that, for all A, Bel(~A) = S(Bel(A)).

Bel(A) breaks this rule, supposedly ruling it out as a quantification of confidence. But how bad is it, really?

Suppose I’m happily using Dempster-Shafer theory for, I don’t know, assessment of fraud risk, when strawman!Cox bursts into my office, and declares “I’ve come to save you from your irrational degrees of belief!”

As the perfectly reasonable foil to this hysterical and unreasonable strawman, I reply in a tone of pure, innocent curiosity: “What do you mean? I’d love any opportunity to improve my fraud detection.”

“Well,” Cox begins, filliping a coin and covering it, “your Bel(Heads)=0.5, and your Bel(~Heads)=0.5, right?”

“Certainly,” I reply.

“And this case you’re reviewing, Bel(Fraud) = 0.5, correct?”

“Absolutely.”

“And your Bel(~Fraud)?”

“0.2.”

“That’s irrational!” he shrieks, throwing his hands in the air and revealing that the coin was a heads. “Let S be the function that maps from Bel(A) to Bel(~A). What’s S(0.5)? Is it 0.5, or 0.2?” He puts his hands on my desk, leans forward, and demands, “Which is it?”

“There is no such function,” I reply. “Why should there be?”

So, what can Cox do to convince me my assignments are irrational? Or that my fraud detection would be more efficient if there existed this negation function S?

So, that’s where I end up when I drop Dempster-Shafer Bel through Cox’s theorem, and this time I don’t feel I’ve revealed any flaw in the system.

Shafer himself says the same thing, actually:

Glenn Shafer:

Most of my own scholarly work has been devoted to representations of uncertainty that depart from the standard probability calculus, beginning with my work on belief functions in the 1970s and 1980s and continuing with my work on causality in the 1990s [18] and my current work with Vladimir Vovk on game-theoretic probability ([19], www.probabilityandfinance.com). I undertook all of this work after a careful reading, as a graduate student in the early 1970s, of Cox’s paper and book. His axioms did not dissuade me. As Van Horn notes, with a quote from my 1976 book [17], I am not on board even with Cox’s implicit assumption that reasonable expectation can normally be expressed as a single number. I should add that I am also unpersuaded by Cox’s two explicit axioms. Here they are in Cox’s own notation:

1. The likelihood ∼ b|a is determined in some way by the likelihood b|a: ∼ b|a = S(b|a). where S is some function of one variable.

2. The likelihood c ·b|a is determined in some way by the two likelihoods b|a and c|b · a: c · b|a = F(c|b · a, b|a), where F is some function of two variables.

I have never been able to appreciate the normative claims made for these axioms. They are abstractions from the usual rules of the probability calculus, which I do understand. But when I try to isolate them from that calculus and persuade myself that they are self-evident in their own terms, I draw a blank. They are too abstract—too distant from specific problems or procedures—to be self-evident to my mind.

Shafer goes on to quote and respond to Cox’s argument that there should exist F, but since I’m talking about S, I’m gonna look up how Jaynes argued for it.

ET Jaynes:

Since the propositions now being considered are of the Aristotelian logical type which must always be either true or false, the logical product AA̅ is always false, the logical sum A+A̅ always true. The plausibility that A is false must depend in some way on the plausibility that it is true. If we define u ≣ w(A|B), v ≣ w(A̅|B), there must exist some functional relation

v = S(u)

And that’s it. To explain notation w is the function that is eventually shown to have a correspondence with a probability mass function, overbar means “not”, and logical “sums” and “products” are conjunctions and disjunctions.

So, why must there exist this functional relation? Perhaps instead, the belief in A could change without altering the belief in ~A? That can happen in Dempster-Shafer I think, and it does seem kind of crazy. But even disallowing that, and allowing that there must be a function between belief in A and ~A, is it really the same function for every A? Why should it be?

Anyway, yeah. So, idk if I’d say, like nostalgebraist does, that Dempster-Shafer theory is surprisingly close to having the hypothesis space + beliefs setup but without the same constraints. I’d say instead that it’s exactly that. But I’m not totally sure since I’ve only read the basics and maybe things change in more complex applications.

Good stuff!!

To be completely honest, when I was writing that part you quoted, I was like “oh shit wait, D-S does have the same setup, so how does it get around the Cox and Dutch Book type stuff, or maybe it doesn’t? um….” and then in the interests of getting on with the rest of the post, I just hedged by being vague (“surprisingly close to that territory”)

So thanks for answering the question I was curious about but had to ignore.

I started wondering about the equivalent of the above in the measure-theoretic picture (i.e. why K-S doesn’t define a probability measure).  If you translate “logical negation” to “set complement” like usual, then it violates additivity: A and ~A are disjoint, and together they make the whole space, so area(A) = area(whole space) - area(~A).  This seems easier to understand than the Cox S thing, which fits with what Shafer said.

(Apparently, instead of a measure, it’s a “fuzzy measure.”  Instead of additivity, a fuzzy measure just needs to get the correct order on what I was calling “obviously-nested” sets earlier)

I can see the strong intuition behind the Cox S desideratum.  You should be able to take the negation of everything without changing any of the content.  Like, when we talk about A and ~A, neither has the intrinsic property of “being the one with the tilde.”  (Likewise with sets A, A^c.)  You can see the desideratum as a relatively weak way of trying to make things symmetric under negation – everything goes through the same function, so hopefully every property of b|a will have an equivalent for S(b|a).

So, if there’s an asymmetry between one side and the other, what broke the initial symmetry?  How do you decide which side is which?  (That’s what I imagine the strawman!Cox figure saying)

But then, A and ~A are always distinct, even if not because “one has the tilde.”  So for the D-S-using fraud protection worker, it is easy to break the symmetry because “Fraud” and “not Fraud” are different things.  (Thus if they’d flipped all their tildes at the start, the symmetry would have broken the same way, “not Fraud” getting 0.2 and “Fraud” getting 0.5.)

Still, if we are understanding the “not” here either as logical negation or as set complement, this is still nonsensical.  Because in both those frameworks, the negation doesn’t contain any information not contained in the original.  Except …

If I think of “the information” used to specify sets S or S^c as a boundary, then S is “everything inside here” and S^c is “everything outside of here.”  Of course this visual picture is depending on topological notions not present in the sets alone, but it suggests something true about spaces of ideas/hypotheses: we can draw a boundary around some ideas we know about, and the “inside here” set is all stuff we know about, but the “outside of here” set includes all other ideas, including ones we haven’t thought of.  So this is a very natural distinction in practice.

How would you formalize that?  I guess you’d have set theory in a universe (=“outcome space”) that wasn’t fully known, so you could say stuff like “I know 1 and 2 are in the universe, and I can make the set {1, 2}, but I don’t know if 3 is in the universe.”  This probably exists but I don’t know what it’s called.

(via raginrayguns)

identicaltomyself:

nostalgebraist:

Having thought about this for a few more minutes:

It seems like things are much easier to handle if, instead of putting any actual numbers (probabilities) in, we just track the partial order generated by the logical relations.  Like, when you consider a new hypothesis you’ve never thought about, you just note down “has to have lower probability than these ones I’ve already thought about, and higher probability than these other ones I’ve already thought about.”

At some point, you’re going to want to assign some actual numbers, but we can think of this step as more provisional and revisable than the partial order.  You can say “if I set P(thing) = whatever, what consequences does that have for everything else?” without committing to “P(thing) = whatever” once and for all, and if you retract it, the partial order is still there.

In fact, we can (I think) do conditionalization without numbers, since it just rules out subsets of hypothesis space.  I’m not sure how the details would work but it feels do-able.

The big problem with this is trying to do decision theory, because there you’re supposed to integrate over your probabilities for all hypotheses, whereas this setup lends itself better to getting bounds on individual hypotheses (“P(A) must be less than P(B), and I’ll willing to say P(B) is less than 0.8, so P(A) is less than 0.8″).  I wonder if a sensible (non-standard) decision theory can be formulated on the basis of these bounds?

I’ve seen papers on doing reasoning, based on propositions being more or less likely than other propositions, but without assigning numbers to the probabilities. Unfortunately, a half hour of poking around doesn’t turn up the papers I’m thinking of. The general area is called “valuation algebras on semirings”. In the case I remember, the semiring is Boolean algebra on propositions, which induces a partial order on the extent to which they are believed.

Anyway, that’s a not-very-useful half-assed reference. Now I’m going to switch to a more common mode of Tumblr discourse, i.e. talking about how what you say shows you’re thinking wrong (I may be misunderstanding what you say, but this being Tumblr, I will ignore that possibility.)

You’re operating on the principle that the goal of reasoning is to put probabilities on propositions. Then you find various problems involving e.g. what if you suddenly think of a new proposition, or realize that two propositions you thought were different are actually the same. But it seems to me that propositions are not the best thing to assign probabilities to.

What we want to find is a probability distribution over states of the world. Turning that into a probability for some proposition is a matter of adding up the probabilities of all the states of the world where that proposition is true. This is bog-standard measure theoretic probability theory, so it’s not just something I made up. You might find that thinking this way dissolves some of the perplexities you’ve been pondering in your last two posts.

Thanks for the pointer about valuation algebras on semirings.

About world states – I addressed that in my original post, when I contrasted the die roll example (where we really can describe world states) to real-world claims like “Trump will be re-elected in 2020.”

If we actually want to specify states of the real world at the level of measure theoretic outcomes (set elements, rather than sets), either we’ll throw away some of what we know about the world, or the outcomes would have to be things like quantum field configurations down to the subatomic scale.  (Indeed, even that would be throwing away knowledge, since we don’t have a unified theory of fundamental physics and aren’t fully committed to any of theories we do have; the outcome-level description would have to involve different candidate laws of physics plus states in terms of them.)

The natural reflex is to do some sort of coarse-graining, where we abstract away from the smallest-level description, but at that point we’re basically doing Jaynes’ propositional framework, since we’re allowing that our most basic units of description could be refined further (we don’t specify O(10^23) variables for every mole of matter, but we allow that we might learn some of those variables later).

TBH, I think I am so skeptical of Bayes in part because I am used to thinking in the measure-theoretic framework, and it just seems so obvious that we can’t do practical reasoning with descriptions that are required to be that complete.  Jaynes’ propositional framework seems like an attempt to avoid this problem, or at least hide it, which is why I’m focusing on it – it’s less clear that it’s unworkable.

(via identicaltomyself)