Install Theme

“Flattening the Curve” is a deadly delusion →

[EDIT: hello SSC readers!  This is a post I wrote quickly and with the expectation that the reader would fill in some of the unstated consequences of my argument.  So it’s less clear than I’d like.  My comment here should hopefully clarify things somewhat.]

———————–

[EDIT2: people seem really interested in my critique of the Gaussian curve specifically.

To be clear, Bach’s use of a Gaussian is not the core problem here, it’s just a symptom of the core problem.

The core problem is that his curves do not come from a model of how disease is acquired, transmitted, etc.  Instead they are a convenient functional form fitted to some parameters, with Bach making the call about which parameters should change – and how much – across different hypothetical scenarios.

Having a model is crucial when comparing one scenario to another, because it “keeps your accounting honest”: if you change one thing, everything causally downstream from that thing should also change.

Without a model, it’s possible to “forget” and not update a value after you change one of the inputs to that value.

That is what Bach does here: He assumes the number of total cases over the course of the epidemic will stay the same, whether or not we do what he calls “mild mitigation measures.”  But the estimate he uses for this total – like most if not all such estimates out there – was computed directly from a specific value of the replication rate of the disease.  Yet, all of the “mild mitigation measures” on the table right now would lower the replication rate of the disease – that’s what “slowing it down” means – and thus would lower the total.

I am not saying this necessarily means Bach is wrong, either in his pessimism about the degree to which slowing measures can decrease hospital overloading, or in his preference for containment over mitigation.  What I am saying is this: Bach does not provide a valid argument for his conclusions.

His conclusions could be right.  Since I wrote this, he has updated his post with a link to the recent paper from Imperial College London, whose authors are relatively pessimistic on mitigation.

I had seen this study yesterday, because an acquaintance in public health research linked it to me along with this other recent paper from the EPIcx lab in France, which is more optimistic on mitigation.  My acquaintance commented that the former seemed too pessimistic in its modeling assumptions and the latter too optimistic.  I am not an epidemiologist, but I get the impression that the research community has not converged to any clear conclusion here, and that the range of plausible assumptions is wide enough to drive a wide range of projected outcomes.  In any case, both these papers provide arguments that would justify their conclusions if their premises were true – something Bach does not do.

P. S. if you’re still curious what I was on about w/r/t the Gaussian, I recommend reading about thin-/heavy-/exponential-tailed distributions, and the logistic distribution as a nice example of the latter.]

———————–

I’ve seen this medium post going around, so I’ll repost here what I wrote about it in a Facebook comment.

This article simply does not make sense.  Here are some of its flaws:

- It assumes the time course of the epidemic will have a Gaussian functional form.  This is not what exponential growth looks like, even approximately.  Exponential growth is y ~ e^x, while a Gaussian’s tail grows like y ~ e^(-x^2), with a slower onset – the famous “light tails” of the normal distribution – and a narrow, sudden peak.  I don’t know why you’d model something that infamously looks like y ~ e^x as though it were y ~ e^(-x^2), even as an approximation, and the author provides no justification.

- Relative to a form that actually grows exponentially, most of the mass of a Gaussian is concentrated right around the peak.  So the top of the peak is higher, to compensate for the mass that’s absent from the light tails.  Since his conclusions depend entirely on how high the peak goes, the Gaussian assumption is doing a lot of work. [EDIT: I no longer think Bach would have drawn a different qualitative conclusion if he had used a different functional form.  See the step function argument from ermsta here.]

- No citation is provided for 40%-to-70% figure, just the names and affiliations of two researchers.  As far as I can tell, the figure comes from Marc Lipsitch (I can’t find anything linking it to Christian Drosten).  Lipsitch derived this estimate originally in mid-February using some back-of-the-envelope math using R0, and has since revised it downward as lower R0 estimates have emerged – see here for details.

- In that Lipsitch thread, he starts out by saying “Simple math models with oversimple assumptions would predict far more than that given the R0 estimates in the 2-3 range (80-90%),” and goes on to justify a somewhat lower number.

The “simple math” he refers to here would be something like the SIR model, a textbook model under which the fraction S_inf of people never infected during an epidemic obeys the equation R_0 * (S_inf - 1) - ln(S_inf) = 0.  (Cf. page 6 of this.)

Indeed, with R_0=2 we get S_inf=0.2 (80% infected), and with R_0=3 we get S_inf=0.06 (94% infected).  So I’m pretty sure Lipsitch’s estimate takes the SIR model as a point of departure, and goes on to postulate some extra factors driving the number down.

But the SIR model, like any textbook model of an epidemic, produces solutions with actual exponential growth, not Gaussians!  There is no justification for taking a number like this and finding a Gaussian that matches it.  If you believe the assumptions behind the number, you don’t actually believe in the Gaussian; if you believe in the Gaussian (for some reason), you ought to ignore the number and compute your own, under whatever non-standard assumptions you used to derive the Gaussian.

- What’s more, he doesn’t say how his plotted Gaussian curves were derived from his other numbers.  Apparently he used the 40%-70% figure together with a point estimate of how long people spend in the ICU.  How do these numbers lead to the curves he plotted?  What does ICU duration determine about the parameters of a Gaussian?  Ordinarily we’d have some (simplified) dynamic model like SIR with a natural place for such a number, and the curve would be a solution to the model.  Here we appear to have a curve with no dynamics, somehow estimated from dynamical facts like ICU duration.

- Marc Lipsitch, on his twitter, is still pushing for social distancing and retweeting those “flatten the curve” infographics.  I suppose it’s conceivable that he doesn’t recognize the implications of his own estimate.  But that is a strong claim and requries a careful argument.

I don’t know if Lipsitch has read this article, but if he has, I imagine he experienced that special kind of discomfort that happens when someone takes a few of your words out of context and uses them to argue against your actual position, citing your own reputation and credibility as though it were a point against you.

@necarion​ (thread clipped for space)

If you had a hyperbolic latent space model *(pun brain, being hyperbolic: absolutely the best possible approach, there is no latent space model better)*, where the encodings and relationships were learned by the classifier, isn’t there a problem where “depth” in the hyperbolic space would start to become an overwhelming factor in the distance metric? Like, I’d you allowed for lots of space between “oncologist” and “dermatologist”, wouldn’t you also end up with a lot of space between either and “doctor”? I could see some silly results, like there being a smaller distance between “doctor of philosophy” and “doctor” than between “doctor” and “oncologist”. Or am I getting the approach wrong?

I think you’re right about how the distance metric behaves (not completely sure), but you’re assuming we want the distance metric to measure conceptual similarity, and we don’t necessarily need that.

Intuitively, what makes concepts similar or dissimilar has a lot to do with the kind of thing they point to (their position on the non-depth axis), and not as much to do with the specificity level of their pointing (position on the depth axis).

This is like a continuous/fuzzy version of the child/ancestor relations in the underlying tree structure: “oncologist” is inherently similar to “medical doctor” because it’s a child of “medical doctor” in the tree, a property enjoyed by any sub-sub-subtype of doctor but not by any kind of non-doctor.  But if you can embed trees in a continuous space, hopefully you can also derive useful continuous versions of important tree relations like parent/child, and you can use this rather than just distance when needed.  IIUC, “hyperbolic entailment cones” purport to provide just this.

So, the hyperbolic metric doesn’t correspond better to intuitive similarity, what advantage am I claiming for it?  Well, the distances between things matter in NN training even before we impose any interpretation on them, because they affect gradients / interact with regularization.  This is hand-wavey, but IMO it’s bad if your parameters require tuning at too many different scales at once, and it will tend to leave some scales neglected by the optimizer in favor of others. 

(Fine-tuning weights is costlier than setting them to just anyplace in a more coarse range of values; learning a new fine-scale distinction costs about as much as refining the details of a coarse-scale distinction you mostly know already.  So it might never “oncologist,” preferring to invest ever further in refining the exact edge of the doctor vs. non-doctor boundary.  We think those aren’t equally important, and we need to convey that in the metric.)

(via necarion)

@marlemane (thread snipped for length)

Is it really necessary to embed your graph in a space? There’s a perfectly fine notion of distance on graphs you can define without respect to any embedding.

As I understand it, you’re mainly using the embedding of the graph into space so that you can classify by separation with hyperplanes, right? That is, a NN is a nonlinear map on your space that sends “x-like” to one side of the plane and everything else to the other side.

But couldn’t you hypothetically have a concept tree and a NN that tracks down branches based on the input? Something like object —> animate —> mammalish —> dog —> husky. A directed graph can even accommodate partially overlapping categories in a way that metric embedding necessarily cannot, so that you can also get to husky by object —> animate —> soft animate! —> husky.

As a purely anecdotal point, this model feels much more like how my daughter learned the world. She first learned objects, then animals as a category, then dogs, then specific breeds.

I’m not sure I understand your argument, but here are some stray comments:

The most interesting thing here is not picking nodes from graphs already known in advance, but learning graph structure automatically from data.  Although something that helps you do the latter will generally help with the former too.

There’s inherent value here in knowing that you can embed something in a differentiable manifold, because an NN is a machine for “learning” mappings between differentiable manifolds.  (They have to be differentiable because the “learning” involves using derivatives.)

Of course, lots of NNs have outputs that don’t live on manifolds.  Like discrete labels, or just True vs. False.  But if you look under the hood, these are really just compositions of two pieces:

  1. A map X -> Y between two manifolds, which is learned from data in a complicated way (with 99% of research energy going into the complications of this step)

  2. A simple, fixed, user-supplied map Y -> Z between the output manifold of step 1 and the actual output space Z

In classification by hyperplanes, for example, step #1 is everything up until the point where you have all the signed distances from the hyperplanes, and then step #2 is where you read off which of those distances is highest.

Thus, “under the hood,” an NN is always learning to select points on a manifold.  There may be an additional step of translation/interpretation which converts the thing the NN naturally does (”I have selected this point”) to a judgment we care about (”the picture is a dog,” or something).

But only works insofar as these judgments are actually well-modeled by selecting points on a manifold.  If your output space Z has some property you care about, it matters whether that property can be “translated” into some property defined on manifolds.

——

Here’s an example.  Imagine the elements of Z are truth-assignments on a boolean algebra.  In principle, for your map Y -> Z from the manifold Y, you could choose anything whatsoever; you could carve up Y into whatever subsets you want and give each one some arbitrary truth-assignment.  But you’d have to make sure that all these truth-assignments were consistent, obeying the rules of Boolean algebra – this would be “your job,” and not something that happens automatically.

On the other hand, suppose you choose Y -> Z in a particular way, with conjunctions in the algebra always translating into set intersections on the manifold, and disjunctions in translating into set unions.  Then the rules of the algebra will always be obeyed, “for free,” in the output you get.  A Boolean-algebraic structure was already there in the manifold, so the outputs of the manifold-learner already had that structure, even before you did any interpretation.

——

Likewise, in the case of graphs, you can always find some way to map a manifold Y onto some particular graph Z.

But if you know graphs of some kind can be embedded in Y without distortion, that means the structure is “already there” in Y, like the Boolean-algebraic one.

So, you can have hope that a generically powerful manifold learner for Y will also be a generically powerful learner for those graphs – by virtue of its manifold-learning powers alone.  You can have hope that manifold learning will naturally and automatically this kind of pattern in the data (because it is a kind of pattern natural to manifolds, which a good manifold learner ought to care about).  You no longer need to worry about the tension between the problem you care about and the “manifold version” of it which the learner cares about – the “manifold version” of the problem just is the problem.

(via marlemane)

the-real-numbers:

necarion:

nostalgebraist:

jadagul:

This looks cool and I need to read it later.

the-real-numbers:

Just, uh, gonna leave this here for… reasons

https://arxiv.org/pdf/1610.08401.pdf

(Tagging @stumpyjoepete​ since he tagged me on this post)

This is definitely a cool result.

It’s an extension of previous adversarial example work, showing that you can find a single adversarial perturbation  – i.e. a very faint, nearly imperceptible pattern you can layer on top of an image that will cause neural net classifiers to mis-classify it – that works generically for any image in the standard ImageNet challenge dataset.  These even generalize across different classifiers, to some extent.

My strong hunch is that this is a “feature, not a bug,” and reflects the inherent mismatch between the ImageNet challenge and real vision, rather than reflecting a flaw in neural net image classifiers.

The paper doesn’t draw this conclusion, but it contains various pieces of evidence pointing in that direction, IMO.  Namely:

  • As mentioned, if you design one of these “universal perturbations” to target one classifier, it will also tend to fool other classifiers, even those with very different architectures.

    This increases the burden of proof for someone arguing that this reflects a flaw in how these models classify images: this person would not be arguing just that some architecture has a blind spot, they’d be arguing that many apparently distinct architectures somehow have the exact same blind spot.

    On the other hand, the different architectures have this in common: they’re all good at the ImageNet challenge.  So if “susceptibility to universal perturbations” is actually a natural result of being good at ImageNet, it’s no surprise that all the architectures have that property.  (Humans find the ImageNet challenge difficult without special training, so it’s not a problem for this hypothesis that humans aren’t thus susceptible.)

  • The authors do a finetuning experiment that tried to teach the VGG-F architecture not to misclassify the perturbed images.  This helped a little, but cannot get the model below a “fooling rate” of 76.2%, which is still high.

    To explain this as a defect in the architecture, one would have to imagine that the universal perturbations are somehow “invisible” to it in a way that prevents them from learning a signal correlated with them; this seems implausible.  [ETA: of course the perturbations aren’t invisible to the models, otherwise they wouldn’t work.]  But if “don’t misclassify the perturbed images” actually competes with “do well at ImageNet,” then of course you won’t get very far on the former while still trying to preserve the latter.  (In this connection, note also the following: “This fine-tuning procedure moreover led to a minor increase in the error rate on the validation set […]”)

  • The incorrect class labels given to perturbed images tend to come from some very small set of “dominant” labels, as visualized in the directed graph.

    This made me think of a hypothesis like “there are a few classes in the ImageNet challenge that have certain distinctive visual patterns not shared by any other classes, and so the optimal way to identify these classes (in the context of the challenge) is just to check for these patterns.”

    This seems a priori plausible.  The ImageNet challenge asks for classification at a very fine-grained level, without partial credit for getting the right general sort of thing.  Many of the 1000 ImageNet challenge classes are specific species (or other low-level taxonomic group) of animal.  The images themselves, largely scraped from Flickr, are photographs of the animals (or other things) from numerous angles, in numerous contexts, sometimes partially obscured, etc.  In this context, developing a high-level concept like “bird” is actually quite difficult, and of limited value (no partial credit for knowing it’s a bird unless you can tell exactly what kind of bird).  But identifying the distinctive markings that are the hallmark of one exact kind of bird will work.

    When you get points for saying “African grey” but not for another kind of parrot, and you have to do this across diverse pictures of African greys, and you’re a neural net that doesn’t know anything at the outset, of course you’re going to develop a detector for some exact textural feature that only African greys have and use that as your African grey detector, and skip over the much harder task of developing detectors for “parrot” or “bird.”

    (African grey is in fact one of the dominant labels.  Macaw is another.)

The authors do this other thing where they look at singular values of a matrix of vectors from images to the nearest decision boundaries, and show that these vectors have some orientations much more often than others.  I’m not sure I understand this part – isn’t it just a restatement of the result, not an explanation of it?  (If this were false, wouldn’t the result be impossible?)

Anyway, this way of describing the situation – “the nearest decision boundary is frequently in a specific direction” – needs to be interpreted in light of the dominant labels things.  It would be different, and arguably more interesting, if there weren’t dominant labels, or if they weren’t quite so dominant; in that case the result would mean that the models identify certain textural differences as inherently “salient for distinctions.”

Instead, it just means that the models make some distinctions differently than others.  Some distinctions are made in a more “realistic” way, on the basis of higher-level features that correspond to different pixel-level variations depending on what base image you’re varying.  And then, some are just simple pattern detectors that always look about the same on the pixel level.  And again, that’s not really surprising.  Distinguishing bird from non-bird is a high-level judgment, but distinguishing one species within birds really is a matter of looking for one telltale pattern that’s relatively stable across orientations.

Now, if you’re a human who has to track objects over time, understand salient categories like “is this animate?”, and so on, you will tend to make the “YES-bird” and “YES-African-grey” judgments simultaneously.  Thus it sounds bizarre for something to say “YES-African-grey” when it’s looking at a bathtub that happens to have a bit of the African grey texture sprinkled on top.  But if you’re an ImageNet challenge machine, the “YES-bird” judgment doesn’t even exist in your universe.  In the toy ImageNet universe, in fact, it is arguably not even wrong to classify that bathtub as an African grey – for in that universe, there are no birds as such, and there is no such thing as a bird for a bathtub to be distinctively not.

Are there CNN training sets that include these hierarchies? So something could be an African Grey and a Parrot and Bird? Or modifying the network to go through some sort of word embedding, so results that are particularly closely clustered might be “partly” acceptable to the training?

There are CNN data sets that have hierarchical classes in the DSP/ML space. I’m not sure how available they are to laypeople. Sometimes you can handle the subclass superclass problem by classifying on the subclasses and have an additional loss factor for superclasses/categories, although I imagine you could try having one head CNN for superclasses that passes off the processed images to various trunks for subclassing.

But say for example if it’s hard to tell the difference between a titmouse and a pug. The traditional superclass may send titmice to the wrong subclass net and you’re guaranteed to get a wrong answer.

Although, you may find that you might want to superclass based on the most confused subclasses, which could mean training a subclassifier and determining superclasses with a mutual information approach or eyeballing a confusion matrix, then trying again.

A relevant, fairly new area of research that I find exciting is hyperbolic embeddings.  Some key papers are

  1. The original paper introducing them (or the one everyone cites, anyway)
  2. This paper which provided an important conceptual advance over #1
  3. This one which builds up some of the necessary building blocks for neural nets over these spaces

The idea behind hyperbolic embeddings is… hmm, let me describe it this way.  Suppose you have some hierarchically nested categories, and you’re trying to model them in Euclidean space in some way.

There are two (?) ways to do this (this distinction is mine, not from the above papers):

  • “Map” model: each category is a region of R^n, and the hierarchy’s nesting relation is represented by the R^n subset relation.  Like, “human” might be some blob of R^n, and “doctor” is a proper subset of that blob, and then “oncologist” is a proper subset of “doctor,” and so forth.

    This is like a map, where “doctor” is inside “human” the way “Colorado“ is inside “U.S.”

  • “Tree” model: each category is a point in R^n, and the points are arranged like a literal picture of a branching tree structure.   If the tree(s) start at the origin, the nesting relation is represented by R^n vector magnitude, with more specific categories further from the origin.

Now, a downside of the “map” model is that finer-grained category distinctions are encoded as smaller distances in R^n.  This might sound natural (aren’t they “smaller” distinctions?), but the practical importance of a distinction doesn’t necessarily scale down with its specificity.  (Sometimes it’s very important whether a doctor is an oncologist or not, even though that’s a “fine-grained” distinction if your perspective also captures doctor vs. non-doctor and human vs. non-human.)

One might hope that the “tree” model could solve this problem: you can have each level “fan out” from the previous level in space, making its nodes just as far apart from one another as the nodes in the previous level.

But, in Euclidean space, there isn’t enough room to do this.  Deeper levels in the tree have exponentially more nodes, so you need exponentially more volume to put them in, but going further from the origin in R^n only gives you polynomially more volume.

However, hyperbolic space gives you just what you want: exponentially more volume as you go out.  Like in the famous Escher illustrations (visualizing the Poincare disk model of 2D hyperbolic space):

image

In the actual hyperbolic metric, the bats are all the same size.  A tree embedded in the Poincare disk model might look like (figure from the Poincare Embeddings paper):

image

where again, things don’t actually get closer together near the rim, they’re just visualized like that.

OK, so what does that have to do with the original topic?

Well, almost any classifier you encounter these days is going to do two things: map its inputs onto a (Euclidean) latent space in some complicated non-linear fashion, and then divide up that latent space into regions for the different labels.  (Usually the latter step is done with hyperplanes.)

We’re discussing ways of letting the classifier “know” that the labels have a hierarchical structure, with some of them “going together” as part of a larger group, which might then be part of an even bigger group etc.

If we do this by allowing “partial credit” for labels in the same coarse class (as in @necarion​‘s word embedding proposal), this will encourage the network to put these labels close together in the latent space.  Which is like the “map” model: all the types of bird will get assigned to adjacent regions, and you could draw a big shape around them and say “this is ‘bird’.”  So at best we end up with the “map” model, with its “oncologist problem” as described above.

Alternately, you can actually change the model to explicitly encode the hierarchy – like what the @the-real-numbers​ describes, where you have different classifiers for different levels.  This can let you get around the downsides of the Euclidean “map” model, because the different classifiers can operate only on their own scales: the coarse classifier that just has to output “bird” is free to squash lots of bird types close together in its latent space, while the intra-bird classifier gets a whole latent space just for birds, so it can make them far apart.

Suppose – as the hyperbolic embedding work suggests – that the discriminations we want out of the model cannot be mapped well onto distances in Euclidean space.

Then:

  • The partial-credit approach says “let’s just do the best we can in Euclidean space, with the nesting relation of an arbitrary hierarchy modeled by the subset relation on a Euclidean space learned from data with that hierarchy.”

    This provides an intrinsic model for “nesting” as a generic concept, but distances inside the same model don’t behave in all the ways we’d like (oncologist problem).

  • The multiple-classifier approach says “let’s give up on modeling the nesting relation of an arbitrary hierarchy; instead let’s tie ourselves down to one specific hierarchy, and design N copies of Euclidean space tailored for it.”

    This does not provide an intrinsic model for "nesting” as a concept – you’re tied to one particular case of nesting, expressed by the output code that maps the various latent spaces to parts of your specific hierarchy.

With hyperbolic latent space, hopefully you can model the nesting relation as a relation in the space (intrinsic) and still have the distinctions you want to make map naturally onto distances in the space (no oncologist problem).

jadagul:

This looks cool and I need to read it later.

the-real-numbers:

Just, uh, gonna leave this here for… reasons

https://arxiv.org/pdf/1610.08401.pdf

(Tagging @stumpyjoepete​ since he tagged me on this post)

This is definitely a cool result.

It’s an extension of previous adversarial example work, showing that you can find a single adversarial perturbation  – i.e. a very faint, nearly imperceptible pattern you can layer on top of an image that will cause neural net classifiers to mis-classify it – that works generically for any image in the standard ImageNet challenge dataset.  These even generalize across different classifiers, to some extent.

My strong hunch is that this is a “feature, not a bug,” and reflects the inherent mismatch between the ImageNet challenge and real vision, rather than reflecting a flaw in neural net image classifiers.

The paper doesn’t draw this conclusion, but it contains various pieces of evidence pointing in that direction, IMO.  Namely:

  • As mentioned, if you design one of these “universal perturbations” to target one classifier, it will also tend to fool other classifiers, even those with very different architectures.

    This increases the burden of proof for someone arguing that this reflects a flaw in how these models classify images: this person would not be arguing just that some architecture has a blind spot, they’d be arguing that many apparently distinct architectures somehow have the exact same blind spot.

    On the other hand, the different architectures have this in common: they’re all good at the ImageNet challenge.  So if “susceptibility to universal perturbations” is actually a natural result of being good at ImageNet, it’s no surprise that all the architectures have that property.  (Humans find the ImageNet challenge difficult without special training, so it’s not a problem for this hypothesis that humans aren’t thus susceptible.)

  • The authors do a finetuning experiment that tried to teach the VGG-F architecture not to misclassify the perturbed images.  This helped a little, but cannot get the model below a “fooling rate” of 76.2%, which is still high.

    To explain this as a defect in the architecture, one would have to imagine that the universal perturbations are somehow “invisible” to it in a way that prevents them from learning a signal correlated with them; this seems implausible.  [ETA: of course the perturbations aren’t invisible to the models, otherwise they wouldn’t work.]  But if “don’t misclassify the perturbed images” actually competes with “do well at ImageNet,” then of course you won’t get very far on the former while still trying to preserve the latter.  (In this connection, note also the following: “This fine-tuning procedure moreover led to a minor increase in the error rate on the validation set […]”)

  • The incorrect class labels given to perturbed images tend to come from some very small set of “dominant” labels, as visualized in the directed graph.

    This made me think of a hypothesis like “there are a few classes in the ImageNet challenge that have certain distinctive visual patterns not shared by any other classes, and so the optimal way to identify these classes (in the context of the challenge) is just to check for these patterns.”

    This seems a priori plausible.  The ImageNet challenge asks for classification at a very fine-grained level, without partial credit for getting the right general sort of thing.  Many of the 1000 ImageNet challenge classes are specific species (or other low-level taxonomic group) of animal.  The images themselves, largely scraped from Flickr, are photographs of the animals (or other things) from numerous angles, in numerous contexts, sometimes partially obscured, etc.  In this context, developing a high-level concept like “bird” is actually quite difficult, and of limited value (no partial credit for knowing it’s a bird unless you can tell exactly what kind of bird).  But identifying the distinctive markings that are the hallmark of one exact kind of bird will work.

    When you get points for saying “African grey” but not for another kind of parrot, and you have to do this across diverse pictures of African greys, and you’re a neural net that doesn’t know anything at the outset, of course you’re going to develop a detector for some exact textural feature that only African greys have and use that as your African grey detector, and skip over the much harder task of developing detectors for “parrot” or “bird.”

    (African grey is in fact one of the dominant labels.  Macaw is another.)

The authors do this other thing where they look at singular values of a matrix of vectors from images to the nearest decision boundaries, and show that these vectors have some orientations much more often than others.  I’m not sure I understand this part – isn’t it just a restatement of the result, not an explanation of it?  (If this were false, wouldn’t the result be impossible?)

Anyway, this way of describing the situation – “the nearest decision boundary is frequently in a specific direction” – needs to be interpreted in light of the dominant labels things.  It would be different, and arguably more interesting, if there weren’t dominant labels, or if they weren’t quite so dominant; in that case the result would mean that the models identify certain textural differences as inherently “salient for distinctions.”

Instead, it just means that the models make some distinctions differently than others.  Some distinctions are made in a more “realistic” way, on the basis of higher-level features that correspond to different pixel-level variations depending on what base image you’re varying.  And then, some are just simple pattern detectors that always look about the same on the pixel level.  And again, that’s not really surprising.  Distinguishing bird from non-bird is a high-level judgment, but distinguishing one species within birds really is a matter of looking for one telltale pattern that’s relatively stable across orientations.

Now, if you’re a human who has to track objects over time, understand salient categories like “is this animate?”, and so on, you will tend to make the “YES-bird” and “YES-African-grey” judgments simultaneously.  Thus it sounds bizarre for something to say “YES-African-grey” when it’s looking at a bathtub that happens to have a bit of the African grey texture sprinkled on top.  But if you’re an ImageNet challenge machine, the “YES-bird” judgment doesn’t even exist in your universe.  In the toy ImageNet universe, in fact, it is arguably not even wrong to classify that bathtub as an African grey – for in that universe, there are no birds as such, and there is no such thing as a bird for a bathtub to be distinctively not.

maybesimon asked: hey, this is a Blast From The Past but do you happen to know if there's a formal name for that conjunction-fallacy-type thing that you guys talked about here? (jadagul (DOT) tumblr (DOT) com/post/142447219223), the thing with the nested outcomes and that it is impossible to assign coherent probabilities to it?

I don’t know of a formal name for it, no. Anyone?

Bayes Trubs, part 1

a-point-in-tumblspace:

Tldr: there are circumstances (which might only occur with infinitesimal probability, which would be a relief) under which a perfect Bayesian reasoner with an accurate model and reasonable priors – that is to say, somebody doing everything right – will become more and more convinced of a very wrong conclusion, approaching certainty as they gather more data.

Keep reading

Thanks for this post, it helped me understand an interesting-seeming paper that I’ve also found tough to read.

Digression

Freedman and Diaconis published a whole bunch of Bayesian consistency counterexamples like this over the course of their careers.  I’m honestly not sure whether any of them have clear practical significance, although I think they have theoretical significance by showing that Bayesian inference is harder to write down as a complete and satisfactory piece of mathematics than some might think.

Specifically, I get a “Counterexamples in Analysis” flavor from them (for one thing, they are literally counterexamples in analysis).  They are symptoms of the fact that the natural mathematical setting for probability is a setting with a lot of counter-intuitive pathologies.  So, it shouldn’t be surprising that these examples exist: if they didn’t, then the formalization of Bayesian inference would have gone unusually smoothly.

End digression

Here are some thoughts about this example specifically.

Null sets

It’s crucial that the true parameter be exactly 0.25, not just close to 0.25.  Otherwise the inconsistency would violate Doob’s result, that the Bayesian is consistent except on a set of prior measure 0.  The example can work the way it does because {θ: θ=0.25}, like any singleton set, is a set of measure zero (a null set) in this prior.

IMO, the intuition that the example is troubling actually conflicts with the prior in the example.  The prior makes {θ: θ=0.25} a null set, which means it views things that happen in that set and only there as negligible.  For example, the behavior there won’t influence any expectation values, so it won’t influence any decisions made by maximizing expected utility over the posterior.

The prior is saying we can “write off” arbitrary pathologies happening only at this point (or only happening at any given point).  If we don’t think the exact value θ=0.25 can be written off like this, we should put a point mass there in our prior.  To put it another way, while it’s theoretically interesting to explore what can go wrong for a Bayesian on one of their null sets, if you think it’s important what happens on the null sets then you are effectively saying they aren’t null sets (in your opinion).  The Bayesian who does view them as null sets actually doesn’t mind the pathologies, and behaves consistently given that.

Could something go wrong in practice?

Now on to something a bit more interesting.  At the end, you write:

But… just because this effect can’t mislead you literally forever doesn’t mean it can’t mislead you for a very long time.

That is: if we look at some non-null set like {θ: θ-ε < 0.25 < θ+ε}, then yeah, for (prior-)almost all θ in the set, we will eventually converge.  But as we make ε small, the convergence will take longer and longer as we are “fooled” by more large observations.  Is this bad?

I don’t think so.  One way to describe the situation is as follows.  Let E_n be the event (in observation space) that “an observation demonstrates the threshold is ≥ n”.  Then we’ve defined a sequence of events {E_n} with these properties:

(i) For each event E_n, there are “two fundamentally different ways” the event could happen, corresponding to the θ~0.25 and θ~0.75 regions.  We have two “types” of hypotheses: I’ll call these hypotheses of “the first class” (θ~0.25) and “the second class” (θ~0.75).

(ii) For any fixed n, both of the ways for E_n to happen have non-zero prior mass.

(iii) For large n, the prior mass of the first way E_n could happen (θ~0.25) is small relative to the prior mass of the second way (θ~0.75).  As n goes to infinity, this ratio goes to zero.

Now, for any specific value of n, these don’t seem problematic at all.  We have two classes of hypotheses, both capable of explaining events of type E_n.  But as we follow the sequence E_n, letting n grow large, we’re considering types of observations that can only be explained by more and more (prior-)unlikely variants of the first hypothesis class.

It doesn’t seem bad at all that these observations push us toward the second hypothesis class.  The observations can be explained two ways: either θ~0.25 and θ is very closely fine-tuned (where the extremity of the “very” grows with n), or θ~0.75 and more generic.  All else being equal, this really does weigh toward θ~0.75.

So there’s nothing wrong with the updates on any specific E_n.  What still feels worrying, if anything does, is something about the limit in (iii).

After all, for every n, there is a positive-prior-mass set of hypotheses in the “first class” that would yield E_n if actually true.  Yet as n grows large, we find E_n to be more and more overwhelming evidence against the first class in favor of the second.  Isn’t that weird?

Actually, it’s completely normal.  Again, we must take the prior seriously; otherwise we’re only quibbling with the prior, not with “Bayes” itself.  (Or perhaps we are pointing out that Bayes can be tricky in practice, but not undermining it in theory.)

So: it is true that for any n, the event E_n could occur due to either a first-class or a second-class situation.  But for very large n, we should be very surprised to see a first-class hypothesis causing E_n: the stars have to really align for that to happen.

As we follow E_n into the limit, the cases where the truth has θ~0.25 get more and more inconvenient for the Bayesian.  But they also get more and more improbable, in terms of prior mass.  That’s why the Bayesian updates away from θ~0.25: as n grows large, an increasingly (prior-)unlikely coincidence is necessary to preserve the belief that we’re near θ=0.25 and not near θ=0.75.  So, yes, if a very unlikely situation occurs and mostly resembles some very likely situation, the Bayesian is going to have a bad time, but they’re having a bad time because they rationally conclude they’re in the likely situation and just happen to be wrong by (increasingly unlikely) construction.

That’s not to say that this was immediately obvious to me, and I think it’s a useful example of how a prior can imply things you don’t realize it implies.  This behavior is rational given a reasonable-looking continuous prior over values of θ.  If there’s something weird going on, it’s possibly that you don’t think the “reasonable-looking” prior is actually reasonable, once you consider everything it implies.  Or, on the other hand, that you do find it reasonable upon reflection but don’t find all of its consequences immediately intuitive, even though it (or things like it) are suppose to capture your real state of prior knowledge.  But now I’m slipping into some argument I’m much less confident in, so I should stop here.

Strategies and techniques for proving theorems have a funny status in mathematics.

If “mathematics” is a collection of precisely defined structures and truths about them – and it’s, at least, very common to talk as though it is – then proof strategies are not really “mathematics.”  If a certain theorem is especially useful/convenient for proving another sort of theorem, that may be interesting, but it’s not a mathematical fact.

There might be a mathematical fact lurking behind it, like a theorem precisely characterizing the class of things the useful theorem can be used for.  But at this point, we’d no longer be in the domain of proof techniques.  “Applying Theorem T is one of the standard strategies to try when dealing with a object of type P” is a statement about technique, but “Theorem T can reduce all Ps having property Q into Rs” is just another theorem like T, even if it’s the entire reason why the standard strategy became standard.

But if the proof techniques are not a part of mathematics, then what does mathematics look like on its own, if you try to forget every piece of knowledge about devising proofs that you can’t distill into mere mathematics?  I don’t think I’ve seen much attention given to this question.

I guess it isn’t very compatible with the practice of mathematics itself.  You can’t push mathematics forward when taking this perspective.  Still, it seems worth knowing how much of the usual mental picture of mathematics – especially all the connections between different subfields and objects – is really depicting of the structure of proof-usefulness, rather than the structure of the objects themselves.

(For example, it doesn’t feel very illuminating about the real numbers to say that they can be embedded in the complex plane.  But this embedding can be used to do contour integration and prove lots of facts about real numbers.  Is this telling us anything about real numbers apart from the value of the integrals?  Is there a way to make this “connection” between R and C interesting while only citing facts about those objects, not about your work?

IDK, in this case there might be an affirmative answer that I just don’t know about [if so please tell me].  But this sort of proof-based gut feeling is a thing that happens a lot in a lot of areas, I think)

furioustimemachinebarbarian asked: I think, but don't know for sure, that the reason variational Bayes methods look weird is that they were derived from physical principals following people like Jaynes. In practice, optimizing in variational Bayes looks like minimizing a free energy. The factorization over variables isn't generally true, but is likely physically true when your variables are the positions of a bunch of particles in thermodynamic equilibrium. It looks like a physics based method getting in over its head.

Ah! Yeah, that makes sense.

As it happens, the Gibbs distribution in stat. mech. used to confuse me too – it was clearly just wrong about some things, most obviously whether more than one value of the total energy is possible, and the sources I originally read about it did not clarify which calculations it was supposed to be valid for. And the confusing choice is the same one: replacing a distribution where variables “compete” with one where they’re independent, and then doing calculations on it as if it’s the original one.

But in stat. mech., you can go out and find rigorous arguments about why this calculation technique is valid and useful for specific things, like computing the marginal over M variables out of N when M<<N,  N –> ∞. By contrast, variational Bayes is presented as a way of getting an “approximate posterior,” which you then use for whatever calculations you wanted to do with the real posterior. Which allows for the sort of invalid calculations I used to worry about with Gibbs, like getting a nonzero number for var(E).

I suppose the Gibbs-valid calculations, of one or a few marginals from many variables, are what you want in statistics if you’re just trying to estimate the marginal for some especially interesting variable. Except… for any variable to be “especially interesting,” there must be something special about it that breaks the symmetry with the many others, which prevents the standard Gibbs argument from working. To put it another way, Gibbs tells you about what one variable does when there are very many variables and they’re all copies of each other, but a model like that in statistics won’t assign interesting interpretations to any given variable. It’s only in physics that you get collections of 10^23 identical things that you believe individually, actually exist as objects of potential interest.

It doesn’t mention the word “variational,” but Shalizi’s notebook page about MaxEnt is about exactly this issue, and it was very helpful to me many years ago when I was trying to understand Gibbs and various non-textbook uses of it.

There’s something that seems really weird to me about the technique called “variational Bayes.”

(It also goes by various other names, like “variational inference with a (naive) mean-field family.”  Technically it’s still “variational” and “Bayes” whether or not you’re making the mean-field assumption, but the specific phrase “variational Bayes” is apparently associated with the mean-field assumption in the lingo, cf. Wainwright and Jordan 2008 p. 160.)

Okay, so, “variational” Bayesian inference is a type of method for approximately calculating your posterior from the prior and observations.  There are lots of methods for approximate posterior calculation, because nontrivial posteriors are generally impossible to calculate exactly.  This is what a mathematician or statistician is probably doing if they say they study “Bayesian inference.”

In the variational methods, the approximation is done as follows.  Instead of looking for the exact posterior, which could be any probability distribution, you agree to look within a restricted set of distributions you’ve chosen to be easy to work with.  This is called the “variational family.”

Then you optimize within this set, trying to pick the one that best fits the exact posterior.  Since you don’t know the exact posterior, this is a little tricky, but it turns out you can calculate a specific lower bound (cutely named ELBO) on the quality of the fit without actually knowing the value you’re fitting to.  So you maximize this lower bound within the family, and hope that gets you the best approximation available in the family.  (”Hope” because this is not guaranteed – it’s just a bound, and it’s possible for the bound to go up while the fit goes down, provided the bound isn’t too tight.  That’s one of the weird and worrisome things about variational inference, but it’s not the one I’m here to talk about.)

The variational family is up to you.  There don’t seem to be many proofs about which sorts of variational families are “good enough” to approximate the posterior in a given type of problem.  Instead it’s more heuristic, with people choosing families that are “nice” and convenient to optimize and then hoping it works out.

This is another weird thing about variational inference: there are (almost) arbitrarily bad approximations that still count as “correctly” doing variational inference, just with a bad variational family.  But since the theory doesn’t tell you how to pick a good variational family – that’s done heuristically – the theory itself doesn’t give you any general bounds on how badly you can do when using it.

In practice, the most common sort of variational family, the one that gets called “variational Bayes,” is a so-called “mean field” or “naive mean field” family.  This is a family of distributions with an independence property.  Specifically, if your posterior is a distribution over variables z_1, …, z_N, then a mean-field posterior will be a product of marginal distributions p_1(z_1), …, p_N(z_N).  So your approximate posterior will treat all the variables as unrelated: it thinks the posterior probability of, say, “z_1 > 0.3″ is the same no matter the value of z_2, or z_3, etc.

This just seems wrong.  Statistical models of the world generally don’t have independent posteriors (I think?), and for an important reason.  Generally the different variables you want to estimate in a model – say coefficients in a regression, or latent variable values in a graphical model – correspond to different causal pathways, or more generally different explanations of the same observations, and this puts them in competition.

You’d expect a sort of antisymmetry here, rather than independence: if one variable changes then the others have to change too to maintain the same output, and they’ll change in the “opposite direction,” with respect to how they affect that output.  In an unbiased regression with two positive variables, if the coefficient for z_1 goes up then the coefficient for z_2 should go down; you can explain the data with one raised and the other lowered, or vice versa, but not with both raised or lowered.

This figure from Blei et al shows what variational Bayes does in this kind of case:

image

The objective function for variational inference heavily penalizes making things likely in the approximation if they’re not likely in the exact posterior, and doesn’t care as much about the reverse.  (It’s a KL divergence – and yes you can also do the flipped version, that’s something else called “expectation propagation”).

An independent distribution can’t make “high x_1, high_2″ likely without also making “high x_1, low x_2″ likely.  So it can’t put mass in the corners of the oval without also putting mass in really unlikely places (the unoccupied corners).  Thus it just squashes into the middle.

People talk about this as “variational Bayes underestimating the variance.”  And, yeah, it definitely does that.  But more fundamentally, it doesn’t just underestimate the variance of each variable, it also completely misses the competition between variables in model space.  It can’t capture any of the models that explain the data mostly with one variable and not another, even though these models are as likely as any.  Isn’t this a huge problem?  Doesn’t it kind of miss the point of statistical modeling?

(And it’s especially bad in cases like neural nets, where your variables have permutation symmetries.  What people call “variational Bayesian neural nets” is basically ordinary neural net fitting to find some local critical point, and placing a little blob of variation around that one critical point.  It’s nothing like a real ensemble, it’s just one member of an ensemble but smeared out a little.)