Install Theme

OK yeah, that thing I was talking to @raginrayguns about is way simpler than I thought

The Kelly criterion maximizes the rate of exponential growth, which is just

log(final / initial)

up to a constant.

Like if you have w(t) = exp(rate * t) , and you end at t=T, then

rate = 1/T log(w(T) / w(0))

and T is a constant.

So the Kelly criterion really is nothing but maximizing log wealth, only phrased equivalently as “maximizing exponential growth rate.”

And this phrasing is confusing, because “maximizing exponential growth rate” sounds sort of generically good. Like why wouldn’t you want that?

But the equivalence goes both ways: it’s the same thing as maximizing log wealth, and it’s easy to see you may not want that.

—-

I made a mistake in my original post about geometric averages – I linked to a twitter thread about the Kelly criterion, and a blog post by the same person, as if they were making the same point.

The thread was how I found the post. But in fact, the thread is both wrong and not really about geometric averages being confusing. The post, however, is mostly good and doesn’t mention Kelly at all.

Why did the thread link back to the post, then? The author is conflating several things.

Here are some things you can compute:

  1. The expected growth in wealth from n sequential bets, E[ w_n / w_0 ]. This is what you want to maximize if you have linear utility.
  2. The expected arithmetic average over the growth in wealth from the individual bets.

    This is E[ (w_1 / w_0) + (w_2 / w_1) + … + (w_n / w_{n-1}) ] / n.

    This is meaningless, there’s no reason to do this. However, this gets reported in financial news all the time, I’ve seen in the WSJ for example.
  3. The expected geometric average over the growth in wealth from the individual bets.

    This is E[ ((w_1 / w_0) * (w_2 / w_1) * … )^1/n ], or after cancelling, E[ (w_n / w_0)^1/n ]. So this is (1.), but with a power of 1/n inside the E[].
  4. Like (3.), but with a logarithm inside the E[]: E[ log((w_n / w_0)^1/n) ]. This is the exponential growth rate.

Everything except (1.) has dubious importance at best, IMO.

(1.) is for linear utility, but you have nonlinear utility U, you would just maximize a variant of #1, E[ U(w_n / w_0) ] instead.

In the blog post, Hollerbach is essentially talking about the confusing relationship between (1.) and terms like (w_1 / w_0). You have to multiply these terms to get (1.), and multiplication is confusing.

However, in the post he conflates this product (1.) with the geometric average (3.). They’re not equivalent because the power doesn’t commute with expectation. But I guess they both involve multiplication, and multiplication is confusing.

In the twitter thread, he sort of conflates the geometric average (3.) with the exponential growth rate (4.). Then he pits these against the arithmetic average (2.), which is bad, but is not what SBF was advocating.

Then, since the blog post has already conflated the geometric average with the expected wealth growth, he ends up conflating together everything except the bad one, (2.). In fact, all four are different. And only (1.), or a nonlinear-utility variant of it, is what matters.

raginrayguns:

After n bets from initial wealth 1, your wealth is about

exp(E[log R] n)

where R is new/old wealth in one bet. That’s the appeal of the kelly criterion

But (assuming for now betting at even odds), if you bet it all at each step, expected wealth is

p^n 2^n - (1 - p^n)

exp(log(2p) n) - (≈1)

The weird thing is

log(2p) > max E[log R]

so in terms of expected value, you’re doing better than the original approximation allows

It seemed paradoxical to me at first. But it makes sense after unpacking “about”, considering what kind of convergence, which is

total wealth / exp(E[log R] n) → 1

EDIT: ↑ probably wrong

Betting everything every time is 0/0 on the left. so maybe there’s no real contradiction?

@nostalgebraist why i dont agree with that matt hollerbach thread btw. Not the only person on twitter who was saying SBF was making some elementary mistake… kelly in a certain sense maximizes the growth rate of your money, but it does NOT maximize the growth rate of the expected value of your money

I think you’re right, yeah …

  • Kelly maximizes the expected growth rate.
  • Betting everything maximizes the expectation of your wealth at any given period n.

And, as you say in the OP,

  • E[wealth] grows exponentially in both cases
  • It grows faster if you bet everything than if you bet Kelly

Which makes it sound better to bet everything, if you care about E[wealth].

EDIT 2: everything after this line is totally wrong lol

However, consider the event “exponential growth happens up to n,” i.e. “wealth at n ~ exp(n).” At each n, this is either true or false. In the large n limit:

  • If you bet Kelly, I think this has probability 1? Haven’t checked but I can’t see how that would fail to be true
  • If you bet everything, this has probability 0. Your wealth goes to 0 at some n and stays there.

OK, why would we care? Well, I think these two results apply in two different scenarios we might be in.

  1. You fix some n in advance, and commit to making n bets and then “cashing out.”
    You want to maximize this cash received at n. Here, you want to bet everything.
  2. You want to keep betting indefinitely, while regularly “cashing out” a <100% fraction of the money used for betting, over and over again.
    You want to maximize the expected total you will cash out. (With some time discounting thing so it’s not infinity.)

In case 2, I think maybe you want to bet Kelly? At least, I’m pretty sure you don’t want to bet everything:

  • If you bet everything, you cash out some finite number of times M, making some finite amount of cash ~M. Then your betting wealth goes to zero.
  • If you bet Kelly, then with probability 1 (?), you can cash out arbitrarily many times.
    If you have zero time preference, then you make infinite cash, which is obv. better than the previous case.
    If you do time discounting, I guess it depends on the details of the time discounting? You get a finite amount, and it might be less than the above if you discount aggressively, but then it might not be.

The punchline is, I think “case 2” is more representative of doing actual investing. (Including anything that SBF could reasonably believe himself to be doing, but also like, in general.)

You don’t have some contract with yourself to be an investor for some exact amount of time, and then cash out and stop. (I mean, this is an imaginable thing someone could do, but generally people don’t.)

You have money invested (i.e. continually being betted) indefinitely, for the long term. You want to take it out, sometimes, in the future, but you don’t know when or how many times. And even if you die, you can bequeath your investments to others, etc.

And maybe you do exponential time discounting, behaviorally, for yourself. But once your descendants, or future generations, come into the picture, well – I mean there are economists who do apply exponential time discounting across generations, it’s kind of hard to avoid it. But it’s very unnatural to think this way, and especially if you’re a “longtermist” (!), I doubt it feels morally correct to say your nth-generation descendants matter an amount that decays exponentially in n.

What would make you prefer the finite lump sum from betting everything here?

Well, if you think the world has some probability of entirely ending in every time interval, and these are independent events, then you get exponential discounting. (This is sort of the usual rationale/interpretation for discounting across generations, in fact.)

So if you think p(doom) in each interval is pretty high, in the near term, maybe you’d prefer to bet everything over Kelly.

Which amusingly gets back to the debate about whether it makes sense to call near-term X-risk concerns “longtermist”! Like, there is a coherent view where you believe near-term X-risk is really likely, and this makes you have unusually low time preference, and prefer short term cash in hand to long-term growth. And for all I know, this is what SBF believes! It’s a coherent thing you can believe, it’s just that “longtermism” is exactly the wrong name for it.

ETA: after more thought I don’t think the above is fully correct.

I don’t think the “event” described above is well-defined. At a single n, your wealth (if it’s nonzero) is always “~ exp(n),” for some arbitrary growth rate. Unless it’s zero.

Betting everything is a pathological edge case, b/c your wealth can go to 0 and get stuck there. If you are any amount more conservative than that, you still “get exponential growth” in some sense, it’s just that you’ll regularly have periods of very low wealth (with this low value, itself, growing exponentially in expectation).

If you are cashing out at every n individually, for all n, then I guess you want to maximize the time-discounted sum over n of wealth at each n … need to work that out explicitly I guess.

The idea that “geometric averages are counter-intuitive” came up in two different things I read this week:

  1. Matt Hollerbach’s review of his argument with SBF about the Kelly criterion for betting, see also his blog post. (EDIT: on reflection, I think the Hollerbach twitter thread is not correct, and also not very relevant; it’s just how I found the blog post.)
  2. Froolow’s argument that AI catastrophe scenarios have low probability, once you take parameter uncertainty into account

(I’m trying not to read stuff like the latter, so this was a slip-up. Sorry.)

Both cases involve multiplying numbers together, where some of the numbers are larger than others.

This sounds simple. But it’s more common in real life to deal with numbers that add up, rather than numbers that multiply together, and this means we come to the problem with misleading intuitions. The Hollerbach post has some nice examples.

I’m sure I’ve heard people talk about this before – like, sometime in the distant past, as an abstract curiosity – but I don’t think the lesson had fully “stuck.” I’m starting to wonder whether this is an important bias affecting a lot of otherwise-numerate people.

on “ai forecasting: one year in”

AI is improving so fast, even expert forecasters are surprised!

… wait, is that true?

Who are these experts? And what exactly was it that surprised them?

If you have been following along with the LessWrong-adjacent conversation about AI, you have probably heard some form of the bolded claim at the top. You might have heard it via

Ayeja Cotra:

As a result, I didn’t closely track specific capabilities advances over the last two years; I’d have probably deferred to superforecasters and the like about the timescales for particular near-term achievements. But progress on some not-cherry-picked benchmarks was notably faster than what forecasters predicted, so that should be some update toward shorter timelines for me.

or Dan Hendrycks et al:

Capability advancements have surprised many in the broader ML community: as they have made discussion of AGI more possible, they can also contribute to making discussion of existential safety more possible.

or Scott Alexander:

Jacob Steinhardt describes the results of his AI forecasting contest last year. Short version: AI is progressing faster than forecasters expected, safety is going slower. Uh oh.

All of these people cite the same blog post as their source.

In the last example, Scott is … well, just linking to a blog post, and it’s clear that his “short version” is a summary of the blog post, not necessarily of what’s-actually-true.

But in the other two examples, the claim is being treated as a “stylized fact,” a generalization about reality on the basis of an empirical result. It’s not about Jacob Steinhardt and his contest, but about “forecasters” and “capabilities,” as entire categories.

This is a pretty striking conclusion to draw. “Big if true,” as they say. So a lot is resting on the shoulders of that one blog post and contest. Do they justify the stylized fact?

—-

In August 2021, Jacob Steinhardt organized a forecasting contest on the platform Hypermind.

In July 2022, he summarized the results up to that point, in the blog post everyone’s citing.

Here’s how Steinhardt begins his summary:

Last August, my research group created a forecasting contest to predict AI progress on four benchmarks. Forecasts were asked to predict state-of-the-art performance (SOTA) on each benchmark for June 30th 2022, 2023, 2024, and 2025. It’s now past June 30th, so we can evaluate the performance of the forecasters so far.

That is:

  • Forecasters were asked to predict 4 numbers, each one at various times in the future.
  • The earliest of those times has come and gone, so we have something to compare their predictions to.
  • The predictions we can evaluate were about what would happen a little under a year in the future.
  • Each of these predictions is about the best published result on some ML benchmark, as of the date in question.

How did the forecasters do on those one-year-ahead questions?

  • On two of the four questions (MATH and MMLU), the actual value was at the extreme high end of the forecasters’ probability distribution.
  • On one of the questions (Something Something v2), the actual value was on the high end of the distribution, but not to the same extreme extent.
  • On the other question (adversarial CIFAR-10), the actual value was on the low end of the distribution.

(By “the forecasters’ probability distribution” here I mean Hypermind’s aggregated crowd forecast, though you should also read Eli Lifland’s personal notes on his own predictions. Lifland was surprised in the same direction as the crowd on MATH and MMLU.)

—-

The first thing I want to point out here is that this is not a large sample! There are lots of important benchmarks in ML, not just these 4.

This is most relevant to the stronger version of the claim – that “capabilities” are moving fast, but “safety” is moving slow. Here, a single benchmark is being as a proxy for the entirety of “AI safety progress.” Did safety really move slowly, or did people just not care that much about adversarial CIFAR-10 over the last year?

(Do people care about adversarial CIFAR-10? I mean, obviously people care about adversarial robustness, there are thousands of papers on it, but is it really a good proxy for AI safety as a whole? When you ask yourself what the most promising AI safety researchers are doing these days, is the answer really “trying to get better numbers on adversarial CIFAR-10”?)

This contest is, at best, really weak/noisy evidence for the stronger version of the claim. Definitely not “stylized fact”-caliber evidence IMO.

With that aside, let’s think about what these forecasts actually mean.

These are forecasts about state-of-the-art (SOTA) results.

A SOTA result is not something that gradually creeps upward in a smooth, regular way. It’s a maximum over every result that’s been published. So it stays constant most of the time, and then occasionally jumps upward instantaneously when a new high score is published.

How often are new SOTA results published? This varies, of course, but the typical frequency is on the order of once a year, give or take.

For example, here’s 5 and a half years of Something Something v2 SOTAs:

image

The value only changed 7 times in those 5+ years. And the spacing was uneven: judged by SOTA results, absolutely no “progress” occurred over the entire two-year interval from late 2018 to late 2020.

This means that a well-calibrated forecast with a single-year time horizon should always under-predict progress, if any progress actually occurred!

When you answer one of these questions on a one-year horizon, you’re not actually saying “here is how much progress is likely to happen.” You’re effectively making a binary guess about whether anyone will publish any progress – which doesn’t happen every year – and then combining that with an estimate about the new high score, conditional on its existence.

If there is progress (and even moreso if there’s significant progress), it will look like relatively fast progress according to your distribution, because your distribution had to spend some of its mass on the possibility of no progress.

Indeed, any serious forecast distribution on these questions ought to have some amount of point mass at the current value, since it’s possible that no one will report a new SOTA. The distribution would have at least two modes, one at the current value and one above it.

But Hypermind’s interface constrains you to unimodal distributions. So the forecasters – if they were doing the task right – had to approximate the bimodal truth by tweaking a unimodal distribution so it puts significant mass near the current value. And since these distributions are nice and smooth, that inevitably drags everything down and makes any actual progress look “surprisingly high.”

(Sidenote: if I understand Hypermind’s scoring function correctly, it actually encourages you to report a distribution more concentrated around the mode of the true distribution than the true distribution itself. So if I were in this contest and just trying to maximize winnings, I’d probably just predict “no progress” with the highest confidence they’ll allow me to use. I don’t know if anyone did that, though.)

Still, though … even if you can only use unimodal distributions, hopefully at least you can flatten them out so they capture both the “no progress” side of the coin and the “how much progress, if any?” side – right?

Well, no. Apparently Hypermind has a maximum on the std. dev. of your distribution, and in Eli Lifland’s case this was too low to let him express his true distribution! He writes:

I didn’t run up to the maximum standard deviation [on MATH], but I probably would have given more weight to larger values if I had been able to forecast a mixture of components like on Metaculus. […]

I think [the std. dev. limit] maybe (40% for my forecast) would have flipped the MMLU forecast to be inside the 90th credible interval, at least for mine and perhaps for the crowd.

In my notes on the MMLU forecast I wrote “Why is the max SD so low???”

And indeed, his notes reveal that this is a pretty severe issue, affecting many of the questions:

The max SD it will let me input is 10… want a bit higher here and obviously would want even higher for later dates. The interface is fairly frustrating compared to Metaculus in general tbh.

Why is the max SD 7.5… I want it larger. Still think most likely is in teens but want a really long tail. Have to split the difference

Ok the max SD still being 2.5 is incredibly frustrating… still think 49.8 should be the modal outcome but want my mean to be able be higher. I guess for 2023 I’ll settle on the 49.8 modal prediction and for 2024 start going higher.

Wow, the max SD is insanely low yet again… my actual mean is higher, probably in the 68-70 range

Really wish the SD could be higher (and ditto for below).

Remember that we’re already doing stats on a tiny sample – at most 4 data points, and if we insist on forming “capabilities” and “safety” subgroups then we only have 3 data points and 1 respectively.

And remember that, because progress is “spiky” and only spikes ~1 time a year, we know one of two things is going to happen:

  1. either there will be ~0 progress and it will look qualitatively like people “overpredicted” progress, or
  2. there will be > 0 progress, and it will look qualitatively like people “underpredicted” progress

And now consider that – although the above is true even in the best case, where everyone reports their real distribution – we are not in the best case. The forecasters in this contest are literally saying stuff like “my actual mean is higher [but the interface won’t let me say that]”!

—-

Above, I showed a screenshot of SOTA progress over time on one of the benchmarks.

That example actually understated how severe the ~1-result-per-year problem is. I picked the benchmark with the longest history, to make a point about how often SOTAs arrive, on average over a multi-year interval. But that was also the benchmark where progress was the most incremental, the least “spiky” – and it’s one the forecasters did relatively well on.

MATH and MMLU – the two where the forecasters really lowballed it – look different. Here’s MMLU:

image

Except this graph is sort of a lie, because MMLU was only introduced in 2021. The earlier data points come from going back and evaluating earlier model on the benchmarks, in some cases fine-tuning them on some of its data.

But fine, let’s imagine counterfactually that MMLU existed this whole time. In 3 years, there have essentially been 3 big jumps:

  • UnifiedQA and GPT-3 in May 2020
  • Gopher in Jan 2022
  • Chinchilla in Apr 2022

What would this look like, if you ran this contest at various points in the (counterfactual) past?

It all depends on precisely when you do it!

It took about a year from GPT-2 to the next milestone, so if you start forecasting at GPT-2, there’s either “zero progress” or substantial progress – depending on whether the May 2020 milestone “sneaks in at the last second” or not.

And it took ~2 years from that to the following milestone. If you started in the first half of that interval, there would be “zero progress,” and forecasters would qualitatively overpredict. If you started in the second half, there would be “substantial progress” and the forecasters would qualitatively underpredict (since, again, they have to give some mass to the zero-progress case).

You’d get diametrically opposed qualitative results depending on when you run the contest, not because progress was slow in one interval and fast in another, but wholly because of these “edge effects.”

OK, that’s MMLU. What about MATH? I’ve saved the big punchline for last:

image

Yep, there are only two data points. We’re trying to (effectively) estimate the slope of a line from literally two points.

More importantly, though: when exactly did that jump happen? Steinhardt writes:

Interestingly, the 50.3% result on MATH was released on the exact day that the forecasts resolved. I’m told this was purely coincidental, but it’s certainly interesting that a 1-day difference in resolution date had such a big impact on the result.

These “edge effects” are not hypothetical: the MATH result “snuck in at the last second” in literally the most extreme possible way.

If the Minerva paper had come out one day later, the contest would have resolved with “no progress on MATH,” and we would have observed qualitative overprediction.

What would have happened in that world? I imagine Steinhardt would (quite reasonably) have said something like:

“Yes, technically the contest resolved with no progress, and I’ll use that for deciding payouts and stuff. But for drawing conclusions about the world, it’s ‘more true’ to count Minerva as being inside the window. It was only one day off, and it was a huge gain, after all.”

But then this would have forced people to confront the topic of edge effects, and there would have been a whole discussion on it, and I wouldn’t have to belabor the point in a post of my own.

Did you notice the asymmetry? In the world where Minerva came one day too late, I don’t think anyone would feel comfortable just writing it off and saying “yep, no progress on MATH this year, end of story.” People would have decided that Minerva at least “partly counted” toward their estimate of progress.

But in our world, no one is doing the reverse. No one is saying that Minerva “only partly counts.” Steinhardt notes the edge effect, but doesn’t say anything about it casting doubt on the implications. He gives it as much weight as the other results, and it ends up being a major driver of his qualitative conclusion.

—-

Yes, yes, you are saying. You’re right, of course, about all these fiddly statistical points.

But (you continue) isn’t the qualitative conclusion just … like, true?

We did in fact get surprising breakthrough results on MATH and MMLU in the last year. What exactly are you saying – that these results didn’t happen? That “forecasters” somehow did see them in advance? Which forecasters?

You are right, reader. If the claim is about these specific benchmarks, MMLU and MATH, then it is true that they over-performed expectations over the last year.

Where things go wrong is the leap from that to the stylized fact, about “capabilities moving faster than expected” as a purported real phenomenon.

You’ve seen the graphs. “These benchmarks over-performed expectations this year” is like saying “the stock market did unusually well this past week.” Some years, the SOTAs overperform; some years they underperform (because they don’t move at all); which kind of year you’re in depends sensitively on where you set the edges. At this time scale, trying to extract a trend is futile.

What’s more, you also have to be careful about double-counting. If you’re following this area enough to have seen this claim, you probably also heard independently about the Minerva result, and about how it came out of nowhere and surprised everyone.

From this knowledge alone, you could have inferred that expert forecasters wouldn’t have guessed it in advance, either. (If not, then where were the voices of those expert forecasters back when the result was announced? Who said “yeah, called it”?)

You would have known this information already, even if this contest had never happened. Now, reading Steinhardt’s post, what exactly is it that you learn? What new information is here that you hadn’t already priced into your thinking?

I think what people are “learning” from this post is something about the systematic nature of the phenomenon.

You can see Minerva and think “huh, that could be a fluke, I don’t know enough to know for sure.” But when you hear Steinhardt say “progress on ML benchmarks happened significantly faster than forecasters expected,” with supporting numbers and charts, it feels like you’re learning that things like Minerva are not flukes: that when you look at the aggregate, this is the trend you see.

But in fact, this result is just Minerva and Chinchilla – which you’d already seen – repackaged in a new form, so it’s hard to tell you’re double-counting.

Viewed as statistical evidence, this stuff is too noisy to be any good, as I detailed above. Viewed as a collection of a few anecdotal stories, well, these stories are noteworthy ones – but you’ve already heard them.

I feel like I’ve seen several cases of this recently, this process of “laundering” existing observations into seemingly novel results pointing in the same direction. I get the same vibe from that one Metaculus AI forecast everyone keeps talking about.

The “forecasters” are not magic – in many cases they are literally the same people who later go on to interpret the forecasts, or at least socially adjacent to them! They are using publicly available information, and making judgments that are reasonable but routine, based on familiar arguments you’ve heard before. If you already follow this area, then what really separates you from the “forecasters”? A bit of calibration training, maybe, at most?

And so we have forecasters boggling at each others’ forecasts, and updating on each other’s updates, and doing statistics to each others’ stated opinions and re-updating on the “statistical versions” of the same arguments they’ve already incorporated into their inside views, and so on, creating this unstable feedback system that can easily spiral one way or another even in the absence of any new information coming in.

Criminal Georg skews recidivism statistics

stumpyjoepete:

michaelkeenan:

Have you ever seen those concerning statistics about criminal recidivism? Like: 44% are re-arrested within a year, and 83% within nine years (source: this Department of Justice report).

I’d seen those statistics before, and been concerned. There’s a great case for shortening prison sentences for deterrence reasons, because likelihood of punishment is much more deterring than severity, but at least prison incapacitates criminals from plundering society while they’re imprisoned. Why hasten prison release if they’ll be back soon anyway? “Once a criminal, always a criminal?”, asks one headline about recidivism.

But today I learned that there’s a huge caveat to those statistics. The more often you go to prison, the more you’re counted in recidivism statistics.

Consider five people who go to prison. Four of them never commit another crime, but one of them was Criminal Georg, who is imprisoned ten times. Out of the fourteen prison sentences (ten for Georg, four for the others), nine of them are followed by recidivism (Georg’s first nine). The proportion of these people who are serial criminals is 20%, but the recidivism rate is 64%.

When considering people rather than prison releases, the recidivism rate is lower than I thought.

see this thread for more examples

Thank you, I hadn’t seen it and it’s a great resource!

I knew I’d seen this pattern before, but I didn’t have a name for it. The linked post by Elizabeth Wrigley-Field tells me it’s called “length-biased sampling.”

The mentions several examples with real-world importance, incl. the recidivism one, and argues the concept should be more widely known.

(It also makes an argument that “length-biased sampling is the deep structure of nested categories” which sounds interesting but which I am not awake enough rn to wrap my head around)

Ah! I know what that recidivism post reminded me of.

When you’re prompting GPT-2, putting an end-of-text separator at the start of your prompt will (all else being equal) bias the model toward shorter documents.

But, just as in the recidivism case, the doesn’t sound prima facie obvious the first time you hear the claim. You have to think about it first, and only then does it seems obvious.

ETA: apparently this is called “length-biased sampling”

Criminal Georg skews recidivism statistics

michaelkeenan:

Have you ever seen those concerning statistics about criminal recidivism? Like: 44% are re-arrested within a year, and 83% within nine years (source: this Department of Justice report).

I’d seen those statistics before, and been concerned. There’s a great case for shortening prison sentences for deterrence reasons, because likelihood of punishment is much more deterring than severity, but at least prison incapacitates criminals from plundering society while they’re imprisoned. Why hasten prison release if they’ll be back soon anyway? “Once a criminal, always a criminal?”, asks one headline about recidivism.

But today I learned that there’s a huge caveat to those statistics. The more often you go to prison, the more you’re counted in recidivism statistics.

Consider five people who go to prison. Four of them never commit another crime, but one of them was Criminal Georg, who is imprisoned ten times. Out of the fourteen prison sentences (ten for Georg, four for the others), nine of them are followed by recidivism (Georg’s first nine). The proportion of these people who are serial criminals is 20%, but the recidivism rate is 64%.

When considering people rather than prison releases, the recidivism rate is lower than I thought.

slatestarscratchpad:

I think I’ve been looking for something like https://www.researchgate.net/profile/Nick_Haslam/publication/341912127_Dimensions_over_categories_a_meta-analysis_of_taxometric_research/links/5edee8c9a6fdcc476890a131/Dimensions-over-categories-a-meta-analysis-of-taxometric-research.pdf my entire life. Now that I’ve found it, I’m confused and angry.

This is a meta-analysis of “taxometrics”, the study of figuring out which things are distinct bimodal groups and which things are are just a dimensional variation along a spectrum or a normal distribution or something. It looks at a lot of personality variables, but focuses on psychiatric disease. It finds that most psychiatric conditions are probably just dimensional spectrum variation, which matches my impression.

But it does find a few things it says hint at maybe being real honest-to-goodness objective categories. It can’t prove any of them, and all of them are sort of ambiguous, but it thinks this might be true of autism, pedophilia, intermittent explosive disorder, alcohol/nicotine/gambling addiction, and biological sex.

I will give them pedophilia - pedophiles really do seem to be a separate group who work very differently from everyone else. Everything else on there is utterly bizarre.

Take intermittent explosive disorder. I thought everyone agreed it was the fakest of fake psychiatric conditions - just a fancy word for people who are often very angry. Yet this study suggests it’s one of the only ones that gets its own taxon - a completely real, utterly separate from the rest of the population stamp of approval.

And what about autism? Just when everybody finally accepted that autism existed on a spectrum, this study claims it’s one of the only psychiatric disorders that *doesn’t*! You’re either autistic or non-autistic, end of story, no shades of gray, and autism is supposedly one of the only things that works like that!

Alcoholism, nicotine addiction, and gambling addiction, same story! I think maybe the explanation here is that this isn’t measuring *tendency toward* alcohol addiction, it’s measuring whether you’re actually addicted to alcohol right now. And there are lots of teetotalers and other people who are definitely not addicted to alcohol, so maybe it’s easier to make categories out of this? Smoking is probably an even easier one - you’re either a nonsmoker or a smoker, that’s a real difference. I guess gambling and stuff work the same way.

The biological sex finding is bizarre for the opposite reason. I don’t mean to wade into any kind of weird political weeds when I say that should just be clearly bimodal, end of story, no ambiguity. I agree intersex people exist and so on, but the question isn’t whether there’s some overlap or ambiguity, the question is whether there’s anything *other* than overlap or ambiguity - that is, whether there’s any tendency at all for things to be other than uniform. I think even the most fervent queer theorist should admit this is obviously true in the case of biological sex. And yet this study cannot do more than say it detects signs this might be true, same as gambling addiction or something.

(there are only two genders: addicted to gambling, and not addicted to gambling.)

Equally annoying is what’s *not* on here. Most of the stuff I’ve read speculating about this sort of thing has always said that if there’s one really real binary-division psychiatric disorder out there, it’s schizophrenia. This meta-analysis utterly fails to find evidence for that.

At some point I am going to look at the individual studies and see whether they’re completely flawed - garbage in, garbage out. Until then, I am just going to sit around being confused and angry.

Some comments on this.

I had never heard of this body of research before, and apparently there’s a lot of it.

——

The papers being meta-analyzed here all used one particular statistical approach.  This what one expects in a meta-analysis, but the statistical technique here is pretty unusual, specialized to this problem, and apparently the brainchild of this one guy named Paul Meehl who was very opinionated about it and advocated for it against the alternatives.

That isn’t necessarily bad, in itself, but it means I take the whole thing with a bigger grain of salt than usual.  Meehl and his followers seem like statistically sophisticated people, and Meehl’s basic idea makes sense, but nonetheless it’s an obscure idea and it doesn’t look like that many people have independently evaluated it.

For example, there’s a single book-length treatment on it (co-authored by Meehl), and I can find exactly one academic review of that book, and it’s written in this odd catty (?) tone that alternates between ambiguous praise, noting that much of the approach was invented earlier by the reviewer, and talking about how the reviewer has taken the same idea in what (naturally) he believes a superior direction since inventing it.

Relatedly, it seems important to distinguish the question addressed by this technique (a general question of general interest) and Meehl’s preferred technique.  Unfortunately, “taxometrics” refers to the latter, when it sounds like it ought to refer to the former.

——

Looking up the papers they cite led me down a bit of a rabbit hole.  There are many papers explaining and defending Meehl’s technique, many of them by Meehl himself.  This one is a good example of Meehl’s own rather grandiose style.

Much of this is very dense (I am resisting the urge to quote some particularly opaque Meehl passages). 

Although the idea is simple, there are at least 3 variants of it used in practice, and there different ways of doing each of those.

Originally, the 3 would produce graphs, and you’d look at the graphs and judge how peaked or flat they look.  To make that less subjective, people started computing the root-mean-squared error between the graphs and each of two comparison graphs, one thought to be “what the graph would look like if these data were ‘taxonic’,” the other “what the graph would look like if the same data were ‘dimensional.’”  (Root-mean-squared error seems like a strange choice when you mostly care about how peaked the curve is?)

And, to generate the comparison graphs, you use bootstrap samples.  And using bootstrap sampling in this case requires inventing a custom iterative algorithm involving 14 complicated steps.

Needing an approximate, iterative algorithm isn’t unusual in itself, but this adds to the sense that this technique comes with a lot of baggage: to be sure these people are doing things right, I have to understand the algorithm, its justification, and the original idea and its justification.  If any of this is wrong, the whole ship sinks.  Indeed, this community looks small enough, I expect they are all using the same bits of R code to execute the algorithm – so the ship might sink if there’s a bug in that code, even if the algorithm is solid.

——

Meehl’s basic idea goes like this.

Suppose some trait really is categorical, with a “high group” and a “low group.”  That doesn’t mean our measurements of it (test scores or something) will be bimodally distributed.  Psychometric measures have a ton of noise, and the noise will tend to smear together the two peaks, so the measure itself might look unimodal.

However, suppose we have a whole bunch of different measures of the traits, like different subtest scores.  Each one gives you some independent info about the true value of the trait.

Let’s arbitrarily choose one of these scores, call it “X,” and select people who have different values for it.  We’ll call all the other scores collectively “Y.”

If we look at really, really, low values of X, we’re probably looking at people in the “low group.”  Yes, there is noise, but there’s only so much noise.   Likewise for the high end: go high enough on this one measure X, and you’re probably looking at members of the “high group.”

Whereas, if the value of X is somewhere in the middle, you might be looking at a member of either group.

This means that if X is somewhere in the middle, we will learn a lot by observing one of the other scores bundled under “Y.”  We aren’t certain which group the person is in, just from X.  So if we observe one score in Y and it’s really low, the others in Y are probably very low too.

Whereas, if X is at the extremes, we don’t learn as much from seeing the scores in Y.  We already know the person is (say) in the low group.  We can already predict that all the Y are probably low.  Observing one of the Y isn’t likely to change our opinion.

In Meehl’s approach, you use this intuition as follows.  You compute some estimate of how related the different Y variables are.  You look at how this varies, as a function of X.  If the story above holds, it should be highest near the middle (when the Y variables are maximally informative about one another), and lower at the ends.

Turning this into a formal methodology involves a bunch of essentially arbitray choices, hence the different variants.  Removing the part where a human looks at a curve and judges whether it’s “peaked enough” involves additional choices. I don’t know whether the advocates of taxometrics made all these choices sensibly enough, and I doubt anyone knows with the level of confidence I’d like to have.

——

After all that, you still have the data you have, which in psychiatric contexts will be sampled from the general population in a very non-uniform way.

The idea makes sense in an idealized world where your research sample is drawn randomly from the population of All Possible Humans.  But psychiatric samples are very unlike that.  You can try to remedy that by introducing some control people from the general population, but then you’re introducing a two-category structure into the data (controls vs. patients)!

Also, Meehl’s idea is supposed to solve the problem where measurement noise makes things look unimodal, even though they’re not.  Is this really what we expect for abnormal psychology?  It’s not like most people look “roughly half schizophrenic” on a test, and we have to do mathematical wizardry to discover they’re really closer to 0% or 100%!

But I could imagine lots of populations where most people look “roughly half schizophrenic”: psychiatric patient populations, where some but not all of the patients are schizophrenic, possibly with general population people mixed in.

If this is the only kind of sample you can construct, I guess you have to use Meehl’s worryingly tall Jenga tower of math tricks to extract a signal from it.  But if it’s possible to improve the sampling itself, that seems better.

covid-19 notes, 4/19/20

Brain-dumping some miscellaneous Covid-19 thoughts.  (Not going to respond to responses to this – this post uses up all my bandwidth for this topic for the moment)

“Mind viruses”

[if you’re skimming, this is the least interesting part of the post IMO]

In late March I wrote this big, long dramatic proclamation about information cascades and stuff.

Back then, it felt like the situation in the US was at an inflection point – at various kinds of inflection point – and I was feeling this particular combination of anxiety and passion about it.  A do-or-die emotion: something was happening quickly, we had a limited window in which to think and act, and I wanted to do whatever I could to help.  (”Whatever I could do” might be little or nothing, but no harm in trying, right?) 

I felt like the intellectual resources around me were being under-applied – the quality of the discussion simply felt worse than the quality of many discussions I’d seen in the past, on less important and time-sensitive topics.  I did my best to write a post urging “us” to do better, and I’m not sure I did very well.

In any event, those issues feel less pressing to me now.  I don’t think I was wrong to worry about epistemically suspect consensus-forming, but right now the false appearance of a consensus no longer feels like such a salient obstacle to good decision-making.  We’ve seen a lot of decisions made in the past month, and some of them have been bad, but the bad ones don’t reflect too much trust in a shaky “consensus,” they reflect some other failure mode.

Bergstrom

Carl Bergstrom’s twitter continues to be my best source for Covid-19 news and analysis.

Bergstrom follows the academic work on Covid-19 pretty closely, generally discussing it before the press gets to it, and with a much higher level of intellectual sophistication while still being accessible to non-specialists.

He’s statistically and epistemically careful to an extent I’ve found uncommon even among scientists: he’s comfortable saying “I’m confused” when he’s confused, happily acknowledges his own past errors while leaving the evidence up for posterity, eloquently critiques flawed methodologies without acting like these critiques prove that his own preferred conclusions are 100% correct, etc.

I wish he’d start writing this great stuff down somewhere that’s easier to follow than twitter, but when I asked him about starting a blog he expressed a preference to stay with twitter. 

I was actually thinking about doing a regular “Bergstrom digest” where I blog about what I’ve learned from his twitter, but I figured it’d be too much work to keep up.  I imagine I’ll contribute more if I write up the same stuff in a freeform way when I feel like it, as I’m doing now.

So, if you’re following Covid-19 news, be sure to read his twitter regularly, if you aren’t already.

IHME

The Covid-19 projections by the IHME, AKA “the Chris Murray model,” are a hot topic right now.

  • On the one hand, they have acquired a de facto “official” status.

    CNN called it “the model that is often used by the White House,”   In other news stories it’s regularly called “influential” or “prominent.”  I see it discussed at work as though it’s simply “the” expert projection, full stop.  StatNews wrote this about it:

    The IHME projections were used by the Trump administration in developing national guidelines to mitigate the outbreak. Now, they are reportedly influencing White House thinking on how and when to “re-open” the country, as President Trump announced a blueprint for on Thursday.

    I don’t know how much the IHME work is actually driving decision-making, but if anyone’s academic work is doing so, it’s the IHME’s.

I find this situation frustrating in a specific way I don’t know the right word for.  The IHME model isn’t interestingly bad.  It’s not intellectually contrarian, it’s just poorly executed.  The government isn’t trusting a weird but coherent idea, they’re just trusting shoddy work.

And this makes me pessimistic about improving the situation.  It’s easy to turn people against a particular model if you can articulate a specific way that the model is likely to misdirect our actions.  “It’s biased in favor of zigging, but everything else says should zag.  Will we blindly follow this model off a cliff?”  That’s the kind of argument you can imagine making its ways to the news.

But the real objection to the IHME’s model isn’t like this.  Because it’s shoddy work, it sometimes makes specific errors identifiable as such, and you can point to these.  But this understates the case: the real concern is that trusting shoddy work will produce bad consequences in general, i.e. about a whole set of bad consequences past and future, and the ones that have already occurred are just a subset.

I feel like there’s a more general point here.  I care a lot about the IHME’s errors for the same reason I cared so much about Joscha Bach’s bad constant-area assumption.  The issue isn’t whether or not these things render their specific conclusions invalid – it’s what it says about the quality of their thinking and methodology.

When someone makes a 101-level mistake and doesn’t seem to realize it, it breaks my trust in their overall competence – the sort of trust required in most nontrivial intellectual work, where methodology usually isn’t spelled out in utterly exact detail, and one is either willing to assume “they handled all the unmentioned stuff sensibly,” or one isn’t.

IHME (details)

Quick notes on some of the IHME problems (IHME’s paper is here, n.b. the Supplemental Material is worth reading too):

They don’t use a dynamic model, they use curve-fitting to a Gaussian functional form.  They fit these curves to death counts.  (Technically, they fit a Gaussian CDF – which looks sigmoid-like – to cumulative deaths, and then recover a bell curve projection for daily deaths by taking the derivative of the fitted curve.)

Objection 1.  Curve fitting to a time series is a weird choice if you want to model something whose dynamics change over time as social distancing policies are imposed and lifted.  IHME has a state-by-state model input that captures differences in when states implemented restrictions (collapsed down to 1 number), but it isn’t time-dependent, just state-dependent.  So their model can learn that states with different policies will tend to have differently shaped or shifted curves overall – but it can’t modify the shape of the curves to reflect the impacts of restrictions when they happened.

Objection 2.  Curve fitting produces misleading confidence bands.

Many people quickly noticed something weird about the IHME’s confidence bands: the model got more confident the further out in the future you looked.

How can that be possible?  Well, uncertainty estimates from a curve fit aren’t about what will happen.  They’re about what the curve looks like.

With a bell-shaped curve, it’s “harder” to move the tails of the curve around than to move the peak around – that is, you have to change the curve parameters more to make it happen.  (Example: the distribution of human heights says very confidently that 100-foot people are extremely rare; you have to really shift or squash the curve to change that.)

To interpret these bands as uncertainty about the future, you’d need to model the world like this: reality will follow some Gaussian curve, plus noise.  Our task is to figure out which curve we’re on, given some prior distribution over the curves.  If the curves were a law of nature, and their parameters the unknown constants of this law, this would be exactly the right thing to do.  But no one has this model of reality.  The future is the accumulation of past effects; it does not simply trace out a pre-determined arc, except in science fiction or perhaps Thomism.

Objection 3. Curve symmetry causes perverse predictions.

Bergstrom brought this up recently.  The use of a symmetric bell curve means the model will always predict a decline that exactly mirrors the ascent.

This creates problems when a curve has been successfully flattened and is being held at a roughly flat position.  The functional form can’t accommodate that – it can’t made the peak wider without changing everything else – so it always notices what looks like a peak, and predicts an immediate decline.  If you stay flat 1 more day, the model extends its estimated decline by 1 day.  If you stay flat 7 more days, you get 7 more days on the other side.  If you’re approximately flat, the model will always tell you tomorrow will look like yesterday, and 3 months from now will look like 3 months ago.

(Put another way, the model under-predicts its own future estimates, again and again and again.)

This can have the weird effect of pushing future estimates down in light of unexpectedly high current data: the latter makes the model update to an overall-steeper curve, which means a steeper descent on the other side of it.

(EDIT 4/20: wanted to clarify this point.

There are two different mechanisms that can cause the curve to decline back to zero: either R0 goes below 1 [i.e. a “suppressed” epidemic], or the % of the population still susceptible trends toward 0 [i.e. an “uncontrolled” epidemic reaching herd immunity and burning itself out].

If you see more cases than expected, that should lower your estimate of future % susceptible, and raise your estimate of future R0.  That is, the epidemic is being less well controlled than you expected, so you should update towards more future spread and more future immunity.

In an uncontrolled epidemic, immunity is what makes the curve eventually decline, so in this case the model’s update would make sense.  But the model isn’t modeling an uncontrolled epidemic – if its projections actually happen, we’ll be way below herd immunity at the end.

So the decline seen in the model’s curves must be interpreted as a projection of successful “suppression,” with R0 below 1.  But if it’s the lowered R0 that causes the decline, then the update doesn’t make sense: more cases than expected means higher R0 than expected, which means a less sharp decline than expected, not more.)

This stuff has perverse implications for forecasts about when things end, which unfortunately IHME is playing up a lot – they’re reporting estimates of when each state will be able to lift restrictions (!) based on the curve dipping below some threshold.  (Example)

EDIT 4/20: forgot to link this last night, but there’s a great website http://www.covid-projections.com/ that lets you see successive forecasts from the IHME on one axis.  So you can evaluate for yourself how well the model updates over time.

Words

I remain frustrated with the amount of arguing over whether we should do X or Y, where X and Y are ambiguous words which different parties define in conflicting ways.

FlattenTheCurve is still causing the same confusions.  Sure, whatever, I’ve accepted that one.  But in reading over some of the stuff I argued about in March, I’ve come to realize that a lot of other terms aren’t defined consistently even across academic work.

Mitigation and suppression

To Joscha Bach, “mitigation” meant the herd immunity strategy.  Bergstrom took him to task for this, saying it wasn’t what it meant in the field.

But the Imperial College London papers (1, 2) also appear to mean “herd immunity” by “mitigation.”  They form their “mitigation scenarios” by assuming a single peak with herd immunity at the end, and then computing the least-bad scenario consistent with those constraints.

When they come out in favor of “suppression” instead of “mitigation,” they are really saying that we must lower R0 far enough that we don’t have a plan to get herd immunity and are basically waiting for a vaccine, either under permanent restrictions or trigger-based on/off restrictions.

But the “mitigation” strategy imagined here seems like either a straw man, or possibly an accurate assessment of the bizarre bad idea they were trying to combat in the UK at that exact moment.

Even in the “mitigation scenarios,” some NPI is done.  Indeed, the authors consider the same range of interventions as in the “suppression scenarios.”  The difference is that, in “mitigation,” the policies are kept light enough that the virus still infects most of the population.  Here are some stats from their second paper:

If mitigation including enhanced social distancing is pursued, for an R0 of 3.0, we estimate a maximum reduction in infections in the range […] These optimal reductions in transmission and burden were achieved with a range of reductions in the overall rate of social contact between 40.0%- 44.9% (median 43.9%) […]

We also explored the impact of more rigorous social distancing approaches aimed at immediate suppression of transmission. We looked at 6 suppression scenarios […] the effects of widespread transmission suppression were modelled as a uniform reduction in contact rates by 75%, applied across all age-groups

In other words, if you still want herd immunity at the end, you can ask people to reduce their social contact ~43% (which is a lot!), but not more.  The “mitigation” strategy as imagined here is bizarre: you have to be open to non-trivial NPI, open to asking your population to nearly halve their social interaction, but not wiling to go further – specifically because you want the whole population to get infected.

Meanwhile, I’ve seen other academic sources use “mitigation” in closer to Bergstrom’s sense, as general term for NPI and any other measures that slow the spread.  (That paper also uses “flatten” in this same generic way.)

Containment

When Bach writes “containment,” he seems to mean the thing called “suppression” by ICL.  (I.e. the good thing everyone wants, where you impose measures and don’t mysteriously stop them short of what would curtail herd immunity.)

When ICL write “containment,” they appear to mean something different.  Among their suppression scenarios, they compare one confusingly labelled “suppression” to another labelled “containment” – see their Fig. 3 in the 3/16 paper.  The difference is that, among interventions, “containment” lacks school closure but adds household quarantine.  This agrees with the intuitive meaning of “containment,” but differs from Bach’s use and Bergstrom’s different usage.

Lockdown

I have no idea what this means.  Apparently I’m in one right now?  To Bach, it appears to mean (at least) city-level travel restrictions, a key component of Bach!Containment but not considered by ICL or other academics I’ve read.

While trying to Google this, I found this Vox article, which, well:

“The term ‘lock-down’ isn’t a technical term used by public health officials or lawyers,” Lindsay Wiley, a health law professor at the Washington College of Law, said in an email. “It could be used to refer to anything from mandatory geographic quarantine (which would probably be unconstitutional under most scenarios in the US), to non-mandatory recommendations to shelter in place (which are totally legal and can be issued by health officials at the federal, state, or local level), to anything in between (e.g. ordering certain events or types of businesses to close, which is generally constitutional if deemed necessary to stop the spread of disease based on available evidence).”

Hammer, dance

These probably mean something, but I cite all of the above as justification for my preemptive wariness about getting into any kind of argument about “whether we should do the hammer,” or who’s currently “doing the dance.”

mind viruses about body viruses

@slatestarscratchpad (thread clipped for length, responding to this)

First of all, thank you for the the thoughtful and charitable response.

Re: my overall message

Second of all, yeah, my post is not too clear on a lot of things and went through some message drift as I was writing.  The message I had in mind when I started was 100% about being more careful in curation, not about doing independent work.

Then I ended up spinning this big theory of why curation was not being done carefully.   Roughly, I hypothesized that – although there is a large volume of material being produced – very little of it would qualify for curation under normal circumstances.  Either because the quality is too low (e.g. obviously bad amateur pet theories) or because the format is too indigestible (e.g. convoluted high-context twitter threads that are hard to even permalink clearly).  Hence, some of us are lowering our usual curation bars just to let anything through.

Since “maybe don’t curate anything at all” felt underwhelming as a recommendation, I added a suggestion that we could try improving the supply side.  I didn’t really mean that more independent work of any sort is good, since as you say we are glutted with independent work.  I meant more independent work good enough to pass even “peacetime” thresholds for curation, stuff that very clearly shows its work, collects scattered expert observations into an easily digestible whole without oversimplifying, doesn’t rely on misleading inflammatory phrases to get your attention, etc.

(I do think your masks post falls in this category, and thank your for writing it.)

Maybe the supply-side point is wrong – maybe, as you say in your final para, there are enough good takes out there and the limiting factor is finding and spreading them.  I don’t have a strong opinion either way there.  What I do see is the signal-boosting of stuff which I personally find “iffy” but would maybe provisionally endorse in the absence of anything better.  If better work is being done, we really need to start curating that instead.  If not, then whoever is capable of produce better work needs to produce it, and then we need to curate it.

Re: my objections to recent SSC posts (big picture)

Like I said, I got carried away with grand theorizing as I wrote.  But the original impetus for me writing the post was very simple and concrete: I read the “Hammer and dance” section in your latest post and was frustrated by it.

Taken together with my frustration about your previous discussion of Bach, it felt like there was a pattern where you were both sharing and endorsing some things without clearly understanding them or being able to summarize them adequately.

I worried that these endorsements would aid an information cascade.  But also, “an information cascade is happening” seemed like a relatively charitable option among potential explanations for the pattern.  That is, conditional on “Scott is endorsing this thing he doesn’t really understand,” your action is more defensible if it’s supported by an impression that many independent observers are converging on the same endorsement, rather than if it’s completely based on your (by hypothesis, insufficient) personal assessment.

But this “more defensible” reading still isn’t defensible enough.  When these decisions are being made on intellectual trust, and some of that trust is not well founded (e.g. the trust I suspect many people place in SSC on this topic), we are likely to see quick formation of consensus far beyond what is epistemically licensed.

Okay, you might say, but what’s the alternative – just sharing nothing?  I agree with what you wrote here:

If I stay inside and don’t spread the actual coronavirus, I’ve trivially made everyone’s lives better. If I shut up and don’t spread any intellectual memes, then that just means that people’s thoughts are being shaped by the set of everyone except me. This is good if I’m worse than average, bad if I’m better than average. Or to put it another way, I’m making a net contribution if I signal-boost true/important things disproportionately often compared to their base rate […].

This is true if we model you as a “pure transmitter” who propagates ideas without modifying them in the process.  What I’m worried about, though, is ideas acquiring an ever-growing halo of credibility/consensus as they’re endorsed by individually credible people who cite all the other credible people who believe them, etc.

As I’m writing this, I realize this is a key thing I didn’t adequately emphasize in OP: the concern isn’t about mere passing on of information, it’s about the side effects that can occur as it’s passed on.  This means my metaphor of an “information epidemic” just like a disease was, although entertainingly meta, not actually accurate or helpful. 

I would be happy with a bare link to Pueyo’s or even Bach’s pieces, without explicit endorsement, perhaps just with a note like “seems interesting but I can’t evaluate it.”  (You have said roughly that about many other things, and I approve of that.)  I would also be happy with a detailed “more than you want to know” type analysis of any of these pieces.

What I am not happy with is a link with a rider saying you endorse it, that the smart people you’re reading endorse it, that it’s the new consensus, etc., without an accompanying deep dive or evidence of good individual vetting.  When iterated, this is a cascade.

Re: my objections to recent SSC posts (specifics)

Here’s are the concrete cases I object to, which made me think I was seeing a bad pattern.

First, here is how you originally glossed Bach’s article in the 3/19 links post:

An article called Flattening The Curve Is A Deadly Delusion has been going around this part of the Internet, saying that it’s implausible to say R0 will ever be exactly 1, so you’re either eradicating the disease (good) or suffering continued exponential growth (bad) without a “flat curve” being much of a possibility.

I won’t explain here why this is not accurate, since I already wrote an SSC comment to that effect.  Shortly after I posted my comment, you modified what’s in the post to say something more accurate which also sounded much like the gloss I wrote in my comment.  (I guessed that this was a reaction to my comment, although I could be wrong.)

Although I appreciate that you made the correction, the damage was done: I was convinced that you had shared the Bach article without understanding it.  If you later came to understand it and still thought it was share-worthy, that’s fine in itself, but understanding was apparently not necessary for sharing.  Further, this called the other Coronalinks into question a la Gell-Mann amnesia: if there’s an error in the one case I happen to have already scrutinized for my own reasons, there are likely some errors in those I haven’t.

Then, in the 3/27 links post, you wrote:

I relayed some criticism of a previous Medium post, Flattening The Curve Is A Deadly Delusion, last links post. In retrospect, I was wrong, it was right (except for the minor math errors it admitted to), and it was trying to say something similar to this. There is no practical way to “flatten the curve” except by making it so flat that the virus is all-but-gone, like it is in South Korea right now. I think this was also the conclusion of the Imperial College London report that everyone has been talking about.

This appears to be an explicit endorsement of the entire article, except the “minor math errors.”  That is, “it was right (except for the minor math errors it admitted to)” implies “everything that was not one of the minor math errors was right.”

I don’t know how to square this with your comments on Bach in the post I’m responding to (I broadly agree with those comments, FWIW).  You describe being initially confused by Bach’s article, then only understanding it after reading other things that made the same point better.  If Bach’s article is confusing, and there are better substitutes, why continue to tout Bach’s article as something “right” and worth reading?

Perhaps a more useful way to say that is: it sounds like you are doing two separate things.  You’re reading articles, and you’re forming a mental model of the situation.  The model can update even when re-reading the same article, if it happens you come to understand it better.  If Bach’s article confused you, but it and things like it eventually caused a useful update to your mental model, then the valuable piece of information you have to transmit is the content of that model update, not the confusing and misleading texts from which you eventually, with effort, distilled that update.  Sharing the texts with endorsement will force others through the same confusion at best, and permanently confuse them at worst.

Remember, there is a lot of stuff in the Bach article beyond the one fact about how low the line is.  I too did not know how low the line was until I read Bach, and in that sense Bach’s meme – including its inflammatory, thus viral, title – was a kind of success.  But it’s a success at transmitting one fact which we didn’t know but every epidemiologist did.

We can take this fact on board and proceed, without – for instance – co-signing an article that explicitly advocates lockdown to stop geographic spread (i.e. creating effectively disease-free zones) as the only solution that will work, something not recommended in any of the ICL or Harvard papers, insofar as I’ve read and understood them.

Closing comments

I realize this is likely to sound like I’m picking nits with phrasing, or perhaps like fixating on a case where you said I was wrong and bloviating until you concede I was “right.”

If I’m kind of unduly fixated on Bach’s article, well … I guess I just think Bach’s article was really bad, although it happened to teach many of us a 101-level fact for the first time.  I may be more confident in this judgment than you, but it doesn’t sound like you were incredibly impressed either – Bach was the first person you saw saying a true thing you didn’t understand until people said it less badly.  

If the best sources for basic information are this polluted with badness, then the supply-side is really messed up and someone less inadequate needs to step up and fix it.  Meanwhile, we should acknowledge the badness and accord no points for merely showing up, because that will mislead people and redistribute a maxed-out attention budget towards the consumption of misleading material.

Or, if there are better sources out there, they really need to be boosted and actively suggested as substitutes for their worse counterparts.  Until Carl Bergstrom gets a Medium account, the best distiller/synthesizer available who writes in a digestible format might well be Pueyo, and his confidence + lack of domain background make me wary.  And he’s the best – there are worse ones.  In relative terms these people may be the best we have, but absolute terms are the ones that matter, and the ones we should apply and communicate.

You are already forming your own model, distinct from these writers’, and in my opinion almost certainly better.  That model could be valuable.  Promoting worse models as stand-ins for it is not valuable.  If your defense of Bach is that he caused you to update a piece of your model, then you are not saying Bach is right – you’re saying, like it or not, that you are.