on “ai forecasting: one year in”
AI is improving so fast, even expert forecasters are surprised!
… wait, is that true?
Who are these experts? And what exactly was it that surprised them?
If you have been following along with the LessWrong-adjacent conversation about AI, you have probably heard some form of the bolded claim at the top. You might have heard it via
As a result, I didn’t closely track specific capabilities advances over the last two years; I’d have probably deferred to superforecasters and the like about the timescales for particular near-term achievements. But progress on some not-cherry-picked benchmarks was notably faster than what forecasters predicted, so that should be some update toward shorter timelines for me.
or Dan Hendrycks et al:
Capability advancements have surprised many in the broader ML community: as they have made discussion of AGI more possible, they can also contribute to making discussion of existential safety more possible.
or Scott Alexander:
Jacob Steinhardt describes the results of his AI forecasting contest last year. Short version: AI is progressing faster than forecasters expected, safety is going slower. Uh oh.
All of these people cite the same blog post as their source.
In the last example, Scott is … well, just linking to a blog post, and it’s clear that his “short version” is a summary of the blog post, not necessarily of what’s-actually-true.
But in the other two examples, the claim is being treated as a “stylized fact,” a generalization about reality on the basis of an empirical result. It’s not about Jacob Steinhardt and his contest, but about “forecasters” and “capabilities,” as entire categories.
This is a pretty striking conclusion to draw. “Big if true,” as they say. So a lot is resting on the shoulders of that one blog post and contest. Do they justify the stylized fact?
—-
In August 2021, Jacob Steinhardt organized a forecasting contest on the platform Hypermind.
In July 2022, he summarized the results up to that point, in the blog post everyone’s citing.
Here’s how Steinhardt begins his summary:
Last August, my research group created a forecasting contest to predict AI progress on four benchmarks. Forecasts were asked to predict state-of-the-art performance (SOTA) on each benchmark for June 30th 2022, 2023, 2024, and 2025. It’s now past June 30th, so we can evaluate the performance of the forecasters so far.
That is:
- Forecasters were asked to predict 4 numbers, each one at various times in the future.
- The earliest of those times has come and gone, so we have something to compare their predictions to.
- The predictions we can evaluate were about what would happen a little under a year in the future.
- Each of these predictions is about the best published result on some ML benchmark, as of the date in question.
How did the forecasters do on those one-year-ahead questions?
- On two of the four questions (MATH and MMLU), the actual value was at the extreme high end of the forecasters’ probability distribution.
- On one of the questions (Something Something v2), the actual value was on the high end of the distribution, but not to the same extreme extent.
- On the other question (adversarial CIFAR-10), the actual value was on the low end of the distribution.
(By “the forecasters’ probability distribution” here I mean Hypermind’s aggregated crowd forecast, though you should also read Eli Lifland’s personal notes on his own predictions. Lifland was surprised in the same direction as the crowd on MATH and MMLU.)
—-
The first thing I want to point out here is that this is not a large sample! There are lots of important benchmarks in ML, not just these 4.
This is most relevant to the stronger version of the claim – that “capabilities” are moving fast, but “safety” is moving slow. Here, a single benchmark is being as a proxy for the entirety of “AI safety progress.” Did safety really move slowly, or did people just not care that much about adversarial CIFAR-10 over the last year?
(Do people care about adversarial CIFAR-10? I mean, obviously people care about adversarial robustness, there are thousands of papers on it, but is it really a good proxy for AI safety as a whole? When you ask yourself what the most promising AI safety researchers are doing these days, is the answer really “trying to get better numbers on adversarial CIFAR-10”?)
This contest is, at best, really weak/noisy evidence for the stronger version of the claim. Definitely not “stylized fact”-caliber evidence IMO.
With that aside, let’s think about what these forecasts actually mean.
These are forecasts about state-of-the-art (SOTA) results.
A SOTA result is not something that gradually creeps upward in a smooth, regular way. It’s a maximum over every result that’s been published. So it stays constant most of the time, and then occasionally jumps upward instantaneously when a new high score is published.
How often are new SOTA results published? This varies, of course, but the typical frequency is on the order of once a year, give or take.
For example, here’s 5 and a half years of Something Something v2 SOTAs:
The value only changed 7 times in those 5+ years. And the spacing was uneven: judged by SOTA results, absolutely no “progress” occurred over the entire two-year interval from late 2018 to late 2020.
This means that a well-calibrated forecast with a single-year time horizon should always under-predict progress, if any progress actually occurred!
When you answer one of these questions on a one-year horizon, you’re not actually saying “here is how much progress is likely to happen.” You’re effectively making a binary guess about whether anyone will publish any progress – which doesn’t happen every year – and then combining that with an estimate about the new high score, conditional on its existence.
If there is progress (and even moreso if there’s significant progress), it will look like relatively fast progress according to your distribution, because your distribution had to spend some of its mass on the possibility of no progress.
Indeed, any serious forecast distribution on these questions ought to have some amount of point mass at the current value, since it’s possible that no one will report a new SOTA. The distribution would have at least two modes, one at the current value and one above it.
But Hypermind’s interface constrains you to unimodal distributions. So the forecasters – if they were doing the task right – had to approximate the bimodal truth by tweaking a unimodal distribution so it puts significant mass near the current value. And since these distributions are nice and smooth, that inevitably drags everything down and makes any actual progress look “surprisingly high.”
(Sidenote: if I understand Hypermind’s scoring function correctly, it actually encourages you to report a distribution more concentrated around the mode of the true distribution than the true distribution itself. So if I were in this contest and just trying to maximize winnings, I’d probably just predict “no progress” with the highest confidence they’ll allow me to use. I don’t know if anyone did that, though.)
Still, though … even if you can only use unimodal distributions, hopefully at least you can flatten them out so they capture both the “no progress” side of the coin and the “how much progress, if any?” side – right?
Well, no. Apparently Hypermind has a maximum on the std. dev. of your distribution, and in Eli Lifland’s case this was too low to let him express his true distribution! He writes:
I didn’t run up to the maximum standard deviation [on MATH], but I probably would have given more weight to larger values if I had been able to forecast a mixture of components like on Metaculus. […]
I think [the std. dev. limit] maybe (40% for my forecast) would have flipped the MMLU forecast to be inside the 90th credible interval, at least for mine and perhaps for the crowd.
In my notes on the MMLU forecast I wrote “Why is the max SD so low???”
And indeed, his notes reveal that this is a pretty severe issue, affecting many of the questions:
The max SD it will let me input is 10… want a bit higher here and obviously would want even higher for later dates. The interface is fairly frustrating compared to Metaculus in general tbh.
Why is the max SD 7.5… I want it larger. Still think most likely is in teens but want a really long tail. Have to split the difference
Ok the max SD still being 2.5 is incredibly frustrating… still think 49.8 should be the modal outcome but want my mean to be able be higher. I guess for 2023 I’ll settle on the 49.8 modal prediction and for 2024 start going higher.
Wow, the max SD is insanely low yet again… my actual mean is higher, probably in the 68-70 range
Really wish the SD could be higher (and ditto for below).
Remember that we’re already doing stats on a tiny sample – at most 4 data points, and if we insist on forming “capabilities” and “safety” subgroups then we only have 3 data points and 1 respectively.
And remember that, because progress is “spiky” and only spikes ~1 time a year, we know one of two things is going to happen:
- either there will be ~0 progress and it will look qualitatively like people “overpredicted” progress, or
- there will be > 0 progress, and it will look qualitatively like people “underpredicted” progress
And now consider that – although the above is true even in the best case, where everyone reports their real distribution – we are not in the best case. The forecasters in this contest are literally saying stuff like “my actual mean is higher [but the interface won’t let me say that]”!
—-
Above, I showed a screenshot of SOTA progress over time on one of the benchmarks.
That example actually understated how severe the ~1-result-per-year problem is. I picked the benchmark with the longest history, to make a point about how often SOTAs arrive, on average over a multi-year interval. But that was also the benchmark where progress was the most incremental, the least “spiky” – and it’s one the forecasters did relatively well on.
MATH and MMLU – the two where the forecasters really lowballed it – look different. Here’s MMLU:
Except this graph is sort of a lie, because MMLU was only introduced in 2021. The earlier data points come from going back and evaluating earlier model on the benchmarks, in some cases fine-tuning them on some of its data.
But fine, let’s imagine counterfactually that MMLU existed this whole time. In 3 years, there have essentially been 3 big jumps:
- UnifiedQA and GPT-3 in May 2020
- Gopher in Jan 2022
- Chinchilla in Apr 2022
What would this look like, if you ran this contest at various points in the (counterfactual) past?
It all depends on precisely when you do it!
It took about a year from GPT-2 to the next milestone, so if you start forecasting at GPT-2, there’s either “zero progress” or substantial progress – depending on whether the May 2020 milestone “sneaks in at the last second” or not.
And it took ~2 years from that to the following milestone. If you started in the first half of that interval, there would be “zero progress,” and forecasters would qualitatively overpredict. If you started in the second half, there would be “substantial progress” and the forecasters would qualitatively underpredict (since, again, they have to give some mass to the zero-progress case).
You’d get diametrically opposed qualitative results depending on when you run the contest, not because progress was slow in one interval and fast in another, but wholly because of these “edge effects.”
OK, that’s MMLU. What about MATH? I’ve saved the big punchline for last:
Yep, there are only two data points. We’re trying to (effectively) estimate the slope of a line from literally two points.
More importantly, though: when exactly did that jump happen? Steinhardt writes:
Interestingly, the 50.3% result on MATH was released on the exact day that the forecasts resolved. I’m told this was purely coincidental, but it’s certainly interesting that a 1-day difference in resolution date had such a big impact on the result.
These “edge effects” are not hypothetical: the MATH result “snuck in at the last second” in literally the most extreme possible way.
If the Minerva paper had come out one day later, the contest would have resolved with “no progress on MATH,” and we would have observed qualitative overprediction.
What would have happened in that world? I imagine Steinhardt would (quite reasonably) have said something like:
“Yes, technically the contest resolved with no progress, and I’ll use that for deciding payouts and stuff. But for drawing conclusions about the world, it’s ‘more true’ to count Minerva as being inside the window. It was only one day off, and it was a huge gain, after all.”
But then this would have forced people to confront the topic of edge effects, and there would have been a whole discussion on it, and I wouldn’t have to belabor the point in a post of my own.
Did you notice the asymmetry? In the world where Minerva came one day too late, I don’t think anyone would feel comfortable just writing it off and saying “yep, no progress on MATH this year, end of story.” People would have decided that Minerva at least “partly counted” toward their estimate of progress.
But in our world, no one is doing the reverse. No one is saying that Minerva “only partly counts.” Steinhardt notes the edge effect, but doesn’t say anything about it casting doubt on the implications. He gives it as much weight as the other results, and it ends up being a major driver of his qualitative conclusion.
—-
Yes, yes, you are saying. You’re right, of course, about all these fiddly statistical points.
But (you continue) isn’t the qualitative conclusion just … like, true?
We did in fact get surprising breakthrough results on MATH and MMLU in the last year. What exactly are you saying – that these results didn’t happen? That “forecasters” somehow did see them in advance? Which forecasters?
You are right, reader. If the claim is about these specific benchmarks, MMLU and MATH, then it is true that they over-performed expectations over the last year.
Where things go wrong is the leap from that to the stylized fact, about “capabilities moving faster than expected” as a purported real phenomenon.
You’ve seen the graphs. “These benchmarks over-performed expectations this year” is like saying “the stock market did unusually well this past week.” Some years, the SOTAs overperform; some years they underperform (because they don’t move at all); which kind of year you’re in depends sensitively on where you set the edges. At this time scale, trying to extract a trend is futile.
What’s more, you also have to be careful about double-counting. If you’re following this area enough to have seen this claim, you probably also heard independently about the Minerva result, and about how it came out of nowhere and surprised everyone.
From this knowledge alone, you could have inferred that expert forecasters wouldn’t have guessed it in advance, either. (If not, then where were the voices of those expert forecasters back when the result was announced? Who said “yeah, called it”?)
You would have known this information already, even if this contest had never happened. Now, reading Steinhardt’s post, what exactly is it that you learn? What new information is here that you hadn’t already priced into your thinking?
I think what people are “learning” from this post is something about the systematic nature of the phenomenon.
You can see Minerva and think “huh, that could be a fluke, I don’t know enough to know for sure.” But when you hear Steinhardt say “progress on ML benchmarks happened significantly faster than forecasters expected,” with supporting numbers and charts, it feels like you’re learning that things like Minerva are not flukes: that when you look at the aggregate, this is the trend you see.
But in fact, this result is just Minerva and Chinchilla – which you’d already seen – repackaged in a new form, so it’s hard to tell you’re double-counting.
Viewed as statistical evidence, this stuff is too noisy to be any good, as I detailed above. Viewed as a collection of a few anecdotal stories, well, these stories are noteworthy ones – but you’ve already heard them.
I feel like I’ve seen several cases of this recently, this process of “laundering” existing observations into seemingly novel results pointing in the same direction. I get the same vibe from that one Metaculus AI forecast everyone keeps talking about.
The “forecasters” are not magic – in many cases they are literally the same people who later go on to interpret the forecasts, or at least socially adjacent to them! They are using publicly available information, and making judgments that are reasonable but routine, based on familiar arguments you’ve heard before. If you already follow this area, then what really separates you from the “forecasters”? A bit of calibration training, maybe, at most?
And so we have forecasters boggling at each others’ forecasts, and updating on each other’s updates, and doing statistics to each others’ stated opinions and re-updating on the “statistical versions” of the same arguments they’ve already incorporated into their inside views, and so on, creating this unstable feedback system that can easily spiral one way or another even in the absence of any new information coming in.
cogi234 reblogged this from nostalgebraist
cogi234 liked this
astrojordan liked this
noziroh liked this michaelkeenan liked this
efficientplatinum liked this
lilbluntworld liked this
cthulhubert reblogged this from nostalgebraist
wirehead-wannabe liked this
misterjoshbear liked this schpeelah-reblogs reblogged this from cromulentenough
vash3r reblogged this from nostalgebraist
vash3r liked this
owlservice liked this
perrydeplatypus liked this
caobei liked this tanadrin liked this
elancholia liked this algorizmi liked this
mortified-muskrat liked this
jack11235712 reblogged this from nostalgebraist
mollykimball liked this
schpeelah liked this
archipithecus liked this
snarwin liked this
tinycatyote liked this
hod-of-glory liked this defectivealtruist liked this
youzicha liked this
transgenderer reblogged this from nostalgebraist
synoddiane liked this guavaorb liked this
dessometrics liked this
transsexual-vampire liked this
tumblringintoanabyss liked this
stumpyjoepete liked this
lilietsblog liked this
xenostalgic liked this
kwarrtz liked this
humanfist reblogged this from nostalgebraist shieldfoss liked this
somni-omni liked this
hazzardevil liked this
di--es---can-ic-ul-ar--es reblogged this from argumate and added:
To read later
circletofcircles liked this
skiamachy liked this argumate reblogged this from nostalgebraist and added:
doesn't seem like a big deal to me: if the forecasters are trapped in a feedback loop that amplified their expectations...
femmenietzsche liked this jack11235712 liked this
nostalgebraist posted this
AI is improving so fast, even expert forecasters are surprised!.... . . wait, is that...
- Show more notes



