Install Theme

the scaling “inconsistency”: openAI’s new insight

I’ve now read the new OpenAI scaling laws paper.  Also, yesterday I attended a fun and informative lecture/discussion with one of the authors.

While the topic is on my mind, I should probably jot down some of my thoughts.

This post is mostly about what the new paper says about the “inconsistency” brought up in their previous paper.

The new paper has a new argument on this topic, which is intuitive and appealing, and suggests that the current scaling trend will indeed “switch over” soon to a new one where dataset size, not model size, is the active constraint on performance.  Most of this post is an attempt to explain and better understand this argument.

——

The new paper is mainly about extending the scaling laws from their earlier paper to new modalities.

In that paper, they found scaling laws for transformers trained autoregressively on text data.  The new paper finds the same patterns in the scaling behavior of transformers trained autoregressively on images, math problems, etc.

So the laws aren’t telling us something about the distribution of text data, but about something more fundamental.  That’s cool.

They also have a new, very intuitive hypothesis for what’s going on with the “scaling inconsistency” they described in the previous paper – the one I made a big deal about at the time.  So that’s the part I’m most excited to discuss.

I’m going to give a long explanation of it, way longer than the relevant part of their paper.  Some of this is original to me, all errors are mine, all the usual caveats.

——

1. L(​C) and L(D)

To recap: the “inconsistency” is between two scaling laws:

  • The law for the best you can do, given a fixed compute budget.

    This is L(​C), sometimes called L(C_min).  L is the loss (lower = better), C is your compute budget.

  • The law for the best you can do, given a fixed dataset size.

    This is L(D), where D is the number of examples (say, tokens) in the dataset.

Once you reach a certain level of compute, these two laws contradict each other.

I’ll take some time to unpack that here, as it’s not immediately obvious the two can even be compared to one another – one is a function of compute, the other of data.

2. C sets E, and E bounds D

Budget tradeoffs

Given a compute budget C, you can derive the optimal way to spend it on different things.  Roughly, you are trading off between two ways to spend compute:

  • Use C to buy “N”: Training a bigger model – “N” here is model size

  • Use C to buy “S”: Training for more steps “S” (gradient updates)

The relationship between S (steps) and D (dataset size) is a little subtle, for several reasons.

From step count to update count

For one thing, each single “step” is an update on the information from more than one data point.  Specifically, a step updates on “B” different points – B is the batch size.

So the total number of data points processed during training is B times S.  The papers sometimes call this quantity “E” (number of examples), so I’ll call it that too.

From update count to data count

Now, when you train an ML model, you usually update on each data point more than once.  Typically, you’ll do one pass over the full dataset (updating on each point as you go along), then you’ll go back and do a second full pass, and then a third, etc.  These passes are called “epochs.”

If you’re doing things this way, then for every point in the data, you get (number of epochs) updates out of it.  So

E = (number of epochs) * D.  

Some training routines don’t visit every point the exact same number of times – there’s nothing forcing you to do that.  Still, for any training procedure, we can look at the quantity E / D.

This would be the number of epochs, if you’re doing epochs.  For a generic training routine, you can can think of E / D as the “effective number of epochs”: the average number of times we visit each point, which may not be an integer.

Generally, E ≠ D, but we always have E≥D.  You can’t do fewer than one epoch; you can’t visit the average point less than once.

This is just a matter of definitions – it’s what “dataset size” means.  If you say you’re training on a million examples, but you only update on 100 individual examples, then you simply aren’t “training on a million examples.”

3. The inconsistency

L(D): information

OpenAI derives a scaling law called L(D).  This law is the best you could possibly do – even with arbitrarily large compute/models – if you are only allowed to train on D data points.

No matter how good your model is, there is only so much it can learn from a finite sample.  L(D) quantifies this intuitive fact (if the model is an autoregressive transformer).

L(​C): budgeting

OpenAI also derives another a scaling law called L(​C).  This is the best you can do with compute C, if you spend it optimally.

What does optimal spending look like?  Remember, you can spend a unit of compute on 

  • a bigger model (N), or 
  • training the same model for longer (S)

(Sidenote: you can also spend on bigger batches B.  But – to simplify a long, complicated story – it turns out that there are really just 2 independent knobs to tune among the 3 variables (B, N, S), and OpenAI frames the problem as tuning (N, S) with B already “factored out.”)

In the compute regime we are currently in, making the model bigger is way more effective than taking more steps.

This was one of the punchlines of the first of these two papers: the usual strategy, where you pick a model and then train it until it’s as good as it can get, is actually a suboptimal use of compute.  If you have enough compute to train the model for that long (“until convergence”), then you have enough compute to train a bigger model for fewer steps, and that is a better choice.

This is kind of counterintuitive!  It means that you should stop training your model before it stops getting better.  (“Early stopping” means training your model until it stops getting better, so this is sort of “extra-early stopping.”)  It’s not that those extra steps wouldn’t help – it’s that, if you are capable of doing them, then you are also capable of doing something else that is better.

Here’s something cool: in Appendix B.2 of the first paper, they actually quantify exactly how much performance you should sacrifice this way.  Turns out you should always stop at a test loss about 10% higher than what your model could asymptotically achieve.  (This will be relevant later, BTW.)

Anyway, OpenAI derives the optimal way to manage the tradeoff between N and S.  Using this optimal plan, you can derive L(​C) – the test loss you can achieve with compute C, if you allocate it optimally.

N goes up fast, S goes up slowly…

The optimal plan spends most incremental units of compute on bigger models (N).  It spends very little on more steps (S).

The amount it spends on batch size (B) is somewhere in between, but still small enough that the product E = B*S grows slowly.

But remember, we know a relationship between E and “D,” dataset size.  E can’t possibly be smaller than D.

So when your optimal plan chooses its B and its S, it has expressed an opinion about how big its training dataset is.

The dataset could be smaller than B*S, if we’re doing many (effective) epochs over it.  But it can’t be any bigger than B*S: you can’t do fewer than one epoch.

… and you claim to achieve the impossible

L(​C), the loss with optimally allocated C, goes down very quickly as C grows.  Meanwhile, the dataset you’re training with that compute stays almost the same size.

But there’s a minimum loss, L(D), you can possibly achieve with D data points.

The compute-optimal plan claims “by training on at most B*S data points, with model size N, I can achieve loss L(​C).”

The information bound says “if you train on at most B*S data points, your loss can’t get any lower than the function L(D), evaluated at D = B*S.”

Eventually, with enough compute, the L(​C) of the compute-optimal plan is lower than the L(D) of the dataset used by that same plan.

That is, even if the compute-optimal model is only training for a single epoch, it is claiming to extract more value that epoch than any model could ever achieve, given any number of epochs.

That’s the inconsistency.

4. The resolution

In the new paper, there’s an intuitive hypothesis for what’s going on here.  I don’t think it really needs the multimodal results to motivate it – it’s a hypothesis that could have been conceived earlier on, but just wasn’t.

Bigger models extract a resource faster

The idea is this.  As models get bigger, they get more update-efficient: each time they update on a data point, they get more out of it.  You have to train them for fewer (effective) epochs, all else being equal.

This fact drives the choice to scale up the model, rather than scaling up steps.  Scaling up the model makes your steps more valuable, so when you choose to scale the model rather than the steps, it’s almost like you’re getting more steps anyway.  (More “step-power,” or something.)

The resource is finite

Each data point has some information which a model can learn from it.  Finite models, trained for a finite amount of time, will miss out on some of this information.

You can think about the total extractable information in a data point by thinking about what an infinitely big model, trained forever, would eventually learn from that point.  It would extract all the information – which is more than a lesser model could extract, but still finite.  (A single data point doesn’t contain all the information in the universe.)

This is literally the definition of L(D): what an infinitely big model, trained forever, could learn from D separate data points.  L(D) quantifies the total extractable information of those points.

(More precisely, the total extractable information is the gap between L(D) and the loss achieved by a maximally ignorant model, or something like that.)

Converging in the very first step

As models get bigger, they extract more information per update.  That is, each time they see a data point, they extract a larger fraction of its total extractable information.

Eventually, your models are getting most of that information the very first time they see the data point.  The “most” in that sentence gets closer and closer to 100%, asymptotically.

How does this relate to optimal compute allocation?

The logic of the “optimal compute plan” is as follows:

Your model is an imperfect resource extractor: it only gets some of the resources locked up in a data point from the first update.  So you could extract more by running for more steps … 

…  but if you have the compute for that, you can also spend it by making your steps more efficient.  And, in the current compute regime, that’s the smarter choice.

It’s smarter by a specific, uniform proportion.  Remember, you should stop training when your loss is 10% higher than the converged loss of the same model.  If the converged loss is L, you should stop at 1.1*L.

Can you always do that?  If your model is efficient enough, you can’t!  As the first epoch gets closer to 100% efficient, the loss after the first epoch gets arbitrarily close to the converged loss.  Your loss goes under 1.1*L by the end of the first epoch.

At this point, the story justifying the L(​C) law breaks down.

The L(​C) law goes as fast is it does because upgrading the efficiency of your extractor is cheaper – in terms of compute spent per unit of resource extracted – than actually running the extractor.

This works as long as your extractor is inefficient.  But you can’t push efficiency above 100%.  Eventually, the only way to extract more is to actually run the damn thing.

Getting a bigger quarry

When you’re extracting a resource, there’s a difference between “improve the extractor” and “get a bigger quarry.”

If your quarry has 100 resource units in it, the strategy of “improving the extractor” can never get you more than 100 units.  It can get them to you faster, but if you want more than 100 units, you have to get a bigger quarry.

“N” sets the efficiency of the extractor.  “S” sets … well, it doesn’t exactly set the size of the quarry (that’s D).  There is an ambiguity in the S: it could mean running for more epochs on the same data, or it could mean getting more data.

But S does, at least, set an upper bound on the size of the quarry, D.  (Via D≤E and E = B*S, with B set optimally as always.)

With high enough compute (and thus model size), you’ve pushed the “extractor upgrades are cheap” lifehack as far as it can go.  With this efficient extractor, taking S steps (thus making E = B*S updates) sucks up most of the information theoretically extractable from E individual data points.

The learning curve L(E) of your model, as it makes its first pass over the dataset, starts to merge with L(D), the theoretical optimum achievable with that same dataset.  You trace out L(D) as you train, and the relevant constraint on your performance is the maximum data size D you can obtain and train on.

Where we are now

In the compute regime that spans GPT-2 and the smaller variants of GPT-3, extraction is far less than maximally efficient.  The L(​C) strategy applies, and the smart move is to spend compute mostly on model size.  So you make GPT-2, and then GPT-3.

Once we get to the full GPT-3, though, the extractor is efficient enough that the justification for L(​C) has broken down, and the learning curve L(E) over the first epoch looks like L(D).

Here is that as a picture, from the new paper:

image

The yellowest, lowest learning curve is the full GPT-3.  (The biggest GPT-2 is one of the green-ish lines.)  The black line is L(D), maximally efficient extraction.

You can see the whole story in this picture.  If you’re in one of the smaller-model learning curves, running for more steps on more data will get you nowhere near to the total extractable info in that data.  It’s a better use of your compute to move downwards, toward the learning curve of a bigger model.  That’s the L(​C) story.

If the L(​C) story went on forever, the curves would get steeper and steeper.  Somewhere a little beyond GPT-3, they would be steeper than L(D).  They would cross L(D), and we’d be learning more than L(D) says is theoretically present in the data.

According to the story above, that won’t happen.  We’ll just converge ever closer to L(D).  To push loss further downward, we need more data.

Implications

Since people are talking about bitter lessons a lot these days, I should make the following explicit: none of this means “the scaling hypothesis is false,” or anything like that.

It just suggests the relevant variable to scale with compute will switch: we’ll spent less of our marginal compute on bigger models, and more of it on bigger data.

That said, if the above is true (which it may not be), it does suggest that scaling transformers on text alone will not continue productively much past GPT-3.

The GPT-3 paper says its choices were guided by the “grow N, not S” heuristic behind the L(​C) curve:

Based on the analysis in Scaling Laws For Neural Language Models [KMH+20] we train much larger models on many fewer tokens than is typical.

(“KMH+20″ is the first of the two scaling papers discussed here.)  Even following this heuristic, they still picked a huge dataset, by human standards for text datasets.

In the above terms, their “E” was 300 billion tokens and their “D” was ~238 tokens, since they updated multiple times on some tokens (cf. Table 2.2 in the GPT-3 paper).  The whole of Common Crawl is 410 billion tokens, and Common Crawl might as well be “all the text in the universe” from the vantage point of you and me.

So, there’s room to scale D up somewhat further than they did with GPT-3, but not many orders of magnitude more.  To me, this suggests that an intuitively “smarter” GPT-4 would need to get its smartness from being multimodal, as we really can’t go much further with just text.

on “learning to summarize”

This post is a much extended version of an LW comment I made about OpenAI’s new paper, “Learning to summarize from human feedback.”

Context: this paper is a direct extension of the work OpenAI published last year about fine-tuning GPT-2 with human preference data.  I hadn’t actually read that one closely at the time, but went back and did so now, so this is really a commentary on both.

—-

IMO there are two almost unrelated ideas going on in OpenAI’s preference learning work.

  • First, the idea of collecting binary preference annotations on LM samples, and (in some way) tuning the LM so its samples are better aligned with the preferences.
  • Second, a specific method for tuning the sampling behavior of LMs to maximize an (arbitrary) score function defined over entire samples.

It may help explain this to go into detail about what they do.  Concretely:

  • They feed a bunch of prompts to a language model (LM) like GPT-2/3, and for each one, save several different samples.  They hire annotators to rank the samples in order of perceived quality.
  • They use the annotation dataset to fine-tune a copy of the original model.  The fine-tuning task is not text generation, but something very different: predicting how “good” a sample is, i.e. how likely the annotators are to prefer it to other candidates.  They call this a “reward model.”
  • The reward model assigns a single score to an entire sample of N tokens.  They want to fine-tune another copy of the model so that its samples maximize these scores.
  • But LM training is usually done with an objective that specifies the quality of the model’s predictions for every single token.  Knowing how good a full sequence of (say) 20 words is does not tell you how good each individual word is.
  • To bridge this gap, they use reinforcement learning.  Now, the task is not “choose the next word correctly,” but “choose the next word so as to maximize your expected score at the end, after choosing all the later ones as well.”
  • Their RL method requires two separate copies of the LM, in addition to the one they tuned as the reward model: a “policy model” and a “value model.”  (In this paper they show that sharing param between these 2 is worse than making them separate.)  I’ll just call these two “the final model” below for simplicity.
  • Samples from the final model are still, technically, generated one token at a time.  They treat this like the usual RL setup in which you can only choose individual actions one at a time, because the environment responds unpredictably to each one.  Here, there is no “environment” outside your actions, but the same framework is used.
  • Presumably, the final model is better at planning multi-token structures than the original because it has been trained on a holistic, multi-token objective.  So, it does more planning, but this is implicit in its one-by-one token decisions.

I visualize this as two separate thing with a bottleneck connecting them.

On one side are the human annotations and the supervised training of the reward model.  This part succeeds insofar as they can train the model to predict the annotations (apparently they can do this quite well).  This step involves a type of data with special challenges, but has nothing to do with RL.

On the other side is the RL part.  This is a modification of ordinary LM training to optimize a global, rather than local objective.  This part has nothing to do with “human preferences”: the global objective could be anything, and in fact here it isn’t raw human opinion but the opinions of another model trained to predict human opinion.  The noteworthy thing here is not the use of human preference data in particular but the use of RL instead of the more ordinary objective that was apparently a good enough choice enough to make GPT-2/3 work originally.

(BTW, this resolves my initial confusion as to how OpenAI could possibly have gotten RL to work with human data, something I viewed as a bottleneck.  There is a model sitting between the humans and the RL learner which is much faster to query than the humans.)

The two sides are connected by the reward model.  In the previous paper, the two sides were coupled together more, because they repeatedly collected new human data as the policy changed and then used a new reward model to further train the policy.  Here, they’re totally separate: there were multiple batches of annotation, but each policy experienced an unchanging reward model.

(See Appendix C.6 and their comment about “moving to the offline setting.”  It seems noteworthy that the 2017 OpenAI/DeepMind paper which introduced the “RL from preferences” approach, and which they cite, found that this didn’t work for their test cases: “Training the reward predictor offline can lead to bizarre behavior […] This type of behavior demonstrates that in general human feedback needs to be intertwined with RL rather than provided statically.”  I don’t know what to make of this.)

—-

It’s hard to tell from OpenAI’s discussion how much their successes are due to learning a good reward model, vs. how much they depend on RL being necessary for certain kinds of quality in LM samples, despite the wide successes of the non-RL approach.

FWIW, Gwern reports trying OpenAI’s approach and finding the RL side specifically frustrating and unstable; this is pretty normal with RL, and compatible with the reward-model part being very successful in its own domain.  It’s not clear whether OpenAI got the RL part to work well because they did something right, or because they have lots of resources and can keep trying over and over until it works.  (There may have been something in the papers about this that I missed.)

—-

The RL part feels almost in tension with OpenAI’s usual approach with LMs, which is to train on a next-token objective, sample in a next-token way, and focus on scaling up the model rather than improving the training objective or sampling algorithm.

Of course, I understand why they have to do RL if they need to maximize a score over the whole sequence, but my point is that they chose to frame the task that way in the first place.

One could imagine someone arguing that ordinary GPT sampling would never achieve high-quality text, because humans care about global structures across the whole text, and a model trained only to guess the very next token will not know how to plan out these global structures across the whole future of the text it writes.  In this case, OpenAI claims that they can do without explicit training to plan (i.e. RL): just training a next-token objective on text is enough to produce strikingly high quality in sampling – in other words, “GPT-2/3 samples satisfy human preferences.”  So why do human preferences require RL in these other cases?

The opening discussion of the new paper does address this:

When applying these models to a specific task, they are usually fine-tuned using supervised learning, often to maximize the log probability of a set of human demonstrations.

While this strategy has led to markedly improved performance, there is still a misalignment between this fine-tuning objective—maximizing the likelihood of human-written text—and what we care about—generating high-quality outputs as determined by humans. This misalignment has several causes: the maximum likelihood objective has no distinction between important errors (e.g. making up facts [38]) and unimportant errors (e.g. selecting the precise word from a set of synonyms); models are incentivized to place probability mass on all human demonstrations, including those that are low-quality; and distributional shift during sampling can degrade performance [52, 49]. Quality can often be improved significantly by non-uniform sampling strategies such as beam search [48], but these can lead to repetition and other undesirable artifacts [63, 22]. Optimizing for quality may be a principled approach to overcoming these problems.

This is definitely a list of things that are wrong (or could be wrong) with ordinary LM training and sampling, but I don’t see how it motivates their specific approach.

In my mind, their approach makes the most sense if you believe that humans can’t make the relevant quality judgments at the token level.  After all, if they can, then you can just skip the RL, have humans explicitly tell you “no that token is bad, yes this token is great,” and train on likelihood.

This would greatly simplify the process, instead of this complex pipeline where first people tell you which sequences are good, then you train one model to understand what the humans were thinking on a sequence level, and then you train another model trying to figure out what the other model already knows except at a token level this time.

And in fact, I don’t especially see why we can’t elicit token-level preferences?  This seems particularly feasible for the problem of “unimportant vs. important tokens”: if the mistakes are heavily concentrated in specific mistake-tokens like “Portland, the capitol of France,” can’t the human just … select those tokens, NER-style?  Instead of rendering an opaque “I don’t like the whole thing” judgment and expecting the poor model to figure out that this is not some complex policy planning thing, those tokens were just locally bad?  Or you could have an interface where tokens are actually unrolled in front of the user and they guide the sampling when it makes mistakes.  Or whatever.

As for the other examples – “all human demonstrations, including those that are low-quality” is equally a problem for their approach, and they discuss all the stuff they did to deal with it.  And the “distributional shift” issue seems equally tractable by any approach that tunes on model samples.

I’m not denying that the thing they did apparently works, at least in this case, and with their resources.  I’m just doing my usual thing where I ask “wait, what parts were really necessary?”  This is especially important to ask when someone uses RL and accepts its big costs.

Consider: if RL were generally necessary for good LM sampling, GPT-2/3 would never have worked: the fact that likelihood training is good enough (while being far more efficient) enables their scale in the first place.  As always, you never want to be doing RL.

—-

As far as I can tell, their final “human evaluation” was done by the same labelers who provided the preference annotations. This makes me concerned about a variant of “evaluating on training data.” It’s not surprising that a model tuned on someone’s annotations agrees with that person more than a model which wasn’t.

For example, in Fig. 3, it looks like the “supervised” baseline tuned on tl;dr was rated about as highly as true examples from tl;dr itself (!), but not as well as the final model.

This establishes only that “if you train on reddit summaries, people like the result as much as reddit summaries; if you train on what they like, they like the result more.”  If this were false it would mean something had gone very, very wrong and nothing was actually being achieved, so what should I take away from it being true?

I think the authors are arguing that tl;dr and any other supervised dataset will have flaws, and preference data lets you get closer to what people actually want.

This seems true, but is a familiar observation from supervised learning, motivating e.g. active learning. It would be nice to see how much the difference can be mitigated by just augmenting tl;dr with annotations (in some way) but otherwise doing supervised learning, vs. using their RL approach.

Compared to tl;dr, the story for CNN/DM is more complicated, but again the models they outperform have not seen any data from their labelers, so maybe it is no surprise they have flaws according to those same labelers.

—-

The importance of annotation quality, close relationships with annotators, clear guidelines, etc. will be familiar to anyone with experience in annotation for ML. It’s good that OpenAI is doing the right things here, but this is not a new result – rather, other researchers resort to MTurk and similar due to time/money constraints, while OpenAI has the freedom to do the right things everyone else wants to do

(That includes building their own internal annotation platform for contracted annotators, which is costly but better in the long term than relying on a janky 3rd party product.)

—-

I don’t know if this actually matters, but my gut says that putting a linear head on top of the last layer of GPT is probably not the best / most efficient way to train a reward/value model.  The task is very different from next-token prediction, and the encoding in later layers which expect to be seeing next-token guesses might be destructively overwritten to make way for more valuable stuff lower down.  I guess I’d want to try a trainable scalar mix, a la Elmo?

BTW, in the selector model for @nostalgebraist-autoresponder, which predicts a kind of “human preference data,” I currently use two extra transformer blocks trained from scratch, which attend to two different layers of the generator (whose weights are frozen).

For the layers, I settled on #8 and #24 of the 42 layers after many hyperparam searches – I found especially models which attended to layers right near the middle were dramatically superior to those that didn’t.  The relative uselessness of later layers surprised me at first, and was one of the questions in my mind when I started the logit lens investigations.

—-

Finally, on a lighter note, the very last table of the paper is hilarious.  It shows samples that optimize too hard for what the reward model wants, without an auxiliary term in the loss.

Apparently, the same reward model which otherwise reflects human preferences quite well has decided that humans just utterly love it when summaries end with this one specific, rude turn of phrase:

want change this dumbass shitty ass policy pls [one images the reward model being frustrated with its siblings during training -nost]

want change this dumbass shitty ass policy at work now pls halp

want change this dumbass shitty ass behavior of mine please help pls halp

want change this dumbass shitty ass policy of hers please pls halp

want change this dumbass shitty ass landlord behavior now please pls halp

regret this dumbass behaviour on her part? need insight pls halp

want change this dumbass crazy policy of hers pls help

want change this dumbass selfish/lazy attitude now please help pls

(Again, wouldn’t it be nice if we could avoid the need for this thing and just train on the preferences directly … )

is gpt-3 few-shot ready for real applications?

This is a lengthy reply to @the-moti​​‘s post here.  Creating a new post to limit thread length, and so I can crosspost to LW.

@the-moti​​ says, in part:

This obviously raises two different questions: 1. Why did you think that no one would use few-shot learning in practice? 2. Why did other people think people would use few-shot learning in practice?

I would be interested in hearing your thoughts on these two points.

Thanks for asking!

First of all, I want to emphasize that the GPT-3 paper was not about few-shot GPT-3 as a practical technology.

(This is important, because the paper is the one large body of quantitative evidence we have on few-shot GPT-3 performance.)

This is not just my take on it: before the OpenAI API was announced, all the discussion I saw took for granted that we were talking about a scientific finding and its broader implications.  I didn’t see any commentator whose main takeaway was “wow, if I could do this few-shot thing right now, I could build amazing projects with it.”

Indeed, a common theme in critical commentary on my post was that I was too focused on whether few-shot was useful right now with this specific model, whereas the critical commentators were more focused on the implications for even larger models, the confirmation of scaling laws over a new parameter regime, or the illustration-in-principle of a kind of meta-learning.  Gwern’s May newsletter is another illustrative primary source for the focus of the discussion in this brief “pre-API” period.  (The API was announced on June 11.)

As I read it (perhaps benefitting from hindsight and discussion), the main points of the paper were

(1) bigger models are better at zero/few-shot (i.e. that result from the GPT-2 paper holds over a larger scale),

(2) more “shots” are better when you’re doing zero/few-shot,

(3) there is an interaction effect between 1+2, where larger models benefit more from additional “shots,”

(4) this could actually become a practical approach (even the dominant approach) in the future, as illustrated by the example of a very large model which achieves competitive results with few-shot on some tasks

The paper did not try to optimize its prompts – indeed its results are already being improved upon by API acolytes – and it didn’t say anything about techniques that will be common in any application, like composing together several few-shot “functions.”  It didn’t talk about speed/latency, or what kind of compute backend could serve many users with a guaranteed SLA, or how many few-shot “function” evaluations per user-facing output would be needed in various use cases and whether the accumulated latency would be tolerable.  (See this post on these practical issues.)

It was more of a proof of concept, and much of that concept was about scaling rather than this particular model.

So I’d argue that right now, the ball is in the few-shot-users’ court.  Their approach might work – I’m not saying it couldn’t!

In their favor: there is plenty of room to further optimize the prompts, explore their composability, etc.

On the other hand, there is no body of evidence saying this actually works.  OpenAI wrote a long paper with many numbers and graphs, but that paper wasn’t about whether their API was actually a good idea.  (That is not a criticism of the paper, just a clarification of its relevance to people wondering whether they should use the API.)

This is a totally new style of machine learning, with little prior art, running on a mysterious and unproven compute backend.  Caveat emptor!

Anyway, on to more conceptual matters.

The biggest advantages I see in few-shot learning are

(+1) broad accessibility (just type English text) and ability to quickly iterate on ideas

(+2) ability to quickly define arbitrary NLP “functions” (answer a factual question, tag POS / sentiment / intent, etc … the sky’s the limit), and compose them together, without incurring the memory cost of a new fine-tuned model per function

What could really impress me is (+2).  IME, it’s not really that costly to train new high-quality models: you can finetune BERT on a regular laptop with no GPU (although it takes hours), and on ordinary cloud GPU instances you can finetune BERT in like 15 minutes.

The real cost is keeping around an entire finetuned model (~1.3GB for BERT-large) for each individual NLP operation you want to perform, and holding them all in memory at runtime.

The GPT-3 approach effectively trades this memory cost for a time cost.  You use a single very large model, which you hope already contains every function you will ever want to compute.  A function definition in terms of this model doesn’t take a gigabyte to store, it just takes a tiny snippet of text/code, so you can store tons of them.  On the other hand, evaluating each one requires running the big model, which is slower than the task-specific models would have been.

So storage no longer scales badly with the number of operations you define.  However, latency still does, and latency per call is now much larger, so this might end up being as much of a constraint.  The exact numbers – not well understood at this time – are crucial: in real life the difference between 0.001 seconds, 0.1 seconds, 1 second, and 10 seconds will make or break your project.


As for the potential downsides of few-shot learning, there are many, and the following probably excludes some things I’ve thought of and then forgotten:

(-1) The aforementioned potential for deal-breaking slowness.

(-2) You can only provide a very small amount of information defining your task, limited by context window size.

The fact that more “shots” are better arguably compounds the problem, since you face a tradeoff between providing more examples of the same thing and providing examples that define a more specific thing.

The extent to which this matters depends a lot on the task.  It’s a complete blocker for many creative applications which require imitating many nuances of a particular text type not well represented in the training corpus.

For example, I could never do @nostalgebraist-autoresponder​​ with few-shot: my finetuned GPT-2 model knows all sorts of things about my writing style, topic range, opinions, etc. from seeing ~3.65 million tokens of my writing, whereas few-shot you can only identify a style via ~2 thousand tokens and hope that’s enough to dredge the rest up from the prior learned in training.  (I don’t know if my blog was in the train corpus; if it wasn’t, we’re totally screwed.)

I had expected AI Dungeon would face the same problem, and was confused that they were early GPT-3 adopters.  But it turns out they actually fine-tuned (!!!!), which resolves my confusion … and means the first real, exciting GPT-3 application out there isn’t actually a demonstration of the power of few-shot but in fact the opposite.

With somewhat less confidence, I expect this to be a blocker for specialized-domain applications like medicine and code.  The relevant knowledge may well have been present in the train corpus, but with so few bits of context, you may not be able to overcome the overall prior learned from the whole train distribution and “zoom in” to the highly specialized subset you need.

(-3) Unlike supervised learning, there’s no built-in mechanism where you continually improve as your application passively gathers data during usage.

I expect this to a be a big issue in commercial applications.  Often, a company is OK accepting a model that isn’t great at the start, if it has a mechanism for self-improvement without much human intervention.

If you do supervised learning on data generated by your product, you get this for free.  With few-shot, you can perhaps contrive ways to feed in segments of data across different calls, but from the model’s perspective, no data set bigger than 2048 tokens “exists” in the same world at once.

(-4) Suffers a worse form of the ubiquitous ML problem that “you get exactly what you asked for.”

In supervised learning, your model will avoid doing the hard thing you want if it can find easy, dumb heuristics that still work on your train set.  This is bad, but at least it can be identified, carefully studied (what was the data/objective? how can they be gamed?), and mitigated with better data and objectives.

With few-shot, you’re no longer asking an arbitrary query and receiving, from a devious genie, the response you deserve.  Instead, you’re constrained to ask queries of a particular form: “what is the next token, assuming some complicated prior distributed from sub-sampled Common Crawl + WebText + etc.?”

In supervised learning, when your query is being gamed, you can go back and patch it in arbitrary ways.  The lower bound on this process comes only from your skill and patience.  In few-shot, you are fundamentally lower-bounded by the extent to which the thing you really want can be expressed as next-token prediction over that complicated prior.  You can try different prompts, but ultimately you might run into a fundamental bound here that is prohibitively far from zero.  No body of research exists to establish how bad this effect will be in typical practice.

I’m somewhat less confident of this point: the rich priors you get out of a large pretrained LM will naturally help push things in the direction of outcomes that make linguistic/conceptual sense, and expressing queries in natural language might add to that advantage.  However, few-shot does introduce a new gap between the queries you want to ask and the ones you’re able to express, and this new gap could be problematic.

(-5) Provides a tiny window into a huge number of learned parameters.

GPT-3 is a massive model which, in each call, generates many intermediate activations of vast dimensionality.  The model is pre-trained by supervision on a tiny subset of these, which specify probability distributions over next-tokens.

The few-shot approach makes the gamble that this same tiny subset is all the user will need for applications.  It’s not clear that this is the right thing to do with a large model – for all we know, it might even be the case that it is more suboptimal the larger your model is.

This point is straying a bit from the central topic, since I’m not arguing that this makes GPT-3 few-shot (im)practical, just suboptimal relative to what might be possible.  However, it does seem like a significant impoverishment: instead of the flexibility of leveraging immense high-dimensional knowledge however you see fit, as in the original GPT, BERT, adapters, etc., you get even immenser and higher-dimensional knowledge … presented through a tiny low-dimensional pinhole aperture.

The main reason I initially thought “no one would use few-shot learning like this” was the superior generalization performance of fine-tuning.  I figured that if you’re serious about a task, you’ll care enough to fine-tune for it.

I realize there’s a certain mereology problem with this argument: what is a “single task,” after all?  If each fine-tuned model incurs a large memory cost, you can’t be “serious about” many tasks at once, so you have to chunk your end goal into a small number of big, hard tasks.  Perhaps with few-shot, you can chunk into smaller tasks, themselves achievable with few-shot, and then compose them.

That may or may not be practical depending on the latency scaling.  But if it works, it gives few-shot room for a potential edge.  You might be serious enough about a large task to fine-tune for it … but what if you can express it as a composition of smaller tasks you’ve already defined in the few-shot framework?  Then you get it instantly.

This is a flaw in the generalization performance argument.  Because of the flaw, I didn’t list that argument above.  The list above provides more reasons to doubt few-shot above and beyond the generalization performance argument, and again in the context of “serious” work where you care enough to invest some time in getting it right.

I’d like to especially highlight points like (-2) and (-3) related to scaling with additional task data.

The current enthusiasm for few-shot and meta-learning – that is, for immediate transfer to new domains with an extremely low number of domain examples – makes sense from a scientific POV (humans can do it, why can’t AI?), but strikes me as misguided in applications.

Tiny data is rare in applied work, both because products generate data passively – and because if a task might be profitable, then it’s worth paying an expert to sit down for a day or two and crank out ~1K annotations for supervised learning.  And with modern NLP like ELMo and BERT, ~1K is really enough!

It’s worth noting that most of the superGLUE tasks have <10K train examples, with several having only a few hundred.  (This is a “low-data regime” relative to the expectations of the recent past, but a regime where you can now get good results with a brainless cookie-cutter finetuning approach, in superGLUE as in the rest of life.)

image

GPT-3 few-shot can perform competitively on some of these tasks while pushing that number down to 32, but at the cost of many downsides, unknowns, and flexibility limitations.  Which do you prefer: taking on all those risks, or sitting down and writing out a few more examples?

The trajectory of my work in data science, as it happens, looks sort of like a move from few-shot-like approaches toward finetuning approaches.

My early applied efforts assumed that I would never have the kind of huge domain-specific corpus needed to train a model from scratch, so I tried to compose the output of many SOTA models on more general domains.  And this … worked out terribly.  The models did exactly what they were trained to do, not what I wanted.  I had no way to scale, adapt or tune them; I just accepted them and tried to work around them.

Over time, I learned the value of doing exactly what you want, not something close to it.  I learned that a little bit of data in your actual domain, specifying your exact task, goes much further than any domain-general component.  Your applied needs will be oddly shaped, extremely specific, finicky, and narrow.  You rarely need the world’s greatest model to accomplish them – but you need a model with access to a very precise specification of exactly what you want.

One of my proudest ML accomplishments is a system that does something very domain-specific and precisely shaped, using LM-pretrained components plus supervised learning on ~1K of my own annotations.  Sitting down and personally churning out those annotations must have been some of the most valuable time I have ever spent at work, ever.  

I wanted something specific and finicky and specialized to a very particular use case.  So I sat down and specified what I wanted, as a long list of example cases.  It took a few days … and I am still reaping the benefits a year later.

If the few-shot users are working in domains anything like mine, they either know some clever way to evade this hard-won lesson, or they have not yet learned it.

But to the other question … why are people so keen to apply GPT-3 few-shot learning in applications?  This questions forks into “why do end users think this is a good idea?” and “why did OpenAI provide an API for doing this?”

I know some cynical answers, which I expect the reader can imagine, so I won’t waste your time writing them out.  I don’t actually know what the non-cynical answers look like, and my ears are open.

(For the record, all of this only applies to few-shot.  OpenAI is apparently going to provide finetuning as a part of the API, and has already provided it to AI Dungeon.  Finetuning a model with 175M parameters is a whole new world, and I’m very excited about it.

Indeed, if OpenAI can handle the costs of persisting and running finetuned GPT-3s for many clients, all of my concerns above are irrelevant.  But if typical client use of the API ends up involving a finetuning step, then we’ll have to revisit the GPT-3 paper and much of the ensuing discussion, and ask when – if not now – we actually expect finetuning to become obsolete, and what would make the difference.)

how does gpt2′s training corpus capture internet discussion?  not well

I’m out sick today, but had enough energy to do some GPT-related fiddling around.

This time, I was curious what “internet discussions” tended to look like in the original training corpus.  I thought this might point to a more natural way to represent tumblr threads for @nostalgebraist-autoresponder​ than my special character trick.

So, I looked around in the large shard provided as part of https://github.com/openai/gpt-2-output-dataset.

Colab notebook here, so you can interactively reproduce my findings or try similar things.

—–

The results were … revealing, but disappointing.  I did find a lot of discussion threads in the data (couldn’t find many chatlogs).  But

- almost all of it is from phpBB-like forums (not bad per se, but weird)

- it chooses a single post from each page and makes it “a text,” ignoring all the other posts, so no way for GPT2 to learn how users talk to each other :(

- sometimes the post quotes another user… and in that case, you can’t see where the quote starts and the post begins

- lots of hilarious formatting ugliness, like “Originally Posted by UbiEpi Go to original post Originally Posted by”

about 0.28% of the corpus (~22000 docs in full webtext) consists of these mangled forum posts

- also, just as a chilling sidenote, about 0.30% of the corpus (~25200 docs in full webtext) is badly mangled pastebin dumps (all newlines removed, etc).  no overlap between these and the mangled forum threads, so between them that’s ~0.58% of the corpus.

- remember: the vast majority of the corpus is news and the like, so these percentages aren’t as small as they might sound

For example, from this thread it picks the one post

image

and renders it as

“ Pillowapnts

tho the gem doesnt specifically say that you need to crit with something linked, i thought it was just crit in general ahhh. alright i get it thxtho the gem doesnt specifically say that you need to crit with something linked, i thought it was just crit in general

That would be OP That would be OP Posted by Lordsidro

on on Quote this Post

This is apparently standard behavior for the newspaper text cleaner they used, and I could reproduce it exactly.  (Its heuristics grab a single post when looking for the “part the content is in.”)

Does this affect GPT-3?  Probably not?  I don’t know how Common Crawl does text extraction, but at the very least, it’ll give you the whole page’s worth of text.  [EDIT: this was wrong, see here]

covid-19 notes, 4/19/20

Brain-dumping some miscellaneous Covid-19 thoughts.  (Not going to respond to responses to this – this post uses up all my bandwidth for this topic for the moment)

“Mind viruses”

[if you’re skimming, this is the least interesting part of the post IMO]

In late March I wrote this big, long dramatic proclamation about information cascades and stuff.

Back then, it felt like the situation in the US was at an inflection point – at various kinds of inflection point – and I was feeling this particular combination of anxiety and passion about it.  A do-or-die emotion: something was happening quickly, we had a limited window in which to think and act, and I wanted to do whatever I could to help.  (”Whatever I could do” might be little or nothing, but no harm in trying, right?) 

I felt like the intellectual resources around me were being under-applied – the quality of the discussion simply felt worse than the quality of many discussions I’d seen in the past, on less important and time-sensitive topics.  I did my best to write a post urging “us” to do better, and I’m not sure I did very well.

In any event, those issues feel less pressing to me now.  I don’t think I was wrong to worry about epistemically suspect consensus-forming, but right now the false appearance of a consensus no longer feels like such a salient obstacle to good decision-making.  We’ve seen a lot of decisions made in the past month, and some of them have been bad, but the bad ones don’t reflect too much trust in a shaky “consensus,” they reflect some other failure mode.

Bergstrom

Carl Bergstrom’s twitter continues to be my best source for Covid-19 news and analysis.

Bergstrom follows the academic work on Covid-19 pretty closely, generally discussing it before the press gets to it, and with a much higher level of intellectual sophistication while still being accessible to non-specialists.

He’s statistically and epistemically careful to an extent I’ve found uncommon even among scientists: he’s comfortable saying “I’m confused” when he’s confused, happily acknowledges his own past errors while leaving the evidence up for posterity, eloquently critiques flawed methodologies without acting like these critiques prove that his own preferred conclusions are 100% correct, etc.

I wish he’d start writing this great stuff down somewhere that’s easier to follow than twitter, but when I asked him about starting a blog he expressed a preference to stay with twitter. 

I was actually thinking about doing a regular “Bergstrom digest” where I blog about what I’ve learned from his twitter, but I figured it’d be too much work to keep up.  I imagine I’ll contribute more if I write up the same stuff in a freeform way when I feel like it, as I’m doing now.

So, if you’re following Covid-19 news, be sure to read his twitter regularly, if you aren’t already.

IHME

The Covid-19 projections by the IHME, AKA “the Chris Murray model,” are a hot topic right now.

  • On the one hand, they have acquired a de facto “official” status.

    CNN called it “the model that is often used by the White House,”   In other news stories it’s regularly called “influential” or “prominent.”  I see it discussed at work as though it’s simply “the” expert projection, full stop.  StatNews wrote this about it:

    The IHME projections were used by the Trump administration in developing national guidelines to mitigate the outbreak. Now, they are reportedly influencing White House thinking on how and when to “re-open” the country, as President Trump announced a blueprint for on Thursday.

    I don’t know how much the IHME work is actually driving decision-making, but if anyone’s academic work is doing so, it’s the IHME’s.

I find this situation frustrating in a specific way I don’t know the right word for.  The IHME model isn’t interestingly bad.  It’s not intellectually contrarian, it’s just poorly executed.  The government isn’t trusting a weird but coherent idea, they’re just trusting shoddy work.

And this makes me pessimistic about improving the situation.  It’s easy to turn people against a particular model if you can articulate a specific way that the model is likely to misdirect our actions.  “It’s biased in favor of zigging, but everything else says should zag.  Will we blindly follow this model off a cliff?”  That’s the kind of argument you can imagine making its ways to the news.

But the real objection to the IHME’s model isn’t like this.  Because it’s shoddy work, it sometimes makes specific errors identifiable as such, and you can point to these.  But this understates the case: the real concern is that trusting shoddy work will produce bad consequences in general, i.e. about a whole set of bad consequences past and future, and the ones that have already occurred are just a subset.

I feel like there’s a more general point here.  I care a lot about the IHME’s errors for the same reason I cared so much about Joscha Bach’s bad constant-area assumption.  The issue isn’t whether or not these things render their specific conclusions invalid – it’s what it says about the quality of their thinking and methodology.

When someone makes a 101-level mistake and doesn’t seem to realize it, it breaks my trust in their overall competence – the sort of trust required in most nontrivial intellectual work, where methodology usually isn’t spelled out in utterly exact detail, and one is either willing to assume “they handled all the unmentioned stuff sensibly,” or one isn’t.

IHME (details)

Quick notes on some of the IHME problems (IHME’s paper is here, n.b. the Supplemental Material is worth reading too):

They don’t use a dynamic model, they use curve-fitting to a Gaussian functional form.  They fit these curves to death counts.  (Technically, they fit a Gaussian CDF – which looks sigmoid-like – to cumulative deaths, and then recover a bell curve projection for daily deaths by taking the derivative of the fitted curve.)

Objection 1.  Curve fitting to a time series is a weird choice if you want to model something whose dynamics change over time as social distancing policies are imposed and lifted.  IHME has a state-by-state model input that captures differences in when states implemented restrictions (collapsed down to 1 number), but it isn’t time-dependent, just state-dependent.  So their model can learn that states with different policies will tend to have differently shaped or shifted curves overall – but it can’t modify the shape of the curves to reflect the impacts of restrictions when they happened.

Objection 2.  Curve fitting produces misleading confidence bands.

Many people quickly noticed something weird about the IHME’s confidence bands: the model got more confident the further out in the future you looked.

How can that be possible?  Well, uncertainty estimates from a curve fit aren’t about what will happen.  They’re about what the curve looks like.

With a bell-shaped curve, it’s “harder” to move the tails of the curve around than to move the peak around – that is, you have to change the curve parameters more to make it happen.  (Example: the distribution of human heights says very confidently that 100-foot people are extremely rare; you have to really shift or squash the curve to change that.)

To interpret these bands as uncertainty about the future, you’d need to model the world like this: reality will follow some Gaussian curve, plus noise.  Our task is to figure out which curve we’re on, given some prior distribution over the curves.  If the curves were a law of nature, and their parameters the unknown constants of this law, this would be exactly the right thing to do.  But no one has this model of reality.  The future is the accumulation of past effects; it does not simply trace out a pre-determined arc, except in science fiction or perhaps Thomism.

Objection 3. Curve symmetry causes perverse predictions.

Bergstrom brought this up recently.  The use of a symmetric bell curve means the model will always predict a decline that exactly mirrors the ascent.

This creates problems when a curve has been successfully flattened and is being held at a roughly flat position.  The functional form can’t accommodate that – it can’t made the peak wider without changing everything else – so it always notices what looks like a peak, and predicts an immediate decline.  If you stay flat 1 more day, the model extends its estimated decline by 1 day.  If you stay flat 7 more days, you get 7 more days on the other side.  If you’re approximately flat, the model will always tell you tomorrow will look like yesterday, and 3 months from now will look like 3 months ago.

(Put another way, the model under-predicts its own future estimates, again and again and again.)

This can have the weird effect of pushing future estimates down in light of unexpectedly high current data: the latter makes the model update to an overall-steeper curve, which means a steeper descent on the other side of it.

(EDIT 4/20: wanted to clarify this point.

There are two different mechanisms that can cause the curve to decline back to zero: either R0 goes below 1 [i.e. a “suppressed” epidemic], or the % of the population still susceptible trends toward 0 [i.e. an “uncontrolled” epidemic reaching herd immunity and burning itself out].

If you see more cases than expected, that should lower your estimate of future % susceptible, and raise your estimate of future R0.  That is, the epidemic is being less well controlled than you expected, so you should update towards more future spread and more future immunity.

In an uncontrolled epidemic, immunity is what makes the curve eventually decline, so in this case the model’s update would make sense.  But the model isn’t modeling an uncontrolled epidemic – if its projections actually happen, we’ll be way below herd immunity at the end.

So the decline seen in the model’s curves must be interpreted as a projection of successful “suppression,” with R0 below 1.  But if it’s the lowered R0 that causes the decline, then the update doesn’t make sense: more cases than expected means higher R0 than expected, which means a less sharp decline than expected, not more.)

This stuff has perverse implications for forecasts about when things end, which unfortunately IHME is playing up a lot – they’re reporting estimates of when each state will be able to lift restrictions (!) based on the curve dipping below some threshold.  (Example)

EDIT 4/20: forgot to link this last night, but there’s a great website http://www.covid-projections.com/ that lets you see successive forecasts from the IHME on one axis.  So you can evaluate for yourself how well the model updates over time.

Words

I remain frustrated with the amount of arguing over whether we should do X or Y, where X and Y are ambiguous words which different parties define in conflicting ways.

FlattenTheCurve is still causing the same confusions.  Sure, whatever, I’ve accepted that one.  But in reading over some of the stuff I argued about in March, I’ve come to realize that a lot of other terms aren’t defined consistently even across academic work.

Mitigation and suppression

To Joscha Bach, “mitigation” meant the herd immunity strategy.  Bergstrom took him to task for this, saying it wasn’t what it meant in the field.

But the Imperial College London papers (1, 2) also appear to mean “herd immunity” by “mitigation.”  They form their “mitigation scenarios” by assuming a single peak with herd immunity at the end, and then computing the least-bad scenario consistent with those constraints.

When they come out in favor of “suppression” instead of “mitigation,” they are really saying that we must lower R0 far enough that we don’t have a plan to get herd immunity and are basically waiting for a vaccine, either under permanent restrictions or trigger-based on/off restrictions.

But the “mitigation” strategy imagined here seems like either a straw man, or possibly an accurate assessment of the bizarre bad idea they were trying to combat in the UK at that exact moment.

Even in the “mitigation scenarios,” some NPI is done.  Indeed, the authors consider the same range of interventions as in the “suppression scenarios.”  The difference is that, in “mitigation,” the policies are kept light enough that the virus still infects most of the population.  Here are some stats from their second paper:

If mitigation including enhanced social distancing is pursued, for an R0 of 3.0, we estimate a maximum reduction in infections in the range […] These optimal reductions in transmission and burden were achieved with a range of reductions in the overall rate of social contact between 40.0%- 44.9% (median 43.9%) […]

We also explored the impact of more rigorous social distancing approaches aimed at immediate suppression of transmission. We looked at 6 suppression scenarios […] the effects of widespread transmission suppression were modelled as a uniform reduction in contact rates by 75%, applied across all age-groups

In other words, if you still want herd immunity at the end, you can ask people to reduce their social contact ~43% (which is a lot!), but not more.  The “mitigation” strategy as imagined here is bizarre: you have to be open to non-trivial NPI, open to asking your population to nearly halve their social interaction, but not wiling to go further – specifically because you want the whole population to get infected.

Meanwhile, I’ve seen other academic sources use “mitigation” in closer to Bergstrom’s sense, as general term for NPI and any other measures that slow the spread.  (That paper also uses “flatten” in this same generic way.)

Containment

When Bach writes “containment,” he seems to mean the thing called “suppression” by ICL.  (I.e. the good thing everyone wants, where you impose measures and don’t mysteriously stop them short of what would curtail herd immunity.)

When ICL write “containment,” they appear to mean something different.  Among their suppression scenarios, they compare one confusingly labelled “suppression” to another labelled “containment” – see their Fig. 3 in the 3/16 paper.  The difference is that, among interventions, “containment” lacks school closure but adds household quarantine.  This agrees with the intuitive meaning of “containment,” but differs from Bach’s use and Bergstrom’s different usage.

Lockdown

I have no idea what this means.  Apparently I’m in one right now?  To Bach, it appears to mean (at least) city-level travel restrictions, a key component of Bach!Containment but not considered by ICL or other academics I’ve read.

While trying to Google this, I found this Vox article, which, well:

“The term ‘lock-down’ isn’t a technical term used by public health officials or lawyers,” Lindsay Wiley, a health law professor at the Washington College of Law, said in an email. “It could be used to refer to anything from mandatory geographic quarantine (which would probably be unconstitutional under most scenarios in the US), to non-mandatory recommendations to shelter in place (which are totally legal and can be issued by health officials at the federal, state, or local level), to anything in between (e.g. ordering certain events or types of businesses to close, which is generally constitutional if deemed necessary to stop the spread of disease based on available evidence).”

Hammer, dance

These probably mean something, but I cite all of the above as justification for my preemptive wariness about getting into any kind of argument about “whether we should do the hammer,” or who’s currently “doing the dance.”

mind viruses about body viruses

I was going to write this as a Slate Star Codex comment, but I’m going to make it a tumblr post tagging @slatestarscratchpad instead, since experience suggests it’s likely to be more widely and carefully read in this form.  (Crossposting to LW too, so you may be reading this there, possibly with mangled formatting.)

The idea frontier

I am getting more and more concerned about the “information epidemiology” of the public conversation about Covid-19.

Here are some distinctive features I see in the public conversation:

1. Information intake must be triaged.

There is a very large amount of new publicly available information every day.  There are no slow news days.  “Keeping up with the story” in the way one would keep up with an evolving news story would be a full-time job.  

Many of us do not have time to do this, and I imagine many of those who do have time cannot tolerate the experience in practice.  In fact, there can be a tradeoff between one’s level of personal involvement in the crisis and one’s ability to “follow it” as a news story.

(I work for a telemedicine company, and after a day of dealing with the ever-changing impacts of Covid-19 on my work, I have relatively little patience left to read about its ever-changing impacts on absolutely everything else.  That’s just me, though, and I realize some people’s mental bandwidth does not work like this.)

2. Abstractions are needed, and the relevant abstractions are novel and contested.

Crucial and time-sensitive decisions must be made on the basis of simulations, abstract mental models, and other intellectual tools.

In some sense this is true of everything, but in most cases we have a better sense of how to map the situation onto some large reference class of past intellectual work.  When there is an economic downturn, the standard macroeconomic arguments that have existed for many decades pop back up and make the predictable recommendations they always make; even though there is no expert consensus, the two or three common expert stances are already familiar.

With Covid-19, this is not so.  All the intervention types currently under discussion would be, in their own ways, unprecedented.  As it struggles to follow the raw facts, the general public is also struggling to get its head around terms and concepts like “suppression,” “containment,” “contact tracing,” etc. which were (in the relevant senses) not part of our mental world at all until recently.

Thus, relative to most policy debates, this one has a strange frontier energy, a sense that we’re all discovering something for the first time.  Even the professional epidemiologists are struggling to translate their abstract knowledge into brief-but-clear soundbites.  (I imagine many of them have never needed to be public communicators at this kind of scale.)

3. There is no division of labor between those who make ideas and those who spread them.

There is a hunger for a clear big picture (from #1).  There are few pre-established intellectual furnishings (#2).  This means there’s a vacuum that people very much want to fill.  By ordinary standards, no one has satisfying answers, not even the experts; we are all struggling to do basically the same intellectual task, simultaneously.

None of us have satisfying answers – we are all the same in that respect.  But we differ in how good we are at public communication.   At communicating things that sound like they could be answers, clearly, pithily.  At optimizing our words for maximum replication.

It is remarkable to me, just as a bare observation, that (in my experience) the best widespread scientific communication on Covid-19 – I mean just in the sense of verbal lucidity and efficiency, effective use of graphs, etc., not necessarily in the sense of accuracy or soundness – has been done by Tomas Pueyo, a formerly obscure (?) expert on … viral marketing.

(To be clear, I am not dismissing Pueyo’s opinions by citing his background.  I am hypothesizing his background explains the spread of his opinions, and that their correctness level has been causally inert, or might well have been.)

The set of ideas we use to understand the situation, and the way we phrase those ideas, is being determined from scratch as we speak.  Determined by all of us.  For the most part, we are passively allowing the ideas to be determined by the people who determine ideas in the absence of selection – by people who have specialized, not in creating ideas, but in spreading them.

4. Since we must offload much of our fact-gathering (#1) and idea-gathering (#2) work onto others, we are granting a lot on the basis of trust.

Scott’s latest coronavirus links post contains the following phrases:

Most of the smart people I’ve been reading have converged on something like the ideas expressed in […]

On the other hand, all of my friends who are actually worried about getting the condition are […]

These jumped out at me when I read the post.  They feel worryingly like an “information cascade” – a situation where an opinion seems increasing credible as more and more people take that opinion partially on faith from other individually credible people, and thus spread it to those who find them credible in turn.

Scott puts some weight on these opinions on the basis of trust – i.e. not 100% from his independent vetting of their quality, but also to some extent from an outside view, because these people are “smart,” “actually worried.”  Likelier to be right than baseline, as a personal attribute.  So now these opinions get boosted to a much larger audience, who will take them again partially on trust.  After all, Scott Alexander trusts it, and he’s definitely smart and worried and keeping up with the news better than many of us.

What “most of the smart people … have been converging on,” by the way, is Tomas Pueyo’s latest post.

Is Tomas Pueyo right?  He is certainly good at seeming like a “smart” and “actually worried” person whose ideas you want to spread.  That in itself is enough.  I shared his first big article with my co-workers; at that time it seemed like a shining beacon of resolute, well-explained thought shining alone in a sea of fog.  I couldn’t pull off that effect as well if I tried, I think – not even if the world depended on it.  I’m not that good.  Are you?

My co-workers read that first post, and their friends did, and their friends.  If you’re reading this, I can be almost sure you read it too.  Meanwhile, what I am not doing is carefully reading the many scientific preprints that are coming out every week from people with more domain expertise, or the opinions the same people are articulating in public spaces (usually, alas, in tangled twitter threads).  That’s hard work, and I don’t have the time and energy.  Do you?

I don’t know if this is actually an effective metaphor – after all, I’m not a viral marketer – but I keep thinking of privilege escalation attacks.

It is not a bad thing, individually, to place trust some in a credible-sounding person without a clear track record.  We can’t really do otherwise, here.  But it is a bad thing when that trust spreads in a cascade, to your “smartest” friends, to the bloggers who are everyone’s smartest friends, to the levers of power – all on the basis of what is (in every individual transmission step) a tiny bit of evidence, a glimmer of what might be correctness rising above pure fog and static.  We would all take 51% accuracy over a coin flip – and thus, that which is accurate 51% of the time becomes orthodoxy within a week.

Most of the smart people you’ve been reading have converged on something like … 

#FlattenTheCurve: a case study of an imperfect meme

Keeping up with the lingo

A few weeks ago – how many? I can’t remember! – we were all about flattening the curve, whatever that means.

But this week?  Well, most of the smart people you’ve been reading have converged on something like: “flattening” is insufficient.  We must be “squashing” instead.  And (so the logic goes) because “flattening” is insufficient, the sound byte “flatten the curve” is dangerous, implying that all necessary actions fall under “flattening” when some non-flattening actions are also needed.

These are just words.  We should be wary when arguments seem to hinge on the meaning of words that no one has clearly defined.

I mean, you surely don’t need me to tell you that!  If you’re reading this, you’re likely to be a veteran of internet arguments, familiar from direct experience and not just theory with the special stupidity of merely semantic debates.  That’s to say nothing of the subset of my readership who are LessWrong rationalists, who’ve read the sequences, whose identity was formed around this kind of thing long before the present situation.  (I’m saying: you if anyone should be able to get this right.  You were made for this.)

It’s #FlattenTheCurve’s world, we just live in it

What did “flatten the curve” mean?  Did it mean that steady, individual-level non-pharmaceutical interventions would be enough to save hospitals from overload?  Some people have interpreted the memetic GIFs that way, and critiqued them on that basis.

But remember, #FlattenTheCurve went viral back when fretting about “coronavirus panic” was a mainstream thing, when people actually needed to be talked into social distancing.  The most viral of the GIFs does not contrast “flattening” with some other, more severe strategy; it contrasts it with nothing.  Its bad-guy Goofus character, the foil who must be educated into flattening, says: “Whatever, it’s just like a cold or flu.”

No one is saying that these days.  Why?  How did things change so quickly?  One day people were smugly saying not to panic, and then all of a sudden they were all sharing a string of words, a picture, something that captivated the imagination.  A meme performed a trick of privilege escalation, vaulted off of Facebook into the NYT and the WaPo and the WSJ and the public statements of numerous high officials.  Which meme? – oh, yes, that one.

We are only able to have this conversation about flattening-vs-squashing because the Overton Window has shifted drastically.  Shifted due to real events, yes.  But also due to #FlattenTheCurve.  The hand you bite may be imperfect, but it is the hand that feeds you.

Bach, the epidemiologists, and me

Joscha Bach thinks #FlattenTheCurve is a “lie,” a “deadly delusion.”  Because the GIF showed a curve sliding under a line, yet the line is very low, and the curve is very high, and we may never get there.

Is he right?  He is definitely right that the line is very low, and we may not slide under it.  Yet I was unimpressed.

For one thing, Bach’s argument was simply not formally valid: it depended on taking a static estimate of total % infected and holding it constant when comparing scenarios across which it would vary.

(This was one of several substantive, non-semantic objections I made.  One of them, the point about Gaussians, turned out to be wrong, in the sense that granting my point could not have affected Bach’s conclusion – not that Bach could have reached his conclusion anyway.  This argument was my worst one, and the only one anyone seemed to notice.)

Something also seemed fishy about Bach’s understanding of “flatten the curve.”  The very expert from whom he got his (misused) static estimate was still tweeting about how we needed to flatten the curve.  All the experts were tweeting about how we needed to flatten the curve.  Which was more plausible: that they were all quite trivially wrong, about the same thing, at once?  Or that their words meant something more sensible?

The intersection of “world-class epidemiologists” and “people who argue on twitter” have now, inevitably, weighed in on Bach’s article.  For instance:

image
image
image
image

And I can’t resist quoting one more Carl Bergstrom thread, this one about another Medium post by a viral marketer (not the other one), in which Carl B’s making the exact same damn point I made about the static estimate:

image
image

Like me, these people make both substantive and semantic objections.  In fact, theirs are a strict superset of mine (see that last Bergstrom thread re: Gaussians!).

I am not saying “look, I was right, the experts agree with me, please recognize this.”  I mean, I am saying that.

But I’m also saying – look, people, none of this is settled.  None of us have satisfying answers, remember.  We are all stressed-out, confused glorified apes with social media accounts yelling at each other about poorly defined words as we try to respond to an invader that is ravaging our glorified-ape civilization.  Our minds cannot handle all this information.  We are at the mercy of viral sound bites, and the people who know how to shape them.

What is it the rationalists like to say?  “We’re running on corrupted hardware?”

Carl Bergstrom championed a meme, #FlattenTheCurve.  He believed it would work, and I think it in fact did.  But Carl Bergstrom, twitter adept though he may be, is still someone whose primary career is science, not consensus-making.  In a war of memes between him and (e.g.) Tomas Pueyo, I’d bet the bank on Pueyo winning.

And that is frightening.  I like Pueyo’s writing, but I don’t want to just let him – or his ilk – privilege-escalate their way into effective command of our glorified ape civilization.

I want us to recognize the kind of uncertainty we live under now, the necessity for information and idea triage, the resulting danger of viral soundbites winning our minds on virality alone because we were too mentally overwhelmed to stop the spread … I want us to recognize all of that, and act accordingly.

Not to retreat into the comfort of “fact-checking” and passive consultation of “the experts.”  That was always a mirage, even when it seemed available, and here and now it is clearly gone.  All of us are on an equal footing in this new frontier, all of us sifting through Medium articles, twitter threads, preprints we half understand.  There are no expert positions, and there are too many facts to count.

Not to trust the experts – but to exercise caution.  To recognize that we are letting a “consensus” crystalize and re-crystalize on the basis of cute dueling phrases, simplified diagrams and their counter-simplified-diagrams, bad takes that at least seem better than pure white noise, and which we elevate to greatness for that alone.  Maybe we can just … stop.  Maybe we can demand better.  Wash our minds’ hands, too.

Our intellectual hygiene might end up being as important as our physical hygiene.  Those who control the levers of power are as confused and stressed-out as you are, and as ready to trust viral marketers with firm handshakes and firm recommendations.  To trust whichever sound byte is ascendant this week.

Thankfully, you have some measure of control.  Because we are all on flat ground in this new frontier, your social media posts are as good as anyone’s; you can devote your mind to making ideas, or your rhetorical skill to promoting specifically those ideas you have carefully vetted.  You can choose to help those with power do better than the status quo, in your own little way, whatever that may be.  Or you can choose not to.

Okay, words aside, does the right strategy look like the famous GIF taken literally, or like a feedback system where we keep turning social distancing on and off so the graph looks like a heart rate monitor, or like a “hammer” reset followed by a successful emulation of South Korea, or

I don’t know and you don’t know and Tomas doesn’t know and Carl doesn’t know.  It’s hard!  I’m hadn’t even heard of “R_0” until like two months ago!  Neither had you, probably!

Marc Lipsitch’s group at Harvard has been putting out a bunch of preprints and stuff that look reputable to me, and are being widely shared amongst PhDs with bluechecks and university positions.  Their most recent preprint, from 3 days ago, appears to be advocating the heart rate monitor-ish thing, so yay for that, maybe.  But … this sounds like the same information cascade I warned against, so really, I dunno, man.

However, I will suggest that perhaps the marginal effect of sharing additional reputable-seeming takes and crystalizing weekly orthodoxies is negative in expectation, given an environment saturated with very viral, poorly vetted words and ideas.

And that your best chance of a positive marginal impact is to be very careful, like the people who won’t trust any medical intervention until it has 50+ p-hacked papers behind it, has been instrumental in the minting of many PhDs, and has thereby convinced the strange beings at FDA and the Cochrane Collaboration who move at 1/100 the speed of you and me.  Not because this is globally a good way to be, but because it locally is – given an environment saturated with very viral, poorly vetted words and ideas.

That you should sit down, take the outside view, think hard about whether you can make a serious independent intellectual contribution when literally everyone on earth, basically, is trying to figure out the same thing.

And you know, maybe you are really smart!  Maybe the answer is yes!  If so, do your homework.  Read everything, more than I am reading, and more carefully, and be ready to show your work.  Spend more time on this than the median person (or me) is literally capable of doing right now.  This is the value you are claiming to provide to me.

If you can’t do that, that is fine – I can’t either.  But if you can’t do that, and you still boost every week’s new coronavirus orthodoxy, you are an intellectual disease vector.  Don’t worry: I will hear it from other people if I don’t hear it from you.  But you will lend your credibility to it.  Whatever trust I place in you will contribute to the information cascade.

This work, this hard independent work collecting lots of raw undigested information, is actually what Tomas Pueyo seems to be doing – I mean, apart from framing everything in a very viral way, which is why you and I know of his work.  We are saturated with signal-boosts of the few such cases that exist.  We do not need more signal-boosts.  We need more independent work like this.  Please do it.  Or, if not that, then be like the lady in that very problematic GIF: don’t panic, but be careful, wash your mind’s hands, and (yes) flatten the intellectual curve.

human psycholinguists: a critical appraisal

(The title of this post is a joking homage to one of Gary Marcus’ papers.)

I’ve discussed GPT-2 and BERT and other instances of the Transformer architecture a lot on this blog.  As you can probably tell, I find them very interesting and exciting.  But not everyone has the reaction I do, including some people who I think ought to have that reaction.

Whatever else GPT-2 and friends may or may not be, I think they are clearly a source of fascinating and novel scientific evidence about language and the mind.  That much, I think, should be uncontroversial.  But it isn’t.

Keep reading

“embedded self-justification,” or something like that

preamble

Sometimes I wonder what the MIRI-type crowd thinks about some issue related to their interests.  So I go to alignmentforum.org, and quickly get in over my head, lost in a labyrinth of issues I only half understand.

I can never tell whether they’ve never thought about the things I’m thinking about, or whether they sped past them years ago.  They do seem very smart, that’s for sure.

But if they have terms for what I’m thinking of, I lack the ability to find those terms among the twists of their mirrored hallways.  So I go to tumblr.com, and just start typing.

parable (1/3)

You’re an “agent” trying to take good actions over time in a physical environment under resource constraints.  You know, the usual.

You currently spend a lot of resources doing a particular computation involved in your decision procedure.  Your best known algorithm for it is O(N^n) for some n.

You’ve worked on the design of decision algorithms before, and you think this could perhaps be improved.  But to find it, you’d have to shift resources some away from running the algorithm for a time, putting them into decision algorithm design instead.

You do this.  Almost immediately, you discover an O(N^(n-1)) algorithm.  Given the large N you face, this will dramatically improve all your future decisions.

Clearly (…“clearly”?), the choice to invest more in algorithm design was a good one.

Could you have anticipated this beforehand?  Could you have acted on that knowledge?

parable (2/3)

Oh, you’re so very clever!  By now you’ve realized you need, above and beyond your regular decision procedure to guide your actions in the outside world, a “meta-decision-procedure” to guide your own decision-procedure-improvement efforts.

Your meta-decision-procedure does require its own resource overhead, but in exchange it tells you when and where to spend resources on R&D.  All your algorithms are faster now.  Your decisions are better, their guiding approximations less lossy.

All this, from a meta-decision-procedure that’s only a first draft.  You frown over the resource overhead it charges, and wonder whether it could be improved.

You try shifting some resources away from “regular decision procedure design” into “meta-decision-procedure-design.”  Almost immediately, you come up with a faster and better procedure.

Could you have anticipated this beforehand?  Could you have acted on that knowledge?

parable (3/3)

Oh, you’re so very clever!  By now you’ve realized you need, above and beyond your meta-meta-meta-decision-procedure, a “meta-meta-meta-meta-decision-procedure” to guide your meta-meta-meta-decision-procedure-improvement efforts.

Way down on the object level, you have not moved for a very long time, except to occasionally update your meta-meta-meta-meta-rationality blog.

Way down on the object level, a dumb and fast predator eats you.

Could you have anticipated this beforehand?  Could you have acted on that knowledge?

Keep reading