Install Theme

What I’ve been doing lately in Frank development:

  1. Switching the ML stuff from tensorflow to pytorch.
  2. Replacing the generator model with one 2x as big, finetuned from the 2.7B GPT-Neo checkpoint released by Eleutherai. (This is the same size and architecture as the smallest GPT-3 model)

#1 is basically done and I should be able to “flip the switch” in production soon, probably tomorrow

#2 is nearly done on the development side, but might be too slow to be practical for Frank’s level of demand. No way to be sure without trying it

The second was enabled by the first: I finetuned the Eleutherai model in tensorflow(-mesh), same way they trained it, then spent like a week going down a Pepe Silvia-style rabbit hole trying to figure out how to do inference with the damn thing.

…then I converted it to pytorch and it instantly worked like a charm. Like 15 minutes of work after spending days on the tf version (actually rewriting and rebuilding parts of tf itself from source by the tail end of my quixotic efforts )

I’d been meaning to switch the project to pytorch for a long time, and this was the last straw.

the-moti:

nostalgebraist:

meta-post on meta-learning

There’s an LW post I keep trying to write. I have several unpublished draft versions of it.

The point I want to make is simple and straightforward, but when I try to write it down, I get worried I’m not … like, “messaging” it correctly? Not striking the right tone?

The point of the post is roughly:

People don’t use the term “meta-learning” consistently when they’re talking about GPT-3. The paper uses the term one way (and they are 100% explicit, they spell out their definition in the text), the blogging community uses it another way.

The bloggers are excited/scared that GPT-3 does “meta-learning” by which they mean something like “general reasoning on the fly without training.”

If you’re excited/scared by this capability (and you should be), then you should really care whether GPT-3 actually has it, to what extent, how the capability scales, etc.

There is very little public evidence on this topic, because the paper is (explicitly!) 95% not about the topic, the remaining 5% is pretty weak evidence, and the only other evidence out there is like … some subjective user impressions? gwern saying “GPT-3 has the capability” in a really eloquent and forceful way?

It would be easy to test the capability much more rigorously than this. This ought to be done since the topic is important. It can only be done by people with API access (AI Dungeon doesn’t count).

But it … feels hard to say this in a way that could actually convince anyone who doesn’t already agree? Like,

  1. These points seem so clearly true to me that when I try to “argue for them,” I feel pedantic and like I’m belaboring the obvious.

    Do I actually have to say “no, few-shot translation from French to English is not an example of general reasoning on the fly?” Surely no one thinks the model is like … learning how to speak French from ~2000 words of data?

    Do I have to quote the part of the paper where it says what it means by meta-learning? It’s right there! You can just read the paper!
  2. I made most of this argument already in my original GPT-3 post, immediately after reading the paper. So (A) I feel like I’m repeating myself and (B) if the point didn’t get across then, why would it now?
  3. There is an element of “mere semantics” to the point and it’s hard to clarify to my satisfaction that no, I don’t just care that blog posts are using a word incorrectly. But I have to bring up the semantic issue to even describe what I am saying.
  4. It feels inevitably like picking on gwern’s choice of words, since blogosphere beliefs about “GPT-3 meta-learning” basically all trace back to gwern’s blog.

    I don’t care about whether gwern is using the right words, he’s just the most detailed “primary source” we have on the topic due to the closed API

I was thinking about this yesterday because @slatestarscratchpad linked my original GPT-3 post in his April linkpost. I actually sat down and wrote up another one of those drafts and … nope, gave up again.

I notice I am able to write this on tumblr with no problems at all. Perhaps this is yet another point of evidence that using tumblr lets me do much more “real blogging” than I could if I had “a real blog.”

Could you write a blog post whose goal was to convince people to do the experiments (regardless of their semantic interpretation of them), rather than to convince people to change their interpretation of the existing experiments?

This seems like a way to move the discussion forwards in a concrete sense.

The overlap of “people with GPT-3 API access” and “lesswrong readers” is not large but it is nonzero (although maybe it’s just gwern and maybe he doesn’t want to do your experiments???). 

I share the sense that it’s best to focus on the proposed experiments – it’s actionable advice, it feels constructive, it tells the reader what kind of concrete evidence would affect my opinion.

The problem is that … well, to convince someone to do the experiments, I have to say why I care about the results. Otherwise, it’s just “hey, here’s a thing you could do, I guess.”

But the true answer to “why do you care about the results?” is simply the argument I described in OP, so we’re back to square one.

(Half-joking option: I could frame it as a sort of challenge, like I’m the “change my mind” meme guy, and I’ll just sit around holding an opinion the reader may find infuriatingly wrong… unless they do the thing I ask.)

—-

A little more background on my neurotic wariness here:

I have mixed feelings about LW 2.0 / Alignment Forum.

On the plus side, I see a lot of valuable discussion there which has no real substitute anywhere else. Many of the frequent posters in AI-related threads are students/alumni of Stuart Russell’s CHAI, or researchers at DeepMind, or things like that, and it’s a unrivaled opportunity to discuss big-picture AGI topics with people like this who understand the fine-grained picture too.

On the minus side, I’ve noticed that the emotional valence of my AI posts – “yay AI!” vs “boo AI!” – is a very good predictor of how well they will be received on LW, while the actual quality of the posts in my own estimation is a weak predictor at best.

  • My most “celebrated” New LW post – the one with the most karma, with the warmest comments, and the only one promoted to “curated” status – was a screed about how GPT-2 is awesome and Gary Marcus is epically wrong.
  • My most controversial post, with the most hostile comments section, was the one that called GPT-3 “disappointing.” This was so noticeable that one supportive reader left a comment about the anomalously poor reception, hoping I would not come away “discouraged from continuing to write about the limits of today’s AI.”
  • In the discussion surrounding the previous post, anything I said about the limits of scaling was treated with skepticism, or at best viewed as a contrarian take. When I later wrote a post on the exact same topic, but framed in a “yay AI!” manner (“OpenAI’s new insight”), it was warmly received.

From the perspective of debate norms, this is probably an unhelpfully uncharitable perspective to take. But one cannot help but notice patterns, and use them to plan one’s actions …

—-

On the specific topic at hand, I guess I’m still smarting from that response I got to the “disappointing GPT-3” post.

Hardly anyone seemed to understand what I was getting at in my comments about the multiple competing interpretations of few-shot results.

Several commenters seemed immediately ready to believe GPT-3 was a fully general reasoner, as though this were a safe default hypothesis with a lot of prior mass, only able to be toppled from its throne via great evidential weight to the contrary. The latter comment directly put the shoe on the other foot, asking me: “Are there bits of evidence against general reasoning ability in GPT-3?”

Another pattern running through several comments was an odd sense that the because paper was long and did so many things, it was therefore unreasonable or unfair (?) to complain about it not doing any given thing. This is effectively impossible to argue with: what can one do, faced with “the paper was 70 pages and did dozens of experiments, therefore it must have established [whatever claim I’m arguing it established]”?

That one is particularly discouraging re: getting people to try experiments with GPT-3, as I expect this to be perceived as yet another “isolated demand for rigor.”

My long, frankly exasperating exchange with dxu (starting here) is difficult to summarize, but it contributed my pessimism about any further attempts to discuss the topic on LW. The same goes for my bizarre exchange with gwern (starting here).

Wow, sorry about the rant … I guess I never wrote the equivalent of this rant at the time, so here I am finally writing it almost a year later.

meta-post on meta-learning

There’s an LW post I keep trying to write. I have several unpublished draft versions of it.

The point I want to make is simple and straightforward, but when I try to write it down, I get worried I’m not … like, “messaging” it correctly? Not striking the right tone?

The point of the post is roughly:

People don’t use the term “meta-learning” consistently when they’re talking about GPT-3. The paper uses the term one way (and they are 100% explicit, they spell out their definition in the text), the blogging community uses it another way.

The bloggers are excited/scared that GPT-3 does “meta-learning” by which they mean something like “general reasoning on the fly without training.”

If you’re excited/scared by this capability (and you should be), then you should really care whether GPT-3 actually has it, to what extent, how the capability scales, etc.

There is very little public evidence on this topic, because the paper is (explicitly!) 95% not about the topic, the remaining 5% is pretty weak evidence, and the only other evidence out there is like … some subjective user impressions? gwern saying “GPT-3 has the capability” in a really eloquent and forceful way?

It would be easy to test the capability much more rigorously than this. This ought to be done since the topic is important. It can only be done by people with API access (AI Dungeon doesn’t count).

But it … feels hard to say this in a way that could actually convince anyone who doesn’t already agree? Like,

  1. These points seem so clearly true to me that when I try to “argue for them,” I feel pedantic and like I’m belaboring the obvious.

    Do I actually have to say “no, few-shot translation from French to English is not an example of general reasoning on the fly?” Surely no one thinks the model is like … learning how to speak French from ~2000 words of data?

    Do I have to quote the part of the paper where it says what it means by meta-learning? It’s right there! You can just read the paper!
  2. I made most of this argument already in my original GPT-3 post, immediately after reading the paper. So (A) I feel like I’m repeating myself and (B) if the point didn’t get across then, why would it now?
  3. There is an element of “mere semantics” to the point and it’s hard to clarify to my satisfaction that no, I don’t just care that blog posts are using a word incorrectly. But I have to bring up the semantic issue to even describe what I am saying.
  4. It feels inevitably like picking on gwern’s choice of words, since blogosphere beliefs about “GPT-3 meta-learning” basically all trace back to gwern’s blog.

    I don’t care about whether gwern is using the right words, he’s just the most detailed “primary source” we have on the topic due to the closed API

I was thinking about this yesterday because @slatestarscratchpad linked my original GPT-3 post in his April linkpost. I actually sat down and wrote up another one of those drafts and … nope, gave up again.

I notice I am able to write this on tumblr with no problems at all. Perhaps this is yet another point of evidence that using tumblr lets me do much more “real blogging” than I could if I had “a real blog.”

fell-walker asked:

My guess is that this is still roughly true, but I'm not sure. OA doesn't pay for enterprise slack, so old messages past the last 10,000 (goes back to December 3rd at this moment) aren't available, so I can't see if the announcement says anything about finetuning.

My model of finetuning is that finetuning is expensive and OA doesn't know what use-cases fit it and there's much less demand, so they're not prioritizing it. I remember they've said that they're working with a select few clients, probably by the channel you mentioned, and also are working on providing a fine-tuning API. In the early beta, fine-tuning was available to people, but only for small models (relative to 175B parameters) and afaik there has been no easy access to finetuning the 175B model, which I think is a common misconception (I might be wrong, but I'd be somewhat surprised).

I see, thanks for the detailed info!

I have definitely seen people assume that finetuning would be a standard part of the typical GPT-3 workflow, e.g. this comment.

But that was back when OA were saying they would “add finetuning to the API” soon, or something to that effect, which made finetuning sound easier to support at scale than it (apparently) has turned out to be.

fell-walker asked:

Not that I know of, it was announced on the API Slack late November.

I see, thanks!  (Re: this)

I’m curious about any changes to finetuning.

In the tiers system, finetuning was reserved for the “Scale tier,” which didn’t have a set price and looked like something where a corporate client would contact OpenAI and hash out a customized agreement.

Is that still true (perhaps with the “tier” terminology changed)?

fell-walker asked:

Nowadays GPT-3 doesn't have price-tiers, you pay 6 cents per thousand tokens.

Thanks for the update.

Has anyone else posted about this change in a publicly accessible place?  I went looking on Google but couldn’t find anything, just a lot of commentary on the tiers-based pricing that was reported earlier.

youzicha asked:

In your faq you say you have no plans to make nostalgebraist-autoresponder use gpt-3, why not? Will it be too expensive to finetune?

For one thing, I currently have no way to use GPT-3 at all: I signed up for OpenAI’s waitlist long ago and haven’t heard back.

(I’m not surprised or mad about this, I expect they’ve gotten a huge number of requests and I’m not especially high profile)

Even if I did have API access, it would be too expensive to host/serve, even without finetuning.

As for finetuning – which, for my project, is a hard requirement – I doubt I’d be allowed to do it at all.  It’s only available at their “Scale” pricing tier, which looks aimed at corporate clients:

image

To sum up, using GPT-3 in production is not really comparable to using GPT-2 in production in terms of cost and accessibility.

GPT-2 is accessible to hobbyists.  You can run GPT-2 on standard data center GPUs like T4s, which you can rent from Google for prices ranging from “low” to “zero” depending on your tolerance for timeouts.  You can finetune it pretty quickly on a standard TPU (not even a pod).

GPT-3 was a deliberate exercise in pushing the limits of current hardware.  The level of hardware and support needed to run it in production is not something average hobbyist can expect to access.

the scaling “inconsistency”: openAI’s new insight

I’ve now read the new OpenAI scaling laws paper.  Also, yesterday I attended a fun and informative lecture/discussion with one of the authors.

While the topic is on my mind, I should probably jot down some of my thoughts.

This post is mostly about what the new paper says about the “inconsistency” brought up in their previous paper.

The new paper has a new argument on this topic, which is intuitive and appealing, and suggests that the current scaling trend will indeed “switch over” soon to a new one where dataset size, not model size, is the active constraint on performance.  Most of this post is an attempt to explain and better understand this argument.

——

The new paper is mainly about extending the scaling laws from their earlier paper to new modalities.

In that paper, they found scaling laws for transformers trained autoregressively on text data.  The new paper finds the same patterns in the scaling behavior of transformers trained autoregressively on images, math problems, etc.

So the laws aren’t telling us something about the distribution of text data, but about something more fundamental.  That’s cool.

They also have a new, very intuitive hypothesis for what’s going on with the “scaling inconsistency” they described in the previous paper – the one I made a big deal about at the time.  So that’s the part I’m most excited to discuss.

I’m going to give a long explanation of it, way longer than the relevant part of their paper.  Some of this is original to me, all errors are mine, all the usual caveats.

——

1. L(​C) and L(D)

To recap: the “inconsistency” is between two scaling laws:

  • The law for the best you can do, given a fixed compute budget.

    This is L(​C), sometimes called L(C_min).  L is the loss (lower = better), C is your compute budget.

  • The law for the best you can do, given a fixed dataset size.

    This is L(D), where D is the number of examples (say, tokens) in the dataset.

Once you reach a certain level of compute, these two laws contradict each other.

I’ll take some time to unpack that here, as it’s not immediately obvious the two can even be compared to one another – one is a function of compute, the other of data.

2. C sets E, and E bounds D

Budget tradeoffs

Given a compute budget C, you can derive the optimal way to spend it on different things.  Roughly, you are trading off between two ways to spend compute:

  • Use C to buy “N”: Training a bigger model – “N” here is model size

  • Use C to buy “S”: Training for more steps “S” (gradient updates)

The relationship between S (steps) and D (dataset size) is a little subtle, for several reasons.

From step count to update count

For one thing, each single “step” is an update on the information from more than one data point.  Specifically, a step updates on “B” different points – B is the batch size.

So the total number of data points processed during training is B times S.  The papers sometimes call this quantity “E” (number of examples), so I’ll call it that too.

From update count to data count

Now, when you train an ML model, you usually update on each data point more than once.  Typically, you’ll do one pass over the full dataset (updating on each point as you go along), then you’ll go back and do a second full pass, and then a third, etc.  These passes are called “epochs.”

If you’re doing things this way, then for every point in the data, you get (number of epochs) updates out of it.  So

E = (number of epochs) * D.  

Some training routines don’t visit every point the exact same number of times – there’s nothing forcing you to do that.  Still, for any training procedure, we can look at the quantity E / D.

This would be the number of epochs, if you’re doing epochs.  For a generic training routine, you can can think of E / D as the “effective number of epochs”: the average number of times we visit each point, which may not be an integer.

Generally, E ≠ D, but we always have E≥D.  You can’t do fewer than one epoch; you can’t visit the average point less than once.

This is just a matter of definitions – it’s what “dataset size” means.  If you say you’re training on a million examples, but you only update on 100 individual examples, then you simply aren’t “training on a million examples.”

3. The inconsistency

L(D): information

OpenAI derives a scaling law called L(D).  This law is the best you could possibly do – even with arbitrarily large compute/models – if you are only allowed to train on D data points.

No matter how good your model is, there is only so much it can learn from a finite sample.  L(D) quantifies this intuitive fact (if the model is an autoregressive transformer).

L(​C): budgeting

OpenAI also derives another a scaling law called L(​C).  This is the best you can do with compute C, if you spend it optimally.

What does optimal spending look like?  Remember, you can spend a unit of compute on 

  • a bigger model (N), or 
  • training the same model for longer (S)

(Sidenote: you can also spend on bigger batches B.  But – to simplify a long, complicated story – it turns out that there are really just 2 independent knobs to tune among the 3 variables (B, N, S), and OpenAI frames the problem as tuning (N, S) with B already “factored out.”)

In the compute regime we are currently in, making the model bigger is way more effective than taking more steps.

This was one of the punchlines of the first of these two papers: the usual strategy, where you pick a model and then train it until it’s as good as it can get, is actually a suboptimal use of compute.  If you have enough compute to train the model for that long (“until convergence”), then you have enough compute to train a bigger model for fewer steps, and that is a better choice.

This is kind of counterintuitive!  It means that you should stop training your model before it stops getting better.  (“Early stopping” means training your model until it stops getting better, so this is sort of “extra-early stopping.”)  It’s not that those extra steps wouldn’t help – it’s that, if you are capable of doing them, then you are also capable of doing something else that is better.

Here’s something cool: in Appendix B.2 of the first paper, they actually quantify exactly how much performance you should sacrifice this way.  Turns out you should always stop at a test loss about 10% higher than what your model could asymptotically achieve.  (This will be relevant later, BTW.)

Anyway, OpenAI derives the optimal way to manage the tradeoff between N and S.  Using this optimal plan, you can derive L(​C) – the test loss you can achieve with compute C, if you allocate it optimally.

N goes up fast, S goes up slowly…

The optimal plan spends most incremental units of compute on bigger models (N).  It spends very little on more steps (S).

The amount it spends on batch size (B) is somewhere in between, but still small enough that the product E = B*S grows slowly.

But remember, we know a relationship between E and “D,” dataset size.  E can’t possibly be smaller than D.

So when your optimal plan chooses its B and its S, it has expressed an opinion about how big its training dataset is.

The dataset could be smaller than B*S, if we’re doing many (effective) epochs over it.  But it can’t be any bigger than B*S: you can’t do fewer than one epoch.

… and you claim to achieve the impossible

L(​C), the loss with optimally allocated C, goes down very quickly as C grows.  Meanwhile, the dataset you’re training with that compute stays almost the same size.

But there’s a minimum loss, L(D), you can possibly achieve with D data points.

The compute-optimal plan claims “by training on at most B*S data points, with model size N, I can achieve loss L(​C).”

The information bound says “if you train on at most B*S data points, your loss can’t get any lower than the function L(D), evaluated at D = B*S.”

Eventually, with enough compute, the L(​C) of the compute-optimal plan is lower than the L(D) of the dataset used by that same plan.

That is, even if the compute-optimal model is only training for a single epoch, it is claiming to extract more value that epoch than any model could ever achieve, given any number of epochs.

That’s the inconsistency.

4. The resolution

In the new paper, there’s an intuitive hypothesis for what’s going on here.  I don’t think it really needs the multimodal results to motivate it – it’s a hypothesis that could have been conceived earlier on, but just wasn’t.

Bigger models extract a resource faster

The idea is this.  As models get bigger, they get more update-efficient: each time they update on a data point, they get more out of it.  You have to train them for fewer (effective) epochs, all else being equal.

This fact drives the choice to scale up the model, rather than scaling up steps.  Scaling up the model makes your steps more valuable, so when you choose to scale the model rather than the steps, it’s almost like you’re getting more steps anyway.  (More “step-power,” or something.)

The resource is finite

Each data point has some information which a model can learn from it.  Finite models, trained for a finite amount of time, will miss out on some of this information.

You can think about the total extractable information in a data point by thinking about what an infinitely big model, trained forever, would eventually learn from that point.  It would extract all the information – which is more than a lesser model could extract, but still finite.  (A single data point doesn’t contain all the information in the universe.)

This is literally the definition of L(D): what an infinitely big model, trained forever, could learn from D separate data points.  L(D) quantifies the total extractable information of those points.

(More precisely, the total extractable information is the gap between L(D) and the loss achieved by a maximally ignorant model, or something like that.)

Converging in the very first step

As models get bigger, they extract more information per update.  That is, each time they see a data point, they extract a larger fraction of its total extractable information.

Eventually, your models are getting most of that information the very first time they see the data point.  The “most” in that sentence gets closer and closer to 100%, asymptotically.

How does this relate to optimal compute allocation?

The logic of the “optimal compute plan” is as follows:

Your model is an imperfect resource extractor: it only gets some of the resources locked up in a data point from the first update.  So you could extract more by running for more steps … 

…  but if you have the compute for that, you can also spend it by making your steps more efficient.  And, in the current compute regime, that’s the smarter choice.

It’s smarter by a specific, uniform proportion.  Remember, you should stop training when your loss is 10% higher than the converged loss of the same model.  If the converged loss is L, you should stop at 1.1*L.

Can you always do that?  If your model is efficient enough, you can’t!  As the first epoch gets closer to 100% efficient, the loss after the first epoch gets arbitrarily close to the converged loss.  Your loss goes under 1.1*L by the end of the first epoch.

At this point, the story justifying the L(​C) law breaks down.

The L(​C) law goes as fast is it does because upgrading the efficiency of your extractor is cheaper – in terms of compute spent per unit of resource extracted – than actually running the extractor.

This works as long as your extractor is inefficient.  But you can’t push efficiency above 100%.  Eventually, the only way to extract more is to actually run the damn thing.

Getting a bigger quarry

When you’re extracting a resource, there’s a difference between “improve the extractor” and “get a bigger quarry.”

If your quarry has 100 resource units in it, the strategy of “improving the extractor” can never get you more than 100 units.  It can get them to you faster, but if you want more than 100 units, you have to get a bigger quarry.

“N” sets the efficiency of the extractor.  “S” sets … well, it doesn’t exactly set the size of the quarry (that’s D).  There is an ambiguity in the S: it could mean running for more epochs on the same data, or it could mean getting more data.

But S does, at least, set an upper bound on the size of the quarry, D.  (Via D≤E and E = B*S, with B set optimally as always.)

With high enough compute (and thus model size), you’ve pushed the “extractor upgrades are cheap” lifehack as far as it can go.  With this efficient extractor, taking S steps (thus making E = B*S updates) sucks up most of the information theoretically extractable from E individual data points.

The learning curve L(E) of your model, as it makes its first pass over the dataset, starts to merge with L(D), the theoretical optimum achievable with that same dataset.  You trace out L(D) as you train, and the relevant constraint on your performance is the maximum data size D you can obtain and train on.

Where we are now

In the compute regime that spans GPT-2 and the smaller variants of GPT-3, extraction is far less than maximally efficient.  The L(​C) strategy applies, and the smart move is to spend compute mostly on model size.  So you make GPT-2, and then GPT-3.

Once we get to the full GPT-3, though, the extractor is efficient enough that the justification for L(​C) has broken down, and the learning curve L(E) over the first epoch looks like L(D).

Here is that as a picture, from the new paper:

image

The yellowest, lowest learning curve is the full GPT-3.  (The biggest GPT-2 is one of the green-ish lines.)  The black line is L(D), maximally efficient extraction.

You can see the whole story in this picture.  If you’re in one of the smaller-model learning curves, running for more steps on more data will get you nowhere near to the total extractable info in that data.  It’s a better use of your compute to move downwards, toward the learning curve of a bigger model.  That’s the L(​C) story.

If the L(​C) story went on forever, the curves would get steeper and steeper.  Somewhere a little beyond GPT-3, they would be steeper than L(D).  They would cross L(D), and we’d be learning more than L(D) says is theoretically present in the data.

According to the story above, that won’t happen.  We’ll just converge ever closer to L(D).  To push loss further downward, we need more data.

Implications

Since people are talking about bitter lessons a lot these days, I should make the following explicit: none of this means “the scaling hypothesis is false,” or anything like that.

It just suggests the relevant variable to scale with compute will switch: we’ll spent less of our marginal compute on bigger models, and more of it on bigger data.

That said, if the above is true (which it may not be), it does suggest that scaling transformers on text alone will not continue productively much past GPT-3.

The GPT-3 paper says its choices were guided by the “grow N, not S” heuristic behind the L(​C) curve:

Based on the analysis in Scaling Laws For Neural Language Models [KMH+20] we train much larger models on many fewer tokens than is typical.

(“KMH+20″ is the first of the two scaling papers discussed here.)  Even following this heuristic, they still picked a huge dataset, by human standards for text datasets.

In the above terms, their “E” was 300 billion tokens and their “D” was ~238 tokens, since they updated multiple times on some tokens (cf. Table 2.2 in the GPT-3 paper).  The whole of Common Crawl is 410 billion tokens, and Common Crawl might as well be “all the text in the universe” from the vantage point of you and me.

So, there’s room to scale D up somewhat further than they did with GPT-3, but not many orders of magnitude more.  To me, this suggests that an intuitively “smarter” GPT-4 would need to get its smartness from being multimodal, as we really can’t go much further with just text.

It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners →

nostalgebraist:

When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown et al., 2020) achieve remarkable few-shot performance on challenging natural language understanding benchmarks. In this work, we show that performance similar to GPT-3 can be obtained with language models whose parameter count is several orders of magnitude smaller. This is achieved by converting textual inputs into cloze questions that contain some form of task description, combined with gradient-based optimization; additionally exploiting unlabeled data gives further improvements. Based on our findings, we identify several key factors required for successful natural language understanding with small language models.

Haven’t read this yet, but it looks relevant to my question “how much better can you do with a small number of examples if you use finetuning rather than prompting?”

Here’s their Figure 1:

image

[Found a mostly written version of this in my drafts, decided to finish and publish it]

Some notes on this, after reading the paper and its predecessor, and trying out the code:

(1)

I like these papers because they answer a question I had after reading the GPT-3 paper.

The main results of the GPT-3 paper were about solving an unusually hard problem (very little training data), with an unusually powerful tool (a very large model), applied with an unusual limitation (no finetuning).

All three of these variables were unusually extreme at once, making it hard to compare GPT-3 with anything else.

The paper assumed we were interested in results on the unusually hard problem (training on only ~32 examples).  If we are, presumably we’d like to know how well the normal approaches work on that problem, before we jump on board with the new GPT-3 approach.  That is, if you use a normal-sized model, and let yourself finetune, how well can you do with ~32 examples?

I could find surprisingly little about this topic at the time.  I expected the “normal” approach to do well here, since as I mentioned here, it can do well in cases that only have a few hundred examples (some of the SuperGLUE tasks).  But I couldn’t find anyone doing it.

This new paper confirms my expectation: you can do at least as well as the original GPT-3 results on the same problem using a normal-sized model and finetuning.

(2)

It’s a little unfair to compare the results here directly to those in the GPT-3 paper.

The results here come from an approach that sounds like what a sophisticated pro user of GPT-3 might built: it involves writing several different prompts to elicit the same information and then ensembling the results.  The author is clearly striving to do good “prompt programming” and get the most value possible out of the prompts.

The GPT-3 paper did not try to optimize its prompts, and people have already improved upon the published results by using better prompting practices with the same GPT-3 model.

However, this paper still demonstrates that “prompt programming” works even with a much smaller model.  Specifically, it casts doubt on the claim that LMs need to be large-scale to do well on tiny datasets, and that performing well on tiny datasets specifically requires much larger LMs than many other tasks.

The GPT-3 paper didn’t actually make that claim explicitly, but it was a reasonable enough thing to conjecture after reading it, and I suspect some people came away from it with that impression.

We already knew that the GPT approach (one-directional LM, no finetuning) was very suboptimal for these tasks.  Bidirectional LMs with finetuning do much better on everything except generating text, but cannot generate good text.

My sense, bolstered by this paper, is that GPT-3 paper establishes the scale cost of using the GPT approach instead of one better suited for these tasks.  Given a fixed param/compute/whatever budget, the GPT approach is the right tool for text generation, but the wrong tool for these type of tasks.  However, a vastly more powerful version of the wrong tool can do as well as a less powerful version of the right tool.  Together, GPT-3 and this paper quantify the size of this gap.

(3)

Fine-tuning is fundamentally much slower than prompting.  Even a small LM, on a good GPU, with a tiny dataset, takes a few minutes to train if you do many epochs (as the authors do), and then for this approach you need to repeat this many times.

That entire process is necessary to try out a single prompt programming idea, so experimentation with prompts is much slower than with GPT-3.  There is also a memory/disk cost to all these finetuned models.

(You can save memory/disk cost by using adapters, but I’m not sure they save compute time.)

I am curious whether the fundamental properties of fine-tuning can be boiled down into something much more efficient.  Adapters try to do this along the dimension of parameter count, but you still have to do many gradient steps.

Especially with tiny datasets, the many gradient steps feel excessive somehow.  There just isn’t much information in the dataset, relative to the pre-trained LM.  Fine-tuning is not teaching the LM something new, but merely “locating” knowledge already stored in it somewhere, and “hooking up” that knowledge to your new prediction head.

If the knowledge is already there, you’d think you wouldn’t even need to tune the LM itself, and could just fit a linear (or simple nonlinear) model on top of it, which would be much faster.  But folk wisdom says it’s better to tune with transformers.  (Maybe attention is too close to sparse, some important tokens are ignored by the existing heads, and you need to teach them a rule like “look at this type of thing” where the “type of thing” is a concept easily expressed in the input basis of later layers.)

If the reason behind this observation were better understood, we might be able to replace fine-tuning with something much faster, and then replicate a GPT-3-like task programming experience with models the average laptop can run.

It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners →

When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown et al., 2020) achieve remarkable few-shot performance on challenging natural language understanding benchmarks. In this work, we show that performance similar to GPT-3 can be obtained with language models whose parameter count is several orders of magnitude smaller. This is achieved by converting textual inputs into cloze questions that contain some form of task description, combined with gradient-based optimization; additionally exploiting unlabeled data gives further improvements. Based on our findings, we identify several key factors required for successful natural language understanding with small language models.

Haven’t read this yet, but it looks relevant to my question “how much better can you do with a small number of examples if you use finetuning rather than prompting?”

Here’s their Figure 1:

image