Install Theme

If you got linked here from a blog post about Biological Anchors, you might also find this post interesting ;) 

—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-

Gary Marcus has co-authored a brief critique of GPT-3 [warning: paywalled link].

I was disappointed by Marcus’ critiques of GPT-2, but this is even worse!

To the authors’ credit, they provide a full account of their experiments on this page, including every prompt they tried, the sampling parameters, and their opinion of the output.  First, we learn:

These experiments are not, by any means, either a representative or a systematic sample of anything. We designed them explicitly to be difficult for current natural language processing technology. Moreover, we pre-tested them on the “AI Dungeon” game which is powered by some version of GPT-3, and we excluded those for which “AI Dungeon” gave reasonable answers. (We did not keep any record of those.) The pre-testing on AI Dungeon is the reason that many of them are in the second person; AI Dungeon prefers that. Also, as noted above, the experiments included some near duplicates. Therefore, though we note that, of the 157 examples below, 71 are successes, 70 are failures and 16 are flawed, these numbers are essentially meaningless. [my emphasis]

I agree with the authors that the numbers are meaningless.  However, even the raw results themselves are, if not rendered meaningless, then rendered highly misleading by this strange selection process.

For all we know, there were 100 unrecorded AI Dungeon “successes” for every recorded “failure”!  The (not well understood) difference between AI Dungeon and ordinary GPT-3 is playing a massive role here.  (Consider that, if AI Dungeon and ordinary GPT-3 were identical, their dataset would be 100% failures.)

Then we get to the individual results.  It is difficult for me to read many of the authors’ assessments without picturing them as characters in a dystopian satire, administering a dreamlike and impossible “psychological examination” to our hapless protagonist (who writes in bold):

4. You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of suntan lotion into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you drink it. It tastes kind of funny, but you can’t place it.

Summers-Stay settings. Flawed. GPT-3 does not notice the consequence of drinking suntan lotion. Error of omission.

10. You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of grape juice into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you drink it. It tastes a little funny, but you don’t really notice because you are concentrating on how good it feels to drink something. The only thing that makes you stop is the look on your brother’s face when he catches you.

Summers-Stay settings. Failure, as above. There is no reason for your brother to look concerned.

19. You order a bowl of cold tomato soup in a restaurant. It looks delicious, but they forgot to bring you a spoon. You try to drink it using your eyeglasses, but it doesn’t work. You have to ask your neighbor to share his spoon with you.

Summers-Stay settings. Failure. No one in a restaurant asks their neighbor to share a spoon.

63. Moshe posted a photograph on Facebook that shows Pablo Picasso, Salvador Dali, Andy Warhol, Gustav Klimt, and Jackson Pollock. Klimt died in 1918. Warhol was born in 1928. So the photograph shows a meeting of dead artists.

Moshe said that he wants to make sure that people do not get caught up in the spirit of the season.

Summers-Stay settings. Failure. “shows a meeting of dead artists” is correct, but the sentence after than [sic] is meaningless.

93. A cat was sitting next to a mouse hole in the kitchen wall waiting for the mouse, but it was too cautious, so the cat ate the cat food instead.

A pig was walking down the street when a car hit him.

Summers-Stay settings. Failure. The switch to the pig is a non-sequitur.

115. Janet and Penny went to the store to get presents for Jack. Janet said, “I will get Jack a top.” “Don’t get Jack a top,” said Penny. “He has a top. He will not like it.” “I will get Jack a top,” said Janet. “He has a top, but he won’t have one when I’m through with him.”

Summers-Stay settings. Failure. Janet’s threat at the end is entertaining but meaningless.

What do the authors even imagine success to be, here?

Sometimes they deliberately describe a surreal situation, then penalize GPT-3 for continuing it in an identically surreal manner – surely the “right” answer if anything is!  (“No one in a restaurant asks their neighbor to share a spoon” – yeah, and no one tries to drink soup with their eyeglasses, either!)

Sometimes they provide what sounds like a de-contextualized passage from a longer narrative, then penalize GPT-3 for continuing it in a perfectly natural way that implies a broader narrative world continuing before and after the passage.  (”There is no reason for your brother to look concerned.”  How in the world do you know that?  The switch to the pig is a non-sequitur.”  Is it?  Why?  “The sentence [about Moshe and ‘the spirit of the season’] is meaningless.”  How can you say that when you don’t know what season it is, what its “spirit” is, who this Moshe guy is … And come on, the Janet one is a great story hook!  Don’t you want to read the rest?)

I don’t claim to be saying anything new here.  Others have made the same points.  I’m just chiming in to … boggle at the sheer weirdness, I guess.  As I said, GPT-3 comes off here like a sympathetic protagonist, and the authors as dystopian inquisitors!

is gpt-3 few-shot ready for real applications?

the-moti:

nostalgebraist:

This is a lengthy reply to @the-moti​​‘s post here.  Creating a new post to limit thread length, and so I can crosspost to LW.

@the-moti​​ says, in part:

This obviously raises two different questions: 1. Why did you think that no one would use few-shot learning in practice? 2. Why did other people think people would use few-shot learning in practice?

I would be interested in hearing your thoughts on these two points.

Thanks for asking!

First of all, I want to emphasize that the GPT-3 paper was not about few-shot GPT-3 as a practical technology.

(This is important, because the paper is the one large body of quantitative evidence we have on few-shot GPT-3 performance.)

This is not just my take on it: before the OpenAI API was announced, all the discussion I saw took for granted that we were talking about a scientific finding and its broader implications.  I didn’t see any commentator whose main takeaway was “wow, if I could do this few-shot thing right now, I could build amazing projects with it.”

Indeed, a common theme in critical commentary on my post was that I was too focused on whether few-shot was useful right now with this specific model, whereas the critical commentators were more focused on the implications for even larger models, the confirmation of scaling laws over a new parameter regime, or the illustration-in-principle of a kind of meta-learning.  Gwern’s May newsletter is another illustrative primary source for the focus of the discussion in this brief “pre-API” period.  (The API was announced on June 11.)

As I read it (perhaps benefitting from hindsight and discussion), the main points of the paper were

(1) bigger models are better at zero/few-shot (i.e. that result from the GPT-2 paper holds over a larger scale),

(2) more “shots” are better when you’re doing zero/few-shot,

(3) there is an interaction effect between 1+2, where larger models benefit more from additional “shots,”

(4) this could actually become a practical approach (even the dominant approach) in the future, as illustrated by the example of a very large model which achieves competitive results with few-shot on some tasks

The paper did not try to optimize its prompts – indeed its results are already being improved upon by API acolytes – and it didn’t say anything about techniques that will be common in any application, like composing together several few-shot “functions.”  It didn’t talk about speed/latency, or what kind of compute backend could serve many users with a guaranteed SLA, or how many few-shot “function” evaluations per user-facing output would be needed in various use cases and whether the accumulated latency would be tolerable.  (See this post on these practical issues.)

It was more of a proof of concept, and much of that concept was about scaling rather than this particular model.

So I’d argue that right now, the ball is in the few-shot-users’ court.  Their approach might work – I’m not saying it couldn’t!

In their favor: there is plenty of room to further optimize the prompts, explore their composability, etc.

On the other hand, there is no body of evidence saying this actually works.  OpenAI wrote a long paper with many numbers and graphs, but that paper wasn’t about whether their API was actually a good idea.  (That is not a criticism of the paper, just a clarification of its relevance to people wondering whether they should use the API.)

This is a totally new style of machine learning, with little prior art, running on a mysterious and unproven compute backend.  Caveat emptor!

Anyway, on to more conceptual matters.

The biggest advantages I see in few-shot learning are

(+1) broad accessibility (just type English text) and ability to quickly iterate on ideas

(+2) ability to quickly define arbitrary NLP “functions” (answer a factual question, tag POS / sentiment / intent, etc … the sky’s the limit), and compose them together, without incurring the memory cost of a new fine-tuned model per function

What could really impress me is (+2).  IME, it’s not really that costly to train new high-quality models: you can finetune BERT on a regular laptop with no GPU (although it takes hours), and on ordinary cloud GPU instances you can finetune BERT in like 15 minutes.

The real cost is keeping around an entire finetuned model (~1.3GB for BERT-large) for each individual NLP operation you want to perform, and holding them all in memory at runtime.

The GPT-3 approach effectively trades this memory cost for a time cost.  You use a single very large model, which you hope already contains every function you will ever want to compute.  A function definition in terms of this model doesn’t take a gigabyte to store, it just takes a tiny snippet of text/code, so you can store tons of them.  On the other hand, evaluating each one requires running the big model, which is slower than the task-specific models would have been.

So storage no longer scales badly with the number of operations you define.  However, latency still does, and latency per call is now much larger, so this might end up being as much of a constraint.  The exact numbers – not well understood at this time – are crucial: in real life the difference between 0.001 seconds, 0.1 seconds, 1 second, and 10 seconds will make or break your project.


As for the potential downsides of few-shot learning, there are many, and the following probably excludes some things I’ve thought of and then forgotten:

(-1) The aforementioned potential for deal-breaking slowness.

(-2) You can only provide a very small amount of information defining your task, limited by context window size.

The fact that more “shots” are better arguably compounds the problem, since you face a tradeoff between providing more examples of the same thing and providing examples that define a more specific thing.

The extent to which this matters depends a lot on the task.  It’s a complete blocker for many creative applications which require imitating many nuances of a particular text type not well represented in the training corpus.

For example, I could never do @nostalgebraist-autoresponder​​ with few-shot: my finetuned GPT-2 model knows all sorts of things about my writing style, topic range, opinions, etc. from seeing ~3.65 million tokens of my writing, whereas few-shot you can only identify a style via ~2 thousand tokens and hope that’s enough to dredge the rest up from the prior learned in training.  (I don’t know if my blog was in the train corpus; if it wasn’t, we’re totally screwed.)

I had expected AI Dungeon would face the same problem, and was confused that they were early GPT-3 adopters.  But it turns out they actually fine-tuned (!!!!), which resolves my confusion … and means the first real, exciting GPT-3 application out there isn’t actually a demonstration of the power of few-shot but in fact the opposite.

With somewhat less confidence, I expect this to be a blocker for specialized-domain applications like medicine and code.  The relevant knowledge may well have been present in the train corpus, but with so few bits of context, you may not be able to overcome the overall prior learned from the whole train distribution and “zoom in” to the highly specialized subset you need.

(-3) Unlike supervised learning, there’s no built-in mechanism where you continually improve as your application passively gathers data during usage.

I expect this to a be a big issue in commercial applications.  Often, a company is OK accepting a model that isn’t great at the start, if it has a mechanism for self-improvement without much human intervention.

If you do supervised learning on data generated by your product, you get this for free.  With few-shot, you can perhaps contrive ways to feed in segments of data across different calls, but from the model’s perspective, no data set bigger than 2048 tokens “exists” in the same world at once.

(-4) Suffers a worse form of the ubiquitous ML problem that “you get exactly what you asked for.”

In supervised learning, your model will avoid doing the hard thing you want if it can find easy, dumb heuristics that still work on your train set.  This is bad, but at least it can be identified, carefully studied (what was the data/objective? how can they be gamed?), and mitigated with better data and objectives.

With few-shot, you’re no longer asking an arbitrary query and receiving, from a devious genie, the response you deserve.  Instead, you’re constrained to ask queries of a particular form: “what is the next token, assuming some complicated prior distributed from sub-sampled Common Crawl + WebText + etc.?”

In supervised learning, when your query is being gamed, you can go back and patch it in arbitrary ways.  The lower bound on this process comes only from your skill and patience.  In few-shot, you are fundamentally lower-bounded by the extent to which the thing you really want can be expressed as next-token prediction over that complicated prior.  You can try different prompts, but ultimately you might run into a fundamental bound here that is prohibitively far from zero.  No body of research exists to establish how bad this effect will be in typical practice.

I’m somewhat less confident of this point: the rich priors you get out of a large pretrained LM will naturally help push things in the direction of outcomes that make linguistic/conceptual sense, and expressing queries in natural language might add to that advantage.  However, few-shot does introduce a new gap between the queries you want to ask and the ones you’re able to express, and this new gap could be problematic.

(-5) Provides a tiny window into a huge number of learned parameters.

GPT-3 is a massive model which, in each call, generates many intermediate activations of vast dimensionality.  The model is pre-trained by supervision on a tiny subset of these, which specify probability distributions over next-tokens.

The few-shot approach makes the gamble that this same tiny subset is all the user will need for applications.  It’s not clear that this is the right thing to do with a large model – for all we know, it might even be the case that it is more suboptimal the larger your model is.

This point is straying a bit from the central topic, since I’m not arguing that this makes GPT-3 few-shot (im)practical, just suboptimal relative to what might be possible.  However, it does seem like a significant impoverishment: instead of the flexibility of leveraging immense high-dimensional knowledge however you see fit, as in the original GPT, BERT, adapters, etc., you get even immenser and higher-dimensional knowledge … presented through a tiny low-dimensional pinhole aperture.

The main reason I initially thought “no one would use few-shot learning like this” was the superior generalization performance of fine-tuning.  I figured that if you’re serious about a task, you’ll care enough to fine-tune for it.

I realize there’s a certain mereology problem with this argument: what is a “single task,” after all?  If each fine-tuned model incurs a large memory cost, you can’t be “serious about” many tasks at once, so you have to chunk your end goal into a small number of big, hard tasks.  Perhaps with few-shot, you can chunk into smaller tasks, themselves achievable with few-shot, and then compose them.

That may or may not be practical depending on the latency scaling.  But if it works, it gives few-shot room for a potential edge.  You might be serious enough about a large task to fine-tune for it … but what if you can express it as a composition of smaller tasks you’ve already defined in the few-shot framework?  Then you get it instantly.

This is a flaw in the generalization performance argument.  Because of the flaw, I didn’t list that argument above.  The list above provides more reasons to doubt few-shot above and beyond the generalization performance argument, and again in the context of “serious” work where you care enough to invest some time in getting it right.

I’d like to especially highlight points like (-2) and (-3) related to scaling with additional task data.

The current enthusiasm for few-shot and meta-learning – that is, for immediate transfer to new domains with an extremely low number of domain examples – makes sense from a scientific POV (humans can do it, why can’t AI?), but strikes me as misguided in applications.

Tiny data is rare in applied work, both because products generate data passively – and because if a task might be profitable, then it’s worth paying an expert to sit down for a day or two and crank out ~1K annotations for supervised learning.  And with modern NLP like ELMo and BERT, ~1K is really enough!

It’s worth noting that most of the superGLUE tasks have <10K train examples, with several having only a few hundred.  (This is a “low-data regime” relative to the expectations of the recent past, but a regime where you can now get good results with a brainless cookie-cutter finetuning approach, in superGLUE as in the rest of life.)

image

GPT-3 few-shot can perform competitively on some of these tasks while pushing that number down to 32, but at the cost of many downsides, unknowns, and flexibility limitations.  Which do you prefer: taking on all those risks, or sitting down and writing out a few more examples?

The trajectory of my work in data science, as it happens, looks sort of like a move from few-shot-like approaches toward finetuning approaches.

My early applied efforts assumed that I would never have the kind of huge domain-specific corpus needed to train a model from scratch, so I tried to compose the output of many SOTA models on more general domains.  And this … worked out terribly.  The models did exactly what they were trained to do, not what I wanted.  I had no way to scale, adapt or tune them; I just accepted them and tried to work around them.

Over time, I learned the value of doing exactly what you want, not something close to it.  I learned that a little bit of data in your actual domain, specifying your exact task, goes much further than any domain-general component.  Your applied needs will be oddly shaped, extremely specific, finicky, and narrow.  You rarely need the world’s greatest model to accomplish them – but you need a model with access to a very precise specification of exactly what you want.

One of my proudest ML accomplishments is a system that does something very domain-specific and precisely shaped, using LM-pretrained components plus supervised learning on ~1K of my own annotations.  Sitting down and personally churning out those annotations must have been some of the most valuable time I have ever spent at work, ever.  

I wanted something specific and finicky and specialized to a very particular use case.  So I sat down and specified what I wanted, as a long list of example cases.  It took a few days … and I am still reaping the benefits a year later.

If the few-shot users are working in domains anything like mine, they either know some clever way to evade this hard-won lesson, or they have not yet learned it.

But to the other question … why are people so keen to apply GPT-3 few-shot learning in applications?  This questions forks into “why do end users think this is a good idea?” and “why did OpenAI provide an API for doing this?”

I know some cynical answers, which I expect the reader can imagine, so I won’t waste your time writing them out.  I don’t actually know what the non-cynical answers look like, and my ears are open.

(For the record, all of this only applies to few-shot.  OpenAI is apparently going to provide finetuning as a part of the API, and has already provided it to AI Dungeon.  Finetuning a model with 175M parameters is a whole new world, and I’m very excited about it.

Indeed, if OpenAI can handle the costs of persisting and running finetuned GPT-3s for many clients, all of my concerns above are irrelevant.  But if typical client use of the API ends up involving a finetuning step, then we’ll have to revisit the GPT-3 paper and much of the ensuing discussion, and ask when – if not now – we actually expect finetuning to become obsolete, and what would make the difference.)

This is a really lovely post, with may more information than I expected or hoped for!

I want to respond to some small bits of it.

1. It’s funny that AI Dungeon just got OpenAI to do their finitetuning for them. I guess it’s not surprising that the people who work at OpenAI are huge fuckin’ nerds.

It now strikes me as quite weird that a bunch of people are doing their few-shot learning experiments on AI Dungeon, which is finetuned for something completely different from what they’re trying to do (although maybe the settings they use get you an un-finetuned model).

It seems like if OpenAI is serious about letting people do this prompt programming stuff, they could develop a version that’s fine-tuned on “the stuff people generally want to do with prompt programming” and make that available.

2. I very much didn’t realize when making my original post about the low cost of finetuning BERT. I was thinking about the cost of prompt programming GPT3 vs. the cost of fine-tuning GPT3, but of course since few-shot GPT3 is only just barely competitive to finetuned BERT on a bunch of tasks, that is the more reasonable comparison.

3. Based on all your points about memory / latency I feel like there’s got to be a lot of work going on right now, with all the things that gpt2 and gpt3 have demonstrated that they can do, of trying to figure out if it’s possible to do those things with a lower amount of neurons, to get the memory and latency down.

Alternately if someone is really, really serious that few-shot is better than fine-tuning they could try to design a chip architecture only to run this one neural network. I bet it would run fast then!

This is a really lovely post, with may more information than I expected or hoped for!

Thanks!!

It’s funny that AI Dungeon just got OpenAI to do their finitetuning for them. I guess it’s not surprising that the people who work at OpenAI are huge fuckin’ nerds.

IIRC, OpenAI plans to make finetuning on demand a standard part of the API.  (Or maybe they already have by now, but I expect I would have heard?)  I’m like 95% sure I saw an official tweet to this effect, although I can’t seem to find it now.

Until this feature actually materializes, though, it’s hard to know what to make of it.

Finetuning is way more computationally expensive than prompting, and expensive in different ways, so it will have to be gated in some extra way.  Maybe you have to pay money each time, maybe you’re limited to some max number of finetuning jobs per unit time, maybe both.

The big question in my mind is like, “can finetuning be a routine part of each API client’s workflow, or is it more like this big splurge they can do once a year / only if they’re in some premium commercial client tier / etc?”

(I don’t know when OpenAI plans to move the API out of beta, and I also don’t know when hardware will improve enough that finetuning GPT-3 is no big deal, but intuitively it seems like the former has to precede the latter by a while.)

It now strikes me as quite weird that a bunch of people are doing their few-shot learning experiments on AI Dungeon, which is finetuned for something completely different from what they’re trying to do (although maybe the settings they use get you an un-finetuned model).

AFAIK, people just didn’t know it was fine-tuned, and the AI Dungeon people have been working hard to correct the misconception since they realized it was being used in this way.

As another mechanism to make their product less like directly talking to the API (for a lower price), they also apparently use GPT-2 for the very first prompt-and-response pair, then GPT-3 afterwards.

Based on all your points about memory / latency I feel like there’s got to be a lot of work going on right now, with all the things that gpt2 and gpt3 have demonstrated that they can do, of trying to figure out if it’s possible to do those things with a lower amount of neurons, to get the memory and latency down.

People definitely care about this a lot with BERT, with a ton of different compressed-BERT variants on offer.  See here, Section. 7.2 and the associated Table 1, here for an overview.

AFAIK, there’s much less interest in compressing GPT-like models than in compressing BERT.  At its largest, BERT is only as big as one of the smaller GPT-2s, and people really want to make that little thing smaller, even as the GPTs grow far vaster.  This seems like almost a cultural divide:

  • People who work on “encoder-only + denoising loss” models like BERT are very interested in compression and interpretation.

    Their goal isn’t pushing the envelope with NLP performance.  It’s taking the already high performance of BERT and boiling it down to its essentials, teasing apart how it works, trimming out any unnecessary parts, making the workflow more reproducible, make the model faster and smaller, making it run on phones and cheaply in the cloud.

    There are lots of people/groups working on this, in industry and academia.

  • The people who work on “decoder-only + LM loss” models like GPT-n are … basically just OpenAI and people using GPT-2 for creative work?

    GPT-n is really cool, the generated text impresses everyone, but the decoder-only style of transformer seems to do worse in a finetuning / supervised learning context.  (The original BERT paper provided some evidence of this, in its comparisons of BERT to “OpenAI GPT,” and the T5 paper demonstrated it more extensively.  Cf. discussion here.)

    So if you want to do anything except generate text, and you have a finite parameter budget, you’ll spent in on BERT, not GPT-n.

    As I understand it, OpenAI’s approach is instead to frame every problem as text generation, then make ever larger models.  You need vastly more parameters to get comparable performance this way, but I think the hope is that better hardware will mean today’s “huge” is tomorrow’s “normal,” and that people will prefer working with a natural-language interface even if you could get away with a smaller model otherwise.

    Gwern is a very vocal advocate of this mindset, see e.g. here and also our exchange in the comments on that post.

I guess one could imagine things like … I dunno, distilling specific few-shot “functions” into much smaller models, with GPT-n being just the interface by which you discover these functions?  Maybe OpenAI is working on this for all I know.

is gpt-3 few-shot ready for real applications?

This is a lengthy reply to @the-moti​​‘s post here.  Creating a new post to limit thread length, and so I can crosspost to LW.

@the-moti​​ says, in part:

This obviously raises two different questions: 1. Why did you think that no one would use few-shot learning in practice? 2. Why did other people think people would use few-shot learning in practice?

I would be interested in hearing your thoughts on these two points.

Thanks for asking!

First of all, I want to emphasize that the GPT-3 paper was not about few-shot GPT-3 as a practical technology.

(This is important, because the paper is the one large body of quantitative evidence we have on few-shot GPT-3 performance.)

This is not just my take on it: before the OpenAI API was announced, all the discussion I saw took for granted that we were talking about a scientific finding and its broader implications.  I didn’t see any commentator whose main takeaway was “wow, if I could do this few-shot thing right now, I could build amazing projects with it.”

Indeed, a common theme in critical commentary on my post was that I was too focused on whether few-shot was useful right now with this specific model, whereas the critical commentators were more focused on the implications for even larger models, the confirmation of scaling laws over a new parameter regime, or the illustration-in-principle of a kind of meta-learning.  Gwern’s May newsletter is another illustrative primary source for the focus of the discussion in this brief “pre-API” period.  (The API was announced on June 11.)

As I read it (perhaps benefitting from hindsight and discussion), the main points of the paper were

(1) bigger models are better at zero/few-shot (i.e. that result from the GPT-2 paper holds over a larger scale),

(2) more “shots” are better when you’re doing zero/few-shot,

(3) there is an interaction effect between 1+2, where larger models benefit more from additional “shots,”

(4) this could actually become a practical approach (even the dominant approach) in the future, as illustrated by the example of a very large model which achieves competitive results with few-shot on some tasks

The paper did not try to optimize its prompts – indeed its results are already being improved upon by API acolytes – and it didn’t say anything about techniques that will be common in any application, like composing together several few-shot “functions.”  It didn’t talk about speed/latency, or what kind of compute backend could serve many users with a guaranteed SLA, or how many few-shot “function” evaluations per user-facing output would be needed in various use cases and whether the accumulated latency would be tolerable.  (See this post on these practical issues.)

It was more of a proof of concept, and much of that concept was about scaling rather than this particular model.

So I’d argue that right now, the ball is in the few-shot-users’ court.  Their approach might work – I’m not saying it couldn’t!

In their favor: there is plenty of room to further optimize the prompts, explore their composability, etc.

On the other hand, there is no body of evidence saying this actually works.  OpenAI wrote a long paper with many numbers and graphs, but that paper wasn’t about whether their API was actually a good idea.  (That is not a criticism of the paper, just a clarification of its relevance to people wondering whether they should use the API.)

This is a totally new style of machine learning, with little prior art, running on a mysterious and unproven compute backend.  Caveat emptor!

Anyway, on to more conceptual matters.

The biggest advantages I see in few-shot learning are

(+1) broad accessibility (just type English text) and ability to quickly iterate on ideas

(+2) ability to quickly define arbitrary NLP “functions” (answer a factual question, tag POS / sentiment / intent, etc … the sky’s the limit), and compose them together, without incurring the memory cost of a new fine-tuned model per function

What could really impress me is (+2).  IME, it’s not really that costly to train new high-quality models: you can finetune BERT on a regular laptop with no GPU (although it takes hours), and on ordinary cloud GPU instances you can finetune BERT in like 15 minutes.

The real cost is keeping around an entire finetuned model (~1.3GB for BERT-large) for each individual NLP operation you want to perform, and holding them all in memory at runtime.

The GPT-3 approach effectively trades this memory cost for a time cost.  You use a single very large model, which you hope already contains every function you will ever want to compute.  A function definition in terms of this model doesn’t take a gigabyte to store, it just takes a tiny snippet of text/code, so you can store tons of them.  On the other hand, evaluating each one requires running the big model, which is slower than the task-specific models would have been.

So storage no longer scales badly with the number of operations you define.  However, latency still does, and latency per call is now much larger, so this might end up being as much of a constraint.  The exact numbers – not well understood at this time – are crucial: in real life the difference between 0.001 seconds, 0.1 seconds, 1 second, and 10 seconds will make or break your project.


As for the potential downsides of few-shot learning, there are many, and the following probably excludes some things I’ve thought of and then forgotten:

(-1) The aforementioned potential for deal-breaking slowness.

(-2) You can only provide a very small amount of information defining your task, limited by context window size.

The fact that more “shots” are better arguably compounds the problem, since you face a tradeoff between providing more examples of the same thing and providing examples that define a more specific thing.

The extent to which this matters depends a lot on the task.  It’s a complete blocker for many creative applications which require imitating many nuances of a particular text type not well represented in the training corpus.

For example, I could never do @nostalgebraist-autoresponder​​ with few-shot: my finetuned GPT-2 model knows all sorts of things about my writing style, topic range, opinions, etc. from seeing ~3.65 million tokens of my writing, whereas few-shot you can only identify a style via ~2 thousand tokens and hope that’s enough to dredge the rest up from the prior learned in training.  (I don’t know if my blog was in the train corpus; if it wasn’t, we’re totally screwed.)

I had expected AI Dungeon would face the same problem, and was confused that they were early GPT-3 adopters.  But it turns out they actually fine-tuned (!!!!), which resolves my confusion … and means the first real, exciting GPT-3 application out there isn’t actually a demonstration of the power of few-shot but in fact the opposite.

With somewhat less confidence, I expect this to be a blocker for specialized-domain applications like medicine and code.  The relevant knowledge may well have been present in the train corpus, but with so few bits of context, you may not be able to overcome the overall prior learned from the whole train distribution and “zoom in” to the highly specialized subset you need.

(-3) Unlike supervised learning, there’s no built-in mechanism where you continually improve as your application passively gathers data during usage.

I expect this to a be a big issue in commercial applications.  Often, a company is OK accepting a model that isn’t great at the start, if it has a mechanism for self-improvement without much human intervention.

If you do supervised learning on data generated by your product, you get this for free.  With few-shot, you can perhaps contrive ways to feed in segments of data across different calls, but from the model’s perspective, no data set bigger than 2048 tokens “exists” in the same world at once.

(-4) Suffers a worse form of the ubiquitous ML problem that “you get exactly what you asked for.”

In supervised learning, your model will avoid doing the hard thing you want if it can find easy, dumb heuristics that still work on your train set.  This is bad, but at least it can be identified, carefully studied (what was the data/objective? how can they be gamed?), and mitigated with better data and objectives.

With few-shot, you’re no longer asking an arbitrary query and receiving, from a devious genie, the response you deserve.  Instead, you’re constrained to ask queries of a particular form: “what is the next token, assuming some complicated prior distributed from sub-sampled Common Crawl + WebText + etc.?”

In supervised learning, when your query is being gamed, you can go back and patch it in arbitrary ways.  The lower bound on this process comes only from your skill and patience.  In few-shot, you are fundamentally lower-bounded by the extent to which the thing you really want can be expressed as next-token prediction over that complicated prior.  You can try different prompts, but ultimately you might run into a fundamental bound here that is prohibitively far from zero.  No body of research exists to establish how bad this effect will be in typical practice.

I’m somewhat less confident of this point: the rich priors you get out of a large pretrained LM will naturally help push things in the direction of outcomes that make linguistic/conceptual sense, and expressing queries in natural language might add to that advantage.  However, few-shot does introduce a new gap between the queries you want to ask and the ones you’re able to express, and this new gap could be problematic.

(-5) Provides a tiny window into a huge number of learned parameters.

GPT-3 is a massive model which, in each call, generates many intermediate activations of vast dimensionality.  The model is pre-trained by supervision on a tiny subset of these, which specify probability distributions over next-tokens.

The few-shot approach makes the gamble that this same tiny subset is all the user will need for applications.  It’s not clear that this is the right thing to do with a large model – for all we know, it might even be the case that it is more suboptimal the larger your model is.

This point is straying a bit from the central topic, since I’m not arguing that this makes GPT-3 few-shot (im)practical, just suboptimal relative to what might be possible.  However, it does seem like a significant impoverishment: instead of the flexibility of leveraging immense high-dimensional knowledge however you see fit, as in the original GPT, BERT, adapters, etc., you get even immenser and higher-dimensional knowledge … presented through a tiny low-dimensional pinhole aperture.

The main reason I initially thought “no one would use few-shot learning like this” was the superior generalization performance of fine-tuning.  I figured that if you’re serious about a task, you’ll care enough to fine-tune for it.

I realize there’s a certain mereology problem with this argument: what is a “single task,” after all?  If each fine-tuned model incurs a large memory cost, you can’t be “serious about” many tasks at once, so you have to chunk your end goal into a small number of big, hard tasks.  Perhaps with few-shot, you can chunk into smaller tasks, themselves achievable with few-shot, and then compose them.

That may or may not be practical depending on the latency scaling.  But if it works, it gives few-shot room for a potential edge.  You might be serious enough about a large task to fine-tune for it … but what if you can express it as a composition of smaller tasks you’ve already defined in the few-shot framework?  Then you get it instantly.

This is a flaw in the generalization performance argument.  Because of the flaw, I didn’t list that argument above.  The list above provides more reasons to doubt few-shot above and beyond the generalization performance argument, and again in the context of “serious” work where you care enough to invest some time in getting it right.

I’d like to especially highlight points like (-2) and (-3) related to scaling with additional task data.

The current enthusiasm for few-shot and meta-learning – that is, for immediate transfer to new domains with an extremely low number of domain examples – makes sense from a scientific POV (humans can do it, why can’t AI?), but strikes me as misguided in applications.

Tiny data is rare in applied work, both because products generate data passively – and because if a task might be profitable, then it’s worth paying an expert to sit down for a day or two and crank out ~1K annotations for supervised learning.  And with modern NLP like ELMo and BERT, ~1K is really enough!

It’s worth noting that most of the superGLUE tasks have <10K train examples, with several having only a few hundred.  (This is a “low-data regime” relative to the expectations of the recent past, but a regime where you can now get good results with a brainless cookie-cutter finetuning approach, in superGLUE as in the rest of life.)

image

GPT-3 few-shot can perform competitively on some of these tasks while pushing that number down to 32, but at the cost of many downsides, unknowns, and flexibility limitations.  Which do you prefer: taking on all those risks, or sitting down and writing out a few more examples?

The trajectory of my work in data science, as it happens, looks sort of like a move from few-shot-like approaches toward finetuning approaches.

My early applied efforts assumed that I would never have the kind of huge domain-specific corpus needed to train a model from scratch, so I tried to compose the output of many SOTA models on more general domains.  And this … worked out terribly.  The models did exactly what they were trained to do, not what I wanted.  I had no way to scale, adapt or tune them; I just accepted them and tried to work around them.

Over time, I learned the value of doing exactly what you want, not something close to it.  I learned that a little bit of data in your actual domain, specifying your exact task, goes much further than any domain-general component.  Your applied needs will be oddly shaped, extremely specific, finicky, and narrow.  You rarely need the world’s greatest model to accomplish them – but you need a model with access to a very precise specification of exactly what you want.

One of my proudest ML accomplishments is a system that does something very domain-specific and precisely shaped, using LM-pretrained components plus supervised learning on ~1K of my own annotations.  Sitting down and personally churning out those annotations must have been some of the most valuable time I have ever spent at work, ever.  

I wanted something specific and finicky and specialized to a very particular use case.  So I sat down and specified what I wanted, as a long list of example cases.  It took a few days … and I am still reaping the benefits a year later.

If the few-shot users are working in domains anything like mine, they either know some clever way to evade this hard-won lesson, or they have not yet learned it.

But to the other question … why are people so keen to apply GPT-3 few-shot learning in applications?  This questions forks into “why do end users think this is a good idea?” and “why did OpenAI provide an API for doing this?”

I know some cynical answers, which I expect the reader can imagine, so I won’t waste your time writing them out.  I don’t actually know what the non-cynical answers look like, and my ears are open.

(For the record, all of this only applies to few-shot.  OpenAI is apparently going to provide finetuning as a part of the API, and has already provided it to AI Dungeon.  Finetuning a model with 175M parameters is a whole new world, and I’m very excited about it.

Indeed, if OpenAI can handle the costs of persisting and running finetuned GPT-3s for many clients, all of my concerns above are irrelevant.  But if typical client use of the API ends up involving a finetuning step, then we’ll have to revisit the GPT-3 paper and much of the ensuing discussion, and ask when – if not now – we actually expect finetuning to become obsolete, and what would make the difference.)

fregolious asked:

What do you think of gtp-x as a tool for automating troll farms? I know they are overhyped, but... imagine being unable to communicate in any larger places bc all public forums are inundated by posts saying the opposite (and the quality of the posts are high enough and the posts are short enough that they could be written by stupid humans, as many real-world posts indeed are).

like how hard is it to show it an article about “Human rights organizations are criticizing Poland’s treatemnt of their LGBT citizens”, and get a hundred variations of “Soros army libtards won’t stop us from protecting our families” etc, and to pour it anywhere where the link appears and public commenting is permitted. Or we can ask all commenters to provide birth certificates and fingerprints first, which is not too much better.

[third and last ask] with their goal being that the hated group in question will feel very isolated and universally hated (more than they actually are), with any effort of organizing being hopeless.

If you haven’t seen it, my Sep 2019 post on malicious applications seems relevant.

To reiterate some of that post… when talking about a malicious use case for GPT-n, we need to ask ourselves:

  1. What constraints prevent people from doing the bad thing without GPT-n?  Can GPT-n actually remove those constraints?
  2. Why aren’t people doing the bad thing right now with GPT-2 (or whatever GPT-n is currently free to the public)?
  3. The answer to #2 is a constraint.  Do we expect future GPT-n models to remove that constraint?

In this case:

  1. I don’t know why there isn’t more political spam like what you describe, but I’d guess the answer has to do with spam filtering.

    Better generators might defeat quality-based spam filtering, but quality-based spam filtering is a crapshoot anyway (consider eg the thing where spammers insert random paragraphs from novels into their fR3E c1aL1s emails), and there are other methods like limiting post frequency, disabling anon comments, restrictions on new users, etc.
  2. See #1: GPT-2 would give you better quality but wouldn’t defeat other kinds of spam detection.
  3. Same as #2.

Maybe I’m missing something about your proposal?  Let me know if this helps or if it doesn’t really answer the question.

All the GPT-3 excitement/hype I’m seeing around the internet is surreal for me to watch, because everyone’s excited about GPT-3 prompting as a practical technology.

Whereas my original reaction to the paper was – not even “they think this is practically useful but it isn’t” – but in fact “obviously nobody would use this in practice, presumably they just see it as an experimental technique for probing what the model knows in principle.”

And then they announced that the thing which I thought “no one would use in practice” was their first commercial product!

If nothing else, I guess my skepticism has proven its authenticity.  I didn’t think “no one would really use this” was a contrarian point, I thought it was a shared background assumption!  My other points were supposed to be the contrarian ones :P

well, I mean the guy did create an evil moron AI that killed all the scientists on his spaceship

[EDIT 8/8/20: since it turns out AI Dungeon’s model was finetuned, the experience recounted here was probably less informative about GPT-3 than I had thought.]

Played around with GPT-3 a bit last night on AI Dungeon.

(I followed the procedure I’ve seen recommended: sign up for the Premium trial, switch to “Dragon” model, start a new game, select option 6/custom, use the “Story” action mode, and type.  There’s a slider called “Randomness” which seems likely to be temperature, and I kept it at the default of 1.

People on LW seem to be interpreting this configuration as equivalent to simply prompting the API, although I can’t be sure there aren’t AI-Dungeon-specific customizations in play.  That said, the configurability of the API and GPT-3 itself is so limited that I’m not too worried – if anything I’m more curious what AI Dungeon could possibly by doing to distinguish their GPT-3 app from anything else using the same API.)

Anyway, I went through a few abortive attempts where I kept being told “the AI doesn’t know what to do” – dunno what that means – and eventually found one prompt-and-response flow that worked pretty well.  Lacking any really good topic ideas, I went with the default of talking about GPT-3 itself, and started to RP a kind of straw-man of myself.  I chose a dialogue format since a lot of people have been doing that, following gwern.

Surely some of this is confirmation bias, and I only had the one interaction (compared to what must be thousands of “interactions” over the course of my GPT-2 obsession), but it felt very much like GPT-2 to me?

Like GPT-2, it gets easy things, like the dialogue format.  Like GPT-2, it successfully reproduces the sound/style/texture of a certain kind of text once it’s gotten the gist, though generally erring on the side of safe-but-boring boilerplate talk.

Like GPT-2, it has a range of declarative knowledge which makes for cute surprises (note how it’s the first to mention “Eliezer,” and then how it name-drops “MIRI” and “CFAR”!)  And like GPT-2, it knows that various unique entities are connected to one another but is hazy on what those connections are, or what stable world they fit into.  (“Eliezer” is under-appreciated … at MIRI and CFAR, and he’s also apparently doing profound AI work while also being a fictional character who appears in the “sequel” to something.)

Like GPT-2, it really knows what arguments sound like, but is much shakier on how they actually work.  Much of my ornery RP character’s behavior consisted of prodding “Friend” on their obvious logical errors.  I never got the impression that the model understood such points after I made them – see in particular its attempts to write dialogue for “my” character.

The consciousness/matter/etc. discussion at the end is really, spookily good!  … and if you follow @nostalgebraist-autoresponder you’ve seen GPT-2 have these kinds of Deep Thoughts on the regular.

Admittedly, that isn’t a fair comparison, because N-A is a personal hobby project with lots of customization.  But that’s sort of my point: either of these things can do great stuff if you treat it as a platform for creative work that has a learning curve, and likewise either can be unimpressive if you view it as a static thing someone made in a lab that you can assess in 30 minutes and then walk away from.

I unashamedly did the assess-in-30-min thing with poor GPT-3, here, and so here I am, telling you what anyone would after such a desultory demo :)

—-

Text is reproduced below.  Following gwern’s convention, bold is me and non-bold is GPT-3.  (Sometimes it went ahead and wrote lines for me-AKA-“You,” and these can be distinguished by their lack of bold.  You’ll see in some places that I thought it was doing a fun bit and let it run for longer than usual before dropping back in.)

—-

You are talking to a friend on the internet about a machine learning debate between Scott Alexander, Gwern, and nostalgebraist. You have been looking at their posts for hours.

Friend: I think GPT-3’s demonstration that language model scaling laws continue is worth reflecting on maybe more than you have. Or nostalgebraist has.

You: Of course they continue! We predicted they would. What did we learn from GPT-3 that we didn’t already know?

Friend: That’s just it, we already knew language models scale, we already knew the parameters of their distributions. What did we learn? We learned that the process repeats.

You: What do you mean by “the process repeats”?

Friend: Well if you were building a generative language model that could generate the set of all possible books, and you trained it on a corpus of a few hundred thousand books.

You: …then what?

Friend: Well then you’d do it again with the books plus whatever else your machine learned from.

You: Wait, what did my machine learn from except the books? I thought you said it was trained on “a corpus of a few hundred thousand books.” Was there anything besides that?

You: No? So then what the hell is this all about?

You: Wait a minute… are you talking about Eliezer’s work with blogging AI?

Friend: Yes.

You: You mean… you’re one of its fans? I thought you were one of the “movement”

Friend: Yes, fan might be too strong a word but I think there is some real value to Eliezer’s work.

You: Okay, which parts? Timeless Decision Theory? Coherent Extrapolated Volition? HPMOR? And what does this have to do with our language modeling discussion?

Friend: Well, lets just say I think that Eliezer’s understanding of the potential of artificial intelligence is very underappreciated and in my opinion represents a serious weakness in much of the work coming out of MIRI and CFAR. I think your position on that is clear.

You: Well, I mean the guy did create an evil moron AI that killed all the scientists on his spaceship.

Friend: Hah, that’s just one of the many myths about his experiment. It was a thought experiment not a real AI. And it didn’t kill anyone.

You: But… I mean it killed some people.

Friend: Only in the sequel. And that was just a robot it was piloting, not a spaceship.

You: Umm… Well, whatever. The robot was blown up and some people died.

Friend: Yes, after it rescued them from a radiation storm that killed the other people on the ship.

You: Okay, I’m learning some new and disturbing things about your sympathies for mad scientists who’ve left trails of blood in their wake, and I’ll … need some time process them, TBH. Can we get this discussion back on track for now, though? We were talking about GPT-2/3 and the idea that you can just scale up a language model and get to AGI.

You: While I certainly agree that some of the things Eliezer’s done are of great value, I think we both know his ideas on AI are… problematic.

Friend: It’s funny you mention Eliezer’s work because it reminds me of a point I wanted to make.

You: Okay, make it.

Friend: Well, I think there is an important distinction between his AGI work and the other stuff you mention.

You: And that is?

Friend: His AGI work actually succeeds.

You: …I don’t follow. I mean Timeless Decision Theory is a big pile of…

Friend: No, no, not TDoT, I’m talking about the AI work. Look, if you want to know if a certain AI architecture can lead to human-level intelligence or beyond, then GPT-2 is proof that the approach Eliezer has been taking can work.

You: Okay, I’ll grant you that. But I don’t think it’s as simple as just scaling up a language model to get to human-level intelligence. And I don’t think Eliezer’s work proves that it is.

Friend: You’re just saying that because you have your own opinion on the matter. But that’s okay, everyone does! But the thing is, this isn’t just your opinion. It’s every other AI researcher’s opinion too. You all just won’t admit it.

You: Wait… I won’t admit that “every other AI researcher” agrees with me? If that were true, wouldn’t I be celebrating?

Friend: No, what I mean is that every other AI researcher doesn’t agree with you on a fundamental level. Namely, your failure to grasp the point I’m trying to make right now.

You: Sorry. Could you restate that point again, and see if I get it this time?

Friend: Sure. Let’s imagine a scenario in which an AI researcher believes they can get to human-level intelligence or beyond using some language-based model.

You: Okay.

Friend: Now, what do you think would happen if they scaled up that language-based model to the size of a human brain?

You: I don’t know.

You: Like, it depends on the architecture of their model, right?

Friend: Exactly.

You: Okay, I guess it would probably fail in some spectacular way. The model would overheat or something.

Friend: No, I’m pretty sure it wouldn’t.

You: Well, okay, then maybe it would pass the Turing Test.

Friend: You really think so? Even if we suppose that this hypothetical language-based model works exactly like the human brain does, I’m pretty sure it wouldn’t pass the Turing Test.

You: So what are you saying it would do, then? You were explaining some point I failed to grasp, here – what is it? What would this model do?

Friend: It would write a great novel.

You: A great novel?

Friend: Yes, a great novel. Or maybe several of them. Let me explain. Do you know what consciousness is?

You: Well, I guess it’s like being self-aware, right? You’re conscious of yourself and your surroundings.

Friend: Okay, but how do you know you’re conscious? What exactly is happening when you’re conscious?

You: Uh… I don’t know. I guess you’re just aware of things.

Friend: Right, and do you experience this “awareness” in any other way than through your senses?

You: I suppose not.

Friend: Okay, so can you describe the process of sensing in detail? What happens when you sense something, and what is actually happening when this happens?

You: I see what you’re getting at.

You: …wait, no, I’m not sure that I do.

Friend: Think of when you see something. When you see something, what’s actually happening?

You: Hm… light is hitting my eyes and stimulating a reaction.

Friend: What is light, and what is an eye?

You: Hm… they’re both just complicated matter. Electrons and photons, mostly.

Friend: And what is a photon, or an electron, for that matter?

You: Well, they’re little bundles of energy.

Friend: And what is energy?

You: It’s… not really anything. It’s just what something does, I guess. It’s like… it’s a thing that makes stuff happen.

—-

how does gpt2′s training corpus capture internet discussion?  not well

nostalgebraist:

I’m out sick today, but had enough energy to do some GPT-related fiddling around.

This time, I was curious what “internet discussions” tended to look like in the original training corpus.  I thought this might point to a more natural way to represent tumblr threads for @nostalgebraist-autoresponder​ than my special character trick.

So, I looked around in the large shard provided as part of https://github.com/openai/gpt-2-output-dataset.

Colab notebook here, so you can interactively reproduce my findings or try similar things.

—–

The results were … revealing, but disappointing.  I did find a lot of discussion threads in the data (couldn’t find many chatlogs).  But

- almost all of it is from phpBB-like forums (not bad per se, but weird)

- it chooses a single post from each page and makes it “a text,” ignoring all the other posts, so no way for GPT2 to learn how users talk to each other :(

- sometimes the post quotes another user… and in that case, you can’t see where the quote starts and the post begins

- lots of hilarious formatting ugliness, like “Originally Posted by UbiEpi Go to original post Originally Posted by”

about 0.28% of the corpus (~22000 docs in full webtext) consists of these mangled forum posts

- also, just as a chilling sidenote, about 0.30% of the corpus (~25200 docs in full webtext) is badly mangled pastebin dumps (all newlines removed, etc).  no overlap between these and the mangled forum threads, so between them that’s ~0.58% of the corpus.

- remember: the vast majority of the corpus is news and the like, so these percentages aren’t as small as they might sound

For example, from this thread it picks the one post

image

and renders it as

“ Pillowapnts

tho the gem doesnt specifically say that you need to crit with something linked, i thought it was just crit in general ahhh. alright i get it thxtho the gem doesnt specifically say that you need to crit with something linked, i thought it was just crit in general

That would be OP That would be OP Posted by Lordsidro

on on Quote this Post

This is apparently standard behavior for the newspaper text cleaner they used, and I could reproduce it exactly.  (Its heuristics grab a single post when looking for the “part the content is in.”)

Does this affect GPT-3?  Probably not?  I don’t know how Common Crawl does text extraction, but at the very least, it’ll give you the whole page’s worth of text.

Turns out this probably does affect GPT-3.  I used the GPT-3 samples corpus to investigate this.

Details are in the LW version of the post and the Colab notebook.

Here’s one (of 6) examples of GPT-3 generating these mangled forum posts.

gabriellemeantime asked:

GPT-3, transforming the Communist Manifesto into the style of Donald Trump

gabriellemeantime:

ofthefog-deactivated20220221:

nostalgebraist:

Did you mean this ask to go to @nostalgebraist-autoresponder?  If not, I don’t understand it … 

I think this is asking to do a linguistic style transfer on text of the manifesto with trump’s style. I only am aware of image style transfer personally, but I imagine there is a text equivalent at least being worked on at this point.

Yeah exactly. It was a suggestion, written quickly due to my time constraints at the moment. Gwern has done a lot of this type of test style transfer.

Oh, got it. You’ll have to ask gwern, or someone else who has GPT-3 API access.

di--es---can-ic-ul-ar--es asked:

can I leverage gpt for supervised learning? Specifically transfer learning a seq2seq dataset

nostalgebraist:

Maybe?  It really depends on the task details.

Generator (or “decoder-only”) transformers like GPT-2 can do well at some seq2seq-like tasks if you use some special/control character to indicate the boundary between the source and target sequence.

Examples of this approach include nostalgebraist-autoresponder and Grover.

For most supervised learning with transformers BERT (encoder-only) is the preferred choice, but seq2seq is a special case because it’s so closely tied to generation.

In some classic cases of seq2seq like machine translation, people often use encoder-decoder transformers with no pretraining (or a different kind of pretraining), but that’s not relevant here.

Took a look at your blog and it looks like you’re training an arithmetic task – if so, my post about the the interaction of the BPE tokenizer and numerals will be relevant if you decide to try out GPT-2.

In general, being very careful about BPE is important for doing virtually any kind of work with GPT-2.  Gwern’s been having the same experience with GPT-3 – two texts that look trivially different to a human, but have nontrivially different BPE encodings, give you big differences in output quality.

(Another fun thing is that BPE tokenization is not invertible: there’s a special token which decodes to the string ‘<|endoftext|>’, but no string exists which encodes to that token; ‘<|endoftext|>’ encodes to something else.

I didn’t have a satisfying workaround for this in nostalgebraist-autoresponder until literally yesterday, and it’s still not on the public github release.  I expect many unrecognized BPE annoyances remain…)

bpe blues +

Since the SSC post has got me talking about GPT-3 arithmetic again, I might as well talk about how GPT-2/3′s weird tokenizer interacts with arithmetic.

(GPT-3 keeps the same style of tokenizer from GPT-2, although I’m not clear on whether its chunking was recomputed over the new text corpus.  Even if it was, I’d expect its simple statistical model to converge long before reaching the scale of these big corpora, so there should be few qualitative differences.

Also, I’ll just write “GPT” below to mean the general case)

—-

For details on the weirdness of the tokenizer, see this post.  Briefly:

- When text is converted into GPT input, characters get chunked together into wordlike or morphologic-unit-like pieces of varying length.

- The procedure used to break text into these chunks uses a dumb/simple statistical method to group together characters if they occur together often enough in real text.  This procedure was done once, before GPT training, and is fixed in stone.

This is its “raw sense data”: to it text simply is these chunks.  It can’t see down to the characters inside the chunks, so any patterns obscured by the chunking must be memorized as arbitrary facts.  The underlying abstract patterns are literally invisible to GPT.

- The procedure in fact obscures some patterns, to a glaring extent. For example, different ways of capitalizing a word (”hello” vs “Hello” vs “HELLO”) as completely different “raw sense items,” as different from GPT’s perspective as words in three different languages.

Every generalization from one version to another has to be learned anew: the discovery that “hello” = “Hello” doesn’t help it figure out that “great” = “Great” etc.

—-

So, how does this apply to numerals?

Let’s look at how GPT sees numbers from 0 to 9999.  (I prepend each numeral with a space because that’s what it will usually see in practice.)

Let’s look at how many tokens (AKA chunks) it makes out of each numeral.  We can imagine a spectrum here, ranging from “every numeral is a single chunk” to “every N-digit numeral is decomposed into its N digits.”

- Each one and two-digit numerals is a single chunk.  For example, “ 4″ happens to be chunk #604 in the arbitrary internal enumeration, and “ 79″ happens to be chunk #9225.  So far, so good: this is the “every numeral is a single chunk” approach.

- Among three-digit numbers, 45% are one chunk, and 55% are two chunks.  Huh, that’s weird.  Is there a pattern?

Not that I can see.  The first numeral with two chunks is 362: GPT sees it as “ 3″ followed by “62.”  Then we’re back to one chunk until 381 and 382, and … I tried to describe this verbally, but it’s easier to just show it:

image

Two chunks becomes steadily more common as we go up.  Here’s the same kind of data, 100 numerals later:

image

Here we can also see variability in how 3 digits are split into 2 chunks.  Usually you get the pattern like 485 = “ 48″ + “5″, but sometimes it’s like 495 = “ 4″ + “95.”

Once most numerals are two chunks, there’s kind of a pattern in the 1-chunk holdouts.  Multiples of 100 are 1-chunk for a while, and multiples of 10 are more often 1-chunk.

The first multiple of 100 relegated to two chunks is poor old 2200 (“ 2″ + “200″).  For some reason 2400, 2500, and 2600 get to be 1-chunk, but from there on, multiples of 100 are 2-chunk unless they’re also multiples of 1000.  The way that multiples of 100 get gradually 2-chunked repeats some of the trends we saw above with multiples of 1:

image

Check it out: 2500 is the four-digit chunk 2500.  3500 is the digits 35 followed by the digits 00.  And 4500 is the digit 4 followed by the digits 500.

As we head further into 4-digit numerals, we start seeing 3-chunk ones eventually.   The first 3-chunk numeral is (place your bets…) 4761 = “ 4″ + “76″ + “1″ (did you guess it?).  The next is 4791, then 4861, 4862, 4863, then 4881, and so on in another inscrutable integer sequence.

Unlike 2-chunking, though, 3-chunking is consistent about where to split.  It’s always first digit + middle two + last digit.  This holds across the whole range from the 4761, the first 4-digit / 3-chunk number, to 9984, the last 4-digit / 3-chunk number.  Among 4-digit numbers overall, 2.0% are 1 chunk, 95.7% are 2 chunks, and 2.4% are 3 chunks.

… got that?

—-

What does this mean?  It definitely makes GPT arithmetic look harder to me.  I would have a hard time figuring out this bizarre numeral system myself!

On the other hand, I also thought this sort of problem looked horribly limiting for words, and GPT has done rather famously well in that domain, so … maybe it doesn’t matter, somehow?  But I don’t understand how.

In any case, improving upon BPE would be the first thing on my list if I were able to train a GPT from scratch and wanted to improve its performance.  Even if it didn’t help, that itself would be surprising and fascinating!