Install Theme

raginrayguns:

I’m so happy we live in precisely the right moment of history for @nostalgebraist ’s “what is the nature of mathematical modeling” side to intersect with his interest in literary style and writer’s distinctive “voices”, through neural networks that provide mathematical models of writing styles

:)

dall-e

OpenAI has a new blog post out, about “DALL-E,” a system that generates images from captions.  The paper isn’t out yet.

Copy/pasting an LW comment I wrote about it, since people here may be interested too.

———–

The approach to images here is very different from Image GPT.  (Though this is not the first time OpenAI has written about this approach – see the “Image VQ” results from the multi-modal scaling paper.)

⭑ In Image GPT, an image is represented as a 1D sequence of pixel colors.  The pixel colors are quantized to a palette of size 512, but still represent “raw colors” as opposed to anything more abstract.  Each token in the sequence represents 1 pixel.

⭑ In DALL-E, an image is represented as a 2D array of tokens from a latent code.  There are 8192 possible tokens.  Each token in the sequence represents “what’s going on” in a roughly 8x8 pixel region (because they use 32x32 codes for 256x256 images).

(Caveat: The mappings from pixels–>tokens and tokens–>pixels are contextual, so a token can influence pixels outside “its” 8x8 region.)

This latent code is analogous to the BPE code used to represent tokens (generally words) for text GPT.  Like BPE, the code is defined before doing generative training, and is presumably fixed during generative training.  Like BPE, it chunks the “raw” signal (pixels here, characters in BPE) into larger, more meaningful units.

This is like a vocabulary of 8192 “image words.”  DALL-E “writes” an 32x32 array of these image words, and then a separate network “decodes” this discrete array to a 256x256 array of pixel colors.

Intuitively, this feels closer than Image GPT to mimicking what text GPT does with text.  Pixels are way lower-level than words; 8x8 regions with contextual information feel closer to the level of words.

As with BPE, you get a head start over modeling the raw signal.  As with BPE, the chunking may ultimately be a limiting factor.  Although the chunking process here is differentiable (a neural auto-encoder), so it ought to be adaptable in a way BPE is not.

(Trivia: I’m amused that one of their visuals allows you to ask for images of triangular light bulbs – the example Yudkowsky used in LOGI to illustrate the internal complexity of superficially atomic concepts.)

helping my bot understand tumblr: the colossal rattumb corpus

This post describes a recent change to @nostalgebraist-autoresponder​.  It’s the one I alluded to in my survey post earlier this week.

tl;dr for people with sufficient background

Previously, I had only fine-tuned GPT-2 on posts from my own tumblr, along with my fiction and Goodreads reviews.

Now, I have collected a much larger tumblr corpus, scraped from 87 tumblr users who I’ve interacted with over the years.

At ~1e8 tokens, this corpus is 17 times larger than the corpus from my tumblr alone, and roughly 1% as big as the WebText corpus used to train GPT-2 originally.

I fine-tuned GPT-2 on this corpus, then fine-tuned the result further on the original narrow corpus (my tumblr + my fiction).

I’ve build several models using this pipeline.  My bot first began using models of this type on 11/22/20, and the most recent such model (a “stable” version I’m pretty happy with) was deployed on 12/7/20.

introducing the problem

For a long time, I’ve wished I could give my bot a clearer understanding of its fundamental task: having conversations with people on a social networking site, and specifically on tumblr dot com.

To recap, here’s some basic facts about how my bot works.  (I’m using a star character as the bullets of a list here because I don’t like how bulleted lists look on tumblr.)

⭑ My bot is very complicated, but its core is a text generator.

⭑ There’s a bunch of “onion layers” build around the text generator that determine how to use the generator to create each post, along with lots of other bells and whistles like the mood feature.

⭑ These extra layers have some control over what text gets posted and what doesn’t.  But the generator is the only part that writes the text.  It’s the writer, and the other layers are like its editor.

⭑ So, at the end of the day, any text posted by the bot was written by the generator.  (Except for the standard text accompanying mood graphs, I wrote that.)

⭑ This post is only about the generator.

⭑ The generator is a fine-tuned GPT-2 1.5B.  That means I start out with the GPT-2 model released by OpenAI, and then fine-tune it on tumblr posts.

————-

The generator is tasked with writing text that sounds like a person on tumblr talking to other people on tumblr.  To do this, it needs to know things like:

General facility: knowledge of the English language, of various commonly-known facts, etc.

The GPT-2 model released by OpenAI is already great at this stuff, because it was trained on “WebText,” a large and varied corpus of documents found on the internet.

Conversational facility: ability to work with texts that represent conversations between different speakers.

This kind of text is different from many others that GPT-2 knows how to write.  At each place in the text, someone is speaking.  Different speakers may have different styles and opinions, so when one speaker stops talking and the next starts, the generated text needs to change accordingly.

There are also various conversational norms that should apply.  If speakers A and B are talking, and speaker B says “I don’t agree with your second point,” this doesn’t make sense unless speaker A has made at least two distinct points.

In principle, OpenAI’s GPT-2 should have some understanding of these matters, since WebText ought to contain plenty of fiction and other texts involving conversation.

However, I have found it difficult to leverage this knowledge (if it exists) for the purposes of a social media bot.  Elsewhere I investigated WebText’s coverage of internet discussion and found it was startlingly poor.  And the style and conventions of internet discussion are very different from those that apply to speech by fictional characters wrapped in quotation marks.

Tumblr-specific facility: knowing the tumblr-specific meanings of words like “ask” and “reblog,” understanding the unique social norms of tumblr, etc.

GPT-2 knows none of this, and previously my generator only learned it from my blog and the people I reblogged.  That’s a small and limited window into a whole nuanced social world.

the colossal rattumb corpus

The obvious solution is just to get more data.  If WebText didn’t teach GPT-2 what I need it to know, then I’ll create a corpus that does.

To that end, I scraped a bunch of tumblr blogs, ultimately 87 in total.

I didn’t scrape every blog completely.  (Some people’s blogs have a huge number of posts in total, and I didn’t want to get stuck scraping one blog for days just because it was one of these.)  My rough rules were:

⭑ I scraped the entire blog if it had under 50,000 posts in total

⭑ If a blog had > 50K posts (especially if had > 100K), I would scrape a subset.  I tried to make these subsets focus on a period from roughly 2014 to 2016, since I was most active then myself, and I thought this would best carve out a coherent social environment.  However, this wasn’t a hard-and-fast rule and I sometimes scraped the most recent ~50K posts of a blog.

This means the time span of the corpus is fuzzily defined.  But it stretches back (to some extent) to the early 2010s and stretches forward to late November 2020.

I focused on users I’m familiar with socially from my years on the site, since this felt most promising for generating the kind of content I want.  Also, there were enough users in this category that I ran no risk of running out, given the limits on my scraping speed imposed by the tumblr API and the other API I use for OCR.

So, to a first approximation, it’s a big archive of rationalist-adjacent tumblr.  (In my head I found myself calling it the “Colossal Rattumb Corpus,” a riff on the “Colossal Clean Crawled Corpus” from the T5 paper, although it’s only “colossal” by the standards of my own previous work.)

The entire corpus is around 110 million tokens.  (That’s 1.1 x 10^8, or 1.1e8.)  This is:

⭑ between 16 and 17 times as big as my nostalgebraist-only corpus (6.5e6 tokens)

⭑ about 0.5% the size of the “WebText2″ corpus (2.29e10 tokens) used in OpenAI’s scaling papers 

⭑ probably about 1% the size of the “WebText” corpus used to train GPT-2 originally – I say probably because I don’t have an exact token count for WebText, and am inferring from its relative file size (40 GB, vs 96 GB for WebText2)

I tracked the last of these statistics closely while scraping blogs.  I wanted the corpus to be some appreciable fraction of WebText itself, to give me confidence it would provide “GPT-2 caliber” knowledge of a new domain.

Indeed, given the diversity of WebText, 1% is pretty big: GPT-2 surely “understands” more than 100 distinct kinds of text, and some of these it must have picked up from a subset of WebText smaller than my new corpus.

The new corpus inherits all the text pre-processing choice I’ve made in the course of developing my bot.  This includes:

⭑ a specific natural-language way of delimiting posts and representing usernames  

⭑ it filters blog content for inclusion in the same way I filtered my blog for the original corpus – for example, I don’t include reblogs without comment.

⭑ throughout, occurrences of name “nostalgebraist” are replaced with “Frank” (or, in the context of a few of my personal tags, “nostalgebraist-autoresponder”)

personal sidenote

There was something strangely exciting about collecting this large archive of my own past social environment, with the aim of creating a machine that would behave like a new participant in that old environment.  Like I was reversing the flow of entropy.  Like I was resurrecting something.

training a generator the corpus

Starting with OpenAI’s GPT-2, I fine-tuned in two stages.

⭑ First, I fine-tuned on the large tumblr corpus for 3 epochs.

⭑ Then took the resulting model and fine-tuned it again on my small nostalgebraist-specific corpus.  This is a strict subset of the big corpus, as I included the non-tumblr nostalgebraist material in the big corpus (why not).  This step also lasted for 3 (much shorter) epochs.

After the first stage, the model had a decent sense of the styles of different users.  It knew people’s personal tagging schemes.  Asked to write a post by @femmenietzsche​, it would produce something appropriately dry and pithy; asked to do @xhxhxhx​, it would produce an impressive simulacrum of his effortposts, full of fake links to impressive-sounding studies.

Likewise, it could already “do” nostalgebraist to some extent.  But I still thought the second stage was warranted, to nudge the model in the direction of “all else being equal, your text should sound like this.”  Also, to ensure competence in non-tumblr text clusters (my fiction and reviews), since I wanted the bot to be competent in these areas and they’re only a tiny fraction of the big corpus.

effects on the bot

The resulting generator – with selector and sentiment models re-trained to plug into it appropriately – has been live in production for a few weeks now.

I didn’t really know what to expect.  I imagined possibilities ranging from “no visible effect” to “vastly more human-like.”  What actually happened was closer to the former than the latter, but I do notice all kinds of differences, most of which are positive:

⭑ Frank’s responses feel more “conversational.”  She uses more banter that sounds like someone responding in a sociable manner (stuff like “thanks for asking!”).

⭑ Frank stays on topic more.  (Several people said they noticed this in their responses to my survey.)

⭑ Frank has picked up some things that are common on tumblr but not on my own blog – most hilariously, the kink meme format.  I also get the sense she’s more familiar with the stylistic nuances of, e.g., tumblr gender discourse.

⭑ Frank’s stylistic and emotional range seems larger.  She still sounds like my old blog posts sometimes, but at other times she’ll sound like a fandom blogger or something.

⭑ The center of those ranges have also shifted.  Frank generally seems a bit happier, and a bit sillier.

I suspect this is partially the cause of Frank’s unprecedently high mood variable in recent days – some of that was all the penis story asks (lol), but some of it may just be that Frank talks like a happier person now, because she’s less narrowly imitating my old sadposts, and that feeds back into the mood variable.

In response, I have recently (cruel old dad…) raised the “zero point” of Frank’s reactions to input, which should bring her mood variable back down to the zero-centered range where I actually have code in place to support it feeding back into the generated text.  (Those effects top out above/below a certain point.)

⭑ Frank is better-aware of events from 2018-2020.  This was not my primary intent, but is a retrospectively obvious consequence.  I never even had the thought “hey, she’ll know what Covid-19 is now,” but of course, she does.

⭑ There are other positive changes I can’t explain as well.  For example, the “fic override” seems to work better now.

The “fic override” is something I added a while ago, when I noticed people were sending many asks of the form “tell me a story about X.”  This rarely produced actual stories, even though Frank knows how to write fiction.

So I added the fic override, which kicks in for asks containing substrings like “tell me a story.”  It produces a prompt which formats the ask as usual, but then instead of following it with the control segment meaning “now Frank writes a post,” I use the segment meaning “now Frank writes a fiction chapter.”

This used to produce a lot of glitchy output that mis-used my control segments, although it did sometimes have the desired effect.  Now, for some reason, it works perfectly.  “Tell me a story about X” asks are more popular than ever now, I think because they’re getting newly funny, creative responses.

p.s. if you don’t want to be in the corpus

If what you’ve read so far makes you think "hey, I might appear in the corpus!”, then quite likely you do in fact appear in the corpus.

So far my bot has not spat out anything that sounds really distinctively like any particular person besides me.  However, someone might not want a large volume of their blog informing the output of a chatbot which exists in their social context.

Hence, if you don’t want to be in the corpus, let me know and I’ll see what I can do.  This would involve removing specific people from the corpus and then retraining the whole model stack again, which would take quite a while, so if there are multiple such requests, I’ll want to wait until I can do the whole process in a batch.

(To be upfront, I’d prefer not to do it at all.  This project consumes a lot of my time and energy as is.

What I’m trying to say is not “I’ll gladly re-do the whole thing if anyone so much as asks.”  What I’m trying to say is, I’m explicitly open to re-doing the whole thing in the event that the “whole thing” as it stands is a severe anxiety trigger for someone in the corpus, or something of that magnitude.  I don’t want my bot project to cause serious harm or distress to anyone.)

Talk about “AI” in the press these days tends to conflate two things:

  1. Machine intelligence in general, a category that includes e.g. hypothetical super-intelligent machines, hypothetical machines based on technologies not yet invented, and robots from science fiction
  2. A specific bundle of technologies which has gotten a lot of hype and investment in the past 5-10 years

#1 is the subject of a rich vein of discussion and speculation going back decades.  Turing, Asimov, et. al. did just fine speculating about AI without needing to know about the thing that industry hype currently calls “AI.”

You don’t need to know what a “convolutional neural network” is to worry about what would happen if a machine were smarter than you.

Because I work on #2 professionally, I get a lot of spam emails and targeted ads that say things like “Accelerate your AI development cycle with ProductName” or “ProductName: scalable AI solutions.”

The word “AI” in these ads has a recognizable meaning, and it is not the same meaning used in the sentence “Elon Musk founded OpenAI because he was worried AI might cause human extinction.”

—-

Because the press conflates these things, the average person tends to do so, too.  It’s common even among people who make consequential decisions involving “AI,” like politicians, executives, and economists.

I usually feel like I’m being pedantic about this, but I’m starting to think it’s a real problem.

The conflation encourages people to imagine #2 preceding #1 in time, as though “AI” were some specific thing discovered in a research lab in like 2002, whose properties were later extrapolated to scary hypotheticals.

It’s understandable that AI research companies would make this conflation.  It’s a great marketing trick, if you can pull it off, to convince the public that when they encounter speculation about arbitrarily powerful future technologies, it’s really about the concrete thing your company does right now.

It might be okay for the public to believe this story willingly, but they seem to have bought it (via the press) without realizing what was happening.  They don’t know they’re letting someone else draw the lines inside their heads.  They may learn all sorts of facts about the region in their mental map labeled “AI,” but they don’t attach them to the right nouns, even if each fact is true of some noun.

adam and coordinate dependence

For neural nets, I grudgingly use the Adam optimizer like everyone else.

It’s widely available and commonly used, and clearly works better in practice than any other optimizer with those properties.  “No one ever got fired for choosing Adam,” so to speak.

But its lack of coordinate independence really bothers me.  You write your model down in some arbitrary set of coordinates, and then Adam (effectively) approximates your function as locally quadratic, with your coordinates as the principal axes.

It doesn’t compute any estimate of how good or bad this approximation is, it just adopts it and sticks with it.

This is particularly weird when you’re just writing down matrices and initializing them with a coordinate independent distribution.  The arbitrary basis I use to keep track of these matrices in my computer’s memory has no role in the mathematical problem, but it affects the optimizer!

Perhaps this problem is not so bad with neural nets, because they tend to contain frequent coordinate-wise operations (activation functions).  This tends to “snap things into place,” aligning the otherwise arbitrary basis used for computer memory with the non-arbitrary basis picked out by the coordinate-wise operations.  Thus the Hessian has a reason to become “axis-aligned,“ cf. this useful paper.

And where there isn’t a functional role for a coordinate-wise operation, people often add one anyway to “help optimization” (batch norm, layer norm).  This is often discussed as “whitening” (making outputs more like white noise), but it’s clearer in my head if I think about it as trying to snap all the remaining arbitrary vector bases in the problem into alignment with the existing preferred bases.

However, in modern architectures, there some are places that neither have an activation nor an artificially imposed whitening step.

In one part of the transformer, you take a vector v and multiply it by a matrix Q to form the “queries,” and also by a matrix K to form the “keys.”  The output depends only on the dot product of Q*v and K*v, which is coordinate independent.  So nothing in the problem cares about your basis for expressing Q and K in computer memory, but Adam still cares.

(Going back to the reasons for axis-alignment – it’s also possible that when you start with coordinate invariant random matrices, but then optimize with Adam, the updates will push your parameters into a region where the Hessian is axis-aligned along your arbitrary axes.  I don’t know why this would be true, if it were true, but it seems possible.

This would be like picking a random vector at initialization, and then only exploring a subset of the solution space, where the subset is somehow parameterized by the vector.)

Maybe someday I’ll switch to natural gradients just so I can sleep better at night … 

I saw a hyped-up science news article about this paper and got briefly nerd sniped trying to figure out what was going on.  I still don’t know.

Both the news article and the paper itself make it sound like this is some … fully general neural net approach for solving PDEs, that works for any PDE, and is fast and accurate … and doesn’t even need to know the PDE, it just learns it from solution data, and then after you “train” it on one solution it knows the PDE, and can produce other solutions.

And I’m like, that can’t be real, right?  You can’t learn an infinite-dimensional operator from a finite sample.  They must be choosing to prefer some operators over others, all else being equal.  Also, what does this look like formally as a statistics problem, what measure are the operators being sampled from … 

I found this near-contemporaneous paper which goes into more detail, but still doesn’t resolve my confusion.  I also found this paper, cited as a competing method (although it shares several co-authors), which goes into much more mathematical detail and proves a general approximation theorem.

I don’t have the energy and interest to read through all these and figure out exactly what they’re actually doing, especially since I suspect it’s not that interesting.

(If PDEs are “generically learnable from finite samples under reasonable conditions” in some non-trivial way, that’s very interesting, but seems like something one could discover with pen and paper, and then go on to win prizes for discovering, without even needing a computer.  I wouldn’t expect such a discovery to look like these papers.)

But if anyone else feels like reading these papers, let me know what you find out!

admiral-craymen asked:

How does Frank know about things you've never posted about? I mentioned Okami in an ask and she knew it was a video game.

bayesic-bitch:

nostalgebraist:

To make Frank’s generator model, I started with a model called GPT-2 1.5B, which you can read about here and here if you aren’t familiar with it.  Then I “fine-tuned” it on my own writing.

The fine-tuning process made the model more likely to say things that sound like my own writing, compared to the original GPT-2 1.5B that I started with.

However, that model “knew” all sorts of things already before I fine-tuned it.  It was originally trained by OpenAI on 8 billion documents taken from all sorts of different places on the web.

For the most part, the fine-tuning process doesn’t remove this knowledge that was already in the model.  So Frank “knows” lots of things because GPT-2 1.5B also “knows” those things.

To what extent does this risk catastrophic forgetting? Does it stop being a problem as models get larger and approach the Neural Tangent Kernel/Gaussian Process limit?

I have no idea.  This is definitely something I’ve worried about.

One way I’ve tried to quantify it is by going back and looking at the magnitude of the typical parameter difference between the original model and my fine-tuned ones.

I’ve found (not very rigorously, take w/ grain of salt) that training with a lower learning rate gets you to the same loss with a smaller change in the param values, and this has motivated me to choose low learning rates, on the assumption that closer params <–> less catastrophic forgetting.

With BERT fine-tuning, sometimes people use a loss with a penalty on this parameter difference.  I haven’t tried this because my training is memory-constrained, and I’d need to keep around an extra copy of the model in memory (storing the original params) to compute this loss.

Intuitively, I suspect that ADAM and similar methods cause way more forgetting than necessary, because they make huge updates on usually-irrelevant params in the few steps where they become relevant, and those steps could easily reflect noise not signal.  In principle Novograd should be better here, but I’ve never gotten it to go more than painfully slow.

the scaling “inconsistency”: openAI’s new insight

I’ve now read the new OpenAI scaling laws paper.  Also, yesterday I attended a fun and informative lecture/discussion with one of the authors.

While the topic is on my mind, I should probably jot down some of my thoughts.

This post is mostly about what the new paper says about the “inconsistency” brought up in their previous paper.

The new paper has a new argument on this topic, which is intuitive and appealing, and suggests that the current scaling trend will indeed “switch over” soon to a new one where dataset size, not model size, is the active constraint on performance.  Most of this post is an attempt to explain and better understand this argument.

——

The new paper is mainly about extending the scaling laws from their earlier paper to new modalities.

In that paper, they found scaling laws for transformers trained autoregressively on text data.  The new paper finds the same patterns in the scaling behavior of transformers trained autoregressively on images, math problems, etc.

So the laws aren’t telling us something about the distribution of text data, but about something more fundamental.  That’s cool.

They also have a new, very intuitive hypothesis for what’s going on with the “scaling inconsistency” they described in the previous paper – the one I made a big deal about at the time.  So that’s the part I’m most excited to discuss.

I’m going to give a long explanation of it, way longer than the relevant part of their paper.  Some of this is original to me, all errors are mine, all the usual caveats.

——

1. L(​C) and L(D)

To recap: the “inconsistency” is between two scaling laws:

  • The law for the best you can do, given a fixed compute budget.

    This is L(​C), sometimes called L(C_min).  L is the loss (lower = better), C is your compute budget.

  • The law for the best you can do, given a fixed dataset size.

    This is L(D), where D is the number of examples (say, tokens) in the dataset.

Once you reach a certain level of compute, these two laws contradict each other.

I’ll take some time to unpack that here, as it’s not immediately obvious the two can even be compared to one another – one is a function of compute, the other of data.

2. C sets E, and E bounds D

Budget tradeoffs

Given a compute budget C, you can derive the optimal way to spend it on different things.  Roughly, you are trading off between two ways to spend compute:

  • Use C to buy “N”: Training a bigger model – “N” here is model size

  • Use C to buy “S”: Training for more steps “S” (gradient updates)

The relationship between S (steps) and D (dataset size) is a little subtle, for several reasons.

From step count to update count

For one thing, each single “step” is an update on the information from more than one data point.  Specifically, a step updates on “B” different points – B is the batch size.

So the total number of data points processed during training is B times S.  The papers sometimes call this quantity “E” (number of examples), so I’ll call it that too.

From update count to data count

Now, when you train an ML model, you usually update on each data point more than once.  Typically, you’ll do one pass over the full dataset (updating on each point as you go along), then you’ll go back and do a second full pass, and then a third, etc.  These passes are called “epochs.”

If you’re doing things this way, then for every point in the data, you get (number of epochs) updates out of it.  So

E = (number of epochs) * D.  

Some training routines don’t visit every point the exact same number of times – there’s nothing forcing you to do that.  Still, for any training procedure, we can look at the quantity E / D.

This would be the number of epochs, if you’re doing epochs.  For a generic training routine, you can can think of E / D as the “effective number of epochs”: the average number of times we visit each point, which may not be an integer.

Generally, E ≠ D, but we always have E≥D.  You can’t do fewer than one epoch; you can’t visit the average point less than once.

This is just a matter of definitions – it’s what “dataset size” means.  If you say you’re training on a million examples, but you only update on 100 individual examples, then you simply aren’t “training on a million examples.”

3. The inconsistency

L(D): information

OpenAI derives a scaling law called L(D).  This law is the best you could possibly do – even with arbitrarily large compute/models – if you are only allowed to train on D data points.

No matter how good your model is, there is only so much it can learn from a finite sample.  L(D) quantifies this intuitive fact (if the model is an autoregressive transformer).

L(​C): budgeting

OpenAI also derives another a scaling law called L(​C).  This is the best you can do with compute C, if you spend it optimally.

What does optimal spending look like?  Remember, you can spend a unit of compute on 

  • a bigger model (N), or 
  • training the same model for longer (S)

(Sidenote: you can also spend on bigger batches B.  But – to simplify a long, complicated story – it turns out that there are really just 2 independent knobs to tune among the 3 variables (B, N, S), and OpenAI frames the problem as tuning (N, S) with B already “factored out.”)

In the compute regime we are currently in, making the model bigger is way more effective than taking more steps.

This was one of the punchlines of the first of these two papers: the usual strategy, where you pick a model and then train it until it’s as good as it can get, is actually a suboptimal use of compute.  If you have enough compute to train the model for that long (“until convergence”), then you have enough compute to train a bigger model for fewer steps, and that is a better choice.

This is kind of counterintuitive!  It means that you should stop training your model before it stops getting better.  (“Early stopping” means training your model until it stops getting better, so this is sort of “extra-early stopping.”)  It’s not that those extra steps wouldn’t help – it’s that, if you are capable of doing them, then you are also capable of doing something else that is better.

Here’s something cool: in Appendix B.2 of the first paper, they actually quantify exactly how much performance you should sacrifice this way.  Turns out you should always stop at a test loss about 10% higher than what your model could asymptotically achieve.  (This will be relevant later, BTW.)

Anyway, OpenAI derives the optimal way to manage the tradeoff between N and S.  Using this optimal plan, you can derive L(​C) – the test loss you can achieve with compute C, if you allocate it optimally.

N goes up fast, S goes up slowly…

The optimal plan spends most incremental units of compute on bigger models (N).  It spends very little on more steps (S).

The amount it spends on batch size (B) is somewhere in between, but still small enough that the product E = B*S grows slowly.

But remember, we know a relationship between E and “D,” dataset size.  E can’t possibly be smaller than D.

So when your optimal plan chooses its B and its S, it has expressed an opinion about how big its training dataset is.

The dataset could be smaller than B*S, if we’re doing many (effective) epochs over it.  But it can’t be any bigger than B*S: you can’t do fewer than one epoch.

… and you claim to achieve the impossible

L(​C), the loss with optimally allocated C, goes down very quickly as C grows.  Meanwhile, the dataset you’re training with that compute stays almost the same size.

But there’s a minimum loss, L(D), you can possibly achieve with D data points.

The compute-optimal plan claims “by training on at most B*S data points, with model size N, I can achieve loss L(​C).”

The information bound says “if you train on at most B*S data points, your loss can’t get any lower than the function L(D), evaluated at D = B*S.”

Eventually, with enough compute, the L(​C) of the compute-optimal plan is lower than the L(D) of the dataset used by that same plan.

That is, even if the compute-optimal model is only training for a single epoch, it is claiming to extract more value that epoch than any model could ever achieve, given any number of epochs.

That’s the inconsistency.

4. The resolution

In the new paper, there’s an intuitive hypothesis for what’s going on here.  I don’t think it really needs the multimodal results to motivate it – it’s a hypothesis that could have been conceived earlier on, but just wasn’t.

Bigger models extract a resource faster

The idea is this.  As models get bigger, they get more update-efficient: each time they update on a data point, they get more out of it.  You have to train them for fewer (effective) epochs, all else being equal.

This fact drives the choice to scale up the model, rather than scaling up steps.  Scaling up the model makes your steps more valuable, so when you choose to scale the model rather than the steps, it’s almost like you’re getting more steps anyway.  (More “step-power,” or something.)

The resource is finite

Each data point has some information which a model can learn from it.  Finite models, trained for a finite amount of time, will miss out on some of this information.

You can think about the total extractable information in a data point by thinking about what an infinitely big model, trained forever, would eventually learn from that point.  It would extract all the information – which is more than a lesser model could extract, but still finite.  (A single data point doesn’t contain all the information in the universe.)

This is literally the definition of L(D): what an infinitely big model, trained forever, could learn from D separate data points.  L(D) quantifies the total extractable information of those points.

(More precisely, the total extractable information is the gap between L(D) and the loss achieved by a maximally ignorant model, or something like that.)

Converging in the very first step

As models get bigger, they extract more information per update.  That is, each time they see a data point, they extract a larger fraction of its total extractable information.

Eventually, your models are getting most of that information the very first time they see the data point.  The “most” in that sentence gets closer and closer to 100%, asymptotically.

How does this relate to optimal compute allocation?

The logic of the “optimal compute plan” is as follows:

Your model is an imperfect resource extractor: it only gets some of the resources locked up in a data point from the first update.  So you could extract more by running for more steps … 

…  but if you have the compute for that, you can also spend it by making your steps more efficient.  And, in the current compute regime, that’s the smarter choice.

It’s smarter by a specific, uniform proportion.  Remember, you should stop training when your loss is 10% higher than the converged loss of the same model.  If the converged loss is L, you should stop at 1.1*L.

Can you always do that?  If your model is efficient enough, you can’t!  As the first epoch gets closer to 100% efficient, the loss after the first epoch gets arbitrarily close to the converged loss.  Your loss goes under 1.1*L by the end of the first epoch.

At this point, the story justifying the L(​C) law breaks down.

The L(​C) law goes as fast is it does because upgrading the efficiency of your extractor is cheaper – in terms of compute spent per unit of resource extracted – than actually running the extractor.

This works as long as your extractor is inefficient.  But you can’t push efficiency above 100%.  Eventually, the only way to extract more is to actually run the damn thing.

Getting a bigger quarry

When you’re extracting a resource, there’s a difference between “improve the extractor” and “get a bigger quarry.”

If your quarry has 100 resource units in it, the strategy of “improving the extractor” can never get you more than 100 units.  It can get them to you faster, but if you want more than 100 units, you have to get a bigger quarry.

“N” sets the efficiency of the extractor.  “S” sets … well, it doesn’t exactly set the size of the quarry (that’s D).  There is an ambiguity in the S: it could mean running for more epochs on the same data, or it could mean getting more data.

But S does, at least, set an upper bound on the size of the quarry, D.  (Via D≤E and E = B*S, with B set optimally as always.)

With high enough compute (and thus model size), you’ve pushed the “extractor upgrades are cheap” lifehack as far as it can go.  With this efficient extractor, taking S steps (thus making E = B*S updates) sucks up most of the information theoretically extractable from E individual data points.

The learning curve L(E) of your model, as it makes its first pass over the dataset, starts to merge with L(D), the theoretical optimum achievable with that same dataset.  You trace out L(D) as you train, and the relevant constraint on your performance is the maximum data size D you can obtain and train on.

Where we are now

In the compute regime that spans GPT-2 and the smaller variants of GPT-3, extraction is far less than maximally efficient.  The L(​C) strategy applies, and the smart move is to spend compute mostly on model size.  So you make GPT-2, and then GPT-3.

Once we get to the full GPT-3, though, the extractor is efficient enough that the justification for L(​C) has broken down, and the learning curve L(E) over the first epoch looks like L(D).

Here is that as a picture, from the new paper:

image

The yellowest, lowest learning curve is the full GPT-3.  (The biggest GPT-2 is one of the green-ish lines.)  The black line is L(D), maximally efficient extraction.

You can see the whole story in this picture.  If you’re in one of the smaller-model learning curves, running for more steps on more data will get you nowhere near to the total extractable info in that data.  It’s a better use of your compute to move downwards, toward the learning curve of a bigger model.  That’s the L(​C) story.

If the L(​C) story went on forever, the curves would get steeper and steeper.  Somewhere a little beyond GPT-3, they would be steeper than L(D).  They would cross L(D), and we’d be learning more than L(D) says is theoretically present in the data.

According to the story above, that won’t happen.  We’ll just converge ever closer to L(D).  To push loss further downward, we need more data.

Implications

Since people are talking about bitter lessons a lot these days, I should make the following explicit: none of this means “the scaling hypothesis is false,” or anything like that.

It just suggests the relevant variable to scale with compute will switch: we’ll spent less of our marginal compute on bigger models, and more of it on bigger data.

That said, if the above is true (which it may not be), it does suggest that scaling transformers on text alone will not continue productively much past GPT-3.

The GPT-3 paper says its choices were guided by the “grow N, not S” heuristic behind the L(​C) curve:

Based on the analysis in Scaling Laws For Neural Language Models [KMH+20] we train much larger models on many fewer tokens than is typical.

(“KMH+20″ is the first of the two scaling papers discussed here.)  Even following this heuristic, they still picked a huge dataset, by human standards for text datasets.

In the above terms, their “E” was 300 billion tokens and their “D” was ~238 tokens, since they updated multiple times on some tokens (cf. Table 2.2 in the GPT-3 paper).  The whole of Common Crawl is 410 billion tokens, and Common Crawl might as well be “all the text in the universe” from the vantage point of you and me.

So, there’s room to scale D up somewhat further than they did with GPT-3, but not many orders of magnitude more.  To me, this suggests that an intuitively “smarter” GPT-4 would need to get its smartness from being multimodal, as we really can’t go much further with just text.

New OpenAI scaling laws paper.  Haven’t read it yet.

It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners →

nostalgebraist:

When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown et al., 2020) achieve remarkable few-shot performance on challenging natural language understanding benchmarks. In this work, we show that performance similar to GPT-3 can be obtained with language models whose parameter count is several orders of magnitude smaller. This is achieved by converting textual inputs into cloze questions that contain some form of task description, combined with gradient-based optimization; additionally exploiting unlabeled data gives further improvements. Based on our findings, we identify several key factors required for successful natural language understanding with small language models.

Haven’t read this yet, but it looks relevant to my question “how much better can you do with a small number of examples if you use finetuning rather than prompting?”

Here’s their Figure 1:

image

[Found a mostly written version of this in my drafts, decided to finish and publish it]

Some notes on this, after reading the paper and its predecessor, and trying out the code:

(1)

I like these papers because they answer a question I had after reading the GPT-3 paper.

The main results of the GPT-3 paper were about solving an unusually hard problem (very little training data), with an unusually powerful tool (a very large model), applied with an unusual limitation (no finetuning).

All three of these variables were unusually extreme at once, making it hard to compare GPT-3 with anything else.

The paper assumed we were interested in results on the unusually hard problem (training on only ~32 examples).  If we are, presumably we’d like to know how well the normal approaches work on that problem, before we jump on board with the new GPT-3 approach.  That is, if you use a normal-sized model, and let yourself finetune, how well can you do with ~32 examples?

I could find surprisingly little about this topic at the time.  I expected the “normal” approach to do well here, since as I mentioned here, it can do well in cases that only have a few hundred examples (some of the SuperGLUE tasks).  But I couldn’t find anyone doing it.

This new paper confirms my expectation: you can do at least as well as the original GPT-3 results on the same problem using a normal-sized model and finetuning.

(2)

It’s a little unfair to compare the results here directly to those in the GPT-3 paper.

The results here come from an approach that sounds like what a sophisticated pro user of GPT-3 might built: it involves writing several different prompts to elicit the same information and then ensembling the results.  The author is clearly striving to do good “prompt programming” and get the most value possible out of the prompts.

The GPT-3 paper did not try to optimize its prompts, and people have already improved upon the published results by using better prompting practices with the same GPT-3 model.

However, this paper still demonstrates that “prompt programming” works even with a much smaller model.  Specifically, it casts doubt on the claim that LMs need to be large-scale to do well on tiny datasets, and that performing well on tiny datasets specifically requires much larger LMs than many other tasks.

The GPT-3 paper didn’t actually make that claim explicitly, but it was a reasonable enough thing to conjecture after reading it, and I suspect some people came away from it with that impression.

We already knew that the GPT approach (one-directional LM, no finetuning) was very suboptimal for these tasks.  Bidirectional LMs with finetuning do much better on everything except generating text, but cannot generate good text.

My sense, bolstered by this paper, is that GPT-3 paper establishes the scale cost of using the GPT approach instead of one better suited for these tasks.  Given a fixed param/compute/whatever budget, the GPT approach is the right tool for text generation, but the wrong tool for these type of tasks.  However, a vastly more powerful version of the wrong tool can do as well as a less powerful version of the right tool.  Together, GPT-3 and this paper quantify the size of this gap.

(3)

Fine-tuning is fundamentally much slower than prompting.  Even a small LM, on a good GPU, with a tiny dataset, takes a few minutes to train if you do many epochs (as the authors do), and then for this approach you need to repeat this many times.

That entire process is necessary to try out a single prompt programming idea, so experimentation with prompts is much slower than with GPT-3.  There is also a memory/disk cost to all these finetuned models.

(You can save memory/disk cost by using adapters, but I’m not sure they save compute time.)

I am curious whether the fundamental properties of fine-tuning can be boiled down into something much more efficient.  Adapters try to do this along the dimension of parameter count, but you still have to do many gradient steps.

Especially with tiny datasets, the many gradient steps feel excessive somehow.  There just isn’t much information in the dataset, relative to the pre-trained LM.  Fine-tuning is not teaching the LM something new, but merely “locating” knowledge already stored in it somewhere, and “hooking up” that knowledge to your new prediction head.

If the knowledge is already there, you’d think you wouldn’t even need to tune the LM itself, and could just fit a linear (or simple nonlinear) model on top of it, which would be much faster.  But folk wisdom says it’s better to tune with transformers.  (Maybe attention is too close to sparse, some important tokens are ignored by the existing heads, and you need to teach them a rule like “look at this type of thing” where the “type of thing” is a concept easily expressed in the input basis of later layers.)

If the reason behind this observation were better understood, we might be able to replace fine-tuning with something much faster, and then replicate a GPT-3-like task programming experience with models the average laptop can run.