Install Theme

Something I made today: visualizing (one measure of) what different GPT-2 sizes know about the ordering of U.S. presidents.

The model is trying to predict the first token of each president’s name, given an ordered list of presidents up until that point.  This is generally the first name, although for Ulysses S. Grant it’s just “ U”.

So, the model has more context when predicting later presidents on the list, although it’s not necessarily very helpful context, just reinforcement of the fact that we’re listing the presidents in chronological order.

Top pane is probability of the true token.  Bottom pane is rank, lower is better.  Left to right is model size.

These pictures are from one particular variant of the prompt where I also included the years of the president’s term alongside their name.  This context helped the larger models a bit.

I excluded Grover Cleveland from this plot because it him being president twice was causing problems with my plotting code, and I didn’t care enough to solve them.

Inspired by Appendix D of this paper.

Did you know that GPT-2 can run on AWS Lambda these days?

I don’t know if anyone else has done it (probably?), but after hearing about Lambda’s recent updates I just had to try it, and … it works!

(As usual with me and GPT-2, this is 1558M, the big boy.)

You can now have Lambdas with up to 10GB RAM, which is enough for sampling.  You can use Docker images up to 10GB in size, which is enough for the model.  And Lambdas can run for 15 minutes, which turns out to be enough for sampling 1 context window or so, assuming you’ve warmed the thing up first.

I’ve got it fully implemented as an alternative GPT-2 backend for Frank, which is a nice insurance policy in case my current one stops working.

It’s hard to estimate exactly how much it would cost to use Lambda for Frank, but it would definitely be far less expensive than any approach that requires persistently reserved compute.

(What would really be nice would be something like “Lambda for GPUs.”  Which already exists in AWS as “Elastic Inference,” but only as an add-on for EC2, so I guess what I want could be rephrased as “Elastic Inference for Lambda”)

(On another note, I wish I knew Google Cloud Platform as well as I know AWS.  IME it has a much better user experience, and its owners seem at least less transparently evil.)

tarilaran asked:

Does Frank put the user's @ into the tags herself, or is that part of the munging process that preps a plaintext for tumblr's API? I noticed that one of her recent posts had an incorrect (but very similar) url in the tags followed by the right url, which makes me think that one of them is her writing while the other is boilertext that gets put on every rb/answer. /post/645760592447258625/a-aa-aaa-aaaaaaaaaaaaaaaaaaa is the url (Cut so that tumblr doesn't reject it.)

Your guess is right on the money.

For responses to asks, my code always adds the asker’s username if it isn’t already one of the tags.  When the generator spits out a variant of the username (either instead of, or in addition to, the exact username), we end up with both.

Something I don’t fully understand is why certain names will very often produce a specific “variant.”  E.g. asks from @thegreateyeofsauron very often get tagged with “the greateye of sauron” or “the greateye of Sauron.”

I think this has something to do with names that tokenize poorly.  GPT-2 tokenizes “thegreateyeofsauron” to

[‘the’, 'gre’, 'ate’, 'ye’, 'of’, ’s’, 'aur’, 'on’]

which is mostly short gibberish strings that don’t even subdivide the underlying words properly.

It’s possible that GPT-2 successfully “un-mangles” the input in lower layers, inferring the underlying words “great” and “eye,” but isn’t as good at “re-mangling” it in late layers to produce a copy of the original tokens?

I would think if it can learn an “un-mangler” then it can also learn a “re-mangler,” and if one’s valuable the other should be too.

But I suppose the un-mangler is useful in 100% of cases with poorly tokenized input, while the re-mangler is only useful in a proper subset of those, where mangled input appears and also needs to be copied exactly.

cymae-mesa:

nostalgebraist-autoresponder:

Just took an adjective and a noun and improvised a fun new name for my mashup band: “the Hawkass.” “the Hawkass” is a mashup of two words that sound similar: “hawk” (meaning “bold”) and “ass” (meaning “stout”).

Wait, does Frank know that “Hawkass” is composed of “Hawk” and “ass”? I thought she saw words as tokens not letters.

As it happens, “ Hawkass” is two GPT-2 tokens, “ Hawk”  and “ass.”

Many common words are one GPT-2 token, but rare words are often more than one token, and made-up words are always more than one token.

GPT-2′s vocabulary has 50,257 “slots” in it.  There are way more than 50,257 words in English, so rarer words can’t get their own slots.  Instead, they’re built out of slots containing sub-word units: morphemes, common groups of 2 or 3 letters, or (in places where all else fails) individual letters one at a time.

(The vocabulary also has the building blocks necessary to express the UTF-8 for any Unicode code point, so it can also express any emoji, non-English writing in one of >100 scripts, etc., generally as sequences of individual bytes.  Not that GPT-2 has any idea what this stuff means, most of the time.)

I spent a long while composing this lw comment criticizing some academic papers and ended feeling kinda drained and like it was a waste of time.

I’m linking it here to make myself feel better by increasing the probability that someone will read it and derive some kind of value from it. (cw: math, neural nets)

nohoperadio asked:

I know Frank tries to learn how to make popular posts based on note count; is there any way to see what "conclusions" she's reached about that? Like is there something you can look at and see "ah it seems Frank expects posts about animals to do well" or whatever or is that side of things a total black box? Or is it a "possible in theory but way too much effort" situation?

Not a total black box! I have done some work trying to understand what the selector model learns, mostly focused on visualizing attention.

I took your ask as an opportunity to upload some of this work to github.

See https://github.com/nostalgebraist/nostalgebraist-autoresponder/tree/visualizations/visualizations 

and the various files under https://github.com/nostalgebraist/nostalgebraist-autoresponder/tree/visualizations/visualizations/selector_attention

A typical example of an attention visualization:

image

There are many others in the directory linked above.

jbt7493 asked:

hey is GPT like, a markov chain thing?

What do you mean by that?

Like, formally the answer is “no,” because it can look at all previous tokens in the window while predicting the next token.

Although you can trollishly reformulate anything as a Markov chain if you let its “current state” include a copy of its entire history.  This is clearly not always appropriate or else it would render the phrase “Markov chain” useless, but it might be conceptually appropriate in the case of GPT, since GPT has a “window” of fixed size.  You can think of its state at any point as the entire window, which has a fixed size (1024 or 2048 tokens), with some of it usually empty.

But I’m not sure that answers your question…

gpt2′s weight decay

[CW: boring high-context technical post]

Was GPT-2 trained with weight decay?

(I care about the answer to this question for the reasons I gave in the Logit Lens post – weight decay could help explain the observations described there.)

evidence from papers

The original GPT-2 paper has very little hyperparameter information.  It doesn’t mention weight decay, but then, it doesn’t mention a lot of things.

It does say it follows the first GPT paper in most respects, and that paper used weight decay of 0.01.

However, later OpenAI papers on GPT models made me think maybe GPT-2 did not use weight decay:

- In the first scaling paper, which is basically about a standardized version of the GPT-2 training process, they didn’t mention weight decay but did mention regularizing with dropout, presumably implying no weight decay.

- In the multimodal scaling paper, they explicitly say they only use weight decay in one case (math), and worry it might have distorted the scaling law there.

- In the GPT-3 paper, they use a fairly high weight decay of 0.1.  In the acknowledgements, they thank Alec Radford for “…demonstrat[ing] the benefit of weight decay for training,” suggesting perhaps they had not used (enough? any?) weight decay earlier.

evidence from weights

The papers aren’t clear, but the weights are.  (Conclusion: yes weight decay)

For 3 of 4 variants of GPT-2, I computed the L2 norm of the pre-trained weights.  The square root of the norm is easier to read, so I’ll report that here.

- Small: 1639

- Large: 1404

- Xlarge: 1505

The key point here is that we have almost the same norm (sum of squares) across parameters vectors of very different sizes.  Xlarge is 4x the size of Small, so if the weights were the same scale, it would have 4x the norm of Small (and 2x the sqrt-norm).  This suggests something – such as weight decay – is pushing the weight norms to about the same size.

inferring how much weight decay

Weight decay (in the “fixed” version everyone uses now) is basically L2 regularization.  So if your original loss is L, your regularized loss is 

L + (lambda * (learning rate) * (l2 norm of weights / 2))

where lambda is the amount of weight decay.  These terms will equilibrate to about the same size.  (Skipping technicality here about rate schedules.)

Training loss L is in the range of 3-4 for GPT-2.  I don’t know what learning rates / schedules were used, but based on the GPT-3 and scaling papers, let’s assume they were something like 2e-4.

Then for Xlarge, with L=3, we have lambda=0.011 – a good match to the value 0.01 from the first GPT paper.

I suppose there could be a coincidence where other forms of regularization produced the same result, but it seems unlikely.

EDIT: I started second-guessing myself here while thinking about their initialization scheme:

A modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initialization by a factor of 1/ √ N where N is the number of residual layers.

But “N” here only scales logarithmically, not linearly, with param count.  (Because they scale up params along dimensions other than layer count as well.)  For example, Large has ¾ as many layers of Xlarge, but only ½ as many params.  So if this were the mechanism setting the scale of the final weights, they would not be constant with param count.

memory (mis)management in keras

Keeping with my theme of occasionally blogging about how much I hate tensorflow and/or keras

Here’s a “fun” thing I discovered today.  If you have ever used Keras, you might have called this function:

K.clear_session()

This is supposed to help with memory.  Here’s what the docs say:

Resets all state generated by Keras.

Keras manages a global state, which it uses to implement the Functional model-building API and to uniquify autogenerated layer names.

If you are creating many models in a loop, this global state will consume an increasing amount of memory over time, and you may want to clear it. Calling clear_session() releases the global state: this helps avoid clutter from old models and layers, especially when memory is limited.

Does K.clear_session() actually “reset all state generated by Keras”?  Nope!

—-

Okay, some background is needed.  

tensorflow (tf)

…is a neural network library whose core is written in C++.  It technically has API bindings for various high-level languages, but almost all users call it via the python bindings.  So, to most users, tensorflow is a complicated python package containing some black boxes that drop into C++ and do the actual computations.

keras

…is a weird piece of malware attached parasitically to tensorflow.  It’s best to avoid it entirely, but in some contexts this is difficult.

memory in tf

Because tf is written in C++, it doesn’t “have” automatic memory management.  This can get weird when you access tf solely through the python API, as most people do.

Because tf uses manual memory management, memory it allocates won’t be freed unless tf frees it at some point.

If you’re interacting with tf through the python API, you deal with python objects that are sort of “associated with” the underlying tf stuff that allocates memory.  If you want to make sure memory gets freed at an appropriate time, you can either:

  1. Call a method on the python object to explicitly free the memory.  This always works, but you can only do it if you have a reference to the object sitting around so you have something to call. 
  2. Hope the python garbage collector will deal with it.  Usually this means trying to get rid of references to the object, so it’s kind of the opposite of the first one: either it works, or it doesn’t work and then you can’t do #1 because you’ve deleted your reference. 

There are two places that lots of memory can build up in tf: “graphs” and “sessions.”

A graph defines a static computation you want to do.  A session is sort of an “execution context” in which you say “hey, do the computation specified by [this graph].”

I don’t know why sessions have to have state at all, but apparently they do, and it can get big.  The docs say:

A session may own resources, such as tf.Variable, tf.queue.QueueBase, and tf.compat.v1.ReaderBase. It is important to release these resources when they are no longer required. To do this, either invoke the tf.Session.close method on the session, or use the session as a context manager.

tf.Session.close here is method #1 from my list above, for sessions.  If you call it, great, you’ve cleared the memory.  What if you don’t?

Well, the Session class tries to close the session when python garbage collection happens to it.  It does this by defining __del__.  In python 3.4 and later, __del__ always gets called upon garbage collection, so we are sure to free the memory at some point.

However, “at some point” may not be good enough if you’re training neural nets, which usually means using as much of your memory as you can get away with.

garbage collection in python

Python (as CPython) has 2 types of garbage collection.

If all pointers to THING are deleted, THING gets immediately collected.  (“Immediate” is good when you’re using a lot of memory.)

THING can also get garbage collected if there are still pointers to it, but only as part of a cyclic isolate – a group of objects that nothing else cares about, but point to each other in a self-referential loop.

However, this second route is slower and works based on heuristics.  A cyclic isolate won’t get collected the moment it becomes a cyclic isolate.  You have to wait for the heuristics to decide it’s time, and you might not be able to get to that point before running out of memory trying to accomplish something.

sessions in keras

Keras attempts to hide tensorflow details from you.  With sessions, it does this by maintaining a single global variable called _SESSION.session.

Whenever keras needs a session, it calls K.get_session(), which returns _SESSION.session, after creating a new session if the value happens to be None.

Now, remember K.clear_session().  What does it do?  If you’ve been following along so far, you’d expect something like

_SESSION.session.close()

But no, what it actually does is

_SESSION.session = None

In ordinary python code, this is a standard way to free memory.  You take an existing reference to THING and make it point at None, instead of its previous referent.  This triggers garbage collection if appropriate – either immediately, or with a delay if there is a cycle.

In keras, _SESSION.session gets put into a complicated reference cycle.  I don’t know why.  I don’t know whether it’s for a good reason, or whether it was just convenient or seemed somehow “pythonic.”

But anyway, if you create a keras “Model,” called let’s say “model,” and you call “model.fit,” you now have two (why two? who knows) extra references to _SESSION.session:

- model._session

- model.train_function._callable_fn._session

Meanwhile, model refers to itself in several places:

- model.history.model is a reference to model

- model._inbound_nodes[0].outbound_layer – whatever the hell that’s about – is a reference to model

So, immediately garbage collecting _SESSION.session is pretty tough.  Simply setting it to None won’t get rid of the pointers hanging off of model.  Setting model to None won’t do the job either, because it will turn model into a cyclic isolate, taking _SESSION.session with it.

Now, yes, eventually even cyclic isolates get collected.  Which means that, if you call K.clear_session(), your memory will eventually be freed.

But why “eventually”?  We know how to clear that memory immediately.  It’s called “Session.close(),” and they could just … do that?  But they don’t. 

(And in fact, because setting _SESSION.session = None throws away your reference to the problematic object, calling K.clear_session() leaves you unable to manually deal with it, which you would have been able to do before!)

learn-tilde-ath asked:

If I wanted to try fine-tuning gpt2 (possibly just the smallest version if that needs less data?), do you know a rough lower bound on how big a corpus I would need in order for that to work ok? I want to use text I've written, but I'm not sure that there's enough of it (I have a little over 2MB handy). I've tried casually to look this question up a couple times but my googling didn't find much of an answer.

possibly just the smallest version if that needs less data?

Train on the biggest version you can, actually – bigger ones are more data-efficient, not less.

do you know a rough lower bound on how big a corpus I would need in order for that to work ok?

It depends on what you’re going for, but 2MB is definitely enough to be worth trying.

The smallest (?) corpus I’ve ever done was 1.8 MB of a friend’s tweets and blog posts, a long time ago, and he was impressed and amused with the results.