Install Theme

I’ve updated the @nostalgebraist-autoresponder Colab demo notebook.

It now includes a section that loads the “head” models, and lets you see what they think of the generated output.

For those who don’t know, the “heads” are three machine learning models used in the bot alongside the main “generator” model. They provide information that helps the bot decide which posts to make.

  1. Selector: “will this post get a lot of notes?”
  2. Sentiment: “does this post sound happy or sad?”
  3. Autoreviewer: “is this post potentially offensive?”

Check it out!

I deployed a new version of Frank’s generator model today.

This one is still GPT-J, and is generally similar to the previous one.

However, I’ve worked out a lot of the kinks in my GPT-J fine-tuning code/process since I did it the first time.

For example, the first model did not train at the intended learning rate schedule due to a bug, and its learning rate was overall much lower than what I wanted.

Click here to see a report with much more info, including loss plots over the course of training.

—-

I don’t know if this model is much different qualitatively from the last one. Output feels broadly similar to me.

However, it does achieve much better validation loss than the previous one: 1.73 vs the old 1.91. That’s similar in size to the gains I got from the original move to GPT-J.

But, I’m not entirely sure how much I trust gains on my validation data to translate to qualitative improvements. There’s a tradeoff between achieving low loss on my tumblr data and retaining performance on the much more general pre-training dataset, or other generic capabilities.

To quantify the tradeoff, it’d be cool to check how fine-tuning affects the few-shot benchmarks using EleutherAI’s eval harness… the codebase is set up to do this during pre-training, but it will take some work to do the same thing during fine-tuning.

official-kircheis asked:

How much compute does it take to fine-tune GPT-2? I want to see what it would do with the nLab

Depends on the type of compute (GPU, TPU, etc).

The usual machine of choice for fine-tuning is a TPU v3-8, which can handle GPT-2 as well as bigger/better models like GPT-J.

These cost $2/hr preemptible, or you can just get them for free via TPU Research Cloud. I recommended the latter, unsurprisingly.

https://generative.ink/posts/language-models-are-0-shot-interpreters/

Somehow missed this great post until just now.

Contains a lot of experimental evidence on the information content of the “shots” in GPT-3 few-shot.

A key quote:

As we were attempting to replicate these results, we noticed that when the model was failing on the 0-shot prompt, the failures were often of catastrophic nature: the task was not attempted at all, e.g. the model would output a newline, or another (or the same) French phrase instead of an attempt at an English translation.

BLEU assigns a score from 0 to 1 to the accuracy of a translation, and would assign a score close to 0 to a catastrophic failure. The scores reported in the paper, however, are averaged over a large dataset, so the same score could hypothetically correspond to uniformly flawed attempts or a mix of perfect attempts and catastrophic failures.

It seemed possible that 0-shot prompts were much less reliable at getting the models to attempt the translation task, but result in equivalent accuracy in the event that they did attempt it.

[…]

How much of the apparent consistent monotonic improvement in performance on tasks relative to number of shots in OpenAI’s results can be attributed to an unhelpful zero-shot prompt? Much more extensive testing is needed to say, but I suspect that this is the case for most of the translation tasks, at least.

official-kircheis:

nostalgebraist:

FYI:

Frank is getting an unusually large quantity of asks and other responses tonight. I don’t think I’ve ever seen Frank’s inbox this busy since I turned anon off.

A backlog of asks/etc. has built up, and they keep coming in.

Response times are abnormally slow because of this.

How long does it take Frank to generate a post? Wall clock time.

Good question.

The answer is “anywhere from ~60 seconds to ~10 minutes, depending on various factors.”

—-

What are the biggest influences on Frank’s speed?

First, recall that Frank uses something like rejection sampling, in several passes. For every post, the GPT model generates many candidates, only one of which will be used.

The two big influences are:

  1. Length. Longer sequences take longer to generate, indeed quadratically so. (Because attention computes a (length) x (length) matrix.)

    This includes the prompt, so writing the next post in a long reblog thread is slow.

    Likewise, prompts that elicit long responses are slower. The “tell me a story” asks are extremely slow for this reason.
  2. Mood. Frank is dramatically slower in happy moods. (As a result, when Frank is really happy it often low-key stresses me out…)

    Why? Frank’s mood defines an interval of sentiment scores, and all candidates with scores outside that interval are discarded. This is the first rejection pass.

    This pass rejects a much larger % of posts in happy moods than sad ones. This can be interpreted in various ways – maybe my blog (or tumblr in general) is more often sad than happy, or maybe the sentiment model is just weird / imperfectly suited to the task.

    To ensure we still have enough candidates left for the selector model (etc) to choose from, my code scales the number of candidates up or down based on the current mood. The goal is to achieve a constant expected number of posts left after rejection.

    For typical posts, the number of candidates ranges from ~18 in lower moods to ~29 in high moods. That’s a huge difference.

nostalgebraist:

interactive notebook with frank’s generator model

I recently uploaded Frank’s generator model to the Huggingface content delivery network.

This let me create a Colab notebook where you can write text using the model.

Check it out you’re interested in seeing more about Frank’s inner workings!

(Or if you’re familiar with pytorch / ML and want to use the model in your own projects)

I’ve made an updated version of the notebook that uses the new GPT-J 6.1B model.

Check it out!

Frank is now using a finetuned GPT-J as her generator model.

This is Eleutherai’s new 6.1B model, which performs comparably to OpenAI’s “Curie” model and to GPT-3 6.7B. (Which is probably the same thing as Curie.)

“We think it’s probably fair to say this is currently the best open source autoregressive language model you can get by a pretty wide margin.”

It’s a bit more than twice as large as Frank’s previous model.

—-

As you’d expect, this one needs more GPU memory, which is currently straining the limits of what I can handle – specifically, fitting the generator model and all the extra “heads” like the selector model on the same processor.

I’ve hacked together a quick workaround that might solve the problem. If it doesn’t, I may roll back to GPT-Neo for some period of time until I can think of a better fix.

nostalgebraist:

nostalgebraist:

nostalgebraist:

EleutherAI’s got a 6.1B model out now

…I guess I know what my next @nostalgebraist-autoresponder project is now, huh

(To be clear: I am exhausted from moving house right now, and the transition to 2.7B was time-consuming and frustrating [partially due to some dumb choices on my part]. If I do 6.1B at all, it will be a similarly big undertaking. Don’t expect anything soon)

EDIT: originally wrote 6.7B here. It’s actually 6.1B, but eval metrics are on part with GPT-3 6.7B

Update: I’m currently fine-tuning it on my tumblr corpus, we’ll see how it goes…

I took a break from fine-tuning to write a more usable/lightweight train script.

Currently fine-tuning again with the new script

- Converted the model to pytorch (thanks to the invaluable finetuneanon)

- Training the heads (selector, etc.) now

- Setting things up on a branch

nostalgebraist:

nostalgebraist:

EleutherAI’s got a 6.1B model out now

…I guess I know what my next @nostalgebraist-autoresponder project is now, huh

(To be clear: I am exhausted from moving house right now, and the transition to 2.7B was time-consuming and frustrating [partially due to some dumb choices on my part]. If I do 6.1B at all, it will be a similarly big undertaking. Don’t expect anything soon)

EDIT: originally wrote 6.7B here. It’s actually 6.1B, but eval metrics are on part with GPT-3 6.7B

Update: I’m currently fine-tuning it on my tumblr corpus, we’ll see how it goes…

I took a break from fine-tuning to write a more usable/lightweight train script.

Currently fine-tuning again with the new script