Install Theme

nostalgebraist:

EleutherAI’s got a 6.1B model out now

…I guess I know what my next @nostalgebraist-autoresponder project is now, huh

(To be clear: I am exhausted from moving house right now, and the transition to 2.7B was time-consuming and frustrating [partially due to some dumb choices on my part]. If I do 6.1B at all, it will be a similarly big undertaking. Don’t expect anything soon)

EDIT: originally wrote 6.7B here. It’s actually 6.1B, but eval metrics are on part with GPT-3 6.7B

Update: I’m currently fine-tuning it on my tumblr corpus, we’ll see how it goes…

EleutherAI’s got a 6.1B model out now

…I guess I know what my next @nostalgebraist-autoresponder project is now, huh

(To be clear: I am exhausted from moving house right now, and the transition to 2.7B was time-consuming and frustrating [partially due to some dumb choices on my part]. If I do 6.1B at all, it will be a similarly big undertaking. Don’t expect anything soon)

EDIT: originally wrote 6.7B here. It’s actually 6.1B, but eval metrics are on part with GPT-3 6.7B

There are a lot of papers out there asking variants of the question “why do ‘neural nets’ work so well in practice?”

I’m thinking of NTK, critiques of NTK, any recent neural net paper with “generalization” in the title, etc.

The question feels intuitive, but I suspect it’s misleading.

—-

The question focuses on “neural nets” as a (not very well defined) subset of all probabilistic models. Those with a specific structure.

It asks why this kind of model has good out-of-sample performance, despite the lack of theoretical guarantees, or whatever.

The implicit assumption is that most models suffer from the classical bias-variance tradeoff, and you need a certain structure to avoid this pitfall.

But “neural net” is ill-defined and covers many different structures.

What all “neural nets” share, though, is that they were made by practitioners using software/hardware for fast automatic differentiation. Their creators generally didn’t care about getting theoretical guarantees, or making choices that led to a fast structure-specific optimizer, or anything.

They were just throwing arbitrarily shaped blobs of nonlinear-statistical-model-goop at the wall, and seeing what stuck.

“Neural nets” are just the goop shapes that stuck.

—-

Possible interpretations include

(1)

Big (over-parameterized or non-parametric) nonlinear statistical models are all roughly the same once you have enough data. “Enough” here is a finite amount, not an unachievable limit, and we’re already there.

Available theory about GPs, kernel methods, etc. simply paints in some fragments of the full picture, that all these things are about the same. (Differences in available theory for different models don’t reflect differences in reality.)

These well-understood methods don’t explain NNs. NNs explain them: an “NN” is just “some big nonlinear thing you fit to data” and these are all about the same.

We know this because computers got good enough to fit models without needing optimizers specially designed for the models. So we tried models of many different shapes. And they all worked about as well.

“Neural nets” work because everything works, even if we don’t know why yet.

(2)

Blobs of nonlinear-statistical-model-goop, AKA models, sometimes work and they don’t. Some of them don’t generalize as well as “neural nets.” You’ve never heard of them.

In a few cases, we know that a type of blob works because people did mathematical work to establish this in the past.

These days, people can just throw blobs at the wall and see what sticks. So people don’t publish all the blobs that didn’t stick. Or, they do publish them as a reference, or in an ablation… but later work naturally follows up on the most promising things found by its predecessors, discarding the rest. After a few iterations, the discarded structures have been effectively forgotten.

The fast meta-optimization loop enabled by automatic differentiation creates a new venue for publication bias.

“Neural nets” work because only things that work become “neural nets.”

In both cases, “why do neural nets generalize?” is the wrong question, if “neural nets” are taken to be some specific kind of structure.

“What doesn’t generalize?” or “what isn’t a neural net?” seem worth asking.

Recently I made a post about whether GPT-3 can really do “meta-learning.”

I had a great follow-up discussion with @the-moti about how to move the discussion forward on this topic. My takeaway was that, rather than writing more posts, I should sit down and construct a formal experiment that someone could run on various GPT models.

I figured I should give an update on this work:

—-

- I have recently received OpenAI API access.

- This gives me freedom to run this experiment myself if I choose to.

- Using the API, I have played around with GPT-3 (AKA “Davinci”) a very small amount, but have otherwise not used my API access.

- Trying to avoid biasing myself too much, on the assumption I’ll design and run this experiment at some point

- I’ve done some brainstorming about tasks I’d like to try in the experiment, but haven’t seriously started work – no files or code written yet

- The biggest blocker to moving forward on this work is the technical/code side.

- I definitely could write all that from scratch myself, but it would take nontrivial effort and adds another variable to consider (“did I make an implementation mistake?”) when interpreting results, and would make it harder for others to follow my work.

- I’d prefer to use EleutherAI’s evaluation harness instead. However, this would introduce a lot of its overhead – I know what I want to do on a low level of direct calls to the LM, but the harness wraps those in several abstraction layers I’ll need to get my head around.

- Also, what I want to do would require some non-trivial changes to the harness codebase. I’m sure EleutherAI is open to PRs, but even if I could get my work merged, this route still sounds like more effort total than writing things myself from scratch.

transhumanoid-deactivated202106 asked:

what are your thoughts on using "/lucidrains/big-sleep" on github to let Frank paint with her words?

I’ve long been interested in neural image generation for Frank, and I’ve done a nontrivial amount of behind-the-scenes work in 2021 on this topic.

This work is entertaining/educational to me, but unlikely to ever yield a usable feature:

  • I have some personal stances about what “feels right” as a Frank feature that rule out easy things like the big-sleep repo you mention.
  • Roughly, any image-generation feature that “feels right” is going to be focused on putting readable text into images, because reading text is the only way Frank engages with actual images.
  • For this problem, all the approaches that “feel right” are also extremely difficult, and unlikely to fall within my compute/data-volume/personal-effort budgets.

—-

Actually, just in the last few weeks, I’ve been playing around with some of lucidrains’s other code for this problem, specifically his DALLE implementation.

(lucidrains is awesome, by the way! His rapidly produced, high-quality implementations of newly published techniques provide a valuable independent check on academic research and make it more accessible. I can train way bigger models than I would otherwise be able to, thanks to his implementations of reversible networks, gMLPs, etc.)

Roughly, I’m training something similar to DALLE from scratch, on a (subsetted, quality-contrlled) dataset of tumblr images + text OCR’d from those images.

I don’t expect this to actually work, as the problem of transcribing arbitrary text into image in an arbitrary typeface with arbitrary surrounding non-text content is … uh, very tough for a neural net, and probably requires vastly more data than I have.

But I was curious how far it would get. The answer is basically that it gets to the point of generating these rather pretty, but unreadable/meaningless, sort of “hieroglyphics”:

image

nostalgebraist:

nostalgebraist:

nostalgebraist:

Will write something up about this later, but here’s something I made today:

logit lens on gpt-neo

This extends my old “logit lens” work to GPT-Neo. Turns out it … doesn’t exhibit the “logit lens” phenomenon at all????

Updated the notebook to add a plot for CTRL, another non-GPT transformer LM.

CTRL does display the “logit lens” phenomenon.

Unlike gpt2, it not only “looks like the output” in late layers, it also “looks like the input” in early layers.

Updated the notebook with many extensions, including a (partial?) solution to the difficult I originally had interpreting the GPT-Neo results.

It’s got pretty pictures, look!

image
image
image
image

nostalgebraist:

nostalgebraist:

Will write something up about this later, but here’s something I made today:

logit lens on gpt-neo

This extends my old “logit lens” work to GPT-Neo. Turns out it … doesn’t exhibit the “logit lens” phenomenon at all????

Updated the notebook to add a plot for CTRL, another non-GPT transformer LM.

CTRL does display the “logit lens” phenomenon.

Unlike gpt2, it not only “looks like the output” in late layers, it also “looks like the input” in early layers.

Updated the notebook with many extensions, including a (partial?) solution to the difficult I originally had interpreting the GPT-Neo results.

interactive notebook with frank’s generator model

I recently uploaded Frank’s generator model to the Huggingface content delivery network.

This let me create a Colab notebook where you can write text using the model.

Check it out you’re interested in seeing more about Frank’s inner workings!

(Or if you’re familiar with pytorch / ML and want to use the model in your own projects)

nostalgebraist:

Will write something up about this later, but here’s something I made today:

logit lens on gpt-neo

This extends my old “logit lens” work to GPT-Neo. Turns out it … doesn’t exhibit the “logit lens” phenomenon at all????

Updated the notebook to add a plot for CTRL, another non-GPT transformer LM.

CTRL does display the “logit lens” phenomenon.

Unlike gpt2, it not only “looks like the output” in late layers, it also “looks like the input” in early layers.

a-point-in-tumblspace:

nostalgebraist:

Will write something up about this later, but here’s something I made today:

logit lens on gpt-neo

This extends my old “logit lens” work to GPT-Neo. Turns out it … doesn’t exhibit the “logit lens” phenomenon at all????

This is distressing. I’m distressed.

According to my understanding – no, screw my understanding, according to GPT-Neo’s source code – each decoder unit has a residual identity connection, so it outputs “x + F(x)” for some big complicated F, which is helpful because “the identity function is hard for NNs to learn” or whatever. And then you can view the stack of decoders as computing “x + F1(x) + F2(x) + F3(x) + …”, making a series of incremental refinements to the input to ~continuously transform it into the output.

And viewed that way, it almost can’t help but produce nice smooth gradients on your logit-lens plots.

And yet, the 125M GPT-Neo appears to just produce random outputs on the intermediate layers before jumping straight to a reasonable guess on the last layer.

So… either your lens code doesn’t play correctly with GPT-Neo, or my “incremental refinements” understanding is nonsense (worse, useless nonsense). Sound about right?

The result surprised me too, but your statement here is too strong IMO:

And viewed that way, it almost can’t help but produce nice smooth gradients on your logit-lens plots.

As I noted in the original LW post, GPT-2 itself isn’t smooth and gradual everywhere. It makes a huge jump right after the input and changes gradually thereafter.

(In later work – which I should clean up and share sometime – I learned that this jump occurs specifically in the MLP sub-block of the first layer. So it happens in the 2nd thing the network does to the input, rather than the very 1st)

GPT-Neo has the same large jump after the input, since early layers don’t look like the input. The difference from GPT-2 is that it also has another large jump near the end.