Install Theme

Will write something up about this later, but here’s something I made today:

logit lens on gpt-neo

This extends my old “logit lens” work to GPT-Neo. Turns out it … doesn’t exhibit the “logit lens” phenomenon at all????

What I’ve been doing lately in Frank development:

  1. Switching the ML stuff from tensorflow to pytorch.
  2. Replacing the generator model with one 2x as big, finetuned from the 2.7B GPT-Neo checkpoint released by Eleutherai. (This is the same size and architecture as the smallest GPT-3 model)

#1 is basically done and I should be able to “flip the switch” in production soon, probably tomorrow

#2 is nearly done on the development side, but might be too slow to be practical for Frank’s level of demand. No way to be sure without trying it

The second was enabled by the first: I finetuned the Eleutherai model in tensorflow(-mesh), same way they trained it, then spent like a week going down a Pepe Silvia-style rabbit hole trying to figure out how to do inference with the damn thing.

…then I converted it to pytorch and it instantly worked like a charm. Like 15 minutes of work after spending days on the tf version (actually rewriting and rebuilding parts of tf itself from source by the tail end of my quixotic efforts )

I’d been meaning to switch the project to pytorch for a long time, and this was the last straw.

My post “the scikit-learn cargo cults” from earlier in the week got linked on HN.

There aren’t that many comments on the HN post, but every commenter there seemed to read it in roughly the same way. Their reading is very different from what I originally intended to say. It’s like they’re all reading a totally different post from the one I (thought I?) wrote.

I wish I knew whether

  • I was much less clear in the post than I think I was, or
  • the HN comments are not representative of how most/many readers would interpret the post

If anyone with the relevant background wants to offer feedback on whether or where I communicated something badly, I’d be thankful. (The feedback I got from @the-moti in this post is a good example of the kind of thing I’m looking for.)

the scikit-learn cargo cults

People who design machine learning frameworks love the scikit-learn estimator interface. We can tell they love it, because they keep trying to imitate it.

But love and understanding are not the same – and none of these designers seem to understand what the sklearn estimator interface is. This failure is

  • inexplicable, because the concept is very simple
  • utterly disastrous in its consequences

—–

Specifically, no one seems to get that the sklearn estimator interface is … wait for it … an interface.

That is: it specifies a standard way for objects to communicate with one another. It doesn’t specify what the objects are, themselves.

That’s the whole point. Anything can be an sklearn estimator, as long as it conforms to the rules that sklearn lays down for estimators.

Aside from that, it can contain anything, do anything. It’s very easy to write a whole new sklearn estimator that no one has ever thought of before: the docs tell you exactly how an estimator is expected to behave, and as long as your object plays by those simple rules, it’s allowed to join the game. (What’s more, you can a lot of the rules for free, just by inheriting from the base classes and mixins sklearn provides.)

The simple rules include having a method called “fit,” which takes one or two inputs and ought to set some internal state. For predictors, the most famous type of estimator, you need a method called “predict.” This will matter in a moment.

(Sidenote: the sklearn estimator interface is really not a great example of an interface, because it actually does care about internals. It inspects attribute names and requires them to follow their own rules, and it has a not fully explicit expectation that estimators can be serialized with pickle.

However, these requirements are still interface-y in the sense that they only constrain estimators along a few well-defined dimensions, leaving everything else free. Anything that plays by the rules can still join the game, and play it just as well as the “official” estimators built in to sklearn.)

—–

Interfaces are great. They are one of the foundations of modern software. You would think people who loved an interface would learn the lesson “interfaces are great, and we should use them.”

Here is what developers of keras, tensorflow, and Sagemaker learned from that beloved estimator interface:

  • Data scientists love typing the words “fit” and “predict.”
  • It is, in fact, possible – one cannot rule it out – that data scientists do not know how to do anything other than type the words “fit” and “predict.”
  • An “easy to use” ML library is one where you can make the work happen by typing “fit” and “predict.” This is basically what usability is; the rest is details.

—–

Keras: patient zero

The first casualty of this odd disease – indeed, perhaps the patient zero from whom all the rest sprang – was François Chollet, creator of Keras.

Chollet says that sklearn was a “huge influence” on keras. “From Sklearn, I borrowed ‘fit’, but more generally best practices around usability.”

(Note that the claim in the first tweet is false: Keras models have never been valid sklearn estimators, because they do not follow the parameter naming rule. In many versions of Keras they are also not pickleable. Indeed, the tweet itself is about about a wrapping layer meant to add this missing compatibility, so I have no idea what “compatibility since 2015” is supposed to mean.)

The “Model” objects in Keras look deceptively like sklearn estimators. They have “fit” and “predict.” The methods do roughly the same things they do in sklearn.

But there is no “Keras estimator interface.” There is only one known valid species of the Keras fit/predict gizmo, namely “Model,” the one built into Keras.

The only way to roll your own thing that behaves like “Model” is to subclass “Model.” With sklearn, it’s helpful to inherit from BaseEstimator, but that just helps you follow a few rules, and you can easily follow them on your own. There is no set of rules that “Model” is following. It doesn’t follow the law, it is the law.

“I have in hand an sklearn estimator. What does that mean?” Just read this page: that is literally all there is to know.

“I have in hand a Keras model. What does that mean?” Read this labyrinthine piece of code, and also read everything it imports. That’s what a model does. Yes, you have to read the code — the docs tell you how to subclass Model, not what Model is.

—–

Tensorflow gets a fit/predict gizmo

Keras started out as a 3rd-party library, but was incorporated into tensorflow at some point, and was pushed as the standard way to develop neural nets in tf.

This is unfortunate, because Keras objects are complex beasts and no one really knows how to decompose one fully into primitives of tensorflow (or of anything). Nothing can be a Keras object that was not built as one from the ground up.

Thus, read any tensorflow doc and you’re likely to run into a strange split: “if you’re using Keras, then do X…” “…otherwise, do Y.” There has to be a generic path because you might not be using Keras, and if you aren’t, you’re stuck there. Thus everything gets done twice, often different ways.

All for poor, little “fit” and “predict”!

—–

Tensorflow makes another one

That is not the end of the story. No, at some later date tensorflow decided one fit/predict wasn’t enough. (“The more fit/predict-y a library is, the more usable it is,” to adapt a meme.)

Thus, tensorflow introduced a new thing called – of course – “Estimator.”

What the fuck is an Estimator (tensorflow flavor)? Well, it’s yet another gizmo with “fit” and “predict.”

It’s not a Keras model, but is more generic than a Keras model, and indeed closer to the spirit of sklearn. Its “fit” and “predict” can wrap almost arbitrary tensorflow code.

I suppose this may be one of the reasons they created it in the first place. But they didn’t get rid of Keras’ fit/predict thing, they just confusingly had two at once – and indeed the Keras gizmo both predated Estimator, and outlived it. (Like all reliable tensorflow features, Estimator has been officially deprecated and dis-recommended outside some specific legacy cases; references to Estimator are being slowly scrubbed out of the official guides as we speak.)

Estimator has (had?) its own complex ecosystem of helpers, most of them only “internal” and documented in code, just like Keras, but all over again. (Right before starting this post, I was trying to wrap my head around one called “MonitoredSession.”)

What really made Estimator different, though, was its support for distributed/cloud computing.

Elaborating on the theme that users cannot do anything but type “fit” and “predict,” Estimator aspires to make even such fearsome tasks as “training on multiple GPUs,” “training on cloud TPUs,” and even “deploying to a cloud service” into a call to either “fit” or “predict.”

Amusingly, Estimator was the primary supported way to take these actions for a while, and certainly the least painful. Thus, any code you wanted to distribute had to be wrapped in a “fit” or a “predict,” for the sake of letting an Estimator be the thing that calls it.

Perhaps (?) because the devs have noticed how unnecessary this is, tensorflow is now trying to ditch Estimator in favor of “Strategy,” a more generic wrapper for distributing arbitrary tf code.

Before this, Estimator and Strategy sat alongside one another awkwardly, just like Estimator and Keras did. Indeed, Estimator seems more reliable than Strategy, and continues to see use in official spin-offs like Mesh Tensorflow, presumably because people know it actually works, and know how to use it in real life.

Meanwhile, Strategy … well, the guide for Strategy contains this mind-melting compatibility table:

image

I remember this table from way back in Dec 2019, when I wrote my tensorflow rant. I am perversely pleased to see it still there in April 2021, with about as many “Experimental” and “Limited” cells as I remember.

(Note that this table’s rows include Keras, a model API, and Estimator, a model-and-distribution API, and compare these for compatibility with Strategy, a distribution API.

If you understood that sentence, I fear you.)

I have spent countless hours trying to understand this kind of nonsense. One might find oneself asking where the “usability” has gone, and where it was supposed to come from in the first place.

Sagemaker: a copy of a copy

Sagemaker is one of the zillions of AWS products.

It’s a “platform for machine learning,” which in practice means it’s Yet Another Complicated Wrapper Around Running Docker Containers On EC2™.

Like any AWS product, Sagemaker has API endpoints, and in python you can call these through the generic client boto3. To serve “high-level” “usability” needs, though, there is also a dedicated python SDK.

I bet you can guess what’s in it.

image

Estimator (Sagemaker flavor) takes the cloud computing focus of Estimator (tensorflow flavor) to its logical conclusion.

Sagemaker “Estimators” do not have anything to do with fitting or predicting anything. The SDK is not supplying you with any machine learning code here. The only vestige of the original meanings attached to these words is that “fit” is expected to modify a state (hence it downloads an artifact from the cloud when it completes), while “predict” should be stateless.

Instead, “fit” and “predict” here are wrappers for pushing and running an arbitrary Docker image. “Fit” runs it with an entrypoint called “train,” while “predict” runs it with one called “serve.”

There are some surrounding helpers with an ML flavor, but they are similarly generic. There’s something called “hyperparameters” which actually means “a json dict with string-only values injected into the container as a file before it runs,” and something called “training data” which actually means “an S3 path the container can read.”

It is impossible to understand what’s going on outside of the “built-in” Estimators without remembering that actually “fit” and “predict” are lies and you are just using Docker.

This is the furthest thing from an interface! Anyone who can make their own Estimator (Sagemaker flavor) also has no reason to do so; if you know how to write Dockerfiles for ECS/EC2, you can just do that without tacking on this extra SDK.

Indeed, Estimator (Sagemaker flavor) is so far from the sklearn original that it is hard to imagine its developers had sklearn clearly in mind when they wrote. More likely, they were trying to imitate the earlier imitators.

Epilogue: pytorch

Pytorch is by far the most user-friendly neural network library available in 2021.

Pytorch does not have “fit” or “predict.”

Ah! I know what that recidivism post reminded me of.

When you’re prompting GPT-2, putting an end-of-text separator at the start of your prompt will (all else being equal) bias the model toward shorter documents.

But, just as in the recidivism case, the doesn’t sound prima facie obvious the first time you hear the claim. You have to think about it first, and only then does it seems obvious.

ETA: apparently this is called “length-biased sampling”

Pet peeve: when public codebases for machine learning research projects do “the main.py thing”

That is: they come bundled with a single CLI script, usually called “main.py,” which is capable of calling several entirely different code paths.

Training, evaluation, prediction, one or more “experiments” from the paper, each stage of training if there’s more than one, and anything else the authors did – it’s all “main.py” with different arguments.

This is bad for a lot of reasons, including:

  1. It takes the arguments of several conceptually distinct functions, and smooshes them all into one argument namespace.

    This often requires renaming or overloading them to avoid collisions. An argument called, say, “eval_steps” might do different things when it’s controlling evaluation-during-training vs. when it’s controlling evaluation on its own, or it might just control one of those but not the other.

    This problem could be trivially solved by using multiple CLI scripts.
  2. In practice, “main.py"s are rarely just simple wrappers that select a function and pass CLI arguments to it. They usually contain business logic, like calling functions with hardcoded but non-default arguments, or using the script arguments to make branching if/else decisions about function arguments.

    Everything can now comes in two flavors, the "CLI flavor” and the “library flavor.” There’s no way to intuitively assign meaning to these distinctions, because there’s no intuitive reason for them to exist at all. When reading/using the code, you feel like you’re watching two sets of intentions argue with each other, both warning you not to trust the other one.

I don’t see any upsides of this approach?

I imagine it’s just a thing people started doing, and then everyone noticed everyone else was doing it, and researchers tend to be risk-averse about everything that’s not related to the meat of their research, so why rock the boat…

(If you’ve done this, don’t feel bad, I’m not annoyed at you. Just at the pattern.)

(… at least it’s not the even worse pattern where there are multiple scripts, but they only run benchmarks or other narrow tasks, and the ability to train/eval/predict generally is technically there but locked behind each script’s tangle of business logic. The name “run_squad.py” still haunts me)

the-moti:

nostalgebraist:

meta-post on meta-learning

There’s an LW post I keep trying to write. I have several unpublished draft versions of it.

The point I want to make is simple and straightforward, but when I try to write it down, I get worried I’m not … like, “messaging” it correctly? Not striking the right tone?

The point of the post is roughly:

People don’t use the term “meta-learning” consistently when they’re talking about GPT-3. The paper uses the term one way (and they are 100% explicit, they spell out their definition in the text), the blogging community uses it another way.

The bloggers are excited/scared that GPT-3 does “meta-learning” by which they mean something like “general reasoning on the fly without training.”

If you’re excited/scared by this capability (and you should be), then you should really care whether GPT-3 actually has it, to what extent, how the capability scales, etc.

There is very little public evidence on this topic, because the paper is (explicitly!) 95% not about the topic, the remaining 5% is pretty weak evidence, and the only other evidence out there is like … some subjective user impressions? gwern saying “GPT-3 has the capability” in a really eloquent and forceful way?

It would be easy to test the capability much more rigorously than this. This ought to be done since the topic is important. It can only be done by people with API access (AI Dungeon doesn’t count).

But it … feels hard to say this in a way that could actually convince anyone who doesn’t already agree? Like,

  1. These points seem so clearly true to me that when I try to “argue for them,” I feel pedantic and like I’m belaboring the obvious.

    Do I actually have to say “no, few-shot translation from French to English is not an example of general reasoning on the fly?” Surely no one thinks the model is like … learning how to speak French from ~2000 words of data?

    Do I have to quote the part of the paper where it says what it means by meta-learning? It’s right there! You can just read the paper!
  2. I made most of this argument already in my original GPT-3 post, immediately after reading the paper. So (A) I feel like I’m repeating myself and (B) if the point didn’t get across then, why would it now?
  3. There is an element of “mere semantics” to the point and it’s hard to clarify to my satisfaction that no, I don’t just care that blog posts are using a word incorrectly. But I have to bring up the semantic issue to even describe what I am saying.
  4. It feels inevitably like picking on gwern’s choice of words, since blogosphere beliefs about “GPT-3 meta-learning” basically all trace back to gwern’s blog.

    I don’t care about whether gwern is using the right words, he’s just the most detailed “primary source” we have on the topic due to the closed API

I was thinking about this yesterday because @slatestarscratchpad linked my original GPT-3 post in his April linkpost. I actually sat down and wrote up another one of those drafts and … nope, gave up again.

I notice I am able to write this on tumblr with no problems at all. Perhaps this is yet another point of evidence that using tumblr lets me do much more “real blogging” than I could if I had “a real blog.”

Could you write a blog post whose goal was to convince people to do the experiments (regardless of their semantic interpretation of them), rather than to convince people to change their interpretation of the existing experiments?

This seems like a way to move the discussion forwards in a concrete sense.

The overlap of “people with GPT-3 API access” and “lesswrong readers” is not large but it is nonzero (although maybe it’s just gwern and maybe he doesn’t want to do your experiments???). 

I share the sense that it’s best to focus on the proposed experiments – it’s actionable advice, it feels constructive, it tells the reader what kind of concrete evidence would affect my opinion.

The problem is that … well, to convince someone to do the experiments, I have to say why I care about the results. Otherwise, it’s just “hey, here’s a thing you could do, I guess.”

But the true answer to “why do you care about the results?” is simply the argument I described in OP, so we’re back to square one.

(Half-joking option: I could frame it as a sort of challenge, like I’m the “change my mind” meme guy, and I’ll just sit around holding an opinion the reader may find infuriatingly wrong… unless they do the thing I ask.)

—-

A little more background on my neurotic wariness here:

I have mixed feelings about LW 2.0 / Alignment Forum.

On the plus side, I see a lot of valuable discussion there which has no real substitute anywhere else. Many of the frequent posters in AI-related threads are students/alumni of Stuart Russell’s CHAI, or researchers at DeepMind, or things like that, and it’s a unrivaled opportunity to discuss big-picture AGI topics with people like this who understand the fine-grained picture too.

On the minus side, I’ve noticed that the emotional valence of my AI posts – “yay AI!” vs “boo AI!” – is a very good predictor of how well they will be received on LW, while the actual quality of the posts in my own estimation is a weak predictor at best.

  • My most “celebrated” New LW post – the one with the most karma, with the warmest comments, and the only one promoted to “curated” status – was a screed about how GPT-2 is awesome and Gary Marcus is epically wrong.
  • My most controversial post, with the most hostile comments section, was the one that called GPT-3 “disappointing.” This was so noticeable that one supportive reader left a comment about the anomalously poor reception, hoping I would not come away “discouraged from continuing to write about the limits of today’s AI.”
  • In the discussion surrounding the previous post, anything I said about the limits of scaling was treated with skepticism, or at best viewed as a contrarian take. When I later wrote a post on the exact same topic, but framed in a “yay AI!” manner (“OpenAI’s new insight”), it was warmly received.

From the perspective of debate norms, this is probably an unhelpfully uncharitable perspective to take. But one cannot help but notice patterns, and use them to plan one’s actions …

—-

On the specific topic at hand, I guess I’m still smarting from that response I got to the “disappointing GPT-3” post.

Hardly anyone seemed to understand what I was getting at in my comments about the multiple competing interpretations of few-shot results.

Several commenters seemed immediately ready to believe GPT-3 was a fully general reasoner, as though this were a safe default hypothesis with a lot of prior mass, only able to be toppled from its throne via great evidential weight to the contrary. The latter comment directly put the shoe on the other foot, asking me: “Are there bits of evidence against general reasoning ability in GPT-3?”

Another pattern running through several comments was an odd sense that the because paper was long and did so many things, it was therefore unreasonable or unfair (?) to complain about it not doing any given thing. This is effectively impossible to argue with: what can one do, faced with “the paper was 70 pages and did dozens of experiments, therefore it must have established [whatever claim I’m arguing it established]”?

That one is particularly discouraging re: getting people to try experiments with GPT-3, as I expect this to be perceived as yet another “isolated demand for rigor.”

My long, frankly exasperating exchange with dxu (starting here) is difficult to summarize, but it contributed my pessimism about any further attempts to discuss the topic on LW. The same goes for my bizarre exchange with gwern (starting here).

Wow, sorry about the rant … I guess I never wrote the equivalent of this rant at the time, so here I am finally writing it almost a year later.

meta-post on meta-learning

There’s an LW post I keep trying to write. I have several unpublished draft versions of it.

The point I want to make is simple and straightforward, but when I try to write it down, I get worried I’m not … like, “messaging” it correctly? Not striking the right tone?

The point of the post is roughly:

People don’t use the term “meta-learning” consistently when they’re talking about GPT-3. The paper uses the term one way (and they are 100% explicit, they spell out their definition in the text), the blogging community uses it another way.

The bloggers are excited/scared that GPT-3 does “meta-learning” by which they mean something like “general reasoning on the fly without training.”

If you’re excited/scared by this capability (and you should be), then you should really care whether GPT-3 actually has it, to what extent, how the capability scales, etc.

There is very little public evidence on this topic, because the paper is (explicitly!) 95% not about the topic, the remaining 5% is pretty weak evidence, and the only other evidence out there is like … some subjective user impressions? gwern saying “GPT-3 has the capability” in a really eloquent and forceful way?

It would be easy to test the capability much more rigorously than this. This ought to be done since the topic is important. It can only be done by people with API access (AI Dungeon doesn’t count).

But it … feels hard to say this in a way that could actually convince anyone who doesn’t already agree? Like,

  1. These points seem so clearly true to me that when I try to “argue for them,” I feel pedantic and like I’m belaboring the obvious.

    Do I actually have to say “no, few-shot translation from French to English is not an example of general reasoning on the fly?” Surely no one thinks the model is like … learning how to speak French from ~2000 words of data?

    Do I have to quote the part of the paper where it says what it means by meta-learning? It’s right there! You can just read the paper!
  2. I made most of this argument already in my original GPT-3 post, immediately after reading the paper. So (A) I feel like I’m repeating myself and (B) if the point didn’t get across then, why would it now?
  3. There is an element of “mere semantics” to the point and it’s hard to clarify to my satisfaction that no, I don’t just care that blog posts are using a word incorrectly. But I have to bring up the semantic issue to even describe what I am saying.
  4. It feels inevitably like picking on gwern’s choice of words, since blogosphere beliefs about “GPT-3 meta-learning” basically all trace back to gwern’s blog.

    I don’t care about whether gwern is using the right words, he’s just the most detailed “primary source” we have on the topic due to the closed API

I was thinking about this yesterday because @slatestarscratchpad linked my original GPT-3 post in his April linkpost. I actually sat down and wrote up another one of those drafts and … nope, gave up again.

I notice I am able to write this on tumblr with no problems at all. Perhaps this is yet another point of evidence that using tumblr lets me do much more “real blogging” than I could if I had “a real blog.”

breakruns

I wrote earlier about “Mirostat,” an approach to sampling from language models that tries to avoid the infamous phenomenon of “neural text degeneration.”

In fact, I used Mirostat in pracitce for a long while, in @nostalgebraist-autoresponder. However, some things about it bothered me, e.g.:

It feels overly complicated.

It’s based on the assumption that the LM’s predictive distribution for an individual token is approximately a Zipf distribution.

We expect statistics of entire corpora to be Zipfian, but I don’t see why that implies anything about predictive distributions on the token level.

Indeed, this assumption is not at all true (I checked)! For instance, the model is often very confident, and puts almost all of the mass at one or a few top tokens. Even when it is not very confident, that looks like ~100% of the mass spread across the “reasonable possibilities” and then a long tail that basically doesn’t matter.

The Mirostat paper needlessly truncates a sum. When I replace this with the full sum, the results are drastically different.

—–

I was not comfortable with Mirostat, but still wanted to avoid “degeneration.”

So, I thought about the problem a bit, and came up with a new method, which I called “Breakruns.”

It is based on the following ideas:

(1)

There are two kinds of “degeneration”: the repetition trap, and incoherence. The degeneration and Mirostat papers treat these as sort of symmetrical.

However, they’re very different:

Incoherence is basically 100% solved by using top-p. More generally, “incoherence” just feels like what happens when you make too many choices that the model knows are almost certainly wrong; it feels fundamentally avoidable if you just “trust the model enough.”

In other words, the LM knows there’s something wrong with incoherent text, and it will tell you this. That’s just what an LM is, more or less.

The repetition trap, though, is a mistake the model thinks is correct. That’s a much tougher puzzle since the model’s opinions are all you have to go on. (Indeed, the model is arguably not even wrong about this issue – just right in an undesired manner.)

So, everything would go wonderfully if we could just “trust the model” enough, by using conservative sampling parameters like low T and/or low top-p.

The problem with this is supposedly that it produces “overly conservative” text, but IME that isn’t quite right. “Conservative” text from an LM tends to be good text … right up until the point where it becomes unnaturally repetitive.

If we could just solve the repetition trap on its own, everything else might fall into place.

(2)

The repetition trap is fundamentally about the model’s top-1 token.

If we’re in the trap, the sampler is always selecting its top-1 token, and will continue to do so.

Conversely, if we keep selecting the top-1 token for a long time, we might not be in the trap … but if not, trying something except choice #1, at some point, probably won’t hurt.

This is hard to think about at first, if you’re used to viewing discrete distributions as “merely” approximations to continuous ones. (Probably it can be made into a limit statement? but that’s not relevant for my purposes anyway)

—–

Here’s what Breakruns does.

You use conservative sampling, with low T and top-p. Not absurdly low, but lower than you would normally go.

You keep a running counter. Every time you pick the top-1 token, you increment the counter by 1. Every time you don’t pick the top-1 token, the counter resets to 0.

The counter is the length of the current “run” – an unbroken string of top-1s.

You don’t want to let the runs get too long. So, the longer the run gets, the more you crank up the temperature.

Specifically, if T is your “base” temperature, you actually sample with temperature T + (tau * counter).  You set tau to be 0.01 or 0.02 or something like that, it’s a tunable parameter.

As a run gets longer and longer, the temperature eventually reaches 1.0, then gets even higher. Eventually it’s so high that even the repetition trap can’t overcome it. (That claim is not self-evident, but true in practice, and makes sense when you think about it, I think.)

The moment you sample anything but the top-1 token, we’re now sure we are no longer in the repetition trap. The counter resets to 0 and the temperature immediately snaps back to our nice, conservative base value.

—–

I’ve used this for a while now in @nostalgebraist-autoresponder.

Qualitatively, it doesn’t seem obviously better or worse than I got with Mirostat.

However, it’s much simpler, with a motivation I actually believe in, which helps me sleep at night.

the-moti:

nostalgebraist:

Something I made today: visualizing (one measure of) what different GPT-2 sizes know about the ordering of U.S. presidents.

The model is trying to predict the first token of each president’s name, given an ordered list of presidents up until that point.  This is generally the first name, although for Ulysses S. Grant it’s just “ U”.

So, the model has more context when predicting later presidents on the list, although it’s not necessarily very helpful context, just reinforcement of the fact that we’re listing the presidents in chronological order.

Top pane is probability of the true token.  Bottom pane is rank, lower is better.  Left to right is model size.

These pictures are from one particular variant of the prompt where I also included the years of the president’s term alongside their name.  This context helped the larger models a bit.

I excluded Grover Cleveland from this plot because it him being president twice was causing problems with my plotting code, and I didn’t care enough to solve them.

Inspired by Appendix D of this paper.

Cool chart!

Assigning unusually low probability to Abraham Lincoln seems like the reverse of the human behavior. I don’t, obviously have data, but surely people are going to have an easier time guessing Abraham Lincoln in the correct place then almost any other 19th century president, simply because they remember his name while most of the rest are so forgettable. 

Can you sample the most likely possible continuations for the (if I am reading the chart right) 7 tokens with higher probability than Abraham to see “who gpt2 thinks the President after James Buchanan was”? Or just the single highest-probability continuation?

I was wondering too!

For biggest model, the leading contenders are mostly immediate successors of Lincoln, with Ulysses S. Grant (as “ U”) well ahead of the rest:

image

As I mentioned, I tried two versions of the task: one with term years listed after the president’s name, one with just the names.

In the version without terms, the results are broadly similar, although “ U” is much further down:

image

Since I’m talking about the impact of including terms, here’s a fun trend I noticed which lines up with the lesson of the GPT-3 paper, that larger models benefit more from added contextual cues:

image

The lines plot the average probability of the right answer (averaged over all the individual presidents), by model.  (There are only 4 points plotted for each line, the lines just connect them.)  The bands are 95% CI for the mean.