Something I made today: visualizing (one measure of) what different GPT-2 sizes know about the ordering of U.S. presidents.
The model is trying to predict the first token of each president’s name, given an ordered list of presidents up until that point. This is generally the first name, although for Ulysses S. Grant it’s just “ U”.
So, the model has more context when predicting later presidents on the list, although it’s not necessarily very helpful context, just reinforcement of the fact that we’re listing the presidents in chronological order.
Top pane is probability of the true token. Bottom pane is rank, lower is better. Left to right is model size.
These pictures are from one particular variant of the prompt where I also included the years of the president’s term alongside their name. This context helped the larger models a bit.
I excluded Grover Cleveland from this plot because it him being president twice was causing problems with my plotting code, and I didn’t care enough to solve them.
Inspired by Appendix D of this paper.
Cool chart!
Assigning unusually low probability to Abraham Lincoln seems like the reverse of the human behavior. I don’t, obviously have data, but surely people are going to have an easier time guessing Abraham Lincoln in the correct place then almost any other 19th century president, simply because they remember his name while most of the rest are so forgettable.
Can you sample the most likely possible continuations for the (if I am reading the chart right) 7 tokens with higher probability than Abraham to see “who gpt2 thinks the President after James Buchanan was”? Or just the single highest-probability continuation?
I was wondering too!
For biggest model, the leading contenders are mostly immediate successors of Lincoln, with Ulysses S. Grant (as “ U”) well ahead of the rest:

As I mentioned, I tried two versions of the task: one with term years listed after the president’s name, one with just the names.
In the version without terms, the results are broadly similar, although “ U” is much further down:

Since I’m talking about the impact of including terms, here’s a fun trend I noticed which lines up with the lesson of the GPT-3 paper, that larger models benefit more from added contextual cues:

The lines plot the average probability of the right answer (averaged over all the individual presidents), by model. (There are only 4 points plotted for each line, the lines just connect them.) The bands are 95% CI for the mean.
