Something I made today: visualizing (one measure of) what different GPT-2 sizes know about the ordering of U.S. presidents.
The model is trying to predict the first token of each president’s name, given an ordered list of presidents up until that point. This is generally the first name, although for Ulysses S. Grant it’s just “ U”.
So, the model has more context when predicting later presidents on the list, although it’s not necessarily very helpful context, just reinforcement of the fact that we’re listing the presidents in chronological order.
Top pane is probability of the true token. Bottom pane is rank, lower is better. Left to right is model size.
These pictures are from one particular variant of the prompt where I also included the years of the president’s term alongside their name. This context helped the larger models a bit.
I excluded Grover Cleveland from this plot because it him being president twice was causing problems with my plotting code, and I didn’t care enough to solve them.
Inspired by Appendix D of this paper.

