I hope to write a post soon explaining the observation about GPT-n which I illustrate in this Colab notebook, but I don’t have the energy, so for now, just have this mysterious link with pretty pictures at the end ;)
(Roughly: gpt’s internal layer outputs are interpretable as logits over its vocab, just like the final output is, and using that lens, most of the layers seem to be progressively and gradually refining an initial guess, rather than doing some inscrutable internal processing that only gets turned into a prediction at the “last step”)
EDIT: there is now a better version of the notebook, with improved code and explanatory text added
The promised post is up: “interpreting GPT: the logit lens”

