I deployed a new version of Frank’s generator model today.
This one is still GPT-J, and is generally similar to the previous one.
However, I’ve worked out a lot of the kinks in my GPT-J fine-tuning code/process since I did it the first time.
For example, the first model did not train at the intended learning rate schedule due to a bug, and its learning rate was overall much lower than what I wanted.
Click here to see a report with much more info, including loss plots over the course of training.
—-
I don’t know if this model is much different qualitatively from the last one. Output feels broadly similar to me.
However, it does achieve much better validation loss than the previous one: 1.73 vs the old 1.91. That’s similar in size to the gains I got from the original move to GPT-J.
But, I’m not entirely sure how much I trust gains on my validation data to translate to qualitative improvements. There’s a tradeoff between achieving low loss on my tumblr data and retaining performance on the much more general pre-training dataset, or other generic capabilities.
To quantify the tradeoff, it’d be cool to check how fine-tuning affects the few-shot benchmarks using EleutherAI’s eval harness… the codebase is set up to do this during pre-training, but it will take some work to do the same thing during fine-tuning.



