breakruns
I wrote earlier about “Mirostat,” an approach to sampling from language models that tries to avoid the infamous phenomenon of “neural text degeneration.”
In fact, I used Mirostat in pracitce for a long while, in @nostalgebraist-autoresponder. However, some things about it bothered me, e.g.:
⭑ It feels overly complicated.
⭑ It’s based on the assumption that the LM’s predictive distribution for an individual token is approximately a Zipf distribution.
We expect statistics of entire corpora to be Zipfian, but I don’t see why that implies anything about predictive distributions on the token level.
Indeed, this assumption is not at all true (I checked)! For instance, the model is often very confident, and puts almost all of the mass at one or a few top tokens. Even when it is not very confident, that looks like ~100% of the mass spread across the “reasonable possibilities” and then a long tail that basically doesn’t matter.
⭑ The Mirostat paper needlessly truncates a sum. When I replace this with the full sum, the results are drastically different.
—–
I was not comfortable with Mirostat, but still wanted to avoid “degeneration.”
So, I thought about the problem a bit, and came up with a new method, which I called “Breakruns.”
It is based on the following ideas:
(1)
There are two kinds of “degeneration”: the repetition trap, and incoherence. The degeneration and Mirostat papers treat these as sort of symmetrical.
However, they’re very different:
Incoherence is basically 100% solved by using top-p. More generally, “incoherence” just feels like what happens when you make too many choices that the model knows are almost certainly wrong; it feels fundamentally avoidable if you just “trust the model enough.”
In other words, the LM knows there’s something wrong with incoherent text, and it will tell you this. That’s just what an LM is, more or less.
The repetition trap, though, is a mistake the model thinks is correct. That’s a much tougher puzzle since the model’s opinions are all you have to go on. (Indeed, the model is arguably not even wrong about this issue – just right in an undesired manner.)
So, everything would go wonderfully if we could just “trust the model” enough, by using conservative sampling parameters like low T and/or low top-p.
The problem with this is supposedly that it produces “overly conservative” text, but IME that isn’t quite right. “Conservative” text from an LM tends to be good text … right up until the point where it becomes unnaturally repetitive.
If we could just solve the repetition trap on its own, everything else might fall into place.
(2)
The repetition trap is fundamentally about the model’s top-1 token.
If we’re in the trap, the sampler is always selecting its top-1 token, and will continue to do so.
Conversely, if we keep selecting the top-1 token for a long time, we might not be in the trap … but if not, trying something except choice #1, at some point, probably won’t hurt.
This is hard to think about at first, if you’re used to viewing discrete distributions as “merely” approximations to continuous ones. (Probably it can be made into a limit statement? but that’s not relevant for my purposes anyway)
—–
Here’s what Breakruns does.
You use conservative sampling, with low T and top-p. Not absurdly low, but lower than you would normally go.
You keep a running counter. Every time you pick the top-1 token, you increment the counter by 1. Every time you don’t pick the top-1 token, the counter resets to 0.
The counter is the length of the current “run” – an unbroken string of top-1s.
You don’t want to let the runs get too long. So, the longer the run gets, the more you crank up the temperature.
Specifically, if T is your “base” temperature, you actually sample with temperature T + (tau * counter). You set tau to be 0.01 or 0.02 or something like that, it’s a tunable parameter.
As a run gets longer and longer, the temperature eventually reaches 1.0, then gets even higher. Eventually it’s so high that even the repetition trap can’t overcome it. (That claim is not self-evident, but true in practice, and makes sense when you think about it, I think.)
The moment you sample anything but the top-1 token, we’re now sure we are no longer in the repetition trap. The counter resets to 0 and the temperature immediately snaps back to our nice, conservative base value.
—–
I’ve used this for a while now in @nostalgebraist-autoresponder.
Qualitatively, it doesn’t seem obviously better or worse than I got with Mirostat.
However, it’s much simpler, with a motivation I actually believe in, which helps me sleep at night.
