Yesterday I found the paper “Mirostat: A Perplexity-Controlled Neural Text Decoding Algorithm” and it looked like a neat idea, so I implemented it for @nostalgebraist-autoresponder.
It’s been running since around noon today. I don’t expect drastic differences in quality, but I do hope it will help avoid the repetition traps that happen frequently in longer posts. (I already had a word-counting hack in place that tried to catch repetition traps, but it wasn’t very good.)
The specific algorithm from the paper is kind of complicated, but the basic idea is to set a target for the average perplexity / “surprisingness” of the entire text. When the text written so far is above the target, the sampling becomes more conservative. When it’s below the target, the sampling becomes less conservative. Like a thermostat, AC, or any other control system.
I really like this idea – unlike other approaches (temperature, top-k, top-p), it actually notices repetitive and incoherent text when they occur and tries to “escape from the hole,” rather than just trying really hard not fall into a hole in the first place, and then saying “that’s life” when it happens anyway.
The specifics of Mirostat feel weird to me, and I suspect a much simpler version of this idea would do just as well.
The authors of the paper seem confused (??) about what is computationally costly and what isn’t: at one point they truncate a sum from ~50K terms to 100 for speed, when the whole sum is just one matrix multiplication per token and its cost is infinitesimal compared to running GPT-2. Likewise, I suspect the simpler “alternate algorithm” they discuss in Section 5b is actually the right way to go – they reject it as being too slow, but the “slow” step is one you also have to do in top-p, so it should be fine.
(The paper strikes me as being the work of people more used to math than programming, and the math parts about the perplexity implications of temperature, top-p, and top-k are cool.)

