Today’s misread: “Free Delivery | Use Code Deicide” (for “Free Delivery | Use Code Decide”)

Today’s misread: “Free Delivery | Use Code Deicide” (for “Free Delivery | Use Code Decide”)
Finally did finish Modern Cannibals. By the end my overall feelings about it were much less uncomplicatedly positive than before, though I was no less impressed with the sheer genuine-article capital-A fucking Artistry of the thing, like holy shit, what a performance
Below are some vague purple lowercase ramblings on this topic, which I wrote on Discord and am copy/pasting here. Technically no plot spoilers there, but I recommend behaving as though there were
The name noto is to convey the idea that Google’s goal is to see “no more tofu”.
oh you like infohazards? name three
@transhumanesque ily
that’s only two but “I love you” is a particularly powerful one so I’ll let it slide
GPT-2′s tokenizer is … kinda weird.
Sure, it’s defined in a perfectly clear and relatively elegant way: it’s a byte-pair encoding on UTF-8 bytes. Unlike many NLP tokenizers, it doesn’t have special custom handling for any particular feature of text, like uppercase/lowercase, whitespace, or common English morphology. It takes a completely generic approach that would work for anything, not just English text or even text.
But, this genericness and simplicity comes at a price: its behavior when applied specifically to English text – which is its only intended application – can leave something to be desired. After all, there’s a reason why people usually do all those text-specific or language-specific customizations.
To better understand the tokenizer, I made a little script that lets me type in text and then shows me the resulting tokens. The concrete examples below come from this script. I have it print out each individual token in both a readable text form and as its index in the vocabulary, like
(' hello', 23748)
Anyway. How is GPT-2′s tokenizer weird? Let me count the ways:
1: No special handling for text that differs only in the case (upper vs. lower) of the letters
This means that the model won’t automatically generalize what it knows about the word “hello” to the version “Hello” that occurs at the start of a sentence, or the version “HELLO” that occurs in text that’s all-caps for whatever reason (titles, yelling…)
Thus, GPT-2′s vocabulary contains the English language (or a large subset of it) not once but in several copies: there’s the lowercase version of each word, the capitalized version, the uppercase version, possibly even the GaMzEe-CaSeD version or other rarer variants.
From the model’s perspective, these are totally different universes, disjoint subsets of the vocab that follow their own internal rules.
For example, choosing the first word of a sentence in normally-formatted text is not just choosing a word like any other: it’s choosing a Capitalized Word™, _and _Capitalized Words™ are their own universe. Insofar as the model understands that the word “Insofar” with which I began this sentence means the exact same thing as the word “insofar” I just used inside it, it understands this by figuring out that these two “seemingly unrelated” things are “secretly” the same. And it must do that for every single word, separately.
INSOFAR as the model understands the first word of this sentence, capitalized as the first sentences of chapters sometimes are … well, yeah.
(I suspect this is why GPT-2 writes so incoherently whenever it decides to write in all caps. All-caps text is relatively rare, and to the model it’s a whole different language it’s had to pick up from a few scattered examples.)
While it seems clearly suboptimal, this one isn’t that weird – for better or for worse people commonly do this in NLP. On the other hand…
2: Spaces glom onto the words after them
BPE tries to be efficient, so it doesn’t waste token slots on spaces if it doesn’t have to. A word is almost always preceded by a space, so instead of representing “ example text” as four tokens (space, “example,” space, “text”), it represents it as two:
[(' Example', 17934), (' text', 2420)]
So far, seems innocuous, right? But what if you’re feeding a prompt into GPT-2? Unless you’re hip to this particular issue, you’ll probably type in something like
“Example text”
which becomes
[('Example', 16281), (' text', 2420)]
Compare this to the one above. Yes – instead of token #17934, with the preceding space, I’ve unwittingly fed in token #16281, without a preceding space.
Previously, we saw there was a “separate copy of English” for each capitalization style. But really, each of those copies is not one but two: the version with preceding space and the one without. And unless you type a space before your prompt, your first word will be in the “no preceding space” language.
The “capitalized word with no preceding space” language is an interesting case. Where does it appear, besides user prompts? IIUC, the most common cause for it is newlines. “\nExample text” tokenizes to
[('\n', 198), ('Example', 16281), (' text', 2420)]
So your prompt, if it lacks an initial space, looks like the start of a paragraph. Well… that seems fine, actually
But putting prompts aside, consider what this means: not only does the model learn “words at the starts of sentences” as a separate language, it learns “words at the starts of paragraphs” as another separate language! (Perhaps this has something to do with the tendency of samples to veer off topic around paragraph breaks? IDK. Might also be interesting in connection w/ Gwern’s poetry project.)
3: Words split differently depending on surrounding whitespace
The thing I just said isn’t exactly true. It’s worse than that.
So far I’ve talked like one token = one word. Frequently, that’s true, but BPE can split words in the middle too. This automatically handles some morphology stuff, e.g. “ Rob’s” becomes
[(' Rob', 3851), ("'s", 338)]
and can break down unfamiliar words/“words” into atomic or “molecular” components, e.g. the keysmash “hgsdfahsf” becomes
[('h', 71), ('gs', 14542), ('df', 7568), ('ah', 993), ('sf', 28202)]
But this interacts weirdly, if predictably, with the “separate languages” thing belabored above. There’s no constraint making any one of our copies of English split up in the same way as the other ones, and generally they don’t.
For example, how many tokens is “farther”? Well, if you’re in the “lowercase with preceding space” universe, it’s one:
[(' farther', 18485)]
If you’re in the caps-and-preceding-space universe (sentence start within paragraph), though:
[(' F', 376), ('art', 433), ('her', 372)]
Likewise in the caps-without-preceding-space universe (paragraph start), except of course we begin with “F” rather than the totally distinct “ F”:
[('F', 37), ('art', 433), ('her', 372)]
And if for some weird reason you’re lowercase but don’t have a preceding space:
[('f', 69), ('art', 433), ('her', 372)]
From the perspective of text generation, this is super weird. Remember, we generate one token at a time. So when we start a sentence with the word “farther,” we don’t just pick out that word, the way we would inside a sentence. Instead we decide:
" F". N.B. this is not the same as deciding “the sentence begins with the letter F”, as it rules out things like ( For, 1114) which is common enough to be a single token"art""her" (as opposed to being a sentence about farting, or something – until this choice, we could have been writing that sentence!)I do wonder how much better GPT-2 text generation could be, and how much less gigantic the model could be, if this stuff were a little friendlier.
One should take this informal argument with a grain of salt; it turns out that after one has made an infinite number of choices, the infinite number of disenfranchised groups, while individually having no further power to influence elections, can begin having some collective power, basically because property 2 of a filter only guarantees closure under finite intersections and not infinite intersections, and things begin to get rather complicated.
I haven’t finished it yet (about 2/3 of the way through) but I’m pretty sure I’d recommend it no matter what happens im the rest, so:
Modern Cannibals is SO GOOD
It’s billed as Homestuck fanfic, which isn’t wrong, but its relation to the source material isn’t the one that’s usual for fanfic. It’s more like… a very good novel that just happens to be (in part) about Homestuck fandom, and that (more importantly) has a lot of delicious thematic, structural and tonal similarity to Homestuck despite looking totally different at first glance.
I’m hesitant to recommend it to people who haven’t read Homestuck only because of these resonances, which make an already great thing even better. Plot- and character-wise it works as a standalone.
It’s beautifully written, funny, emotionally raw and real (sometimes too real), irreverent, unsettling, disturbing, engrossing. Feels very Epilogues-esque. Not really like TNC in any deep way (I think?), but shares with TNC various elements not often found together (fandom as social environment and as spooky intense obsession, suspense, ambiguously supernatural horror). The author is apparently a full-time writer who writes fanfic under a pseudonym and wants to keep his identity secret, which is perfectly valid but does make me sad because I wish I could read his non-fanfic stuff too.
As the 21st century approached, many pearls of wisdom were wrought to move and analyze information on the web.
Unknown, Unidentified Festival of Song and Light after dusk
from the long gone woolgathersome blog
(via disconcision)
Stephen Bond, author of the Bond32 book series, does an entire 42-page chapter called “The One Who Cannot Be Traded For Anything” where he imagines a universe where people can agree on “anything, including theistic God(s) or Eternal Generosity/Bounty Held By The Infinities” without it ever actually happening. I really enjoy this series, almost all of it over the course of about 100 pages. In it a bunch of people combine in a sort of utopia, where they can “agree on anything, including theistic God(s) or Eternal Generosity/Bounty Held By The Infinities” without it ever actually happening. (Worth reading just for the whole 100-page section about futurophilic table-setting)