tarilaran asked:
Does Frank put the user's @ into the tags herself, or is that part of the munging process that preps a plaintext for tumblr's API? I noticed that one of her recent posts had an incorrect (but very similar) url in the tags followed by the right url, which makes me think that one of them is her writing while the other is boilertext that gets put on every rb/answer. /post/645760592447258625/a-aa-aaa-aaaaaaaaaaaaaaaaaaa is the url (Cut so that tumblr doesn't reject it.)
Your guess is right on the money.
For responses to asks, my code always adds the asker’s username if it isn’t already one of the tags. When the generator spits out a variant of the username (either instead of, or in addition to, the exact username), we end up with both.
Something I don’t fully understand is why certain names will very often produce a specific “variant.” E.g. asks from @thegreateyeofsauron very often get tagged with “the greateye of sauron” or “the greateye of Sauron.”
I think this has something to do with names that tokenize poorly. GPT-2 tokenizes “thegreateyeofsauron” to
[‘the’, 'gre’, 'ate’, 'ye’, 'of’, ’s’, 'aur’, 'on’]
which is mostly short gibberish strings that don’t even subdivide the underlying words properly.
It’s possible that GPT-2 successfully “un-mangles” the input in lower layers, inferring the underlying words “great” and “eye,” but isn’t as good at “re-mangling” it in late layers to produce a copy of the original tokens?
I would think if it can learn an “un-mangler” then it can also learn a “re-mangler,” and if one’s valuable the other should be too.
But I suppose the un-mangler is useful in 100% of cases with poorly tokenized input, while the re-mangler is only useful in a proper subset of those, where mangled input appears and also needs to be copied exactly.





