I’ve made some changes to Frank’s image model in the last few days:
I’ve incrementally improved the model. The new one has more parameters, and can “see” more text per image (up to 384 characters). May improve image quality.
I’ve added classifier-free guidance. This makes the images more likely to contain the text Frank wants to write, at the cost of potentially making them less varied.
You will see tags like “#guidance scale 2” start to appear on Frank posts with images. This tells you how much classifier-free guidance was used, where 0 means “none.”
I’m not sure what the “sweet spot” for this number is, so for now, I’m having Frank randomly pick from a range of different values.
[EDIT 9/6/22: I wrote this post in January 2022. I’ve made a number of improvements to this model since then. See the links above for details on what the latest version looks like.]
Last week, I released a new feature for @nostalgebraist-autoresponder that generates images. Earlier I promised a post explaining how the model works, so here it is.
I’ll try to make this post as accessible as I can, but it will be relatively technical.
Why so technical? The interesting thing (to me) about the new model is not that it makes cool pictures – lots of existing models/techniques can do that – it’s that it makes a newkind of picture which no other model can make, as far as I know. As I put it earlier:
As far as I know, the image generator I made for Frank is the first neural image generator anyone has made that can write arbitrary text into the image!! Let me know if you’ve seen another one somewhere.
The model is solving a hard machine learning problem, which I didn’t really believe could be solved until I saw it work. I had to “pull out all the stops” to do this one, building on a lot of prior work. Explaining all that context for readers with no ML background would take a very long post.
tl;dr for those who speak technobabble: the new image generator is OpenAI-style denoising diffusion, with a 128x128 base model and a 128->256 superresolution model, both with the same set of extra features added. The extra features are: a transformer text encoder with character-level tokenization and T5 relative position embeddings; a layer of image-to-text and then text-to-image cross-attention between each resnet layer in the lower-resolution parts of the U-Net’s upsampling stack, using absolute axial position embeddings in image space; a positional “line embedding” in the text encoder that does a cumsum of newlines; and information about the diffusion timestep injected in two places, as another embedding fed to the text encoder, and injected with AdaGN into the queries of the text-to-image cross-attention. I used the weights of the trained base model to initialize the parts of the superresolution model’s U-Net that deal with resolutions below 256.
This post is extremely long, so the rest is under a readmore
I’ve noticed that when Frank sends images of (Twitter, tumblr, etc) posts, they start off pretty coherent but rapidly stop forming real words, let alone the grammatical sentences/complete thoughts that are present in text posts. Is this related to how images are read in training data, or some other quirk of the image generation? Although, I only really remember seeing this like twice, so I could be cherry-picking a bunch here.
A lot of it is the fact that the image model can see a maximum of 192 characters of text.
(I did this to speed up training; I’m now training a “version 2” of the model at a more leisurely page, which among other changes has a max length of 384)
So for example in this tweet, the actual text Frank wanted to write was
@Nirvash
a very smart man said that women
like long and complicated stories, but
men like short and easy to read. It
should
mean that there are more women
that read and more men that write.
You know what, that’s very much true.
I’m glad that men read because I can read and appreciate the fact that they
liked my work.
The 192nd character is in the middle of a sentence, indeed in the middle a word (partway through “write”).
So the model can guess that it’s seeing something longer that’s been truncated. It knows there’s more text after what it can see. But doesn’t know what it is, so it just spams twitter font gibberish.
(There may be other mechanisms, but this is the only one I’ve explicitly confirmed by examining an example)
hey, I'm not sure what the best method for giving feedback on Frank is, apologies if sending an ask is not it. i just wanted to bring up that there are several mutuals i have who find the neural-blender images that go around upsetting, uncomfortable, or disturbing, because of the general look of images made that way. would you consider a tag that Frank would put on every post where she adds an image that people can blacklist?
Also, i lov Frank and it's so cool seeing her grow and develop over time! Thank you so much for making this for us to enjoy!
Sure! I’ll use the tag “#computer generated image.”
The change should be live now (let me know if it’s not working).
The green-tinted spectacles worn by Olds were designed to protect the eyes from the intensity of Argand lamps, a type of indoor light used during the early 1800s. These lamps burned whale oil, and many people worried that its bright flames might damage eyesight.The painter of this portrait founded the Western Union Telegraph Company in 1854 and soon became one of Cleveland’s wealthiest industrialists. His grandson, Jeptha Wade II, was a founder of the Cleveland Museum of Art and donated the land upon which it stands as a Christmas gift to the city in 1892. Size: Framed: 87 x 71.8 x 5.7 cm (34 ¼ x 28 ¼ x 2 ¼ in.); Unframed: 76.5 x 61.2 cm (30 1/8 x 24 1/8 in.) Medium: oil on canvas
I have a playlist that I always listen to when I’m writing Almost Nowhere, at least since sometime in late 2019 or early 2020 (?).
By now, I’ve heard the same songs so many times that their effect is muted – it’s more about the ritual at this point. My mind associates the playlist with “being in the Almost Nowhere writing headspace,” and I listen to induce that headspace, and the more I do this, the stronger the association grows, so it’s self-reinforcing.
Anyway, the funny thing about this playlist, and the reason I’m posting about it at all, is that it’s, like, hilariously cringe.
I’m talking like, “an incoherent mixture of anime-adjacent J-pop, the Doctor Who soundtrack, the FFXV soundtrack, and Emilie Autumn.” Among other things.
Whereas, when I was writing TNC, I listened to stuff like this. A stark contrast!
I don’t know how these things get to be the way they are … it seems undetermined, not inevitable. Maybe it’s all just exponential self-reinforcement magnifying arbitrary small differences, maybe not.