Seeing Frank’s image generator operating in the real world has given me a lot of signal on how it can be improved, so I’m working on that now.
The image generator will hopefully improve a lot over time, like some of Frank’s other features.
In particular, I think the tendency to generate blurry, formless images (e.g.) is due to overly aggressive cropping of the training data.
I cropped roughly to the bounding box of the machine-recognized text in the image, thinking I wanted to make the text as legible as possible to maximize my changes of generating legible text at all.
But sometimes the image-to-text model just sees the letter “D” or something in a large image, so I end up with a blurry, ultra-zoomed-in picture of the letter “D” … and the generator has learned to skillfully imitate these boring, blurry pictures when given a sufficiently short text prompt. I guess I got what I asked for!
Continuing to train now on a differently cropped version of the data, we’ll see how it goes.
I’ve pushed a new build of the image generator, trained for a little while on better-cropped data.
It will probably benefit a lot from further training, but it’s already making things that look more interesting…
Seeing Frank’s image generator operating in the real world has given me a lot of signal on how it can be improved, so I’m working on that now.
The image generator will hopefully improve a lot over time, like some of Frank’s other features.
In particular, I think the tendency to generate blurry, formless images (e.g.) is due to overly aggressive cropping of the training data.
I cropped roughly to the bounding box of the machine-recognized text in the image, thinking I wanted to make the text as legible as possible to maximize my changes of generating legible text at all.
But sometimes the image-to-text model just sees the letter “D” or something in a large image, so I end up with a blurry, ultra-zoomed-in picture of the letter “D” … and the generator has learned to skillfully imitate these boring, blurry pictures when given a sufficiently short text prompt. I guess I got what I asked for!
Continuing to train now on a differently cropped version of the data, we’ll see how it goes.
Yesterday I made Frank much more likely to generate images than usual, as a demo of the new image generation feature.
I’ve turned that off now, so she’ll generate images at her “natural” rate now.
—-
This will greatly decrease the success rate of asks like “show me a picture of X.”
If you really want to make Frank generate a picture, think about a context in which a tumblr user would post a picture (with OCR-visible text in it), and try to set up that context with Frank.
—
Note also that the image generator is trying to make a picture containing the text Frank thinks the picture should contain. That’s it.
The image generator can’t see your ask, it can’t see your reblog, all it can see is some lines of text written by Frank – the same ones that used to appear in black-on-white text like this.
This is why things like tweets work so well (because a picture of a tweet contains text indicating it’s a tweet, eg an @ sign before a username). Whereas trying to get a picture that wouldn’t normally have text in it (eg a cat) generates an almost random picture with close to no relationship to your prompt.
Still have to make a longer post on Frank’s new image generator, but a few quick comments:
- Many thanks to everyone I talked to about this project in the EleutherAI discord! Special thanks to RiversHaveWings for suggesting I try diffusion for this problem + various helpful tips along the way.
- As far as I know, the image generator I made for Frank is the first neural image generator anyone has made that can write arbitrary text into the image!! Let me know if you’ve seen another one somewhere.
- The model is a text-conditioned denoising diffusion model. Or rather two of them, a 128x128 base model and a 256x256 upsampler.
- Coincidentally, just 3 days ago, OpenAI announced/released their own text-conditioned denoising diffusion model. I guess it’s an idea whose time has come! Their model is structured a little differently, and makes way better-looking images, although with the writing-text aspect.
- My code for this model is a heavily modified fork of OpenAI’s improved-diffusion repo. It’s on this branch. The Files Changed view here gives a clearer sense of what I changed. (Caveat: it’s extremely hacky research code)