The AI projects I’ve found most interesting are GPT-2 (generates text from a prompt) and StyleGAN (can redraw one picture in the style of another picture).
GPT-2’s medium is text, and its purpose is generation. StyleGAN’s medium is images, and its purpose is style change. Is there any necessary reason why each medium is matched to its respective purpose?
Could you have an image generator - a model trained on every image on the Internet - and if you give it most of an image with one part missing, it can fill in the missing part? I mean, obviously you can, this is how your eye fills in the blind spot, but could AI scientists make it today? What about something where if you give it half an image, it can generate the rest of it? A body, given a head? A tree, given a trunk? If not, why not?
And could you have a text style changer? Something that can rewrite Harry Potter in the voice of Ernest Hemingway, or give you The Da Vinci Code in the heroic meter of the Iliad, or the Dao De Ching as writen by @nostalgebraist? If not, why not?
As a few other people mentioned, the first of the two things you describe – image continuation from the “prompt” of surrounding image regions – is possible and has been done under the name of “image inpainting.”
However, I’m not aware of anything like your second example, and I think there is indeed a good reason related to the difference between text and images.
To do style transfer, you need some way to decompose a thing into a “style” component and a “content” component.
In the image domain, people do this by encoding each image as two lists of properties: a list of properties that have single values for the whole image and another list of properties that take different values at different points in the image. They call the first list “style” and the second list “content.”
That is, the style transfer literature depends on the assumption that “style” = “all the interesting things you can say about an image that only refer to it as a whole, not to what is happening in any particular spatial part.” A priori, one might wonder if this is too reductive or something, but as it happens, it just works. To quote the original StyleGAN paper:
This observation is in line with style transfer literature, where it has been established that spatially invariant statistics (Gram matrix, channel-wise mean, variance, etc.) reliably encode the style of an image [20, 39] while spatially varying features encode a specific instance.
I don’t know if anyone has tried this for text, but if they haven’t, I know why: a priori it sounds much less promising.
“Style” for text isn’t the list of facts you can give without pointing to specific segments of the text – that list of facts is much bigger than style, arguably including the whole informational content of the text. “Is written in JKR’s style” is this kind of fact, but “is a story about Harry Potter” is equally this kind of fact.
Trying to phrase the key difference clearly:
- In text, spatial variations don’t constitute a well-defined channel for meaning separate from other channels.
“There is a long line of dialogue about ¼ of the way through” is a weird, not very interesting fact about a text: you could imagine a text with the same essential style and information content which puts that line in some other place, or doesn’t have it at all. - In images, spatial variations do constitute a well-defined channel for meaning separate from other channels, roughly “content” as opposed to “style.”
“There is a human eye about ¼ of the way down and right from the upper left corner” is a perfectly natural sort of fact about an image, conveying precisely one kind of information (what is there) without another kind (what “a human eye” looks like according to the current style). If we have this kind of info plus a single global style for the image, we’ve specified it and can draw it.
There may be some other mathematical way of separating style and context in text – if so, I’d love to hear about it – but the very simple one that works for images won’t work for text.


