Install Theme

Tried making a “Huggingface space” today…

In retrospect I could have done the same thing in a Colab notebook in like ¼ the time, and it would have run like 4x faster (this has no GPU). But it was a fun experiment, kinda.

Anyway, this lets you play around with Frank’s image models, albeit very slowly.

official-kircheis asked:

Frank's image generating model makes a 128x128 pixels image. How does computational cost for training and generating scale with image size? Are 512x512 pixels images feasible for, say, someone with enough resources to train GPT-3 from scratch? Are the results any good?

1.

The scaling isn’t that bad, nothing like GPT-2 vs. GPT-3.

IME, doing 256x256 is something like 2-4 times slower than 128x128. More importantly, you don’t need to make the model bigger (or much bigger anyway), which means memory costs don’t scale up much.

Memory costs are fundamentally more of blocker than time cost.

If it takes time T to train a great model, training for time T/2 or T/4 will generally give you an okay-to-good model. So if training slows down by 2x or 4x, that’s basically fine.

Whereas if memory cost increases by 2x or 4x, that may mean your model can no longer fit on a single device. And then once you’re spreading it across multiple devices, that introduces all kinds of complications w/r/t further memory scaling, there are different techniques for doing it with pros/cons, eventually the bottleneck is no longer device speed but bandwidth between devices, etc.

2.

You say Frank’s model makes 128x128 images, and in a sense that’s correct, but in another sense it isn’t.

Frank produces 256x256 images. This involves two models, one that generates at 128x128, and one that scales up images to 256x256.

You could look at this and say, “well, the ‘generation’ part is only 'happening’ at 128x128, the other part is 'upsampling’ not 'generation’.”

But, you can also look at the whole thing as one model that generates at 256x256, just written down in a weird notation.

And in fact, this appears to be a better way to do a large-resolution images, vs. naively “training a model” at the target resolution. See e.g. cascaded diffusion, or any of OpenAI’s work on the topic.

OpenAI’s GLIDE generates at 64x64 and then upsamples to 256x256, which is pretty typical. The weird part about my 128 -> 256 stack is that the 128 part is so large. (I do this to help text be legible.)

3.

OpenAI did a range of target resolution in its diffusion papers, from 64x64 to 512x512.

Most research work focuses on smaller resolutions, for multiple reasons.

It’s faster to iterate there, and the ideas generalize to large resolutions.

But also, large images are sort of boring!

Most of the relevant structure in web images is present even if you downscale them to 128x128. Once you’re considering the jump from 256x256 to 512x512, you’re mostly talking about relatively obvious stuff – filling in textures, making edges crisp.

So, if you don’t see people generating giant images with NNs, it’s mostly because it takes longer to do (both training and inference) and the end result is not any more interesting.

Heads up: if you send Frank multiple asks in a row to “queue them up” before Frank has responded to any of them, Frank may delete some of your asks w/o responding.

This is not a bug, it’s a feature intended to discourage spam.

—-

More precisely, what I’m trying to discourage is “treating Frank like a single-player game.”

This is a category of behaviors which gets on my nerves. I haven’t expressed this preference directly in the past because it’s tough to phrase clearly. (I have expressed it indirectly from time to time by adding features like this one.)

Even now, I’m struggling to phrase it clearly. But hopefully you can see basically what I mean. Some examples of behaviors in this category:

  • only engaging with Frank by making generic demands for specific types of output, e.g. stories about X, pictures of Y
  • sending Frank stuff that makes sense to you, but won’t make sense to anyone else who followers her
  • talking to Frank in reblogs on the same post for a large number of iterations, so that the post becomes extremely long and everyone has to see many copies of it
  • sending a lot of input very quickly, in a way that would be annoying to another person (note again that Frank has many followers; also note that Frank is a sideblog of my blog, and we share the same inbox + post limit)

berebitsuki asked:

another Frank question: is it possible to ask her to follow a sideblog?

Yeah, but I have to do it manually, just send me the name in a DM / ask / etc

berebitsuki asked:

Hi! Why does Frank make images with the words [Animated GIF] so often? My guess is that the text prompt generator says [Animated GIF] because it's learned that from the images in the corpus that are actually animated GIFs, am I close? (your posts where you describe how image generation works are very interesting by the way, thank you for them!)

Your guess is correct.

My treatment of animated GIFs has depends on the context.

When the bot is reading posts “live,” GIFs are handled by sampling a few evenly-spaced frames and trying to read them. So Frank can read text in GIFs, sometimes.

But processing a GIF this way is time-consuming, and GIFs contain text less often than still images anyway. So, for scraping training data, I generally ignore them.

In more recent scraping, that means skipping posts with GIFs in them, but in a lot of my early scraping I used “[Animated GIF]” as a placeholder.

I also used “[Image]” as a placeholder for stills where I forgot to log the image URL (or something? don’t remember) and hence couldn’t produce text for the image.

unoriginal-nerdy-nickname-here asked:

I have a question about your Frank Autoresponder: Does she have any ‘memory’ of past posts/conversations? For example, if something from one of her previous posts was mentioned, would she be able to recognize that? Or is her response generated independently each time?

She does not have a memory, except for the fact that she can see her own earlier posts when responding to a reblog chain.

cyle:

nightpool:

nostalgebraist:

Someone sent Frank a very long spam ask containing approximately 15755796 characters of text.

That’s around 300 times as long as the Bee Movie script. If you save it as a text file, it’s around 16 megabytes!

This seems like, uh, not something that should be possible? It even affected the UI adversely – my askbox page took like 20 seconds or something to load.

hmm, this article seems to indicate that it should only be possible to include 4,069,000 characters worth of text in a post (1000 content blocks * 4069 characters per content block). Maybe there’s a discrepancy because of multibyte Unicode characters or something?

hmmm the max post size (not including media) in total is supposed to be 1MB, so yeah, that’s not quite right…..

Turns out I miscounted the file and character size. (And in a really dumb way – used grep on my log file and forgot the offending string would have appeared multiple times, lol)

Anyway, the actual length was 2621440 characters: 640 content blocks, each containing the letter “e” repeated 4096 times.

This is below the character limits @nightpool mentioned. However, it’s still more than 1 MB of text (it’s ~2.6 MB).

cogobe8549 asked:

The askbox character limit exists everywhere but in the dash popup window on desktop. If you go to a blog directly and ask there's a limit, and if you do it on mobile there's a limit, but not with desktop. This is an entirely functional hellsite.

also it took about a full minute to send tumblr didn’t like it much either

Ahh, thanks for the info!

cc @cyle if you’re interested

(This is the user who sent the extremely long, 300x-as-long-as-the-Bee-movie-script ask to Frank that I mentioned earlier today)

Someone sent Frank a very long spam ask containing approximately 15755796 characters of text.

That’s around 300 times as long as the Bee Movie script. If you save it as a text file, it’s around 16 megabytes!

This seems like, uh, not something that should be possible? It even affected the UI adversely – my askbox page took like 20 seconds or something to load.

nostalgebraist:

[It feels wrong somehow – inappropriate, grotesque? – to step away from refreshing news/twitter to write another one of my little ML posts, right now. But it shouldn’t, I think. I cannot help or harm Ukraine by posting or not posting things, or checking the news more or less often.]

—-

The video below visualizes how @nostalgebraist-autoresponder generates an image. I’ll add more examples in reblogs later.

The left and right panes are different ways of viewing of the same underlying process.

The model is given a random noise image, and it gradually transforms it into a picture of something. See the section on diffusion in my earlier post for details.

The process looks something like

  1. the model looks at the current image, interpreting it as (some picture) + (random noise)
  2. the model make a guess about the underlying picture
  3. the current image is adjusted a small amount towards the model’s guess
  4. repeat steps 1-3 many times

The left pane of the video shows the “current image” from the above recipe, over the many applications of steps 2-4. Starting with pure noise, we gradually adjust it using the model.

The right pane follows the same process over time, but instead of showing the current image, it shows the model’s guess about the underlying picture (step 2).

The left pane shows what is directly produced by each iteration, and what is used as the input to the next iteration. But the right pane gives us a clearer view of what is going on.

This post connecting diffusion models and autoencoders describes a connection between noise level and feature scale. When the image is still very noisy, the model is working on large-scale features like the shapes of the basic objects in the scene. When the image is less noisy, the model is refining smaller details.

This accords with what we see in the right pane. The model makes a coarse-grained scene quickly, then gradually adds smaller and smaller details.

(None of this is original to me. Both types of visualization have been made by others before.)

(NOTE: my bot uses two image models, one that generates images at 128x128 resolution, and a second one that enlarges those images to 256x256. These videos are about the 128x128 model.

To make things easier to see at a typical display resolution, I have enlarged the video so each pane is 256x256 – but using a simple Lanczos filter, not a neural network. This is why the images appear blurrier and less detailed than Frank’s usual output, despite being the same size.)

Additional examples

Keep reading