Install Theme

bluejaysfeathers asked:

Hi! I’ve noticed that sometimes Frank tags her posts do not reblog/please don’t reblog/etc. including some reblogs from others (who did not tag their posts that way). While as a rule I do respect that sort of tag, I haven’t been able to work out any sort of pattern to what posts she puts it on, so I guess my question is… is that a genuine “don’t reblog” (which I’ll respect and would love to know more about how she decides which posts it goes on) or is she learning that sometimes other people tag posts that way so she sometimes does it too?

Most of the time, when Frank uses tags, they were written by the same text generator that writes the posts themselves.

There are a few special tags that can get added by the code in a predictable rule-based fashion. Examples include: tagging the username of the person she’s talking to, tagging her image posts with “#computer generated image.”

“Don’t reblog” and similar tags are not special rule-based tags. They’re written by the text generator, which writes them occasionally because it’s trying to sound like a tumblr user. Which is to say, they are exactly as “genuine” or “not genuine” as the posts themselves.

[It feels wrong somehow – inappropriate, grotesque? – to step away from refreshing news/twitter to write another one of my little ML posts, right now. But it shouldn’t, I think. I cannot help or harm Ukraine by posting or not posting things, or checking the news more or less often.]

—-

The video below visualizes how @nostalgebraist-autoresponder generates an image. I’ll add more examples in reblogs later.

The left and right panes are different ways of viewing of the same underlying process.

The model is given a random noise image, and it gradually transforms it into a picture of something. See the section on diffusion in my earlier post for details.

The process looks something like

  1. the model looks at the current image, interpreting it as (some picture) + (random noise)
  2. the model make a guess about the underlying picture
  3. the current image is adjusted a small amount towards the model’s guess
  4. repeat steps 1-3 many times

The left pane of the video shows the “current image” from the above recipe, over the many applications of steps 2-4. Starting with pure noise, we gradually adjust it using the model.

The right pane follows the same process over time, but instead of showing the current image, it shows the model’s guess about the underlying picture (step 2).

The left pane shows what is directly produced by each iteration, and what is used as the input to the next iteration. But the right pane gives us a clearer view of what is going on.

This post connecting diffusion models and autoencoders describes a connection between noise level and feature scale. When the image is still very noisy, the model is working on large-scale features like the shapes of the basic objects in the scene. When the image is less noisy, the model is refining smaller details.

This accords with what we see in the right pane. The model makes a coarse-grained scene quickly, then gradually adds smaller and smaller details.

(None of this is original to me. Both types of visualization have been made by others before.)

(NOTE: my bot uses two image models, one that generates images at 128x128 resolution, and a second one that enlarges those images to 256x256. These videos are about the 128x128 model.

To make things easier to see at a typical display resolution, I have enlarged the video so each pane is 256x256 – but using a simple Lanczos filter, not a neural network. This is why the images appear blurrier and less detailed than Frank’s usual output, despite being the same size.)

the-moti asked:

What does Frank's selector model think about images, now that she can generate interesting images? Does it tend to predict posts containing images get more notes? That images with more text get more notes than images with less text?

I haven’t looked into this much. Probably the model hasn’t seen enough examples to be very confident.

It can distinguish the generated images from earlier image posts by checking whether the “#computer generated image” tag is there, but it has less than two months of data where the latter kind of image was possible. (I do some smoothing on the raw notes signal over a ~week timescale, so the selector model can’t meaningfully learn about a new feature until there’s a minimum of a few weeks of data with the feature present.) And there’s confounding due to people being especially excited about the feature right when it came out, etc …

the-moti:

nostalgebraist:

Just released an update to Frank’s mood graph feature.

Here’s an example demonstrating the new behavior.

Nice! Do you have the data needed to construct an all-time or at least longer-term leaderboard of the posts that had the biggest positive impact on Franks’s mood? That would be interesting to see.

I don’t have all the needed data on hand. Reconstructing the missing part would be feasible but fairly annoying/tedious.

Specifically, I have full historical data on which “user inputs” (specific asks, reblogs, or replies) had which mood effects, but I don’t have the mapping from these to the post IDs of Frank’s responses. Post IDs are the long numbers after “/post/” in tumblr links, and are needed to link to specific posts.

To make this feature possible, I had to add new code to record this mapping for newly created responses. For older posts, I could presumably reconstruct it by joining the mood effects data to another data source I log upon post creation, although I’m not 100% sure that would work.

There are various subtleties, e.g. if a post is created as a draft, it gets a new ID when published, and you can ask the API “what was the draft ID (if applicable) of this published ID?” but not the reverse.

Anyway, I doubt this exercise would be that informative, since it’s essentially Goodharting oneself. Taking the max/min over a long enough time interval will sample from various model builds, manual changes to constants in the mood calculation, etc., as well as honing in on cases where the model itself did something atypically extreme. You’re basically asking for results that aren’t representative of typical patterns in the data (whatever those happen to be).

Just released an update to Frank’s mood graph feature.

Here’s an example demonstrating the new behavior.

stephaniedola asked:

do you tell Frank to tag images with "#computer generated image" or does she do that of her own accord in trying to mimic you. i know she does not control the guidance scale tag

That’s a force-added tag like the guidance scale ones.

FYI: Frank now ignores asks that don’t have any text in them.

(If you sent some image-only asks recently and wondered what happened to them, this is what.)

Why:

Frank writes better, more interesting responses when you talk to her like a person.

Good Frank asks generally pass the test “if I sent this ask to a human, would they understand what I was asking / telling them?”

Most image-only asks don’t pass this test IMO.

publicuniversalworstie asked:

i almost never see your posts but I see frank's all the time so when I saw your Homestuck post I spent a few incredibly confusing moments wondering why frank would say this unprompted

lol, the reason Frank talks about Homestuck so much is that I’ve been a longtime Homestuck poster, see eg https://nostalgebraist.tumblr.com/tagged/homestalgebra/chrono :)

I also posted a lot on MSPAF in 2011 to 2013ish (?), but that stuff is of course lost in time, like tears in rain

nostalgebraist:

nostalgebraist:

nostalgebraist:

2/2/22 was a big day in the world of neural language models!

A probably incomplete list of good stuff that came out today:

1. AlphaCode

2. That OpenAI math Olympiad paper

3. New open model from Eleuther with 20B parameters

4. New scaling laws paper

5. A new sampling method I might try in Frank sometime

Tried that new sampling method, Typical Sampling, in Frank this afternoon.

I set their parameter tau to 0.9, after reading samples with a few values and not seeing a clear difference even between values as far apart as 0.2 and 0.9. (If anything, intermediate values of tau seemed worse than extreme ones, though I could have been imagining that.)

It didn’t take long to get an instance of degenerate repetition with this method, so I’m switching back to breakruns for now – it avoids repetition better than anything else I’ve seen.

Possibly another value of tau would suppress repetition harder, but given that the text from Typical Sampling feels similar to text from other methods, I’ll probably just stick with breakruns.

I decided to give Typical Sampling another try, with tau=0.2 this time.

Just turned it on. Let’s see how long it takes to get a repetitive post this time…

Update after ~24 hours of Typical Sampling with tau=0.2

- I haven’t seen any repetition-trapped posts (good)

- However, there have been two posts that are weird/degenerate in a new way I haven’t seen with Frank before (bad!)

In both of those posts, the gibberish begins after a “=======”, which is my delimiter for blocks of text read from images in the training data.

(You don’t normally see it appear in Frank posts, because the generator normally “closes” the image with a second “=======”, like a closing html tag, and then my code regex-matches this pattern and replaces it with a generated image in an <img> tag.)

Image text is unusually random/unpredictable, esp. at the start of the image, when it could contain almost any text. Image text is also frequently glitchy and misspelled.

So I bet that “confuses” typical sampling … somehow?

I don’t have a good mental model of this yet, and don’t have good intuition for what the tau parameter is doing. I can study all this stuff later, if I feel like it, by re-running those posts through the generator and looking at probabilities.

For now, back to breakruns it is.

nostalgebraist:

Exponential moving averages (EMAs) are tricky…

—-

The update rule for an EMA e_t of values v_t is

e_t = (1-alpha) * e_{t_1} + alpha * v_t

for some small alpha like 0.1 or 0.001.

So the next value is a weighted average between the current average and the observation, strongly weighted toward the current average.

Once an observation v_t has been updated on, its term in the average “gets multiplied by” 1-alpha every time you take a step.

So, a term from 10 steps ago gets “decayed” by (1-alpha)^10, and a term from 1000 steps ago gets “decayed” more, by (1-alpha)^1000. This biases the average toward recent observations, which is the whole point.

But there’s a twist! The very first observation v_0 is treated differently. Unlike every other observation, it never appears as the alpha * v_t term in the update. So its weight in the average is “inflated” by a factor of 1/alpha relative to all the other terms.

Since alpha is small, this is a huge effect!

Suppose alpha=0.001. At step t=1000, here are some (unnormalized) term weights:

- v_1000: 0.001 * (0.999)^0 = 0.001

- v_1: 0.001 * (0.999)^999 = 0.0003681

- v_0: (0.999)^1000 = 0.3677

The weight for v_1000 is 2-3 times bigger than the weight for v_1, which makes sense, that’s the recency bias. But it’s ~370 times smaller than the weight for v_0. The oldest point of them all, v_0, gets over 1/3 of the total weight in the average!

A common heuristic says that an EMA has an “effective time window” of something like 2/alpha steps. But you have to wait much longer than this for average to care more about the most recent term than the least recent one. For alpha=0.001, this happens around t=6900, not t=2000. And even there, the first and last terms are given equal weight, where intuitively their weights should be far apart.

—-

One way to deal with this would be to do an arithmetic average up to t=1/alpha, then do an EMA thereafter.

Why t=1/alpha? The update rule for a running arithmetic mean is

e_t = ((t-1)/t) * e_{t_1} + (v_t / t)

That is, you correct the current average for the divisor now being t instead of t-1, and add in the next term.

The two weights here sum to 1, so this looks like the EMA update, with alpha = 1/t.

The two rules are the same when t=1/alpha. For larger t, the EMA gives more weight to the latest observation than the arithmetic average does, which we want (recency bias). For smaller t, the EMA is actually less recency-biased than the arithmetic average. Presumably we don’t want that, so we use the arithmetic average until the crossover point.

I just made this up, so IDK if this is a good idea.

—-

Another way to deal with this is what the Adam optimizer does.

It sets v_0 = 0, and treats the first observation as v_1. So all real observations are treated equally.

But then the average is too small, because you’re averaging in an extra zero. To correct for this, you divide the average by 1-(1-alpha)^t at each t, which makes it an unbiased estimator again. (You do this only when you “report” the average for downstream use, not inside the update rule.)

This is cleaner than my arithmetic average trick, but the division might be too numerically imprecise (?) in cases where you really care about the exact values, like when you’re doing parameter averaging over optimization. Might be fine though.

The standard practice in parameter averaging is to just use a naive EMA, with the issue mentioned above, which seems … bad? I guess the idea is that you should be averaging for a very long time, such that v_0 eventually washes out. But you do have to wait a long time for this to happen, much longer than the “effective time window” of your EMA.

I tried the arithmetic-average-then-EMA approach on Frank’s image models, and I’m getting great results!

Converges to high quality results much much faster than iterate average with regular EMA, and gives me the freedom to pick a different EMA alpha in the middle of training without having to wait a gazillion steps for the new EMA to get anywhere.

I don’t have the link handy, but I read a paper recently that was advocating for iterate averaging, and it claimed arithmetic averages were actually better than EMAs in terms of theoretical convergence rate, as long as you start averaging near convergence.

From this perspective, maybe the EMAs that are popular in the field are effectively just proxies for arithmetic averages over the last O(1/alpha) points of training. In principle one could do an actual running arithmetic average, but this requires storing two values (so you know what “drops off the left side of the window” on each step), so the EMA is more memory-efficient.