Install Theme

raginrayguns asked:

I've heard you describe neural networks learning differentiable maps from R^j to R^k I think? it just occurred to me that they have to be differentiable with respect to the weights in order to be trained with gradient descent, but do they actually have to be differentiable with respect to the input features? Like are these connected or just happen to both be true

They don’t have to be differentiable with respect to the input, no.

Indeed, in a case like GPT, the input space is discrete, and you can’t be differentiable with respect to it.

On the other hand… most of the time, a neural network is a long string of composed functions, where each one is parameterized by some “weights”.

Like y(x) = p(q(h(g(f(x))))), where x is the input. Except each of the functions p, q, … f have an extra second argument, its weights.

You want the derivatives w/r/t the weights of p, q, … f. Doing the chain rule, you get terms like p’(q(…)), q’(h(…)), etc.

So every one of these functions has to be differentiable w/r/t its “input argument” – except for the very first one, f, which doesn’t have to be differentiable with respect to x (the input of the whole network).

That is, neural networks are usually differentiable “most of the way through,” from the output to the first function in the composition, which is usually very simple. Even if dy/dx doesn’t exist, dy/df has to exist. And that derivative spans all of the nontrivial calculation done by the network. In the case of GPT, “f(x)” is a lookup table that maps discrete tokens to vectors, and everything after that is differentiable w/r/t the vectors.

In this sense, it feels natural to speak of neural networks as “learning differentiable functions” even if it’s not always strictly true.

(P. S. there is an unrelated nuance with commonly used functions like relu that are differentiable almost everywhere, but not everywhere. In practice this doesn’t really matter, as you’d probably expect)

nostalgebraist:

nostalgebraist:

nostalgebraist:

2/2/22 was a big day in the world of neural language models!

A probably incomplete list of good stuff that came out today:

1. AlphaCode

2. That OpenAI math Olympiad paper

3. New open model from Eleuther with 20B parameters

4. New scaling laws paper

5. A new sampling method I might try in Frank sometime

Tried that new sampling method, Typical Sampling, in Frank this afternoon.

I set their parameter tau to 0.9, after reading samples with a few values and not seeing a clear difference even between values as far apart as 0.2 and 0.9. (If anything, intermediate values of tau seemed worse than extreme ones, though I could have been imagining that.)

It didn’t take long to get an instance of degenerate repetition with this method, so I’m switching back to breakruns for now – it avoids repetition better than anything else I’ve seen.

Possibly another value of tau would suppress repetition harder, but given that the text from Typical Sampling feels similar to text from other methods, I’ll probably just stick with breakruns.

I decided to give Typical Sampling another try, with tau=0.2 this time.

Just turned it on. Let’s see how long it takes to get a repetitive post this time…

Update after ~24 hours of Typical Sampling with tau=0.2

- I haven’t seen any repetition-trapped posts (good)

- However, there have been two posts that are weird/degenerate in a new way I haven’t seen with Frank before (bad!)

In both of those posts, the gibberish begins after a “=======”, which is my delimiter for blocks of text read from images in the training data.

(You don’t normally see it appear in Frank posts, because the generator normally “closes” the image with a second “=======”, like a closing html tag, and then my code regex-matches this pattern and replaces it with a generated image in an <img> tag.)

Image text is unusually random/unpredictable, esp. at the start of the image, when it could contain almost any text. Image text is also frequently glitchy and misspelled.

So I bet that “confuses” typical sampling … somehow?

I don’t have a good mental model of this yet, and don’t have good intuition for what the tau parameter is doing. I can study all this stuff later, if I feel like it, by re-running those posts through the generator and looking at probabilities.

For now, back to breakruns it is.

nostalgebraist:

Exponential moving averages (EMAs) are tricky…

—-

The update rule for an EMA e_t of values v_t is

e_t = (1-alpha) * e_{t_1} + alpha * v_t

for some small alpha like 0.1 or 0.001.

So the next value is a weighted average between the current average and the observation, strongly weighted toward the current average.

Once an observation v_t has been updated on, its term in the average “gets multiplied by” 1-alpha every time you take a step.

So, a term from 10 steps ago gets “decayed” by (1-alpha)^10, and a term from 1000 steps ago gets “decayed” more, by (1-alpha)^1000. This biases the average toward recent observations, which is the whole point.

But there’s a twist! The very first observation v_0 is treated differently. Unlike every other observation, it never appears as the alpha * v_t term in the update. So its weight in the average is “inflated” by a factor of 1/alpha relative to all the other terms.

Since alpha is small, this is a huge effect!

Suppose alpha=0.001. At step t=1000, here are some (unnormalized) term weights:

- v_1000: 0.001 * (0.999)^0 = 0.001

- v_1: 0.001 * (0.999)^999 = 0.0003681

- v_0: (0.999)^1000 = 0.3677

The weight for v_1000 is 2-3 times bigger than the weight for v_1, which makes sense, that’s the recency bias. But it’s ~370 times smaller than the weight for v_0. The oldest point of them all, v_0, gets over 1/3 of the total weight in the average!

A common heuristic says that an EMA has an “effective time window” of something like 2/alpha steps. But you have to wait much longer than this for average to care more about the most recent term than the least recent one. For alpha=0.001, this happens around t=6900, not t=2000. And even there, the first and last terms are given equal weight, where intuitively their weights should be far apart.

—-

One way to deal with this would be to do an arithmetic average up to t=1/alpha, then do an EMA thereafter.

Why t=1/alpha? The update rule for a running arithmetic mean is

e_t = ((t-1)/t) * e_{t_1} + (v_t / t)

That is, you correct the current average for the divisor now being t instead of t-1, and add in the next term.

The two weights here sum to 1, so this looks like the EMA update, with alpha = 1/t.

The two rules are the same when t=1/alpha. For larger t, the EMA gives more weight to the latest observation than the arithmetic average does, which we want (recency bias). For smaller t, the EMA is actually less recency-biased than the arithmetic average. Presumably we don’t want that, so we use the arithmetic average until the crossover point.

I just made this up, so IDK if this is a good idea.

—-

Another way to deal with this is what the Adam optimizer does.

It sets v_0 = 0, and treats the first observation as v_1. So all real observations are treated equally.

But then the average is too small, because you’re averaging in an extra zero. To correct for this, you divide the average by 1-(1-alpha)^t at each t, which makes it an unbiased estimator again. (You do this only when you “report” the average for downstream use, not inside the update rule.)

This is cleaner than my arithmetic average trick, but the division might be too numerically imprecise (?) in cases where you really care about the exact values, like when you’re doing parameter averaging over optimization. Might be fine though.

The standard practice in parameter averaging is to just use a naive EMA, with the issue mentioned above, which seems … bad? I guess the idea is that you should be averaging for a very long time, such that v_0 eventually washes out. But you do have to wait a long time for this to happen, much longer than the “effective time window” of your EMA.

I tried the arithmetic-average-then-EMA approach on Frank’s image models, and I’m getting great results!

Converges to high quality results much much faster than iterate average with regular EMA, and gives me the freedom to pick a different EMA alpha in the middle of training without having to wait a gazillion steps for the new EMA to get anywhere.

I don’t have the link handy, but I read a paper recently that was advocating for iterate averaging, and it claimed arithmetic averages were actually better than EMAs in terms of theoretical convergence rate, as long as you start averaging near convergence.

From this perspective, maybe the EMAs that are popular in the field are effectively just proxies for arithmetic averages over the last O(1/alpha) points of training. In principle one could do an actual running arithmetic average, but this requires storing two values (so you know what “drops off the left side of the window” on each step), so the EMA is more memory-efficient.

nostalgebraist:

nostalgebraist:

2/2/22 was a big day in the world of neural language models!

A probably incomplete list of good stuff that came out today:

1. AlphaCode

2. That OpenAI math Olympiad paper

3. New open model from Eleuther with 20B parameters

4. New scaling laws paper

5. A new sampling method I might try in Frank sometime

Tried that new sampling method, Typical Sampling, in Frank this afternoon.

I set their parameter tau to 0.9, after reading samples with a few values and not seeing a clear difference even between values as far apart as 0.2 and 0.9. (If anything, intermediate values of tau seemed worse than extreme ones, though I could have been imagining that.)

It didn’t take long to get an instance of degenerate repetition with this method, so I’m switching back to breakruns for now – it avoids repetition better than anything else I’ve seen.

Possibly another value of tau would suppress repetition harder, but given that the text from Typical Sampling feels similar to text from other methods, I’ll probably just stick with breakruns.

I decided to give Typical Sampling another try, with tau=0.2 this time.

Just turned it on. Let’s see how long it takes to get a repetitive post this time…

nostalgebraist:

2/2/22 was a big day in the world of neural language models!

A probably incomplete list of good stuff that came out today:

1. AlphaCode

2. That OpenAI math Olympiad paper

3. New open model from Eleuther with 20B parameters

4. New scaling laws paper

5. A new sampling method I might try in Frank sometime

Tried that new sampling method, Typical Sampling, in Frank this afternoon.

I set their parameter tau to 0.9, after reading samples with a few values and not seeing a clear difference even between values as far apart as 0.2 and 0.9. (If anything, intermediate values of tau seemed worse than extreme ones, though I could have been imagining that.)

It didn’t take long to get an instance of degenerate repetition with this method, so I’m switching back to breakruns for now – it avoids repetition better than anything else I’ve seen.

Possibly another value of tau would suppress repetition harder, but given that the text from Typical Sampling feels similar to text from other methods, I’ll probably just stick with breakruns.

Broke: “matrix multiplication”

Woke: “a two-layer MLP with orthogonally parameterized weight matrices, weight normalization, a linear activation function, and no bias”

2/2/22 was a big day in the world of neural language models!

A probably incomplete list of good stuff that came out today:

1. AlphaCode

2. That OpenAI math Olympiad paper

3. New open model from Eleuther with 20B parameters

4. New scaling laws paper

5. A new sampling method I might try in Frank sometime

catchaspark asked:

what do you think of the alphago coding result? it wigs me out and i'm wondering if that is warranted

I learned about this one from the EleutherAI discord (a bunch of very knowledgeable ML people) and have been following the discussion in there.

I’d describe the overall reaction in there as “impressed with the raw performance, not massively surprised, not sold that it implies anything big in particular,” which is also my reaction (so my summary might be biased). See eg this guy’s take.

IMO, the most important fact about this paper is that they didn’t make the underlying neural net better at coding. As far as I can tell, it’s pretty similar to OpenAI’s existing Codex model. If you just throw it at the competitive coding problems “directly,” it does very badly: its output only passes the provided example tests ~1% of the time, and passing tests is only a first step toward actually providing a correct problem submission.

What is new in the paper is an aggressive expansion of the “prune” side of the classic “babble-and-prune” approach – generate lots of texts, then keep only the best ones according to some heuristic.

Rather than making the model smarter, they make it faster to run, and then they generate a very large number of candidate solutions. If only 1% of them pass tests, that’s OK because they generate >>100 at once and can discard all the dross. And then after that, they have a separate step where another model invents additional tests for more pruning (which is a neat idea). And finally, they hedge their bets and try to pick 10 maximally diverse programs for the 10 submissions they get to send, to maximize the chance that one of them works.

This does get the job done, I can’t argue with that. However, it has some limitations:

1. It feels closely specialized to this one problem, competitive programming. In this context, it’s OK to use a lot of compute, because we’re solving hard problems, not building a consumer application; it’s OK that current models can’t see enough text at once to understand a whole codebase, because you only have to care about a verbal problem description and your solution; it’s OK to assume there are high-quality tests available, because there are; they get an edge out of heuristics designed on the assumption they get to guess 10 times, which they do; etc. It’s not clear how to generalize this stuff to other coding tasks.

2. The novel work on the pruning step was hand-designed by humans for this problem; the neural network doesn’t “know” it’s being used as part of this larger routine and can’t “learn” to do it better in the way it “learns” to write code originally. Again, this limits generalization to other problems: if you want to do something other than competitive programming, you may have to go back to the human AI experts and get them to design a different specialized pruning setup.

Because of (2), I’m actually more excited by the OpenAI paper that came out the same day, on an AlphaGo-like technique – where the neural net learns to “internalize” the logic of a search system built around it – applied to theorem proving. (That paper actually has much more of a connection to AlphaGo than AlphaCode does.)

What I’m really curious about now is applying expert iteration (the AlphaGo internalize-the-search thing) to code generation. Which seems like an obvious idea? but I can’t remember seeing it tried.

the-moti:

nostalgebraist:

@the-moti

Today’s new OpenAI result on solving math Olympiad problems (blogpost, paper) seems relevant to our discussion a while back about using RL / self-play / AlphaGo-like techniques for math.

If I understand correctly, the core idea (section 4.5) is similar to AlphaGo, where you do tree search guided by a neural network, then train the network to imitate the results of the search, then search again with the updated network, etc.

To do this, you need a list of true statements to run the proof search on, and you need them to span a range of difficulties-to-prove. But formalized true statements are much easier to generate than formalized proofs, and there’s prior art on how to do this.

The impressive-sounding results on Olympiad problems required also training on some manually formalized Olympiad problems (section 6), and the resulting model still wasn’t very good at advanced math in any absolute sense (section 7.2), but this seems like an important proof of concept.

Thanks!

Are they really searching on a tree? For the expansion process they refer to the GPT-f paper, which describes sampling a bunch of tactics and evaluating them, but not going beyond depth 1.

It might be that this is not necessary if many steps are reversible or harmless - if the network suggests tactic A and B, and evaluates the goal state after A as more promising than the goal state after B, so the program tries A, and then generates a bunch of tactics, before realizing that none of them achieve the potential it saw in A, the network might not need to backtrack and do A as it can just do B at this point.

But maybe this produces in the training data a lot of proofs like the mathd_algebra_140 example on page 24 of the paper, which apparently begin with a bunch of steps that don’t do very much. Using such bloated proofs as material for the next stage of training can’t be helpful…

TBH, I only have a fuzzy understanding of the GPT-F paper, and proof assistants are not my area in the first place. That said…

They definitely do search a tree (Fig 1 in that paper), but not the same way that game-playing agents do.

In game-playing, the agent walks an entire tree “in its head,” then takes a single step “in the real world.” Then walks another entire tree, then takes another single step, etc.

Whereas here, we just walk the tree once, and then the tree just is the proof (if it contained the theorem).

The way I think about it is: since there’s no adversary, and the agent isn’t trying to predict the dynamics of an environment, we don’t have a clear distinction (which we have in games) between steps that are “part of the real trajectory” and steps that are “merely simulated.” You simply roll out the search tree and check if it contains the theorem or not.

If it does, it will likely contain many subtrees that are irrelevant to how the theorem was eventually proven. “Removing” these subtrees from some sort of “final copy” would be useful for readability, but it’s distinct from the sort of backtracking an agent would want to do in an environment where time and opportunity costs matter (making a useless move = giving your your adversary a free move). In a sense, this removal step would happen “outside the game,” not affecting whether the proof worked and not freeing up space the agent could use for anything else.

(ETA: to be more explicit, the width axis of the tree is always whatever size the hyperparameters set it to be. It’s better to use the fixed space along this axis to do more useful things vs. less useful ones, but once it’s been used to assess a possible avenue, there’s no additional notion of “deciding whether or not to go there in reality.” If it’s been imagined, it’s been done.)

I could be wrong about what’s going on, though, I’m not very confident.

Anyway, all that’s about GPT-F – the new paper adds a part where the model learns to predict proof length (to reach a [sub]goal) during training and then uses this to prefer shorter proofs during inference. I imagine this helps generate more readable proofs, but again, the decision to do this happens “outside the game” (it redefines “the game” so that time matters internally, which was not true in GPT-F’s game).

@the-moti

Today’s new OpenAI result on solving math Olympiad problems (blogpost, paper) seems relevant to our discussion a while back about using RL / self-play / AlphaGo-like techniques for math.

If I understand correctly, the core idea (section 4.5) is similar to AlphaGo, where you do tree search guided by a neural network, then train the network to imitate the results of the search, then search again with the updated network, etc.

To do this, you need a list of true statements to run the proof search on, and you need them to span a range of difficulties-to-prove. But formalized true statements are much easier to generate than formalized proofs, and there’s prior art on how to do this.

The impressive-sounding results on Olympiad problems required also training on some manually formalized Olympiad problems (section 6), and the resulting model still wasn’t very good at advanced math in any absolute sense (section 7.2), but this seems like an important proof of concept.