Install Theme

@adzolotl replied to this post

Changing one weight by 1,000 units does seem like it could break things, and is rotationally similar to changing 1,000,000 weights by one unit. (A “unit” here is, like, the stdev of weights, and I’m assuming this is similar for different blocks of weights, which is sketchy but makes things more Euclidean.)

Hmm… part of why I think “changing one weight can’t break things” is that NN weights are often “inactive” for a large fraction of inputs.

Like, if it’s an operation before a relu, and that relu is doing useful work, then the relu output is zero for some nontrivial fraction of inputs, and for those it doesn’t matter what the weight is.

So the full loss is an average over those inputs, where screwing up this weight doesn’t change the problem, and the other inputs, where it does. For the “bad” inputs, the changed weight might produce large gradients in other weights, but in each mini-batch you still have “good” inputs with gradients that won’t like moving away from what currently works for those inputs.

I haven’t tried this, though, maybe it’s worse than I think. You probably could mess things up by changing a weight in a layer/batch norm or something like that that’s active for all inputs.

Anyway, that reasoning is all specialized to the weight basis.

So you might think: if I need that argument to explain why a step in the weight basis is OK, then maybe a step of the same magnitude in an arbitrary basis would be not-OK.

But again, in N-d, most vectors are orthogonal. If you have a trained weight matrix before an activation fn., its rows (“filters”) are specially chosen to be not-orthogonal-to-the-input a nontrivial fraction of the time (picking out inputs which the filter matches). If you add a random vector to each row, you don’t screw up that property; the random vector is orthogonal to most inputs (b/c everything is).

So again, there’s a “good” part of the input space where the perturbation doesn’t matter. The “bad” inputs (where the perturbation is harmful) give you gradients telling you to get rid of the perturbation, but they’re noisy gradients and by themselves might scramble the original filter. But if they try to do that, the gradients from the “good” part of the input will say “hey, no, that thing’s useful” and ensure it stays.

I’m sure there are adversarial directions that do destroy your progress by compounding the same type of error across layers, just as there are adversarial directions where you change one weight in a layer/batch norm. They just aren’t most directions.

I’m just thinking out loud here… I guess this sounds like a half-baked theory of why NNs are easy to optimize, dunno how well it holds up.

In an N-dimensional space, if you pick any vector, there’s an (N-1)-dimensional subspace orthogonal to that vector.

So if you move in some direction, there is a basis in which you’ve changed 1 element of your position vector while leaving the other (N-1) elements unchanged.

When N=3, the difference is not too big: 1 thing changes, 2 things don’t.

Whenever you move from one point to another in 3D, there is some coordinate system where only 1 of your 3 coordinates has changed. This observation doesn’t feel weird or surprising to me, since I can immediately imagine a picture: there’s the vector, there’s the plane orthogonal to it.

When N is large, though, (N-1) is also large. So if you move from any point A to any other point B, there is some coordinate system in which the coordinates of your position vector look mostly the same.

Well, duh, you might say. But, thinking about this fact seems to help me get intuition about high-dimensional spaces. For example:

• In a high-dimensional space, two randomly chosen vectors are almost exactly orthogonal with high probability. This is often described as a counter-intuitive difference from 3D space.

But it seems intuitive when I think about it this way. If someone gives you a vector v1, the subspace orthogonal to v1 is most of the vector space. Projecting most vectors onto that subspace will barely change them.

• In a high-dimensional space, volumes scale like r^N, so most of the volume in an N-ball is close to the surface, most of the volume in an N-cube is close to the corners, etc. The consequences – e.g. that an N-ball packed inside an N-cube occupies almost none of the cube’s volume – are often described as weird and marvelous.

But this also seems intuitive when I think about it this way.

Imagine you’re in a maze, and you can send people (or robots or whatever) out to explore it along different paths.

From your starting point, there are 2N different directions that your explorers can head out in. (For each axis, they can go one way along that axis, or the opposite way.) A choice of any one direction is a choice to not explore the other 2N “exits” of the room.

After this step, the explorers are now in adjacent rooms. And each of them has 2N exits. When the explorers take their second steps, they’re again choosing not to explore most of the adjacent area they could be exploring. As you get further from the starting point, you accumulate this exploding backlog of un-chosen directions and un-seen stuff.

This only has a hand-wavey connection to volume – I’m counting the number of possible paths, not the volume around those paths, and the number of paths grows exponentially which is faster than volume. But, it gives me an intuitive picture of a “space with a lot of space in it,” so to speak, where it feels natural that volume grows very super-linearly.

• This one is vaguer, but looking at things this way makes neural net training feel less strange to me.

Like, I used to imagine the loss function as this complicated hilly landscape in 3D (because I can only imagine 3D). I.e. a scalar function of a 2D vector. Everyone knows this is a misleading picture, but it’s hard to do otherwise.

One way that the picture misleads is that… OK, this is hand-wavey, but in 2D, you have to be careful because your state is so small.

All the optimization progress you’ve made so far lies in the 2 numbers specifying the current point. When you take a step, the space orthogonal to that step only has 1 dimension, the same as the step itself. If, by accident, you take a step that’s way too big and end up somewhere else in the landscape, you’ve basically erased all your progress.

Putting that a different way: if you take a large step and end up “somewhere else on the landscape,” the place you end up is completely different. In the best case scenario, the cross section orthogonal to the large step is approx. constant over the whole step, so you can find your way back without screwing up the “other” coordinate in the process. But that only happens if the hills have a convenient “simple” structure; “complicated” 3D hills generally won’t; so if you screw up one coordinate, you’re as good as starting from scratch.

But in an N-dimensional space, there are lots of directions, and a bad step only takes you along one of them. There’s plenty of room for many axes to look like the “best case scenario” for the one remaining axis you get in 2D, while still having enough spatial variation in the eigenbasis of the Hessian (or whatever) for the surface to count as “complicated.” There are a lot of ways the problem could factor into independent subspaces without being so thoroughly factored it’s trivial.

And in fact, this is perfectly “obvious” if you think about directions aligned with specific network parameters. If you pick one weight in a network with millions or billions of weights, and mess up its value, this seems like a small change rather than a drastic one; you expect the network to recover.

But how is this different from taking a step in any direction? I guess this argument has to stop somewhere, since if you take the “step” from a fully trained network to its initialization point, it’s hard to get back to the trained network again (since it took the entire training run to do precisely this). Still, this puts the shoe on the other foot – instead of asking why a given step doesn’t ruin everything, we’re asking why it would, which seems like the right question in N-d.

raginrayguns asked:

I've heard you describe neural networks learning differentiable maps from R^j to R^k I think? it just occurred to me that they have to be differentiable with respect to the weights in order to be trained with gradient descent, but do they actually have to be differentiable with respect to the input features? Like are these connected or just happen to both be true

They don’t have to be differentiable with respect to the input, no.

Indeed, in a case like GPT, the input space is discrete, and you can’t be differentiable with respect to it.

On the other hand… most of the time, a neural network is a long string of composed functions, where each one is parameterized by some “weights”.

Like y(x) = p(q(h(g(f(x))))), where x is the input. Except each of the functions p, q, … f have an extra second argument, its weights.

You want the derivatives w/r/t the weights of p, q, … f. Doing the chain rule, you get terms like p’(q(…)), q’(h(…)), etc.

So every one of these functions has to be differentiable w/r/t its “input argument” – except for the very first one, f, which doesn’t have to be differentiable with respect to x (the input of the whole network).

That is, neural networks are usually differentiable “most of the way through,” from the output to the first function in the composition, which is usually very simple. Even if dy/dx doesn’t exist, dy/df has to exist. And that derivative spans all of the nontrivial calculation done by the network. In the case of GPT, “f(x)” is a lookup table that maps discrete tokens to vectors, and everything after that is differentiable w/r/t the vectors.

In this sense, it feels natural to speak of neural networks as “learning differentiable functions” even if it’s not always strictly true.

(P. S. there is an unrelated nuance with commonly used functions like relu that are differentiable almost everywhere, but not everywhere. In practice this doesn’t really matter, as you’d probably expect)

the-moti:

nostalgebraist:

Exponential moving averages (EMAs) are tricky…

—-

The update rule for an EMA e_t of values v_t is

e_t = (1-alpha) * e_{t_1} + alpha * v_t

for some small alpha like 0.1 or 0.001.

So the next value is a weighted average between the current average and the observation, strongly weighted toward the current average.

Once an observation v_t has been updated on, its term in the average “gets multiplied by” 1-alpha every time you take a step.

So, a term from 10 steps ago gets “decayed” by (1-alpha)^10, and a term from 1000 steps ago gets “decayed” more, by (1-alpha)^1000. This biases the average toward recent observations, which is the whole point.

But there’s a twist! The very first observation v_0 is treated differently. Unlike every other observation, it never appears as the alpha * v_t term in the update. So its weight in the average is “inflated” by a factor of 1/alpha relative to all the other terms.

Since alpha is small, this is a huge effect!

Suppose alpha=0.001. At step t=1000, here are some (unnormalized) term weights:

- v_1000: 0.001 * (0.999)^0 = 0.001

- v_1: 0.001 * (0.999)^999 = 0.0003681

- v_0: (0.999)^1000 = 0.3677

The weight for v_1000 is 2-3 times bigger than the weight for v_1, which makes sense, that’s the recency bias. But it’s ~370 times smaller than the weight for v_0. The oldest point of them all, v_0, gets over 1/3 of the total weight in the average!

A common heuristic says that an EMA has an “effective time window” of something like 2/alpha steps. But you have to wait much longer than this for average to care more about the most recent term than the least recent one. For alpha=0.001, this happens around t=6900, not t=2000. And even there, the first and last terms are given equal weight, where intuitively their weights should be far apart.

—-

One way to deal with this would be to do an arithmetic average up to t=1/alpha, then do an EMA thereafter.

Why t=1/alpha? The update rule for a running arithmetic mean is

e_t = ((t-1)/t) * e_{t_1} + (v_t / t)

That is, you correct the current average for the divisor now being t instead of t-1, and add in the next term.

The two weights here sum to 1, so this looks like the EMA update, with alpha = 1/t.

The two rules are the same when t=1/alpha. For larger t, the EMA gives more weight to the latest observation than the arithmetic average does, which we want (recency bias). For smaller t, the EMA is actually less recency-biased than the arithmetic average. Presumably we don’t want that, so we use the arithmetic average until the crossover point.

I just made this up, so IDK if this is a good idea.

—-

Another way to deal with this is what the Adam optimizer does.

It sets v_0 = 0, and treats the first observation as v_1. So all real observations are treated equally.

But then the average is too small, because you’re averaging in an extra zero. To correct for this, you divide the average by 1-(1-alpha)^t at each t, which makes it an unbiased estimator again. (You do this only when you “report” the average for downstream use, not inside the update rule.)

This is cleaner than my arithmetic average trick, but the division might be too numerically imprecise (?) in cases where you really care about the exact values, like when you’re doing parameter averaging over optimization. Might be fine though.

The standard practice in parameter averaging is to just use a naive EMA, with the issue mentioned above, which seems … bad? I guess the idea is that you should be averaging for a very long time, such that v_0 eventually washes out. But you do have to wait a long time for this to happen, much longer than the “effective time window” of your EMA.

I think this arithmetic mean approach is good. I tried to come up with a philosophically correct approach, which ends up being the same:

We want to assign weight alpha to the most recent observation, alpha (1-alpha) to the next most recent, and so on. But because there are not infinitely many past informations, this doesn’t add up to one. We have “extra weight” we need to assign. Since this weight was supposed to arise from observations older than the oldest observation, it makes sense to assign it to the oldest one - up until the point where the oldest one has equal weight to the next-oldest, as it doesn’t make sense to put more weight on older observations than more recent ones. So instead we assign the weight evenly to the oldest two, until the oldest two become equal to the third oldest.

It’s not hard to see that when we do this, we always end up putting equal weight on the oldest 1/alpha observations, regardless of how many observations there are in total (unless there’s fewer than 1/alpha, in which case we put equal weight on all of them). So this matches exactly what you said.

Exponential moving averages (EMAs) are tricky…

—-

The update rule for an EMA e_t of values v_t is

e_t = (1-alpha) * e_{t_1} + alpha * v_t

for some small alpha like 0.1 or 0.001.

So the next value is a weighted average between the current average and the observation, strongly weighted toward the current average.

Once an observation v_t has been updated on, its term in the average “gets multiplied by” 1-alpha every time you take a step.

So, a term from 10 steps ago gets “decayed” by (1-alpha)^10, and a term from 1000 steps ago gets “decayed” more, by (1-alpha)^1000. This biases the average toward recent observations, which is the whole point.

But there’s a twist! The very first observation v_0 is treated differently. Unlike every other observation, it never appears as the alpha * v_t term in the update. So its weight in the average is “inflated” by a factor of 1/alpha relative to all the other terms.

Since alpha is small, this is a huge effect!

Suppose alpha=0.001. At step t=1000, here are some (unnormalized) term weights:

- v_1000: 0.001 * (0.999)^0 = 0.001

- v_1: 0.001 * (0.999)^999 = 0.0003681

- v_0: (0.999)^1000 = 0.3677

The weight for v_1000 is 2-3 times bigger than the weight for v_1, which makes sense, that’s the recency bias. But it’s ~370 times smaller than the weight for v_0. The oldest point of them all, v_0, gets over 1/3 of the total weight in the average!

A common heuristic says that an EMA has an “effective time window” of something like 2/alpha steps. But you have to wait much longer than this for average to care more about the most recent term than the least recent one. For alpha=0.001, this happens around t=6900, not t=2000. And even there, the first and last terms are given equal weight, where intuitively their weights should be far apart.

—-

One way to deal with this would be to do an arithmetic average up to t=1/alpha, then do an EMA thereafter.

Why t=1/alpha? The update rule for a running arithmetic mean is

e_t = ((t-1)/t) * e_{t_1} + (v_t / t)

That is, you correct the current average for the divisor now being t instead of t-1, and add in the next term.

The two weights here sum to 1, so this looks like the EMA update, with alpha = 1/t.

The two rules are the same when t=1/alpha. For larger t, the EMA gives more weight to the latest observation than the arithmetic average does, which we want (recency bias). For smaller t, the EMA is actually less recency-biased than the arithmetic average. Presumably we don’t want that, so we use the arithmetic average until the crossover point.

I just made this up, so IDK if this is a good idea.

—-

Another way to deal with this is what the Adam optimizer does.

It sets v_0 = 0, and treats the first observation as v_1. So all real observations are treated equally.

But then the average is too small, because you’re averaging in an extra zero. To correct for this, you divide the average by 1-(1-alpha)^t at each t, which makes it an unbiased estimator again. (You do this only when you “report” the average for downstream use, not inside the update rule.)

This is cleaner than my arithmetic average trick, but the division might be too numerically imprecise (?) in cases where you really care about the exact values, like when you’re doing parameter averaging over optimization. Might be fine though.

The standard practice in parameter averaging is to just use a naive EMA, with the issue mentioned above, which seems … bad? I guess the idea is that you should be averaging for a very long time, such that v_0 eventually washes out. But you do have to wait a long time for this to happen, much longer than the “effective time window” of your EMA.

Broke: “matrix multiplication”

Woke: “a two-layer MLP with orthogonally parameterized weight matrices, weight normalization, a linear activation function, and no bias”

the-moti:

nostalgebraist:

@the-moti

Today’s new OpenAI result on solving math Olympiad problems (blogpost, paper) seems relevant to our discussion a while back about using RL / self-play / AlphaGo-like techniques for math.

If I understand correctly, the core idea (section 4.5) is similar to AlphaGo, where you do tree search guided by a neural network, then train the network to imitate the results of the search, then search again with the updated network, etc.

To do this, you need a list of true statements to run the proof search on, and you need them to span a range of difficulties-to-prove. But formalized true statements are much easier to generate than formalized proofs, and there’s prior art on how to do this.

The impressive-sounding results on Olympiad problems required also training on some manually formalized Olympiad problems (section 6), and the resulting model still wasn’t very good at advanced math in any absolute sense (section 7.2), but this seems like an important proof of concept.

Thanks!

Are they really searching on a tree? For the expansion process they refer to the GPT-f paper, which describes sampling a bunch of tactics and evaluating them, but not going beyond depth 1.

It might be that this is not necessary if many steps are reversible or harmless - if the network suggests tactic A and B, and evaluates the goal state after A as more promising than the goal state after B, so the program tries A, and then generates a bunch of tactics, before realizing that none of them achieve the potential it saw in A, the network might not need to backtrack and do A as it can just do B at this point.

But maybe this produces in the training data a lot of proofs like the mathd_algebra_140 example on page 24 of the paper, which apparently begin with a bunch of steps that don’t do very much. Using such bloated proofs as material for the next stage of training can’t be helpful…

TBH, I only have a fuzzy understanding of the GPT-F paper, and proof assistants are not my area in the first place. That said…

They definitely do search a tree (Fig 1 in that paper), but not the same way that game-playing agents do.

In game-playing, the agent walks an entire tree “in its head,” then takes a single step “in the real world.” Then walks another entire tree, then takes another single step, etc.

Whereas here, we just walk the tree once, and then the tree just is the proof (if it contained the theorem).

The way I think about it is: since there’s no adversary, and the agent isn’t trying to predict the dynamics of an environment, we don’t have a clear distinction (which we have in games) between steps that are “part of the real trajectory” and steps that are “merely simulated.” You simply roll out the search tree and check if it contains the theorem or not.

If it does, it will likely contain many subtrees that are irrelevant to how the theorem was eventually proven. “Removing” these subtrees from some sort of “final copy” would be useful for readability, but it’s distinct from the sort of backtracking an agent would want to do in an environment where time and opportunity costs matter (making a useless move = giving your your adversary a free move). In a sense, this removal step would happen “outside the game,” not affecting whether the proof worked and not freeing up space the agent could use for anything else.

(ETA: to be more explicit, the width axis of the tree is always whatever size the hyperparameters set it to be. It’s better to use the fixed space along this axis to do more useful things vs. less useful ones, but once it’s been used to assess a possible avenue, there’s no additional notion of “deciding whether or not to go there in reality.” If it’s been imagined, it’s been done.)

I could be wrong about what’s going on, though, I’m not very confident.

Anyway, all that’s about GPT-F – the new paper adds a part where the model learns to predict proof length (to reach a [sub]goal) during training and then uses this to prefer shorter proofs during inference. I imagine this helps generate more readable proofs, but again, the decision to do this happens “outside the game” (it redefines “the game” so that time matters internally, which was not true in GPT-F’s game).

@the-moti

Today’s new OpenAI result on solving math Olympiad problems (blogpost, paper) seems relevant to our discussion a while back about using RL / self-play / AlphaGo-like techniques for math.

If I understand correctly, the core idea (section 4.5) is similar to AlphaGo, where you do tree search guided by a neural network, then train the network to imitate the results of the search, then search again with the updated network, etc.

To do this, you need a list of true statements to run the proof search on, and you need them to span a range of difficulties-to-prove. But formalized true statements are much easier to generate than formalized proofs, and there’s prior art on how to do this.

The impressive-sounding results on Olympiad problems required also training on some manually formalized Olympiad problems (section 6), and the resulting model still wasn’t very good at advanced math in any absolute sense (section 7.2), but this seems like an important proof of concept.

loving-n0t-heyting asked:

Are we ever going to get an expodump on the relevant math about spinors etc for Almost Nowhere? Or, er, a reading list?

It depends on what you mean, but … I would guess probably not?

Almost Nowhere makes a lot of references to mathematical concepts. It also has some moments of undisguised math/physics pedagogy, aimed at the reader as much as the characters, a la Egan or Stephenson.

All of this is present for a reason – for various reasons – but the exact reasons vary from one instance to the next. The reasons often include

* technobabble verisimilitude: when discussing fictional science, the characters should sound like people discussing real science

* the nature of anomaling rhetoric: the anomalings see their own intuitions mirrored in the structure of reality itself, and often justify those intuitions by saying something like “that’s just how geometry is, so why would you expect otherwise.”

To make this legible, I have to explain that, yes, geometry (or whatever) is in fact like that

* the origins of anomaling rhetoric: the shared conventions used by the anomalings/shades, when they “speak” in English, derive from Azad trying to “translate” data handed to him by physicists in relatively undigested form – and from a period when both species would have been leaning heavily on mathematics as a source of shared reference points.

Everyone got crashed before this “creole” could mature beyond this phase, and it still retains some residues of its origins which freer, more sustained communication would have eventually pruned away

—-

The reader’s instinct for parsimony may lead them to imagine more unity of purpose than I intend. (And, probably, more depth of knowledge than I possess)

As if (to strawman a bit) there were some unpublished mathematical monograph sitting on my computer’s flash drive which is the key to everything, which all the references in the text are gesturing towards, and of which they are mere flickering cave-shadows.

My flash drive doesn’t contain that monograph, any more than it contains Salby’s full 4000-page TNC. More importantly, AN isn’t really about that hypothetical monograph, the way that my TNC is about Salby’s TNC.

Rather, AN is “about mathematics (and physics)” in roughly the same way that Floornight was.

Meaning: it’s a story about made-up science, where the made-up science was inspired by real math and physics concepts, and hence I have to explain the latter to get the former across, sometimes. Also, the characters often talk about the made-up science in a highly technical register, because that’s how scientists would talk if it were a real object of study.

In Floornight, a lot of the made-up science was inspired by MWI and philosophical issues related to it.

And, when I think about it, a reader with that background would probably have an easier time grasping a lot of things at first blush: that an “eigensoul decomposition” might be some way of writing a state in a basis of “eigensouls”; that, when this same term refers to a physical process, it might be something like decoherence into a preferred basis of minimally interacting “eigensouls”; that splitting a soul into such parts is a special case of splitting the whole universe into branches; that, in the implied kind of multiverse, it might be difficult to sensibly count entities/persons for the sake of assigning moral weight; that some physical notion of “branch size,” likely couched in terms of measure theory, might therefore make its way into discussions of human worth; etc, etc.

But if you don’t have these concepts going in to Floornight, the story breezily glosses them for you anyway. Anything much deeper than these glosses is not really useful for understanding the story itself, as opposed to understanding how I came up with it. Reading about eigenbases in QM will tell you where I got the phrase “eigensoul decomposition,” but (if I did my job correctly) all the relevant details should be clear from the text itself when read in full, and hence all the details not on the page should be irrelevant.

Likewise, with Almost Nowhere, a reader who knows the measure theoretic term in the title will probably have an intuition for how it might relate to discreteness and continua, which are main themes in the story. Insofar as this is important, it will be glossed in the text itself – but the glosses may look less like a math tutorial, and more like the brief discussion of related ideas in Floornight, which you might not even recognize as “mathematical” unless you already knew its inspiration.