argmin-gravitas-blog asked:
My friend made an unusual kind of Floornight fanfic: a data science puzzle! lesswrongdotcom/ posts/ NZvFostjy8gu5mEnc/ d-and-d-sci-fi-june-2021-the-duel-with-earwax
Ooh, thanks for letting me know!

My friend made an unusual kind of Floornight fanfic: a data science puzzle! lesswrongdotcom/ posts/ NZvFostjy8gu5mEnc/ d-and-d-sci-fi-june-2021-the-duel-with-earwax
Ooh, thanks for letting me know!
Have you ever seen those concerning statistics about criminal recidivism? Like: 44% are re-arrested within a year, and 83% within nine years (source: this Department of Justice report).
I’d seen those statistics before, and been concerned. There’s a great case for shortening prison sentences for deterrence reasons, because likelihood of punishment is much more deterring than severity, but at least prison incapacitates criminals from plundering society while they’re imprisoned. Why hasten prison release if they’ll be back soon anyway? “Once a criminal, always a criminal?”, asks one headline about recidivism.
But today I learned that there’s a huge caveat to those statistics. The more often you go to prison, the more you’re counted in recidivism statistics.
Consider five people who go to prison. Four of them never commit another crime, but one of them was Criminal Georg, who is imprisoned ten times. Out of the fourteen prison sentences (ten for Georg, four for the others), nine of them are followed by recidivism (Georg’s first nine). The proportion of these people who are serial criminals is 20%, but the recidivism rate is 64%.
When considering people rather than prison releases, the recidivism rate is lower than I thought.
Thank you, I hadn’t seen it and it’s a great resource!
I knew I’d seen this pattern before, but I didn’t have a name for it. The linked post by Elizabeth Wrigley-Field tells me it’s called “length-biased sampling.”
The mentions several examples with real-world importance, incl. the recidivism one, and argues the concept should be more widely known.
(It also makes an argument that “length-biased sampling is the deep structure of nested categories” which sounds interesting but which I am not awake enough rn to wrap my head around)
Ah! I know what that recidivism post reminded me of.
When you’re prompting GPT-2, putting an end-of-text separator at the start of your prompt will (all else being equal) bias the model toward shorter documents.
But, just as in the recidivism case, the doesn’t sound prima facie obvious the first time you hear the claim. You have to think about it first, and only then does it seems obvious.
ETA: apparently this is called “length-biased sampling”
I spent a long while composing this lw comment criticizing some academic papers and ended feeling kinda drained and like it was a waste of time.
I’m linking it here to make myself feel better by increasing the probability that someone will read it and derive some kind of value from it. (cw: math, neural nets)
For neural nets, I grudgingly use the Adam optimizer like everyone else.
It’s widely available and commonly used, and clearly works better in practice than any other optimizer with those properties. “No one ever got fired for choosing Adam,” so to speak.
But its lack of coordinate independence really bothers me. You write your model down in some arbitrary set of coordinates, and then Adam (effectively) approximates your function as locally quadratic, with your coordinates as the principal axes.
It doesn’t compute any estimate of how good or bad this approximation is, it just adopts it and sticks with it.
This is particularly weird when you’re just writing down matrices and initializing them with a coordinate independent distribution. The arbitrary basis I use to keep track of these matrices in my computer’s memory has no role in the mathematical problem, but it affects the optimizer!
Perhaps this problem is not so bad with neural nets, because they tend to contain frequent coordinate-wise operations (activation functions). This tends to “snap things into place,” aligning the otherwise arbitrary basis used for computer memory with the non-arbitrary basis picked out by the coordinate-wise operations. Thus the Hessian has a reason to become “axis-aligned,“ cf. this useful paper.
And where there isn’t a functional role for a coordinate-wise operation, people often add one anyway to “help optimization” (batch norm, layer norm). This is often discussed as “whitening” (making outputs more like white noise), but it’s clearer in my head if I think about it as trying to snap all the remaining arbitrary vector bases in the problem into alignment with the existing preferred bases.
However, in modern architectures, there some are places that neither have an activation nor an artificially imposed whitening step.
In one part of the transformer, you take a vector v and multiply it by a matrix Q to form the “queries,” and also by a matrix K to form the “keys.” The output depends only on the dot product of Q*v and K*v, which is coordinate independent. So nothing in the problem cares about your basis for expressing Q and K in computer memory, but Adam still cares.
(Going back to the reasons for axis-alignment – it’s also possible that when you start with coordinate invariant random matrices, but then optimize with Adam, the updates will push your parameters into a region where the Hessian is axis-aligned along your arbitrary axes. I don’t know why this would be true, if it were true, but it seems possible.
This would be like picking a random vector at initialization, and then only exploring a subset of the solution space, where the subset is somehow parameterized by the vector.)
Maybe someday I’ll switch to natural gradients just so I can sleep better at night …
I saw a hyped-up science news article about this paper and got briefly nerd sniped trying to figure out what was going on. I still don’t know.
Both the news article and the paper itself make it sound like this is some … fully general neural net approach for solving PDEs, that works for any PDE, and is fast and accurate … and doesn’t even need to know the PDE, it just learns it from solution data, and then after you “train” it on one solution it knows the PDE, and can produce other solutions.
And I’m like, that can’t be real, right? You can’t learn an infinite-dimensional operator from a finite sample. They must be choosing to prefer some operators over others, all else being equal. Also, what does this look like formally as a statistics problem, what measure are the operators being sampled from …
I found this near-contemporaneous paper which goes into more detail, but still doesn’t resolve my confusion. I also found this paper, cited as a competing method (although it shares several co-authors), which goes into much more mathematical detail and proves a general approximation theorem.
I don’t have the energy and interest to read through all these and figure out exactly what they’re actually doing, especially since I suspect it’s not that interesting.
(If PDEs are “generically learnable from finite samples under reasonable conditions” in some non-trivial way, that’s very interesting, but seems like something one could discover with pen and paper, and then go on to win prizes for discovering, without even needing a computer. I wouldn’t expect such a discovery to look like these papers.)
But if anyone else feels like reading these papers, let me know what you find out!
I’ve been reading the Li papers. They are pretty standard incremental-progress-in-NN stuff. True, finding optimal infinite dimensional operators is even nastier than finding finite-dimensional maps, in principle. But practically, as in much NN research, we are far from the regime of proving ultimate awesomeness and actually in the regime of going “oh cool this architecture works suprisingly less in practice”. The architecture in this case is a frankenstein mashup of kernel-learning/basis function decomposition smashed onto (1,1,) convolutions which has an efficient implementation in practice. As for what types of PDEs this biases us towards, jury is still out, but heuristically, some low-frequency wavs plus some wiggly bits at each layer sounds a lot like it might encode classic PDE solvers.
If you are prepared to give up the resolution-”independence” of the Li papers, there is a classic series of papers by Haber and Ruthotto that develop a sort-of-duality between Resnets and PDE solvers which claim that you can translate between the two viewpoints and between different resolutions, sort-of:http://arxiv.org/abs/1703.02009 and http://arxiv.org/abs/1804.04272 are a good entry point.
Practically, I quite like the Li papers and will probably use them, especially if I can find a Bayesian interpretation, but I don’t think there is anything Next Level here, just a Sweet-Hack. I blogged some more on this theme.
Thanks!
But practically, as in much NN research, we are far from the regime of proving ultimate awesomeness and actually in the regime of going “oh cool this architecture works suprisingly less in practice”.
I’m fine with this line of reasoning for most NN applications, but I don’t think it makes sense for PDE solving.
Consider some properties of more typical NN applications:
1. The problem is inherently statistical – the goal is to estimate a conditional p(y|X) or a joint p(y, X) from data. We would frame the problem this way even if we weren’t using NNs, and NNs are just a particular estimator.
2. We don’t “know the physics.” That is, our prior knowledge doesn’t put many constraints on the answer (i.e. p(y|X) or p(y, X)).
3. In terms of accuracy (ignoring speed), NNs empirically seem to perform better than other known methods. There’s no “gold standard” method to assess NNs against, or rather, NNs are the gold standard.
4. We don’t have proofs about accuracy/convergence/etc. for NNs … but we usually don’t have them for other competing methods, either.
These considerations make the heuristic argument convincing.
The problem is already statistical, so there’s no risk “distorting it” by putting it in the statistical form NNs require. Since we don’t know the physics, we aren’t worried that the generic, opaque function classes learned by NNs will fail to respect the physics. NNs are no worse than other methods in terms of proofs/guarantees, and they are better than other methods in their accuracy.
None of these are true in the case of PDE solving:
1. “PDE solving” usually means solving a known PDE with known data – not a statistical problem.
When the data is noisy, that’s the statistical problem of “filtering” (as in Kalman filters), but that is not what the Li papers are doing. They’re treating the data and the PDE as noisy. This is not really what most people solving PDEs are trying to do, and I’m not yet convinced that it’s even a well-defined problem, much less one I want to solve.
2. Again, most “PDE solving” has the PDE already known. So I care whether the NN’s function class can express the actual PDE I know I’m solving – that’s the whole point.
Whether or not we know the PDE, we generally have a prior constraints like conservation laws, and it’s desirable for a solver to respect these. I’m not convinced the Li approach even could handle conservation laws appropriately, since it appears to treat the PDE as fundamental (rather than the associated integral equation). I.e. it views solutions as functions and discretization as evaluation at points, rather than viewing solutions as densities and discretization as integration over volume elements.
3. We have gold-standard methods (standard PDE solvers) and there’s no evidence the NNs do better than these. The claim is that they’re faster, very different from the usual claim in favor of NNs.
4. The gold-standard methods all have known rates of convergence; that’s what it normally means to say you’ve created a PDE solver. Thus, the lack of guarantees about NNs is a disadvantage here, unlike elsewhere.
—-
Other comments:
The method is very complicated to write down (“a frankenstein mashup” as you put it). I’m not convinced the authors have successfully motivated something so complicated, given that their problem definition is unusual and the counts in favor of their method are a few numerical experiments.
In your blog post, shouldn’t “each Vt is a map V→U” be “each Vt is a map V→V”?
(via howthebodyworks)
I saw a hyped-up science news article about this paper and got briefly nerd sniped trying to figure out what was going on. I still don’t know.
Both the news article and the paper itself make it sound like this is some … fully general neural net approach for solving PDEs, that works for any PDE, and is fast and accurate … and doesn’t even need to know the PDE, it just learns it from solution data, and then after you “train” it on one solution it knows the PDE, and can produce other solutions.
And I’m like, that can’t be real, right? You can’t learn an infinite-dimensional operator from a finite sample. They must be choosing to prefer some operators over others, all else being equal. Also, what does this look like formally as a statistics problem, what measure are the operators being sampled from …
I found this near-contemporaneous paper which goes into more detail, but still doesn’t resolve my confusion. I also found this paper, cited as a competing method (although it shares several co-authors), which goes into much more mathematical detail and proves a general approximation theorem.
I don’t have the energy and interest to read through all these and figure out exactly what they’re actually doing, especially since I suspect it’s not that interesting.
(If PDEs are “generically learnable from finite samples under reasonable conditions” in some non-trivial way, that’s very interesting, but seems like something one could discover with pen and paper, and then go on to win prizes for discovering, without even needing a computer. I wouldn’t expect such a discovery to look like these papers.)
But if anyone else feels like reading these papers, let me know what you find out!
I’ve now read the new OpenAI scaling laws paper. Also, yesterday I attended a fun and informative lecture/discussion with one of the authors.
While the topic is on my mind, I should probably jot down some of my thoughts.
This post is mostly about what the new paper says about the “inconsistency” brought up in their previous paper.
The new paper has a new argument on this topic, which is intuitive and appealing, and suggests that the current scaling trend will indeed “switch over” soon to a new one where dataset size, not model size, is the active constraint on performance. Most of this post is an attempt to explain and better understand this argument.
——
The new paper is mainly about extending the scaling laws from their earlier paper to new modalities.
In that paper, they found scaling laws for transformers trained autoregressively on text data. The new paper finds the same patterns in the scaling behavior of transformers trained autoregressively on images, math problems, etc.
So the laws aren’t telling us something about the distribution of text data, but about something more fundamental. That’s cool.
They also have a new, very intuitive hypothesis for what’s going on with the “scaling inconsistency” they described in the previous paper – the one I made a big deal about at the time. So that’s the part I’m most excited to discuss.
I’m going to give a long explanation of it, way longer than the relevant part of their paper. Some of this is original to me, all errors are mine, all the usual caveats.
——
To recap: the “inconsistency” is between two scaling laws:
Once you reach a certain level of compute, these two laws contradict each other.
I’ll take some time to unpack that here, as it’s not immediately obvious the two can even be compared to one another – one is a function of compute, the other of data.
Budget tradeoffs
Given a compute budget C, you can derive the optimal way to spend it on different things. Roughly, you are trading off between two ways to spend compute:
The relationship between S (steps) and D (dataset size) is a little subtle, for several reasons.
From step count to update count
For one thing, each single “step” is an update on the information from more than one data point. Specifically, a step updates on “B” different points – B is the batch size.
So the total number of data points processed during training is B times S. The papers sometimes call this quantity “E” (number of examples), so I’ll call it that too.
From update count to data count
Now, when you train an ML model, you usually update on each data point more than once. Typically, you’ll do one pass over the full dataset (updating on each point as you go along), then you’ll go back and do a second full pass, and then a third, etc. These passes are called “epochs.”
If you’re doing things this way, then for every point in the data, you get (number of epochs) updates out of it. So
E = (number of epochs) * D.
Some training routines don’t visit every point the exact same number of times – there’s nothing forcing you to do that. Still, for any training procedure, we can look at the quantity E / D.
This would be the number of epochs, if you’re doing epochs. For a generic training routine, you can can think of E / D as the “effective number of epochs”: the average number of times we visit each point, which may not be an integer.
Generally, E ≠ D, but we always have E≥D. You can’t do fewer than one epoch; you can’t visit the average point less than once.
This is just a matter of definitions – it’s what “dataset size” means. If you say you’re training on a million examples, but you only update on 100 individual examples, then you simply aren’t “training on a million examples.”
L(D): information
OpenAI derives a scaling law called L(D). This law is the best you could possibly do – even with arbitrarily large compute/models – if you are only allowed to train on D data points.
No matter how good your model is, there is only so much it can learn from a finite sample. L(D) quantifies this intuitive fact (if the model is an autoregressive transformer).
L(C): budgeting
OpenAI also derives another a scaling law called L(C). This is the best you can do with compute C, if you spend it optimally.
What does optimal spending look like? Remember, you can spend a unit of compute on
(Sidenote: you can also spend on bigger batches B. But – to simplify a long, complicated story – it turns out that there are really just 2 independent knobs to tune among the 3 variables (B, N, S), and OpenAI frames the problem as tuning (N, S) with B already “factored out.”)
In the compute regime we are currently in, making the model bigger is way more effective than taking more steps.
This was one of the punchlines of the first of these two papers: the usual strategy, where you pick a model and then train it until it’s as good as it can get, is actually a suboptimal use of compute. If you have enough compute to train the model for that long (“until convergence”), then you have enough compute to train a bigger model for fewer steps, and that is a better choice.
This is kind of counterintuitive! It means that you should stop training your model before it stops getting better. (“Early stopping” means training your model until it stops getting better, so this is sort of “extra-early stopping.”) It’s not that those extra steps wouldn’t help – it’s that, if you are capable of doing them, then you are also capable of doing something else that is better.
Here’s something cool: in Appendix B.2 of the first paper, they actually quantify exactly how much performance you should sacrifice this way. Turns out you should always stop at a test loss about 10% higher than what your model could asymptotically achieve. (This will be relevant later, BTW.)
Anyway, OpenAI derives the optimal way to manage the tradeoff between N and S. Using this optimal plan, you can derive L(C) – the test loss you can achieve with compute C, if you allocate it optimally.
N goes up fast, S goes up slowly…
The optimal plan spends most incremental units of compute on bigger models (N). It spends very little on more steps (S).
The amount it spends on batch size (B) is somewhere in between, but still small enough that the product E = B*S grows slowly.
But remember, we know a relationship between E and “D,” dataset size. E can’t possibly be smaller than D.
So when your optimal plan chooses its B and its S, it has expressed an opinion about how big its training dataset is.
The dataset could be smaller than B*S, if we’re doing many (effective) epochs over it. But it can’t be any bigger than B*S: you can’t do fewer than one epoch.
… and you claim to achieve the impossible
L(C), the loss with optimally allocated C, goes down very quickly as C grows. Meanwhile, the dataset you’re training with that compute stays almost the same size.
But there’s a minimum loss, L(D), you can possibly achieve with D data points.
The compute-optimal plan claims “by training on at most B*S data points, with model size N, I can achieve loss L(C).”
The information bound says “if you train on at most B*S data points, your loss can’t get any lower than the function L(D), evaluated at D = B*S.”
Eventually, with enough compute, the L(C) of the compute-optimal plan is lower than the L(D) of the dataset used by that same plan.
That is, even if the compute-optimal model is only training for a single epoch, it is claiming to extract more value that epoch than any model could ever achieve, given any number of epochs.
That’s the inconsistency.
In the new paper, there’s an intuitive hypothesis for what’s going on here. I don’t think it really needs the multimodal results to motivate it – it’s a hypothesis that could have been conceived earlier on, but just wasn’t.
Bigger models extract a resource faster
The idea is this. As models get bigger, they get more update-efficient: each time they update on a data point, they get more out of it. You have to train them for fewer (effective) epochs, all else being equal.
This fact drives the choice to scale up the model, rather than scaling up steps. Scaling up the model makes your steps more valuable, so when you choose to scale the model rather than the steps, it’s almost like you’re getting more steps anyway. (More “step-power,” or something.)
The resource is finite
Each data point has some information which a model can learn from it. Finite models, trained for a finite amount of time, will miss out on some of this information.
You can think about the total extractable information in a data point by thinking about what an infinitely big model, trained forever, would eventually learn from that point. It would extract all the information – which is more than a lesser model could extract, but still finite. (A single data point doesn’t contain all the information in the universe.)
This is literally the definition of L(D): what an infinitely big model, trained forever, could learn from D separate data points. L(D) quantifies the total extractable information of those points.
(More precisely, the total extractable information is the gap between L(D) and the loss achieved by a maximally ignorant model, or something like that.)
Converging in the very first step
As models get bigger, they extract more information per update. That is, each time they see a data point, they extract a larger fraction of its total extractable information.
Eventually, your models are getting most of that information the very first time they see the data point. The “most” in that sentence gets closer and closer to 100%, asymptotically.
How does this relate to optimal compute allocation?
The logic of the “optimal compute plan” is as follows:
Your model is an imperfect resource extractor: it only gets some of the resources locked up in a data point from the first update. So you could extract more by running for more steps …
… but if you have the compute for that, you can also spend it by making your steps more efficient. And, in the current compute regime, that’s the smarter choice.
It’s smarter by a specific, uniform proportion. Remember, you should stop training when your loss is 10% higher than the converged loss of the same model. If the converged loss is L, you should stop at 1.1*L.
Can you always do that? If your model is efficient enough, you can’t! As the first epoch gets closer to 100% efficient, the loss after the first epoch gets arbitrarily close to the converged loss. Your loss goes under 1.1*L by the end of the first epoch.
At this point, the story justifying the L(C) law breaks down.
The L(C) law goes as fast is it does because upgrading the efficiency of your extractor is cheaper – in terms of compute spent per unit of resource extracted – than actually running the extractor.
This works as long as your extractor is inefficient. But you can’t push efficiency above 100%. Eventually, the only way to extract more is to actually run the damn thing.
Getting a bigger quarry
When you’re extracting a resource, there’s a difference between “improve the extractor” and “get a bigger quarry.”
If your quarry has 100 resource units in it, the strategy of “improving the extractor” can never get you more than 100 units. It can get them to you faster, but if you want more than 100 units, you have to get a bigger quarry.
“N” sets the efficiency of the extractor. “S” sets … well, it doesn’t exactly set the size of the quarry (that’s D). There is an ambiguity in the S: it could mean running for more epochs on the same data, or it could mean getting more data.
But S does, at least, set an upper bound on the size of the quarry, D. (Via D≤E and E = B*S, with B set optimally as always.)
With high enough compute (and thus model size), you’ve pushed the “extractor upgrades are cheap” lifehack as far as it can go. With this efficient extractor, taking S steps (thus making E = B*S updates) sucks up most of the information theoretically extractable from E individual data points.
The learning curve L(E) of your model, as it makes its first pass over the dataset, starts to merge with L(D), the theoretical optimum achievable with that same dataset. You trace out L(D) as you train, and the relevant constraint on your performance is the maximum data size D you can obtain and train on.
Where we are now
In the compute regime that spans GPT-2 and the smaller variants of GPT-3, extraction is far less than maximally efficient. The L(C) strategy applies, and the smart move is to spend compute mostly on model size. So you make GPT-2, and then GPT-3.
Once we get to the full GPT-3, though, the extractor is efficient enough that the justification for L(C) has broken down, and the learning curve L(E) over the first epoch looks like L(D).
Here is that as a picture, from the new paper:

The yellowest, lowest learning curve is the full GPT-3. (The biggest GPT-2 is one of the green-ish lines.) The black line is L(D), maximally efficient extraction.
You can see the whole story in this picture. If you’re in one of the smaller-model learning curves, running for more steps on more data will get you nowhere near to the total extractable info in that data. It’s a better use of your compute to move downwards, toward the learning curve of a bigger model. That’s the L(C) story.
If the L(C) story went on forever, the curves would get steeper and steeper. Somewhere a little beyond GPT-3, they would be steeper than L(D). They would cross L(D), and we’d be learning more than L(D) says is theoretically present in the data.
According to the story above, that won’t happen. We’ll just converge ever closer to L(D). To push loss further downward, we need more data.
Since people are talking about bitter lessons a lot these days, I should make the following explicit: none of this means “the scaling hypothesis is false,” or anything like that.
It just suggests the relevant variable to scale with compute will switch: we’ll spent less of our marginal compute on bigger models, and more of it on bigger data.
That said, if the above is true (which it may not be), it does suggest that scaling transformers on text alone will not continue productively much past GPT-3.
The GPT-3 paper says its choices were guided by the “grow N, not S” heuristic behind the L(C) curve:
Based on the analysis in Scaling Laws For Neural Language Models [KMH+20] we train much larger models on many fewer tokens than is typical.
(“KMH+20″ is the first of the two scaling papers discussed here.) Even following this heuristic, they still picked a huge dataset, by human standards for text datasets.
In the above terms, their “E” was 300 billion tokens and their “D” was ~238 tokens, since they updated multiple times on some tokens (cf. Table 2.2 in the GPT-3 paper). The whole of Common Crawl is 410 billion tokens, and Common Crawl might as well be “all the text in the universe” from the vantage point of you and me.
So, there’s room to scale D up somewhat further than they did with GPT-3, but not many orders of magnitude more. To me, this suggests that an intuitively “smarter” GPT-4 would need to get its smartness from being multimodal, as we really can’t go much further with just text.
[post about high-context trivia]
For practical reasons, I’ve been reading papers recently about minor architectural details in transformers.
People mostly vary these things to make training more stable, rather than for final performance, which barely cares about the architecture (e.g. you can do GPT-2 with only 6 layers, maybe only even 2, if you make it wider to compensate).
Here’s an example paper that cites a lot of the others. These papers are mostly about the placement and function of the layer norm operations – for example it helps a lot if you move them so they don’t block the residual connections from working as intended (“pre-norm”), which they did in the original transformer (“post-norm”).
This made me think about layer norm again, which had always bothered me, because it’s not coordinate invariant! I had figured “oh it probably doesn’t matter” but apparently you get better performance if you remove the part that is not coordinate invariant (“RMSNorm” and “ScaleNorm”), so maybe the coordinate invariance is harmful.
Layer norm is weird and I don’t understand why it got off the ground in the first place. It’s an operation that takes in a vector, subtracts off its “mean,” and then scales the result to unit norm. What is the “mean” of a vector? Well, it’s the mean in whatever basis your computer happens to be using.
This might be less bad if you imagine it being applied rather after the activation function, which selects a particular basis anyway (and layer norm would operate in that basis). However, in transformers it’s applied after embedding and projection steps that have no preferred basis.
When you think about what this actually does, it seems pointless? Subtracting “the mean” is equivalent to choosing some direction and projecting out that component. So, after layer norm your N-dim vectors will always live in an (N-1)-dim subspace; otherwise everything’s the same, so it’s similar to reducing your hidden size by 1. (Though not exactly the same.) I don’t see how this would stabilize anything.
Layer norm also does another thing in the preferred basis later, multiplying each component by a learned “gain.” Not sure what this accomplishes.
The authors of the original layer norm paper try to justify it using information geometry (!) … I don’t know what to make of talk about Riemannian manifolds and metrics when you haven’t written a coordinate-independent function to begin with.
When used properly in transformers (“pre-norm”), it gets applied to the input of each residual branches i.e. when we compute x + f(x) we change it to x + f(LN(x)). Among other things, this means there’s this one component of the input which nothing can see, but which is preserved all the way to the output through the identity branch. In GPT specifically there’s another layer norm at the end, which will delete this component, so it just does nothing. In other cases, it will affect the output logits, but the input is a learned embedding vector anyway, so this can’t matter much.
@stumpyjoepete replied to your post “I don’t think (?) I’ve said this before, even though it seems…”
Is there a reason they’ve been so successful at apparently hard problems with this technique? I wouldn’t generally expect that “apply wholely generic optimization” would ever lead to advances in the state of the art of anything. So was the secret sauce actually elsewhere in what they did, and the RL was just a boring part people latched onto? If so, what was it?Good question. First off, two things that are important here:
1. Again, RL isn’t a technique, it’s a problem formulation. Some problems domains are inherently hard to formulate in any terms less generic than RL, so in these domains, any machine-learned/statistical approach will look like “RL.”
This exerts a conditioning/selection effect on the comparisons we make. The impressive results demonstrated for DeepMind’s series of game-players (AlphaGo, AlphaGo Zero, AlphaZero, MuZero) were “beating top humans” and “beating top non-ML programs that use hardcoded rules/functions.”
There is no slot there for “beating ML programs that didn’t ‘use RL’,” because if you make the usual reductions away from RL in this domain, you have to aceept prohibitive limits on train data size (see below).
2. There is a distinction between “doing RL” and “applying wholly generic optimization.” What makes something RL is the fully generic problem statement, but the technique / model architecture can be as specialized as you like.
In the last part of my post, I critiqued work on domain-general RL, because that work can’t specialize either the model or the problem description, so it really is “wholly generic optimization.” But in actual applications like the DeepMind game-players, you phrase the problem as a wholly generic “do a thing well” but then construct a thing-doer in full awareness of what specific thing you’re trying to do.
(DeepMind’s players have successfully removed more and more of the baked-in domain knowledge while still performing well, with their latest one – MuZero – being pretty generic across the domain of transparently scored games with 2D spatial states and sufficiently non-huge [?] action spaces, but that’s still far away from “do a generic thing.”)
—-
I’ve said the domains where the SOTA looks like RL are the domains where statistical learning cannot be put in a simpler form than RL. Which are these?
My impression is that “doing RL” has led to impressive SOTA results mostly in board/computer games. (This may be out of date – I think it was at least true in 2018.)
So, what’s special about games?
Relevant features of the problem defn. for games
Objective evaluation of quality happens at the full game level (win/loss or total points), and a game comprises many successive moves.
This is the big thing that makes this inherently an “RL” domain. In some domains, there is a natural, objective quality metric for single actions – for example, in language modeling, the task is “predict the next word/token,” there is always a correct answer (the true next word/token), and there isn’t some other real metric like “winning the game” for which this is a mere proxy.
In a game, we can invent move-quality metrics, like predicting the next move of a skilled player, but these are proxies. The true, objective definition of “a good chess move” is one that wins games, whatever that means, period.
Any program has to pick its moves one by one, so it has some (at least implicit) function for scoring moves. Either this function is hardcoded (so not statistical learning), or it’s learned from a proxy (like imitating skilled players), or it’s learned from the true quality metric (this is RL).
So, in statistical learning, we either optimize a move-level proxy or optimize at game-level. The statement that “RL works for games” = the statement that the latter is superior.
Relevant facts about data generation for games
We can optimize at the move level or at the game level. The latter matches what we actually care about, but is extremely inefficient:
- An entire board game, played to the end, gives us a single bit of signal (did we win?)
- And, even this is not a direct signal about what the move-quality metric ought to be, but an indirect signal about all the moves at once. We must (in some sense) statistically learn an attribution function that decides what the win/loss implies for individual moves. Such a function could look many different ways, and we must spend many bits of information setting its parameters above and beyond the ones we spend setting the parameters of the move-scoring function.
But in games, you can be inefficient as long as you’re only playing against computers. It’s cheap to generate enormous amounts of example data with gold-standard scores attached, by just playing the game inside the computer. This allows training on arbitrary numbers of examples, limited only by compute.
Meanwhile, if you want to train on a move-quality signal, you must use data from human players – and at high skill level, there’s only a finite and tiny quantity of that. So we’re comparing an efficient method on a finite resource to an inefficient method on a resource only bounded by compute. As compute grows, eventually the latter wins.
Other facts making RL less infeasible for games
Via self-play, it’s possible to generate large amounts of adversarial data that probe the model’s current weaknesses. However good the program is, when it faces itself, it faces a worthy opponent. Thus we can avoid overfitting to incidental features of some single fixed environment, which is a big problem in other RL work.
Quality, although only defined at the game level, is unambiguous where it’s defined, so we don’t have misspecification/paperclip-maximizer issues, which is another big problem in other RL work.
—-
To conclude, the cases where the best solution looks like RL are cases where, roughly:
- There is no natural quality metric at the single-action level
- There is an unambiguous quality metric over larger chains of actions
- Our source of quality-scored action chains as training data is only limited by compute
- Some other properties that let you avoid common pitfalls
- The task is simple enough in terms of the size of the action space, complexity of the dynamics, etc. (No one knows exactly what “simple enough” means, but no one thinks that that the DeepMind players won’t eventually break as you scale up the problem. For example, they’re finite-sized convnets with finite model capacity, and you can imagine an environment generated by a dynamics with so many parameters you can run something of that size forwards on current hardware, but not backpropagate over it.)
It’s a narrow regime where – almost “perversely” – much more data becomes available when you formulate the problem in the least data-efficient manner, so that the former trend dominates the latter, and learning fewer bits per step still leaves you having learned more bits at the end. It’s cool that this regime includes a few tasks considered among the pinnacles of human achievement, but it’s still a narrow regime, with not much else of interest in it.
I really like this analysis!
I have some questions/comments/blathering.
1. There are a lot of domains that are like games in the sense that they have defined moves and win states that can be simulated on a computer at basically arbitrary speed, but humans don’t always know the best moves that lead to a win, and we have a limited data supply of good human moves. The most extreme example I can think of is formal theorem proving, where “winning” is proving the theorem, and “losing” is everything else.
These domains, IMO, do involve quite a lot of things that people care about. It seems that RL has not been as effective in these domains. Do you have a sense of why that is?
One possibility is that these domains are either so intrinsically hard that machine learning overall has made no progress, or, for easier tasks, that existing combinatorial optimization routines are strong enough that machine l
On the other hand, in game-like tasks, it seems that machine learning beats a pure algorithmic combinatorial optimization approach - perhaps because we don’t have good algorithms for adversarial settings.
It’s maybe an interesting data point that in some domains, the top machine learning algorithms are GANs, which basically are designed to take a task that humans would think of as totally unlike chess or go and treat it as a game.
2. For these games, an additional flaw in the human data, beyond the fact that it is limited in quantity, is that humans may just not play these games very well, all things considered! It’s easy to see how a pure supervised learning algorithm could become a little better than humans by imitating the top humans but avoiding blunders, but it’s hard to see how a pure supervised learning algorithm could become a lot better than the top humans. (Well, in chess, you could generate a bunch of Stockfish games and train on those, but then you would be unlikely to become much better than Stockfish, if at all.)
On the other hand RL bots do play better than humans, to the point that these are some of the only domains where humans have taken ideas generated by ML algorithms and applied them in their own tasks (!!!). Imagine if image recognition algorithms taught us a better way to look at a picture and figure out if it was a picture of a dog, or gpt-4 comes out and it teaches novelists new literary techniques!
3. It’s perhaps relevant that the striking success of the AlphaZero and MuZero algorithms in part comes from the fact that the approach is basically as unlike traditional reinforcement learning techniques as possible. In fact I told people that it wasn’t really reinforcement learning until I found out that reinforcement learning refers to a problem statement and not a class of techniques (it lacks the feature where actions that lead to success are reinforced…)
Instead you basically do supervised learning, trying to predict data (moves and game outcomes), which are generated by a combination of the algorithm itself, and a traditional combinatorial algorithm (Monte Carlo tree search) which you know has good mathematical properties.
So I don’t know how much this should be seen as the same kind of thing as e.g. AlphaStar, which to my knowledge uses much more crude “do more of the things we did in the games that we won” RL strategies, and which hasn’t (IIRC) developed strategies that humans have used.
4. Maybe the overall lesson is something we more-or-less already knew - machine learning algorithms are very, very hungry for data, and so if you want to apply a machine learning algorithm to a problem domain you should first figure out how to obtain or generate the most relevant data for these hungry, hungry boys, and then figure out a way to formulate a gradient descent process that uses that data, rather than deciding initially whether reinforcement learning or supervised learning is the best and then searching for the relevant type of data.
Interesting stuff, thanks!
Re: #1
I’m not too familiar with the area (of theorem proving), but I happened to bump into it when I was interested in graph NNs a while ago.
At that time, I remember finding this paper, with interesting results on SAT solving (I was mainly interested in it as an application of graph NNs). They treated the whole thing as supervised learning, though.
Looking around now, I found this paper which uses RL to train a graph NN that computes one specific heuristic used in an otherwise standard solver algorithm. Their section 5 has what looks like a good lit review of the area. (Outside of SAT, I see plenty of papers when I search for theorem proving and RL, but don’t feel confident opining on them…)
Anyway, here are some random comments on this:
- It seems possible that the similarity you mention (between proving and games) really does mean these approaches will go far? Maybe “AlphaMath” or whatever is just a year or two of routine work away.
- A mathematically “correct” (i.e. invariance-respecting) input encoding for math stuff requires newer NN architectures, with less prior art / tooling to build on. Terms in a formula are permutation invariant, and people want an encoding that captures that intrinsically, hence the use of graph NNs.
In domains like board games where your elements have an order, it feels “less bad” to use CNNs or RNNs or whatever, and then you can build on tons of past work with those. (The DeepMind players use CNNs.)
Two caveats to that, though. First, DeepMind’s players have gotten less careful about invariances (they stopped doing data augmentation for rotation/reflection symmetry in AlphaZero, and have used the same CNN they “designed” with Go in mind for an increasing range of games). So maybe this issue just doesn’t matter so much.
Second, if humans understand formulas during proving by repurposing linguistic faculties, then our own encoding is “wrong” in the same way a CNN/RNN’s would be. So that’s at least a bound on how much this issue could hurt.
- Some of the work on SAT is structured like the DeepMind players, where you have a “traditional” search algorithm, but with an NN supplying the heuristics for how promising different search avenues are. This gives you various freedoms: which search algorithm, which parts the NN computes, etc. Researchers are doing a meta-search over these options, and it may take time to find the best one.
- Our standards may just be higher for proving than for games. Games are generally timed, while proofs generally aren’t, so proofs are really closer to solving chess problems than playing chess.
- Putting that another way, in a game you only have to do better than the adversary, who is doing a lossy search just like you; there presumably are vastly better moves in the search space that neither of you can find, but they don’t matter for beating the adversary.
I think this also provides a certain helpful continuity when learning move scoring: during self-play, you face an adversary about as strong as yourself, so your win/loss signal is pretty balanced and tells you how to slightly modify yourself to be better. In math, to get better, you need to find problems just at the edge of your capacity so that the signal isn’t just an unhelpful string of wins (too easy) or losses (too hard); in games, self-play finds this regime automatically. Perhaps, a la your GAN comment, we need ways to make the proving domain even more like a game.
Re #3
I’m not sure I understand what you mean?
Definitely it’s different from conventional RL because search is used in the outer loop, and because of self-play.
Also, except in MuZero for Atari, there’s also only one reward per episode, so time discounting isn’t a thing. We’re not managing explore/exploit within episodes, but just trying to judge how good different moves are based on win/loss statistics, which is what any ML approach to the problem would have to do.
Also also, the loss doesn’t directly say “learn with a policy that maximizes expected discounted reward,” it says “learn a policy that imitates a smarter version of yourself (one who can use search)” and “learn a value function that captures expected reward” and then combines these in search scoring.
I think this is closest to what you were getting at? The learned policy will play well (“get rewards”) if used directly even without search (see Fig. 6 of AGZ paper, see also discussion in this thread). But this “learned policy” has a convoluted relationship with the true behavioral policy; it’s trained to imitate what search would do, where search has access to itself and also the separate value function.
The presence of the value function means the “learned policy” isn’t even just “the raw policy that search augments,” it’s a smaller piece of the picture than that. The relationship between all the pieces is very complicated and self-referential.
Having thought about it while writing this section, it does seem like a mistake to group this in with traditional RL that uses Q-learning or whatever. Even if we say “RL is a problem formulation,” this stuff distinctively (!) doesn’t quite fit in that formulation, since (again, ignoring Atari) the environment dynamics is entirely self-generated, with no way to write down the environment without terms that refer to the agent. And the methods used are very different.
(They’re apparently very powerful methods, so all the more reason to give them some name and study them as such, instead of lumping them uselessly under “RL”…)