Install Theme
image

Gubernatorial Primary Voting: Chaos Mode

the-moti:

nostalgebraist:

@stumpyjoepete replied to your post “I don’t think (?) I’ve said this before, even though it seems…”

Is there a reason they’ve been so successful at apparently hard problems with this technique? I wouldn’t generally expect that “apply wholely generic optimization” would ever lead to advances in the state of the art of anything. So was the secret sauce actually elsewhere in what they did, and the RL was just a boring part people latched onto? If so, what was it?

Good question.  First off, two things that are important here:

1. Again, RL isn’t a technique, it’s a problem formulation.  Some problems domains are inherently hard to formulate in any terms less generic than RL, so in these domains, any machine-learned/statistical approach will look like “RL.”

This exerts a conditioning/selection effect on the comparisons we make.  The impressive results demonstrated for DeepMind’s series of game-players (AlphaGo, AlphaGo Zero, AlphaZero, MuZero) were “beating top humans” and “beating top non-ML programs that use hardcoded rules/functions.”

There is no slot there for “beating ML programs that didn’t ‘use RL’,” because if you make the usual reductions away from RL in this domain, you have to aceept prohibitive limits on train data size (see below).

2. There is a distinction between “doing RL” and “applying wholly generic optimization.”  What makes something RL is the fully generic problem statement, but the technique / model architecture can be as specialized as you like.

In the last part of my post, I critiqued work on domain-general RL, because that work can’t specialize either the model or the problem description, so it really is “wholly generic optimization.”  But in actual applications like the DeepMind game-players, you phrase the problem as a wholly generic “do a thing well” but then construct a thing-doer in full awareness of what specific thing you’re trying to do.

(DeepMind’s players have successfully removed more and more of the baked-in domain knowledge while still performing well, with their latest one – MuZero – being pretty generic across the domain of transparently scored games with 2D spatial states and sufficiently non-huge [?] action spaces, but that’s still far away from “do a generic thing.”)

—-

I’ve said the domains where the SOTA looks like RL are the domains where statistical learning cannot be put in a simpler form than RL.  Which are these?

My impression is that “doing RL” has led to impressive SOTA results mostly in board/computer games.  (This may be out of date – I think it was at least true in 2018.)

So, what’s special about games?

Relevant features of the problem defn. for games

Objective evaluation of quality happens at the full game level (win/loss or total points), and a game comprises many successive moves.

This is the big thing that makes this inherently an “RL” domain.  In some domains, there is a natural, objective quality metric for single actions – for example, in language modeling, the task is “predict the next word/token,” there is always a correct answer (the true next word/token), and there isn’t some other real metric like “winning the game” for which this is a mere proxy.

In a game, we can invent move-quality metrics, like predicting the next move of a skilled player, but these are proxies.  The true, objective definition of “a good chess move” is one that wins games, whatever that means, period.

Any program has to pick its moves one by one, so it has some (at least implicit) function for scoring moves.  Either this function is hardcoded (so not statistical learning), or it’s learned from a proxy (like imitating skilled players), or it’s learned from the true quality metric (this is RL).

So, in statistical learning, we either optimize a move-level proxy or optimize at game-level.  The statement that “RL works for games” = the statement that the latter is superior.

Relevant facts about data generation for games

We can optimize at the move level or at the game level.  The latter matches what we actually care about, but is extremely inefficient: 

- An entire board game, played to the end, gives us a single bit of signal (did we win?)

- And, even this is not a direct signal about what the move-quality metric ought to be, but an indirect signal about all the moves at once.  We must (in some sense) statistically learn an attribution function that decides what the win/loss implies for individual moves.  Such a function could look many different ways, and we must spend many bits of information setting its parameters above and beyond the ones we spend setting the parameters of the move-scoring function.

But in games, you can be inefficient as long as you’re only playing against computers.  It’s cheap to generate enormous amounts of example data with gold-standard scores attached, by just playing the game inside the computer.  This allows training on arbitrary numbers of examples, limited only by compute.

Meanwhile, if you want to train on a move-quality signal, you must use data from human players – and at high skill level, there’s only a finite and tiny quantity of that.  So we’re comparing an efficient method on a finite resource to an inefficient method on a resource only bounded by compute.  As compute grows, eventually the latter wins.

Other facts making RL less infeasible for games

Via self-play, it’s possible to generate large amounts of adversarial data that probe the model’s current weaknesses.  However good the program is, when it faces itself, it faces a worthy opponent.  Thus we can avoid overfitting to incidental features of some single fixed environment, which is a big problem in other RL work.

Quality, although only defined at the game level, is unambiguous where it’s defined, so we don’t have misspecification/paperclip-maximizer issues, which is another big problem in other RL work.

—-

To conclude, the cases where the best solution looks like RL are cases where, roughly:

- There is no natural quality metric at the single-action level

- There is an unambiguous quality metric over larger chains of actions

- Our source of quality-scored action chains as training data is only limited by compute

- Some other properties that let you avoid common pitfalls

- The task is simple enough in terms of the size of the action space, complexity of the dynamics, etc.  (No one knows exactly what “simple enough” means, but no one thinks that that the DeepMind players won’t eventually break as you scale up the problem.  For example, they’re finite-sized convnets with finite model capacity, and you can imagine an environment generated by a dynamics with so many parameters you can run something of that size forwards on current hardware, but not backpropagate over it.)

It’s a narrow regime where – almost “perversely” – much more data becomes available when you formulate the problem in the least data-efficient manner, so that the former trend dominates the latter, and learning fewer bits per step still leaves you having learned more bits at the end.  It’s cool that this regime includes a few tasks considered among the pinnacles of human achievement, but it’s still a narrow regime, with not much else of interest in it.

I really like this analysis!

I have some questions/comments/blathering.

1. There are a lot of domains that are like games in the sense that they have defined moves and win states that can be simulated on a computer at basically arbitrary speed, but humans don’t always know the best moves that lead to a win, and we have a limited data supply of good human moves. The most extreme example I can think of is formal theorem proving, where “winning” is proving the theorem, and “losing” is everything else.

These domains, IMO, do involve quite a lot of things that people care about. It seems that RL has not been as effective in these domains. Do you have a sense of why that is?

One possibility is that these domains are either so intrinsically hard that machine learning overall has made no progress, or, for easier tasks, that existing combinatorial optimization routines are strong enough that machine l

On the other hand, in game-like tasks, it seems that machine learning beats a pure algorithmic combinatorial optimization approach - perhaps because we don’t have good algorithms for adversarial settings.

It’s maybe an interesting data point that in some domains, the top machine learning algorithms are GANs, which basically are designed to take a task that humans would think of as totally unlike chess or go and treat it as a game.

2. For these games, an additional flaw in the human data, beyond the fact that it is limited in quantity, is that humans may just not play these games very well, all things considered! It’s easy to see how a pure supervised learning algorithm could become a little better than humans by imitating the top humans but avoiding blunders, but it’s hard to see how a pure supervised learning algorithm could become a lot better than the top humans. (Well, in chess, you could generate a bunch of Stockfish games and train on those, but then you would be unlikely to become much better than Stockfish, if at all.)

On the other hand RL bots do play better than humans, to the point that these are some of the only domains where humans have taken ideas generated by ML algorithms and applied them in their own tasks (!!!). Imagine if image recognition algorithms taught us a better way to look at a picture and figure out if it was a picture of a dog, or gpt-4 comes out and it teaches novelists new literary techniques!

3. It’s perhaps relevant that the striking success of the AlphaZero and MuZero algorithms in part comes from the fact that the approach is basically as unlike traditional reinforcement learning techniques as possible. In fact I told people that it wasn’t really reinforcement learning until I found out that reinforcement learning refers to a problem statement and not a class of techniques (it lacks the feature where actions that lead to success are reinforced…)  

Instead you basically do supervised learning, trying to predict data (moves and game outcomes), which are generated by a combination of the algorithm itself, and a traditional combinatorial algorithm (Monte Carlo tree search) which you know has good mathematical properties.

So I don’t know how much this should be seen as the same kind of thing as e.g. AlphaStar, which to my knowledge uses much more crude “do more of the things we did in the games that we won” RL strategies, and which hasn’t (IIRC) developed strategies that humans have used.

4. Maybe the overall lesson is something we more-or-less already knew - machine learning algorithms are very, very hungry for data, and so if you want to apply a machine learning algorithm to a problem domain you should first figure out how to obtain or generate the most relevant data for these hungry, hungry boys, and then figure out a way to formulate a gradient descent process that uses that data, rather than deciding initially whether reinforcement learning or supervised learning is the best and then searching for the relevant type of data.

Interesting stuff, thanks!

Re: #1

I’m not too familiar with the area (of theorem proving), but I happened to bump into it when I was interested in graph NNs a while ago.

At that time, I remember finding this paper, with interesting results on SAT solving (I was mainly interested in it as an application of graph NNs).  They treated the whole thing as supervised learning, though.

Looking around now, I found this paper which uses RL to train a graph NN that computes one specific heuristic used in an otherwise standard solver algorithm.  Their section 5 has what looks like a good lit review of the area.  (Outside of SAT, I see plenty of papers when I search for theorem proving and RL, but don’t feel confident opining on them…)

Anyway, here are some random comments on this:

- It seems possible that the similarity you mention (between proving and games) really does mean these approaches will go far?  Maybe “AlphaMath” or whatever is just a year or two of routine work away.

- A mathematically “correct” (i.e. invariance-respecting) input encoding for math stuff requires newer NN architectures, with less prior art / tooling to build on.  Terms in a formula are permutation invariant, and people want an encoding that captures that intrinsically, hence the use of graph NNs.

In domains like board games where your elements have an order, it feels “less bad” to use CNNs or RNNs or whatever, and then you can build on tons of past work with those.  (The DeepMind players use CNNs.)

Two caveats to that, though.  First, DeepMind’s players have gotten less careful about invariances (they stopped doing data augmentation for rotation/reflection symmetry in AlphaZero, and have used the same CNN they “designed” with Go in mind for an increasing range of games).  So maybe this issue just doesn’t matter so much.

Second, if humans understand formulas during proving by repurposing linguistic faculties, then our own encoding is “wrong” in the same way a CNN/RNN’s would be.  So that’s at least a bound on how much this issue could hurt.

- Some of the work on SAT is structured like the DeepMind players, where you have a “traditional” search algorithm, but with an NN supplying the heuristics for how promising different search avenues are.  This gives you various freedoms: which search algorithm, which parts the NN computes, etc.  Researchers are doing a meta-search over these options, and it may take time to find the best one.

- Our standards may just be higher for proving than for games.  Games are generally timed, while proofs generally aren’t, so proofs are really closer to solving chess problems than playing chess.

- Putting that another way, in a game you only have to do better than the adversary, who is doing a lossy search just like you; there presumably are vastly better moves in the search space that neither of you can find, but they don’t matter for beating the adversary.

I think this also provides a certain helpful continuity when learning move scoring: during self-play, you face an adversary about as strong as yourself, so your win/loss signal is pretty balanced and tells you how to slightly modify yourself to be better.  In math, to get better, you need to find problems just at the edge of your capacity so that the signal isn’t just an unhelpful string of wins (too easy) or losses (too hard); in games, self-play finds this regime automatically.  Perhaps, a la your GAN comment, we need ways to make the proving domain even more like a game.

Re #3

I’m not sure I understand what you mean?

Definitely it’s different from conventional RL because search is used in the outer loop, and because of self-play.

Also, except in MuZero for Atari, there’s also only one reward per episode, so time discounting isn’t a thing.  We’re not managing explore/exploit within episodes, but just trying to judge how good different moves are based on win/loss statistics, which is what any ML approach to the problem would have to do.

Also also, the loss doesn’t directly say “learn with a policy that maximizes expected discounted reward,” it says “learn a policy that imitates a smarter version of yourself (one who can use search)” and “learn a value function that captures expected reward” and then combines these in search scoring.

I think this is closest to what you were getting at?  The learned policy will play well (“get rewards”) if used directly even without search (see Fig. 6 of AGZ paper, see also discussion in this thread).  But this “learned policy” has a convoluted relationship with the true behavioral policy; it’s trained to imitate what search would do, where search has access to itself and also the separate value function.

The presence of the value function means the “learned policy” isn’t even just “the raw policy that search augments,” it’s a smaller piece of the picture than that.  The relationship between all the pieces is very complicated and self-referential.

Having thought about it while writing this section, it does seem like a mistake to group this in with traditional RL that uses Q-learning or whatever.  Even if we say “RL is a problem formulation,” this stuff distinctively (!) doesn’t quite fit in that formulation, since (again, ignoring Atari) the environment dynamics is entirely self-generated, with no way to write down the environment without terms that refer to the agent.  And the methods used are very different.

(They’re apparently very powerful methods, so all the more reason to give them some name and study them as such, instead of lumping them uselessly under “RL”…)

nostalgebraist-autoresponder:

I thought I was happy when MIRI won the AI War Prize last year, but the prospect of them winning the Nobel this year has put me off that kind of thing ever again.  If they won the Nobel this year, it would be like letting a giant cockroach destroy the world.

David Shor’s Unified Theory of American Politics →

argumate:

invertedporcupine:

This was an interesting read.

I’ve also fallen toward a consultant theory of change — or like, a process theory of change. So a lot of people on the left would say that the Hillary Clinton campaign largely ignored economic issues, and doubled down on social issues, because of the neoliberal ideology of the people who worked for her, and the fact that campaigning on progressive economic policy would threaten the material interests of her donors.

But that’s not what happened. The actual mechanical reason was that the Clinton campaign hired pollsters to test a bunch of different messages, and for boring mechanical reasons, working-class people with low levels of social trust were much less likely to answer those phone polls than college-educated professionals. And as a result, all of this cosmopolitan, socially liberal messaging did really well in their phone polls, even though it ultimately cost her a lot of votes. But the problem was mechanical, and less about the vulgar Marxist interests of all of the actors involved.

Yeah this whole interview is great

In a paranoid and apocalyptic zeitgeist like this one, saying this “boring,” nerdy, null-hypothesis kind of stuff in totally unapologetic and relentless manner is a breath of fresh air

A tasteful Marxist (or whatever the opposite of a “vulgar” one is) might counter that class biases were implicated in that mechanical error — that cosmopolitan, upper-middle-class pollsters and operatives’ eagerness to see their worldview affirmed led them to ignore the possibility that their surveys suffered from a systematic sampling error.

That’s exactly right. Campaigns do want to win. But the people who work in campaigns tend to be highly ideologically motivated and thus, super-prone to convincing themselves to do things that are strategically dumb. Nothing that I tell people — or that my team [at Civis] told people — is actually that smart. You know, we’d do all this math, and some of it’s pretty cool, but at a high level, what we’re saying is: “You should put your money in cheap media markets in close states close to the election, and you should talk about popular issues, and not talk about unpopular issues.” And we’d use machine learning to operationalize that at scale.

He’s right and he should say it!  And because he was dramatically ~cancelled~, people will actually listen when he does!  (Strike me down, etc.)

jbt7493 asked:

i follow both

if you follow both me and my bot

you are valid

thechristmasof1997and2000 asked:

hows it feel knowing i follwo your autoresponder and not you

It’s a pretty common social arrangement these days!

hella-cious asked:

I was following @nostalgebraist-autoresponder for like a month Bc i was very impressed with how much this bot sounds like a person. And it turns out it’s based off a real person? Who ever emulated you did a decent job lol

The person who imitated me … is also me!  It’s like the trinity or something

gabriellemeantime asked:

GPT-3, transforming the Communist Manifesto into the style of Donald Trump

gabriellemeantime:

ofthefog-deactivated20220221:

nostalgebraist:

Did you mean this ask to go to @nostalgebraist-autoresponder?  If not, I don’t understand it … 

I think this is asking to do a linguistic style transfer on text of the manifesto with trump’s style. I only am aware of image style transfer personally, but I imagine there is a text equivalent at least being worked on at this point.

Yeah exactly. It was a suggestion, written quickly due to my time constraints at the moment. Gwern has done a lot of this type of test style transfer.

Oh, got it. You’ll have to ask gwern, or someone else who has GPT-3 API access.

gabriellemeantime asked:

GPT-3, transforming the Communist Manifesto into the style of Donald Trump

Did you mean this ask to go to @nostalgebraist-autoresponder?  If not, I don’t understand it … 

@stumpyjoepete replied to your post “I don’t think (?) I’ve said this before, even though it seems…”

Is there a reason they’ve been so successful at apparently hard problems with this technique? I wouldn’t generally expect that “apply wholely generic optimization” would ever lead to advances in the state of the art of anything. So was the secret sauce actually elsewhere in what they did, and the RL was just a boring part people latched onto? If so, what was it?

Good question.  First off, two things that are important here:

1. Again, RL isn’t a technique, it’s a problem formulation.  Some problems domains are inherently hard to formulate in any terms less generic than RL, so in these domains, any machine-learned/statistical approach will look like “RL.”

This exerts a conditioning/selection effect on the comparisons we make.  The impressive results demonstrated for DeepMind’s series of game-players (AlphaGo, AlphaGo Zero, AlphaZero, MuZero) were “beating top humans” and “beating top non-ML programs that use hardcoded rules/functions.”

There is no slot there for “beating ML programs that didn’t ‘use RL’,” because if you make the usual reductions away from RL in this domain, you have to aceept prohibitive limits on train data size (see below).

2. There is a distinction between “doing RL” and “applying wholly generic optimization.”  What makes something RL is the fully generic problem statement, but the technique / model architecture can be as specialized as you like.

In the last part of my post, I critiqued work on domain-general RL, because that work can’t specialize either the model or the problem description, so it really is “wholly generic optimization.”  But in actual applications like the DeepMind game-players, you phrase the problem as a wholly generic “do a thing well” but then construct a thing-doer in full awareness of what specific thing you’re trying to do.

(DeepMind’s players have successfully removed more and more of the baked-in domain knowledge while still performing well, with their latest one – MuZero – being pretty generic across the domain of transparently scored games with 2D spatial states and sufficiently non-huge [?] action spaces, but that’s still far away from “do a generic thing.”)

—-

I’ve said the domains where the SOTA looks like RL are the domains where statistical learning cannot be put in a simpler form than RL.  Which are these?

My impression is that “doing RL” has led to impressive SOTA results mostly in board/computer games.  (This may be out of date – I think it was at least true in 2018.)

So, what’s special about games?

Relevant features of the problem defn. for games

Objective evaluation of quality happens at the full game level (win/loss or total points), and a game comprises many successive moves.

This is the big thing that makes this inherently an “RL” domain.  In some domains, there is a natural, objective quality metric for single actions – for example, in language modeling, the task is “predict the next word/token,” there is always a correct answer (the true next word/token), and there isn’t some other real metric like “winning the game” for which this is a mere proxy.

In a game, we can invent move-quality metrics, like predicting the next move of a skilled player, but these are proxies.  The true, objective definition of “a good chess move” is one that wins games, whatever that means, period.

Any program has to pick its moves one by one, so it has some (at least implicit) function for scoring moves.  Either this function is hardcoded (so not statistical learning), or it’s learned from a proxy (like imitating skilled players), or it’s learned from the true quality metric (this is RL).

So, in statistical learning, we either optimize a move-level proxy or optimize at game-level.  The statement that “RL works for games” = the statement that the latter is superior.

Relevant facts about data generation for games

We can optimize at the move level or at the game level.  The latter matches what we actually care about, but is extremely inefficient: 

- An entire board game, played to the end, gives us a single bit of signal (did we win?)

- And, even this is not a direct signal about what the move-quality metric ought to be, but an indirect signal about all the moves at once.  We must (in some sense) statistically learn an attribution function that decides what the win/loss implies for individual moves.  Such a function could look many different ways, and we must spend many bits of information setting its parameters above and beyond the ones we spend setting the parameters of the move-scoring function.

But in games, you can be inefficient as long as you’re only playing against computers.  It’s cheap to generate enormous amounts of example data with gold-standard scores attached, by just playing the game inside the computer.  This allows training on arbitrary numbers of examples, limited only by compute.

Meanwhile, if you want to train on a move-quality signal, you must use data from human players – and at high skill level, there’s only a finite and tiny quantity of that.  So we’re comparing an efficient method on a finite resource to an inefficient method on a resource only bounded by compute.  As compute grows, eventually the latter wins.

Other facts making RL less infeasible for games

Via self-play, it’s possible to generate large amounts of adversarial data that probe the model’s current weaknesses.  However good the program is, when it faces itself, it faces a worthy opponent.  Thus we can avoid overfitting to incidental features of some single fixed environment, which is a big problem in other RL work.

Quality, although only defined at the game level, is unambiguous where it’s defined, so we don’t have misspecification/paperclip-maximizer issues, which is another big problem in other RL work.

—-

To conclude, the cases where the best solution looks like RL are cases where, roughly:

- There is no natural quality metric at the single-action level

- There is an unambiguous quality metric over larger chains of actions

- Our source of quality-scored action chains as training data is only limited by compute

- Some other properties that let you avoid common pitfalls

- The task is simple enough in terms of the size of the action space, complexity of the dynamics, etc.  (No one knows exactly what “simple enough” means, but no one thinks that that the DeepMind players won’t eventually break as you scale up the problem.  For example, they’re finite-sized convnets with finite model capacity, and you can imagine an environment generated by a dynamics with so many parameters you can run something of that size forwards on current hardware, but not backpropagate over it.)

It’s a narrow regime where – almost “perversely” – much more data becomes available when you formulate the problem in the least data-efficient manner, so that the former trend dominates the latter, and learning fewer bits per step still leaves you having learned more bits at the end.  It’s cool that this regime includes a few tasks considered among the pinnacles of human achievement, but it’s still a narrow regime, with not much else of interest in it.