I don’t think (?) I’ve said this before, even though it seems important, because it feels so obvious. But it might not be as obvious as I think, so:
“Reinforcement learning” (RL) is not a technique. It’s a problem statement, i.e. a way of framing a task as an optimization problem, so you can hand it over to a mechanical optimizer.
What’s more, even calling it a problem statement is misleading, because it’s (almost) the most general problem statement possible for any arbitrary task. If you try to formalize a concept like “doing a task well,“ or even “being an entity that acts freely and wants things,” in the most generic terms with no constraints whatsoever, you end up writing down “reinforcement learning.” From a classic 2018 post:
RL algorithms fall along a continuum, where they get to assume more or less knowledge about the environment they’re in. The broadest category, model-free RL, is almost the same as black-box optimization. These methods are only allowed to assume they are in an MDP [this is why I said “(almost)” above -nost]. Otherwise, they are given nothing else. The agent is simply told that this gives +1 reward, this doesn’t, and it has to learn the rest on its own.
Life itself can be described as an RL problem. The fully general optimal agent AIXI is solving an RL problem.
Any time you are not “doing RL,” what this means is that you’ve selected some properties specific to your task, and used them to simplify the problem statement so it’s no longer just the maximally uninformative “this is some arbitrary task.” What you’re doing can be re-interpreted as RL where some properties of the policy are already “filled in” by prior knowledge; what people call “doing RL” is just the case where you fill nothing in. (Or implicitly/softly fill things in via some kind of side-channel.)
You never want to be “doing RL,” if you can help it. It’s the approach of last resort, when you can make no useful simplifications or reductions.
This is not to say that “doing RL” is never successful or never the right choice. Sometimes the research community really can’t find useful simplifications or reductions for a problem, and sometimes the problem has just the right qualities (cf the “common properties” list in that post) for the fully general approach to find the right simplifications on its own.
I had a period in 2017 where I was really interested in improving upon that DeepMind Atari paper, but since then I have been less interested in RL than many people who like reading AI papers.
If you look at DeepMind’s latest publications at any given time, they always seem (to me, skimming, anyway) to be doing lots of work about “making RL work better” in a very general sense. This feels weird to me, almost a category mistake.
Given a problem, RL is what you resort to last, not what you jump to first. So it will be the approach of choice only in that subset of domains where we must resort to it. Thus, trying to do RL well in a domain-general way is both extremely hard (it’s equivalent to “do well at absolutely anything with no help”) and oddly irrelevant, since the mere fact I’m doing RL (i.e. I have resorted to RL) conveys many bits of information specifying my domain beyond the fully generic one.
(Exercise for the interested reader: why do people not “use RL” to train transformer LMs like GPT/BERT? What does the training algorithm they do use look like when re-interpreted as RL with simplifications? What does this mean about their domain vs. e.g. board games?)
pernoctatious-charrette liked this
lifecycleofamentalobject liked this caprice-nisei-enjoyer liked this
megasilverfist liked this
stumpyjoepete said: Is there a reason they’ve been so successful at apparently hard problems with this technique? I wouldn’t generally expect that “apply wholely generic optimization” would ever lead to advances in the state of the art of anything. So was the secret sauce actually elsewhere in what they did, and the RL was just a boring part people latched onto? If so, what was it?
perrydeplatypus reblogged this from nostalgebraist
namelessdistribution liked this nostalgebraist posted this
