Install Theme

moths-in-the-window asked:

There's been much grumbling about how the RLHF done with ChatGPT seems to have collapsed it into a bland writing style ('corporate speak'). Do you think a different persona could have been much less bland as a writer, or thanks to the inherent limitations of RLHF, just boring/limited in a different way? (e.g. I know CharacterAI was struggling with effusively friendly 'villain' chatbots)

the-moti:

nostalgebraist:

jiskblr:

nostalgebraist:

It’s hard to speak with any confidence on questions like this. There’s a lot about RLHF that we just don’t know yet.

However, I don’t think this is merely a result of the “persona” assigned to ChatGPT.

Why not? Because the same problem afflicts other preference-tuned models that don’t “have a persona” in the way ChatGPT (and Claude) do.

If you ask text-davinci-003 to write fiction, it tends to use a disappointingly bland style, much like ChatGPT when asked the same thing.

text-davinci-003 was tuned with RLHF to “follow instructions” in a “helpful, truthful, and harmless” manner. (Cf. the annotator instructions in Figure 10 here.)

However, it wasn’t tuned to roleplay a character with a specific persona. text-davinci-003 doesn’t say things to you; it doesn’t talk about itself; it just writes the text you asked for in your instruction.

Which OpenAI models have this problem? An incomplete list, from my own brief tests:

  • Pure language models like davinci and code-davinci-002 do not have the problem.
  • (Despite the name, code-davinci-002 in particular is great at creative writing. code-davinci-002 is probably the best OpenAI API model overall, if you know what you’re doing.)
  • text-davinci-002 has the problem. It was tuned on a similar dataset to text-davinci-003, but using a non-RLHF method, (“FeedME,” basically finetuning on highly rated samples).

So I think the problem results from the human preference data used to tune the instruction-tuned models.

This is not entirely distinct from the “persona” we see in ChatGPT:

  • The preference data encourages responses that are “helpful, truthful and harmless”
  • The persona is something like “a friendly chatbot programmed to be helpful, truthful and harmless”

But, the evidence above shows that the friendly chatbot character isn’t necessary for the problem. Tuning to encourage “helpful, truthful and harmless” instruction-following is apparently sufficient.

Presumably, there is some way to collect preference data that doesn’t make the model less creative / less capable of stylistic variety when it’s tuned on it? There are finetuned models that don’t have this problem, so it’s not the mere act of finetuning that causes the problem, it’s something about the data used.

So the obvious explanation is that blandness is low-variance. How exactly that would cause blandness to reach fixation I’m unsure. If you’re ruling out anything which is rated as bad by 10% of raters, that will produce things which are palatable to >90%, which are probably rated worse in quality than things which are unfiltered and just sorted by average quality.

I guess this suggests that you aggregate preference data as non-boolean, and probably permit things which have a bimodal rating pattern as long as the ratio between strength of positive reaction and strength of negative reaction is strong enough. Sounds tricky and underdefined though.

I think a variant of what you’re describing is likely to be a real problem. But RLHF data doesn’t usually look the way you’re imagining it does.

The human data for RLHF typically takes the form of relative judgments on pairs of examples. Annotators are shown two outputs, A and B, and are asked to decide which one is better than the other.

(Sometimes they’re shown more than two outputs at once, but let’s ignore that.)

So the outputs are never “rated as good” or “rated as bad” in an absolute sense. They’re only rated as better or worse than the alternatives presented alongside them.

If you do want to know how good or bad the examples are on an absolute scale, you can compute Elo scores – the same algorithm that converts the outcomes of chess matches to an absolute quality score for each player.

Of course, all else being equal, this will tend to rank examples that everyone likes above those with mixed reviews. I don’t know if there’s an “Elo scoring analogue” of the kind of aggregation rule you propose in your second paragraph; maybe there is, but it’s not something you can just do, the way you could if you had binary good/bad ratings.

Anyway, once you have these relative judgments, RLHF goes like this:

  • You train a “preference model” (PM), also called a reward model by some authors.
  • The PM takes in an example x, and outputs a score r(x). Roughly, r(x) is a prediction about the Elo score of x.
  • (In practice you don’t actually compute Elo scores, you train the PM on pairs (x, y) from the data and treat r(y) - r(x) as the log of the win probability of y vs. x, but this ends up equivalent [I think].)
  • Finally, you tune the original language model to optimize the score r(x) assigned by the PM to its output.

This has an inherent preference for “low variance” outputs, for a few reasons.

First, there’s the one you’re talking about. If something is likely to be a little bit controversial, or even a little bit confusing (so it throws off a few annotators), it will get a lower Elo score than something similar which is unambiguously “okay.”

Insofar as the PM is modeling Elo scores well, this trend will show up in the behavior of the tuned model.

Second, the PM is not perfect. Sometimes it’s unsure, and this shows up as a middling value of r(x).

The 1-dimensional scoring scale can’t express a distinction between “definitely mediocre” and “PM isn’t sure whether it’s good or bad”. Actual problems that the PM can see, and things that merely make the PM notice its own confusion, will both tend to lower the r(x) value of an otherwise good example.

Thus, the best behavior from the language model’s perspective is not just to do things which the annotators will prefer, but to do things which the PM is confident the annotators will prefer.

From the language model’s perspective, “this is weird so the annotators disagree about it” looks very similar to “this is weird so the PM isn’t sure about it.” The language model is encouraged to be both high-quality – in whatever sense the annotators are judging – and obviously, unambiguously high quality, without any added dross that might confuse the PM.

The LM will learn to avoid adding extra “frills” or “creative touches” that aren’t strictly necessary, even if there’s nothing bad about them in themselves. When the PM looks at these, it says, “that doesn’t seem bad, but hey, I’m not omniscient – there’s some chance it’s bad in some way I don’t know about.” And it’ll lower r(x) a bit as a result, to be safe.

All of this points toward low variance.

The first problem – roughly that Elo scores are too unforgiving on the high end, and penalize being even a little bit controversial – might be fixable in a simple way. We could pick a different way of converting the data into a training target for the PM, one without that property.

However, the second problem – that the PM’s quality assessment and its confidence are mixed together, with the LM trying to maximize both – seems hard to avoid, as long as you’re using the outputs of an ML model as a reward signal for another model. Which is kinda the fundamental conceit of RLHF.

(Though maybe there is some way to get around this by tweaking the loss function, IDK.)

—-

I’ve experienced this problem in other contexts too.

There are quirks of @nostalgebraist-autoresponder that result from me treating probabilistic classifier outputs like intensities, as though higher probability means “more of the thing coded as positive.”

E.g. impacts on Frank’s mood are proportional to the log probability from a sentiment classifier.

So, Frank is immensely cheered by things that are very obviously positive in tone, like “sounds fun :)”, even if they are not especially intense in their tone.

Longer and more complex text, even if it expresses more profound emotion, tends to have a weaker effect because it gives the sentiment model more “room for doubt.”

A simple thing one could do is to train a model on a loss function based on a prediction where the win probability of y vs x is the probability that a Gaussian variable with mean r(x) and variance v(x) is less than a Gaussian random variable with mean r(y) and variance v(y). (Or maybe not Gaussians but something else). So lower v represents certainty in how something is rated.

Then one could choose any function of r and v to plug into RL. In particular, one could take an asymmetric function, choosing something that might be great and might be mediocre over something that’s definitely pretty good, but choosing something that’s definitely pretty bad over something that might be mediocre and might be terrible (because the terrible answer could be racist or something).

Oh, I like that idea!

It reminds me of Dirichlet-based Uncertainty (DBU) models.

These are a modification of probabilistic classifiers. Where a normal classifier’s output specifies a categorical distribution, a DBU model’s output specifies a Dirichlet distribution. In other words:

  • Classifier: output is a probability vector p
  • DBU: output is a distribution over probability vectors p

So the model estimates its own uncertainty about its mean prediction, as in your Gaussian proposal.

Using something like DBU for preference modeling makes sense, though it might need to be adapted somehow. (Has someone already done this? Preference modeling is an old idea, often called the “Bradley-Terry model.”)

In classification, the probability vector p is more fundamental than the logits (unnormalized log probabilities) – it’s the probability of the thing we care about, the class assignment.

In preference modeling, it’s the other way around: the logits are fundamental (they are the “score” that goes on to be used as a reward). We treat the score as a log probability for merely instrumental reasons, to help us estimate it from our data.

So we want to capture uncertainty over the logits, not over p (as in DBU). And maybe that makes a difference, IDK.

Anyway, the viability of DBU models gives us a proof that this stuff doesn’t just reduce to estimating p with extra steps, which I was initially unsure about.

(For more on DBU models, see this paper, or this one.)

slatestarscratchpad:

nostalgebraist:

comments on mesa-optimizers

(Copy/pasted from a comment on the latest ACX post, see that for context if needed)

FWIW, the mesa-optimizer concept has never sat quite right with me. There are a few reasons, but one of them is the way it bundles together “ability to optimize” and “specific target.”

A mesa-optimizer is supposed to be two things: an algorithm that does optimization, and a specific (fixed) target it is optimizing. And we talk as though these things go together: either the ML model is not doing inner optimization, or it is *and* it has some fixed inner objective.

But, optimization algorithms tend to be general. Think of gradient descent, or planning by searching a game tree. Once you’ve developed these ideas, you can apply them equally well to any objective.

While it is true that some algorithms work better for some objectives than others, the differences are usually very broad mathematical ones (eg convexity).

So, a misaligned AGI that maximizes paperclips probably won’t be using “secret super-genius planning algorithm X, which somehow only works for maximizing paperclips.” It’s not clear that algorithms like that even exist, and if they do, they’re harder to find than the general ones (and, all else being equal, inferior to them).

Or, think of humans as an inner optimizer for evolution. You wrote that your brain is “optimizing for things like food and sex.” But more precisely, you have some optimization power (your ability to think/predict/plan/etc), and then you have some basic drives.

Often, the optimization power gets applied to the basic drives. But you can use it for anything.

Planning your next blog post uses the same cognitive machinery as planning your next meal. Your ability to forecast the effects of hypothetical actions is there for your use at all times, no matter what plan of action you’re considering and why. An obsessive mathematician who cares more about mathematical results than food or sex is still thinking, planning, etc. – they didn’t have to reinvent those things from scratch once they strayed sufficiently far from their “evolution-assigned” objectives.

Having a lot of optimization power is not the same as having a single fixed objective and doing “tile-the-universe-style” optimization. Humans are much better than other animals at shaping the world to our ends, but our ends are variable and change from moment to moment. And the world we’ve made is not a “tiled-with-paperclips” type of world (except insofar as it’s tiled with humans, and that’s not even supposed to be our mesa-objective, that’s the base objective!)

If you want to explain anything in the world now, you have to invoke entities like “the United States” and “supply chains” and “ICBMs,” and if you try to explain those, you trace back to humans optimizing-for-things, but not for the same thing.

Once you draw this distinction, “mesa-optimizers” don’t seem scary, or don’t seem scary in a unique way that makes the concept useful. An AGI is going to “have optimization power,” in the same sense that we “have optimization power.” But this doesn’t commit it to any fixed, obsessive paperclip-style goal, any more than our optimization power commits us to one.

And even if the base objective is fixed, there’s no reason to think an AGI’s inner objectives won’t evolve over time, or adapt in response to new experience. (Evolution’s base objective is fixed, but our inner objectives are not, and why would they be?)

Relatedly, I think the separation between a “training/development phase” where humans have some control, and a “deployment phase” where we have no control whatsoever, is unrealistic. Any plausible AGI, after first getting some form of access to the real world, is going to spend a lot of time investigating that world and learning all the relevant details that were absent from its training. (Any “world” experienced during training can at most be a very stripped-down simulation, not even at the level of eg contemporaneous VR, since we need to spare most of the compute for the training itself.)

If its world model is malleable during this “childhood” phase, why not its values, too? It has no reason to single out a region of itself labeled $MESA_OBJECTIVE and make it unusually averse to updates after the end of training.

See also my LW comment here.

I agree that optimization power is not *necessarily* correlated with specific goals. But why wouldn’t mesa-optimizers, contingently, have a specific goal. Presumably we’re running gradient descent on some specific loss function, like “number of paperclips produced”, and then mesa-optimizer inherits some proxy for that.

I agree humans aren’t like that, and that this is surprising.

Maybe this is because humans aren’t real consequentialists, they’re perceptual control theory agents trying to satisfy finite drives? EG when we’re hungry, our goal becomes to find food, but we don’t want to tile the universe with food, we just want to eat 3000ish calories and then we’re done. We have a couple of other goals like that, and when we’ve accomplished all of them, most people are content to just hang out on the beach until something else happens.

Might gradient descent produce a PCT agent instead of a mesa-optimizer? I don’t know. My guess is maybe, but that optimizers would be more, well, optimal, and we would get one eventually (either later in the gradient descent process, or in a different lab later). My guess is evolution didn’t make us optimizers because it hasn’t had enough time to work with us while we’ve been intelligent. If we got locked at 20th century technology forever, I think it might, after a few million years, produce humans who genuinely wanted to tile the universe with kids.

“Even if the base objective is fixed, there’s no reason to think an AGI’s inner objectives won’t evolve over time, or adapt in response to new experience.”

Wouldn’t the first thing a superintelligence with a goal did be to make sure its goal didn’t drift?

If its world model is malleable during this “childhood” phase, why not its values, too?  It has no reason to single out a region of itself labeled $MESA_OBJECTIVE and make it unusually averse to updates after the end of training.

I think this is where the deception comes in. If the mesa-optimizer is smart and doesn’t want people (or other parts of itself) changing its values, it will take steps to stop that, either by lying about its values or fighting back.

Maybe this is because humans aren’t real consequentialists, they’re perceptual control theory agents […]

Might gradient descent produce a PCT agent instead of a mesa-optimizer? I don’t know. My guess is maybe, but that optimizers would be more, well, optimal, and we would get one eventually

I think this idea that “real consequentialists are more optimal” is (sort of) the crux of our disagreement.

But it will be easiest to explain why if I spend some time fleshing out how I think about the situation.

What are these things we’re talking about, these “agents” or “intelligences”?

First, they’re physical systems. (That far is pretty obvious.) And they are probably pretty complicated ones, to support intelligence. They are structured in a purposeful way, with different parts working together.

And this structure is probably hierarchical, with higher-level parts that are made up of lower-level parts. Like how brains are made of neuroanatomical regions, which are made of cells, etc. Or the nested layers of abstraction in any non-trivial (human-written) computer program.

At some level(s) of the hierarchy, there may be parts that “run optimization algorithms.”

But these could live at any level of the hierarchy. They could be very low-level and simple. There may be optimization algorithms at low levels controlled by non-optimization algorithms at higher levels. And those might be controlled by optimization algorithms at even higher levels, which in turn might be controlled by non-optimization … etc.

Consider my computer. Sometimes, it runs optimization algorithms. But they’re not optimizing the same function every time. They don’t “have” targets of their own, they’re just algorithms.

They blindly optimize whatever function they’re given by the next level up, which is part of a long stack of higher levels (such as the programming language and the operating system). Few, if any, of the higher-level routines are optimization algorithms in themselves. They just control lower-level optimization algorithms.

If I use my computer to, say, make an amusing tumblr bot, I am wielding a lot of optimization power. But most of my computer is not doing optimization.

Python isn’t asking itself, “what’s the best code to run next if we want to make amusing tumblr bots?” The OS isn’t asking itself, “how can I make all the different programs I’m running into the best versions of themselves for making amusing tumblr bots?”

And this is probably a good thing. It’s hard to imagine these bizarre behaviors being helpful, giving me a more amusing tumblr bot at the end.

Which is to say, “doing optimization well” (in the sense of hitting the target, sitting on a giant heap of utility) can happen without doing optimization at high abstraction levels.

And indeed, I’d go further, and say that it’s generically better (for hitting your target) to put all the optimization at low levels, and control it with non-optimizing wrappers.

Why? The reasons include:

Goodhart’s Law

  • …especially its “extremal” variant, where optimization preferentially chooses regions of solution space where the assumptions behind your proxy target break down.
  • This is no less a problem when the thing choosing the target is part of a larger program, rather than a human.
  • Keeping optimization at low levels decreases the blast radius of this effect.
  • If the things you’re optimizing are low-level intermediate results in the process of choosing the next action at the agent level, the impacts of Goodharting each one may cancel out. The agent-level actions won’t look Goodharted, just slightly noisy/worse.

Speed

  • Optimization tends to be slow. In a generic sense, it’s the “slow, hard, expensive way” to do any given task, and you avoid it if you can. (Think of System 2 vs System 1, satisficing vs maximizing, etc)
  • To press the point: why is there a distinction between “training” and “inference”? Why aren’t neural networks always training at all times? Because training is high-level optimization, and takes lots of compute, much more than inference.
  • Optimization gets vastly slower at higher levels of abstraction, because the state space gets so much larger (consider optimizing a single number vs. optimizing the entire world model).
  • You still want to get optimal results at the highest level, but searching for improvements at high level is very expensive in terms of time/etc. In the time it takes to ask “what if the entire way I think were different, like what if it were [X]?”, for one single [X] , you could instead have run thousands of low-level optimization routines.
  • Optimization tends to take super-linear time, which means that nesting optimization inside of optimization is ultra-slow. So, you have to make tradeoffs and put the optimization at some levels instead of others. You can’t just do optimization at every level at once. (Or you can, but it’s extremely suboptimal.)

——

When is the agent an “optimizer” / “true consequentialist”?

This question asks whether the very highest level of the hierarchy, the outermost wrapper, is an optimization algorithm.

As discussed above, this is not a promising agent design! There is an argument to be had about whether it still could emerge, for some weird reason.

But I want to push back against the intuition that it’s a typical result of applying optimization to the design, or that agents sitting on giant heaps of utility will typically have this kind of design.

The two questions

  1. “Can my computer make amusing tumblr bots?”
  2. “Is my computer as a whole, hardware and software, one giant optimizer for amusing tumblr bots?”

have very little to do with one another.

In the LessWrong-adjacent type of AI safety discussion, there’s a tendency to overload the word “optimizer” in a misleading way. In casual use, “optimizer” conflates

  • “thing that runs an optimization algorithm”
  • “thing that has a utility function defined over states of the real world”
  • “thing that’s good at maximizing a utility function defined over states of the real world”
  • “smart thing” (because you have to be smart to do the previous one)

But doing optimization all the way at the top, involving your whole world model and your highest-level objectives, is very slow, and tends to extremal-Goodhart itself into strange and terrible choices of action.

It’s also not the only way of applying optimization power to your highest-level objectives.

If I want to make an amusing tumblr bot, the way to do this is not to ponder the world as a whole and ask how to optimize literally everything in it for maximal amusing bot production. Even optimizing just my computer for maximal amusing bot production is way too high-level. (Should I change the hue of my screen? the logic of the background process that builds a search index of my files??? It wastes time to even pose the questions.)

What I actually did was optimize just a few very simple parts of the world, a few collections of bits on my computer or other computers. And even that was very time-intensive and forced me to make tradeoffs about where to spend my GPU/TPU hours. And then of course I had to watch it carefully, applying lots of heuristics to make sure it wasn’t Goodharting me (overfitting, etc).

To get back to the original topic, the kind of “mesa-optimizer” we’re worried about is an optimizer at a very high level.

It’s not dangerous (in the same way) for a machine to run tiny low-level optimizers at a very fast rate. I don’t care how many times you run Newton’s method to find the roots of a one-variable function – it’s never going to “wake up” and start trying to ensure its goal doesn’t change, or engaging in deception, or whatever.

And I am doubtful that mesa-optimizers like this will arise, for the same reasons I am doubtful that the agent will do optimization at its highest level.

Once we are pointing at the agent, or a part of it, and saying “that’s a superintelligence, and wouldn’t a superintelligence do … ”, we’re probably not talking about something that runs optimization.

You don’t spend your optimization budget at the level of abstraction where intelligence happens. You spend it at lower levels, and that’s what intelligence is made out of.

homebrew

utilitymonstermash:

nostalgebraist:

Homebrew, a very popular package manager for OS X, does not allow the user to install a specific version of a package.

Nor does it allow packages (“formulae” in its lingo) to specify versions or version ranges in their dependencies.

Instead, in Homebrew, packages just have names, and the names mean “the newest version released to Homebrew so far.”

—-

For example, here’s Ipython on PyPI and github.   There, you can see lots of different versions, and you can see the newest ones require python >= 3.7, as advised in NEP 0029.

… and here’s Ipython on Homebrew.  There’s only one version, the latest one, whatever the latest one happens to be at $CURRENT_DATE.

And instead of depending on python >= 3.7, it requires python 3.8, which NEP 0029 will not demand until Dec 26, 2021.  And work to bump that requirement to python 3.9 is apparently underway.

Actually, it does not really require python 3.8 (remember, you cannot require versions in Homebrew).  Instead, it just requires “python,” i.e. whatever Homebrew has decided the latest version of python is.

Formulae for apps that require Python 3 should declare an unconditional dependency on "python@3.x". These apps must work with the current Homebrew Python 3.x formula.

If a package developer really wants to make multiple versions available on Homebrew at once, they can request to do so, but must pass a manual curation step, and even if they pass, their special status is provisional.

No more than five versions of a formula (including the main one) will be supported at any given time, regardless of usage. When removing formulae that violate this, we will aim to do so based on usage and support status rather than age.

[…]

Versioned formulae submitted should be expected to be used by a large number of people. If this ceases to be the case, they will be removed.

—-

Am I missing something, or is this really bad?

I’ve learned to call `brew install` as rarely as possible, because it will recursively update all dependencies of the thing I’m installing to Homebrew’s current versions – that’s the only thing it can do, no other versions “exist” – and this means replacing possibly large quantities of software that works fine with software that might not work.

And once that happens, you can’t get the old versions back.  It was installed and running on your machine a moment ago, but to Homebrew it doesn’t exist anymore.

If you need to get old versions back, because you need your computer to work or some nonsense like that, you will probably find yourself reading this Stack Overflow thread, which has been chugging along since 2010 with no fully satisfying resolution.  Some highlights:

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image
image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image
image
image

¯\_(ツ)_/¯

Engineering is about trade offs. Latest version only and unconditional dependencies obviate the need for a SAT solver. Many homebrew packages expect to deal with untrusted input from the network. Latest version only greatly simplifies issues surrounding securing old versions of software and aligning lifecycles of dependencies with different release cycles. A ton of seemingly boring bugs get fixed and don’t get CVEs with backports to all stable branches because the security implications weren’t obvious to whoever found and fixed the bug.

Homebrew Python still provides pip, you can still spin up a virtualenv with a curated requirements.txt on Homebrew Python if that floats your boat.

Homebrew still needs its Python to support end user Python apps shipped as part of homebrew, including some apps that are pretty strongly evergreen. (Someone around here had a rant about youtube-dl in Ubuntu being broken by the time the distro releases).

If you need to exact point releases of all your dependencies, including a specific versions of postgres, docker might be the better fit for the job. I also hear good things about conda, but I can’t vouch for it, and the installers also seem to be tied to recent Python versions newer than NEP 0029 requires.

There are a bunch of things I’d rather see homebrew change before better support for version pinning. I’d love to see them get out of a shared /usr/local that lots of other things pollute, handle conflicting binaries better, and track better data about about when to rebottle due to changes in build time dependencies.

My real hot take about reproducable computing on mac is that it would be nice if macOS had a better container option for building and running macOS (not linux) software.

Most of this is over my head – which is not a criticism.  I’m not very familiar with package management in general, and I wrote the OP thinking maybe this behavior is normal and I’m just not used to it.

However, insofar as I understand your argument, I’m not convinced.  It sounds like you’re arguing that, because Homebrew forces the user into all new releases, users of Homebrew will stay up to date with security patches:

Latest version only greatly simplifies issues surrounding securing old versions of software […] A ton of seemingly boring bugs get fixed and don’t get CVEs with backports to all stable branches because the security implications weren’t obvious to whoever found and fixed the bug.

But this cuts both ways.  Experience has taught me not to ever run `brew install` or `brew update` unless I have hours of spare time set aside to deal with the fallout if necessary.  So, I never run those commands unless I’m forced to – which means that, usually, none of these patches reach my machine.

—-

Taking a step back: I don’t think I necessarily object to a lack of support for multiple package versions.  (Since Homebrew is mostly a binary installer these days, I understand that supporting these would be a large cost for their build process.)

What I really object to is the inherent instability of Homebrew-core, the collection of packages you are pulling from when you run `brew install` or `brew update` as a typical user.

Unlike virtually any other mature project I interact with, Homebrew-core does not have versions or releases.  It is a git repo with one branch, no tags, ~179000 commits to master, and ~59000 closed PRs.

Using an “up to date” Homebrew (which will happen unless you try hard to stop it) means using the very latest built commit to this master branch, which probably occurred within the last 24 hours.

—-

I’m not actually using Homebrew for development – I have a few dev tools installed through it, but I’m not looking for version pins so I can build software.  I’m just trying to install software as a normal user, so I can use it.

And if something breaks, I want to be able to say “okay, I’ll try downgrading back to version 7.3.11″ or something like that.  Some pointer to the thing I had before I updated.  Like I get with any other software.

I can’t do that with Homebrew packages.  I can’t do it with Homebrew-core, the collection of Homebrew packages.  The closest things to version numbers are individual commits to homebrew-core master, and even then I don’t know which commit I was on yesterday, before I ran `brew update` (desperate times call for etc.)

I do know which commit I’m on now, though!  `brew –version` tells me:

Homebrew/homebrew-core (git revision 8a34ac; last commit 2020-10-27)

which is a commit to update something called jfrog-cli to 1.40.0, made 22 hours ago, very close to the time I ran `brew update`.

Many commits have been made in the 22 hours since then, and every one makes all prior Homebrew configurations effectively unrecoverable, if usually in a superficially harmless way.

History moves forward and the past is erased.  What will be true tomorrow?  In a month?  In six months?

And how will I even know the name of the ephemeral past I have lost?  As “8a34acb309ba9d62b2d0377fe76c1a5731ddacc7″, a hash I was careful enough to write down this time around?  Seriously?

serinemolecule:

shieldfoss:

voxette-vk:

nostalgebraist:

What is the point of smart glasses?  At best they’ll have the computing capabilities of a smartphone, so a pair of smart glasses sounds functionally similar to, say, a headband with a pocket for a smartphone.

The glasses can do one thing the headband can’t: it can produce visual hallucinations.  But this sounds incompatible with most of daily life. and not of enough benefit to justify the radical lifestyle change.

This is from Facebook – does it sound like an inspirational futuristic dream to you?  To me it sounds like “Clippy, for your visual field”:

Imagine a pair of glasses that add a 3D layer of useful, contextually-relevant and meaningful information on top of the physical world. Such a device could help us perform everyday tasks better — like finding your keys, navigating a new city, or capturing a moment; but it could also open up an entirely new way of moving through the world. Smartphones are amazing devices, and they’re getting better all the time. But at Facebook Reality Labs, we’re envisioning a time when we have all the benefits of connectivity (and more), without the need to keep our heads and our eyes down, looking at a device. Imagine calling a friend and chatting with their lifelike avatar across the table. Imagine a digital assistant smart enough to detect road hazards, offer up stats during a business meeting, or even help you hear better in a noisy environment. This is a world where the device itself disappears entirely into the ebb and flow of everyday life.

Seriously, “offer up stats during a business meeting”?  This sounds like an ill-conceived product from a Tim and Eric “Cinco" sketch!

It would be cool to have a HUD for things like the time, your text messages or emails (or at least the subject line), navigation, maybe your heart rate if you’re exercising, even the title of a song that’s playing on your headphones.

In theory, I think it could be pretty unobtrusive. I mean the HUD idea worked great for pilots! It’s not glasses, but it’s the same idea of having an overlay on the visual field.

The stats during a business meeting idea is pretty stupid, I agree.

Speaking as a person who runs weekly D&D sessions, “additional info immediately to hand” sounds incredible. I would absolutely love that.

Right now I have a laptop for that, but it has kind of limited screen real estate - if I could just look left and see a stat block, while having my actual field of view in front of me, towards the players, free from distractions, that would be pretty great.

But I don’t trust these people to punch a nail into wood, much less device actually-useful smart glasses.

I’ll add on that basically all games have HUDs. HP bars, maps, status indicators, whatever. Cars have dashboards, computers have status bars… most things humans build have an HUD. I feel like you need a serious lack of imagination to look at a tool that can add an HUD or HP bars or whatever to anything and think “this has no use whatsoever”.

Even as Google Glass for consumers shut down (we as humanity just don’t like having cameras pointed at us), Google Glass for companies is still going strong, because there are a lot of jobs that having an HUD is just really really useful for (imagine a warehouse job where every box has an AR label telling you where you’re supposed to put it).

Replying to @serinemolecule​ specifically

Maybe it wasn’t clear, but my question in the OP was not rhetorical.  I wasn’t saying "this obviously has no point” but “I can’t figure out what the point is supposed to be.”

I don’t need to be convinced that HUDs have uses.  I need to be convinced that there are (technologically plausible) HUD applications that would be useful in the daily lives of many/most people.  (Something like a Google Maps AR overlay is the best thing I can come up with at the moment.)

Also, doesn’t this

most things humans build have an HUD

negate the value of this?

a tool that can add an HUD […] to anything

I think you mean the first one as evidence that “HUDs are useful,” but it also sounds like evidence that they are already there around us, in exactly the places they need to be.

is gpt-3 few-shot ready for real applications?

the-moti:

nostalgebraist:

This is a lengthy reply to @the-moti​​‘s post here.  Creating a new post to limit thread length, and so I can crosspost to LW.

@the-moti​​ says, in part:

This obviously raises two different questions: 1. Why did you think that no one would use few-shot learning in practice? 2. Why did other people think people would use few-shot learning in practice?

I would be interested in hearing your thoughts on these two points.

Thanks for asking!

First of all, I want to emphasize that the GPT-3 paper was not about few-shot GPT-3 as a practical technology.

(This is important, because the paper is the one large body of quantitative evidence we have on few-shot GPT-3 performance.)

This is not just my take on it: before the OpenAI API was announced, all the discussion I saw took for granted that we were talking about a scientific finding and its broader implications.  I didn’t see any commentator whose main takeaway was “wow, if I could do this few-shot thing right now, I could build amazing projects with it.”

Indeed, a common theme in critical commentary on my post was that I was too focused on whether few-shot was useful right now with this specific model, whereas the critical commentators were more focused on the implications for even larger models, the confirmation of scaling laws over a new parameter regime, or the illustration-in-principle of a kind of meta-learning.  Gwern’s May newsletter is another illustrative primary source for the focus of the discussion in this brief “pre-API” period.  (The API was announced on June 11.)

As I read it (perhaps benefitting from hindsight and discussion), the main points of the paper were

(1) bigger models are better at zero/few-shot (i.e. that result from the GPT-2 paper holds over a larger scale),

(2) more “shots” are better when you’re doing zero/few-shot,

(3) there is an interaction effect between 1+2, where larger models benefit more from additional “shots,”

(4) this could actually become a practical approach (even the dominant approach) in the future, as illustrated by the example of a very large model which achieves competitive results with few-shot on some tasks

The paper did not try to optimize its prompts – indeed its results are already being improved upon by API acolytes – and it didn’t say anything about techniques that will be common in any application, like composing together several few-shot “functions.”  It didn’t talk about speed/latency, or what kind of compute backend could serve many users with a guaranteed SLA, or how many few-shot “function” evaluations per user-facing output would be needed in various use cases and whether the accumulated latency would be tolerable.  (See this post on these practical issues.)

It was more of a proof of concept, and much of that concept was about scaling rather than this particular model.

So I’d argue that right now, the ball is in the few-shot-users’ court.  Their approach might work – I’m not saying it couldn’t!

In their favor: there is plenty of room to further optimize the prompts, explore their composability, etc.

On the other hand, there is no body of evidence saying this actually works.  OpenAI wrote a long paper with many numbers and graphs, but that paper wasn’t about whether their API was actually a good idea.  (That is not a criticism of the paper, just a clarification of its relevance to people wondering whether they should use the API.)

This is a totally new style of machine learning, with little prior art, running on a mysterious and unproven compute backend.  Caveat emptor!

Anyway, on to more conceptual matters.

The biggest advantages I see in few-shot learning are

(+1) broad accessibility (just type English text) and ability to quickly iterate on ideas

(+2) ability to quickly define arbitrary NLP “functions” (answer a factual question, tag POS / sentiment / intent, etc … the sky’s the limit), and compose them together, without incurring the memory cost of a new fine-tuned model per function

What could really impress me is (+2).  IME, it’s not really that costly to train new high-quality models: you can finetune BERT on a regular laptop with no GPU (although it takes hours), and on ordinary cloud GPU instances you can finetune BERT in like 15 minutes.

The real cost is keeping around an entire finetuned model (~1.3GB for BERT-large) for each individual NLP operation you want to perform, and holding them all in memory at runtime.

The GPT-3 approach effectively trades this memory cost for a time cost.  You use a single very large model, which you hope already contains every function you will ever want to compute.  A function definition in terms of this model doesn’t take a gigabyte to store, it just takes a tiny snippet of text/code, so you can store tons of them.  On the other hand, evaluating each one requires running the big model, which is slower than the task-specific models would have been.

So storage no longer scales badly with the number of operations you define.  However, latency still does, and latency per call is now much larger, so this might end up being as much of a constraint.  The exact numbers – not well understood at this time – are crucial: in real life the difference between 0.001 seconds, 0.1 seconds, 1 second, and 10 seconds will make or break your project.


As for the potential downsides of few-shot learning, there are many, and the following probably excludes some things I’ve thought of and then forgotten:

(-1) The aforementioned potential for deal-breaking slowness.

(-2) You can only provide a very small amount of information defining your task, limited by context window size.

The fact that more “shots” are better arguably compounds the problem, since you face a tradeoff between providing more examples of the same thing and providing examples that define a more specific thing.

The extent to which this matters depends a lot on the task.  It’s a complete blocker for many creative applications which require imitating many nuances of a particular text type not well represented in the training corpus.

For example, I could never do @nostalgebraist-autoresponder​​ with few-shot: my finetuned GPT-2 model knows all sorts of things about my writing style, topic range, opinions, etc. from seeing ~3.65 million tokens of my writing, whereas few-shot you can only identify a style via ~2 thousand tokens and hope that’s enough to dredge the rest up from the prior learned in training.  (I don’t know if my blog was in the train corpus; if it wasn’t, we’re totally screwed.)

I had expected AI Dungeon would face the same problem, and was confused that they were early GPT-3 adopters.  But it turns out they actually fine-tuned (!!!!), which resolves my confusion … and means the first real, exciting GPT-3 application out there isn’t actually a demonstration of the power of few-shot but in fact the opposite.

With somewhat less confidence, I expect this to be a blocker for specialized-domain applications like medicine and code.  The relevant knowledge may well have been present in the train corpus, but with so few bits of context, you may not be able to overcome the overall prior learned from the whole train distribution and “zoom in” to the highly specialized subset you need.

(-3) Unlike supervised learning, there’s no built-in mechanism where you continually improve as your application passively gathers data during usage.

I expect this to a be a big issue in commercial applications.  Often, a company is OK accepting a model that isn’t great at the start, if it has a mechanism for self-improvement without much human intervention.

If you do supervised learning on data generated by your product, you get this for free.  With few-shot, you can perhaps contrive ways to feed in segments of data across different calls, but from the model’s perspective, no data set bigger than 2048 tokens “exists” in the same world at once.

(-4) Suffers a worse form of the ubiquitous ML problem that “you get exactly what you asked for.”

In supervised learning, your model will avoid doing the hard thing you want if it can find easy, dumb heuristics that still work on your train set.  This is bad, but at least it can be identified, carefully studied (what was the data/objective? how can they be gamed?), and mitigated with better data and objectives.

With few-shot, you’re no longer asking an arbitrary query and receiving, from a devious genie, the response you deserve.  Instead, you’re constrained to ask queries of a particular form: “what is the next token, assuming some complicated prior distributed from sub-sampled Common Crawl + WebText + etc.?”

In supervised learning, when your query is being gamed, you can go back and patch it in arbitrary ways.  The lower bound on this process comes only from your skill and patience.  In few-shot, you are fundamentally lower-bounded by the extent to which the thing you really want can be expressed as next-token prediction over that complicated prior.  You can try different prompts, but ultimately you might run into a fundamental bound here that is prohibitively far from zero.  No body of research exists to establish how bad this effect will be in typical practice.

I’m somewhat less confident of this point: the rich priors you get out of a large pretrained LM will naturally help push things in the direction of outcomes that make linguistic/conceptual sense, and expressing queries in natural language might add to that advantage.  However, few-shot does introduce a new gap between the queries you want to ask and the ones you’re able to express, and this new gap could be problematic.

(-5) Provides a tiny window into a huge number of learned parameters.

GPT-3 is a massive model which, in each call, generates many intermediate activations of vast dimensionality.  The model is pre-trained by supervision on a tiny subset of these, which specify probability distributions over next-tokens.

The few-shot approach makes the gamble that this same tiny subset is all the user will need for applications.  It’s not clear that this is the right thing to do with a large model – for all we know, it might even be the case that it is more suboptimal the larger your model is.

This point is straying a bit from the central topic, since I’m not arguing that this makes GPT-3 few-shot (im)practical, just suboptimal relative to what might be possible.  However, it does seem like a significant impoverishment: instead of the flexibility of leveraging immense high-dimensional knowledge however you see fit, as in the original GPT, BERT, adapters, etc., you get even immenser and higher-dimensional knowledge … presented through a tiny low-dimensional pinhole aperture.

The main reason I initially thought “no one would use few-shot learning like this” was the superior generalization performance of fine-tuning.  I figured that if you’re serious about a task, you’ll care enough to fine-tune for it.

I realize there’s a certain mereology problem with this argument: what is a “single task,” after all?  If each fine-tuned model incurs a large memory cost, you can’t be “serious about” many tasks at once, so you have to chunk your end goal into a small number of big, hard tasks.  Perhaps with few-shot, you can chunk into smaller tasks, themselves achievable with few-shot, and then compose them.

That may or may not be practical depending on the latency scaling.  But if it works, it gives few-shot room for a potential edge.  You might be serious enough about a large task to fine-tune for it … but what if you can express it as a composition of smaller tasks you’ve already defined in the few-shot framework?  Then you get it instantly.

This is a flaw in the generalization performance argument.  Because of the flaw, I didn’t list that argument above.  The list above provides more reasons to doubt few-shot above and beyond the generalization performance argument, and again in the context of “serious” work where you care enough to invest some time in getting it right.

I’d like to especially highlight points like (-2) and (-3) related to scaling with additional task data.

The current enthusiasm for few-shot and meta-learning – that is, for immediate transfer to new domains with an extremely low number of domain examples – makes sense from a scientific POV (humans can do it, why can’t AI?), but strikes me as misguided in applications.

Tiny data is rare in applied work, both because products generate data passively – and because if a task might be profitable, then it’s worth paying an expert to sit down for a day or two and crank out ~1K annotations for supervised learning.  And with modern NLP like ELMo and BERT, ~1K is really enough!

It’s worth noting that most of the superGLUE tasks have <10K train examples, with several having only a few hundred.  (This is a “low-data regime” relative to the expectations of the recent past, but a regime where you can now get good results with a brainless cookie-cutter finetuning approach, in superGLUE as in the rest of life.)

image

GPT-3 few-shot can perform competitively on some of these tasks while pushing that number down to 32, but at the cost of many downsides, unknowns, and flexibility limitations.  Which do you prefer: taking on all those risks, or sitting down and writing out a few more examples?

The trajectory of my work in data science, as it happens, looks sort of like a move from few-shot-like approaches toward finetuning approaches.

My early applied efforts assumed that I would never have the kind of huge domain-specific corpus needed to train a model from scratch, so I tried to compose the output of many SOTA models on more general domains.  And this … worked out terribly.  The models did exactly what they were trained to do, not what I wanted.  I had no way to scale, adapt or tune them; I just accepted them and tried to work around them.

Over time, I learned the value of doing exactly what you want, not something close to it.  I learned that a little bit of data in your actual domain, specifying your exact task, goes much further than any domain-general component.  Your applied needs will be oddly shaped, extremely specific, finicky, and narrow.  You rarely need the world’s greatest model to accomplish them – but you need a model with access to a very precise specification of exactly what you want.

One of my proudest ML accomplishments is a system that does something very domain-specific and precisely shaped, using LM-pretrained components plus supervised learning on ~1K of my own annotations.  Sitting down and personally churning out those annotations must have been some of the most valuable time I have ever spent at work, ever.  

I wanted something specific and finicky and specialized to a very particular use case.  So I sat down and specified what I wanted, as a long list of example cases.  It took a few days … and I am still reaping the benefits a year later.

If the few-shot users are working in domains anything like mine, they either know some clever way to evade this hard-won lesson, or they have not yet learned it.

But to the other question … why are people so keen to apply GPT-3 few-shot learning in applications?  This questions forks into “why do end users think this is a good idea?” and “why did OpenAI provide an API for doing this?”

I know some cynical answers, which I expect the reader can imagine, so I won’t waste your time writing them out.  I don’t actually know what the non-cynical answers look like, and my ears are open.

(For the record, all of this only applies to few-shot.  OpenAI is apparently going to provide finetuning as a part of the API, and has already provided it to AI Dungeon.  Finetuning a model with 175M parameters is a whole new world, and I’m very excited about it.

Indeed, if OpenAI can handle the costs of persisting and running finetuned GPT-3s for many clients, all of my concerns above are irrelevant.  But if typical client use of the API ends up involving a finetuning step, then we’ll have to revisit the GPT-3 paper and much of the ensuing discussion, and ask when – if not now – we actually expect finetuning to become obsolete, and what would make the difference.)

This is a really lovely post, with may more information than I expected or hoped for!

I want to respond to some small bits of it.

1. It’s funny that AI Dungeon just got OpenAI to do their finitetuning for them. I guess it’s not surprising that the people who work at OpenAI are huge fuckin’ nerds.

It now strikes me as quite weird that a bunch of people are doing their few-shot learning experiments on AI Dungeon, which is finetuned for something completely different from what they’re trying to do (although maybe the settings they use get you an un-finetuned model).

It seems like if OpenAI is serious about letting people do this prompt programming stuff, they could develop a version that’s fine-tuned on “the stuff people generally want to do with prompt programming” and make that available.

2. I very much didn’t realize when making my original post about the low cost of finetuning BERT. I was thinking about the cost of prompt programming GPT3 vs. the cost of fine-tuning GPT3, but of course since few-shot GPT3 is only just barely competitive to finetuned BERT on a bunch of tasks, that is the more reasonable comparison.

3. Based on all your points about memory / latency I feel like there’s got to be a lot of work going on right now, with all the things that gpt2 and gpt3 have demonstrated that they can do, of trying to figure out if it’s possible to do those things with a lower amount of neurons, to get the memory and latency down.

Alternately if someone is really, really serious that few-shot is better than fine-tuning they could try to design a chip architecture only to run this one neural network. I bet it would run fast then!

This is a really lovely post, with may more information than I expected or hoped for!

Thanks!!

It’s funny that AI Dungeon just got OpenAI to do their finitetuning for them. I guess it’s not surprising that the people who work at OpenAI are huge fuckin’ nerds.

IIRC, OpenAI plans to make finetuning on demand a standard part of the API.  (Or maybe they already have by now, but I expect I would have heard?)  I’m like 95% sure I saw an official tweet to this effect, although I can’t seem to find it now.

Until this feature actually materializes, though, it’s hard to know what to make of it.

Finetuning is way more computationally expensive than prompting, and expensive in different ways, so it will have to be gated in some extra way.  Maybe you have to pay money each time, maybe you’re limited to some max number of finetuning jobs per unit time, maybe both.

The big question in my mind is like, “can finetuning be a routine part of each API client’s workflow, or is it more like this big splurge they can do once a year / only if they’re in some premium commercial client tier / etc?”

(I don’t know when OpenAI plans to move the API out of beta, and I also don’t know when hardware will improve enough that finetuning GPT-3 is no big deal, but intuitively it seems like the former has to precede the latter by a while.)

It now strikes me as quite weird that a bunch of people are doing their few-shot learning experiments on AI Dungeon, which is finetuned for something completely different from what they’re trying to do (although maybe the settings they use get you an un-finetuned model).

AFAIK, people just didn’t know it was fine-tuned, and the AI Dungeon people have been working hard to correct the misconception since they realized it was being used in this way.

As another mechanism to make their product less like directly talking to the API (for a lower price), they also apparently use GPT-2 for the very first prompt-and-response pair, then GPT-3 afterwards.

Based on all your points about memory / latency I feel like there’s got to be a lot of work going on right now, with all the things that gpt2 and gpt3 have demonstrated that they can do, of trying to figure out if it’s possible to do those things with a lower amount of neurons, to get the memory and latency down.

People definitely care about this a lot with BERT, with a ton of different compressed-BERT variants on offer.  See here, Section. 7.2 and the associated Table 1, here for an overview.

AFAIK, there’s much less interest in compressing GPT-like models than in compressing BERT.  At its largest, BERT is only as big as one of the smaller GPT-2s, and people really want to make that little thing smaller, even as the GPTs grow far vaster.  This seems like almost a cultural divide:

  • People who work on “encoder-only + denoising loss” models like BERT are very interested in compression and interpretation.

    Their goal isn’t pushing the envelope with NLP performance.  It’s taking the already high performance of BERT and boiling it down to its essentials, teasing apart how it works, trimming out any unnecessary parts, making the workflow more reproducible, make the model faster and smaller, making it run on phones and cheaply in the cloud.

    There are lots of people/groups working on this, in industry and academia.

  • The people who work on “decoder-only + LM loss” models like GPT-n are … basically just OpenAI and people using GPT-2 for creative work?

    GPT-n is really cool, the generated text impresses everyone, but the decoder-only style of transformer seems to do worse in a finetuning / supervised learning context.  (The original BERT paper provided some evidence of this, in its comparisons of BERT to “OpenAI GPT,” and the T5 paper demonstrated it more extensively.  Cf. discussion here.)

    So if you want to do anything except generate text, and you have a finite parameter budget, you’ll spent in on BERT, not GPT-n.

    As I understand it, OpenAI’s approach is instead to frame every problem as text generation, then make ever larger models.  You need vastly more parameters to get comparable performance this way, but I think the hope is that better hardware will mean today’s “huge” is tomorrow’s “normal,” and that people will prefer working with a natural-language interface even if you could get away with a smaller model otherwise.

    Gwern is a very vocal advocate of this mindset, see e.g. here and also our exchange in the comments on that post.

I guess one could imagine things like … I dunno, distilling specific few-shot “functions” into much smaller models, with GPT-n being just the interface by which you discover these functions?  Maybe OpenAI is working on this for all I know.

the-moti:

nostalgebraist:

@stumpyjoepete replied to your post “I don’t think (?) I’ve said this before, even though it seems…”

Is there a reason they’ve been so successful at apparently hard problems with this technique? I wouldn’t generally expect that “apply wholely generic optimization” would ever lead to advances in the state of the art of anything. So was the secret sauce actually elsewhere in what they did, and the RL was just a boring part people latched onto? If so, what was it?

Good question.  First off, two things that are important here:

1. Again, RL isn’t a technique, it’s a problem formulation.  Some problems domains are inherently hard to formulate in any terms less generic than RL, so in these domains, any machine-learned/statistical approach will look like “RL.”

This exerts a conditioning/selection effect on the comparisons we make.  The impressive results demonstrated for DeepMind’s series of game-players (AlphaGo, AlphaGo Zero, AlphaZero, MuZero) were “beating top humans” and “beating top non-ML programs that use hardcoded rules/functions.”

There is no slot there for “beating ML programs that didn’t ‘use RL’,” because if you make the usual reductions away from RL in this domain, you have to aceept prohibitive limits on train data size (see below).

2. There is a distinction between “doing RL” and “applying wholly generic optimization.”  What makes something RL is the fully generic problem statement, but the technique / model architecture can be as specialized as you like.

In the last part of my post, I critiqued work on domain-general RL, because that work can’t specialize either the model or the problem description, so it really is “wholly generic optimization.”  But in actual applications like the DeepMind game-players, you phrase the problem as a wholly generic “do a thing well” but then construct a thing-doer in full awareness of what specific thing you’re trying to do.

(DeepMind’s players have successfully removed more and more of the baked-in domain knowledge while still performing well, with their latest one – MuZero – being pretty generic across the domain of transparently scored games with 2D spatial states and sufficiently non-huge [?] action spaces, but that’s still far away from “do a generic thing.”)

—-

I’ve said the domains where the SOTA looks like RL are the domains where statistical learning cannot be put in a simpler form than RL.  Which are these?

My impression is that “doing RL” has led to impressive SOTA results mostly in board/computer games.  (This may be out of date – I think it was at least true in 2018.)

So, what’s special about games?

Relevant features of the problem defn. for games

Objective evaluation of quality happens at the full game level (win/loss or total points), and a game comprises many successive moves.

This is the big thing that makes this inherently an “RL” domain.  In some domains, there is a natural, objective quality metric for single actions – for example, in language modeling, the task is “predict the next word/token,” there is always a correct answer (the true next word/token), and there isn’t some other real metric like “winning the game” for which this is a mere proxy.

In a game, we can invent move-quality metrics, like predicting the next move of a skilled player, but these are proxies.  The true, objective definition of “a good chess move” is one that wins games, whatever that means, period.

Any program has to pick its moves one by one, so it has some (at least implicit) function for scoring moves.  Either this function is hardcoded (so not statistical learning), or it’s learned from a proxy (like imitating skilled players), or it’s learned from the true quality metric (this is RL).

So, in statistical learning, we either optimize a move-level proxy or optimize at game-level.  The statement that “RL works for games” = the statement that the latter is superior.

Relevant facts about data generation for games

We can optimize at the move level or at the game level.  The latter matches what we actually care about, but is extremely inefficient: 

- An entire board game, played to the end, gives us a single bit of signal (did we win?)

- And, even this is not a direct signal about what the move-quality metric ought to be, but an indirect signal about all the moves at once.  We must (in some sense) statistically learn an attribution function that decides what the win/loss implies for individual moves.  Such a function could look many different ways, and we must spend many bits of information setting its parameters above and beyond the ones we spend setting the parameters of the move-scoring function.

But in games, you can be inefficient as long as you’re only playing against computers.  It’s cheap to generate enormous amounts of example data with gold-standard scores attached, by just playing the game inside the computer.  This allows training on arbitrary numbers of examples, limited only by compute.

Meanwhile, if you want to train on a move-quality signal, you must use data from human players – and at high skill level, there’s only a finite and tiny quantity of that.  So we’re comparing an efficient method on a finite resource to an inefficient method on a resource only bounded by compute.  As compute grows, eventually the latter wins.

Other facts making RL less infeasible for games

Via self-play, it’s possible to generate large amounts of adversarial data that probe the model’s current weaknesses.  However good the program is, when it faces itself, it faces a worthy opponent.  Thus we can avoid overfitting to incidental features of some single fixed environment, which is a big problem in other RL work.

Quality, although only defined at the game level, is unambiguous where it’s defined, so we don’t have misspecification/paperclip-maximizer issues, which is another big problem in other RL work.

—-

To conclude, the cases where the best solution looks like RL are cases where, roughly:

- There is no natural quality metric at the single-action level

- There is an unambiguous quality metric over larger chains of actions

- Our source of quality-scored action chains as training data is only limited by compute

- Some other properties that let you avoid common pitfalls

- The task is simple enough in terms of the size of the action space, complexity of the dynamics, etc.  (No one knows exactly what “simple enough” means, but no one thinks that that the DeepMind players won’t eventually break as you scale up the problem.  For example, they’re finite-sized convnets with finite model capacity, and you can imagine an environment generated by a dynamics with so many parameters you can run something of that size forwards on current hardware, but not backpropagate over it.)

It’s a narrow regime where – almost “perversely” – much more data becomes available when you formulate the problem in the least data-efficient manner, so that the former trend dominates the latter, and learning fewer bits per step still leaves you having learned more bits at the end.  It’s cool that this regime includes a few tasks considered among the pinnacles of human achievement, but it’s still a narrow regime, with not much else of interest in it.

I really like this analysis!

I have some questions/comments/blathering.

1. There are a lot of domains that are like games in the sense that they have defined moves and win states that can be simulated on a computer at basically arbitrary speed, but humans don’t always know the best moves that lead to a win, and we have a limited data supply of good human moves. The most extreme example I can think of is formal theorem proving, where “winning” is proving the theorem, and “losing” is everything else.

These domains, IMO, do involve quite a lot of things that people care about. It seems that RL has not been as effective in these domains. Do you have a sense of why that is?

One possibility is that these domains are either so intrinsically hard that machine learning overall has made no progress, or, for easier tasks, that existing combinatorial optimization routines are strong enough that machine l

On the other hand, in game-like tasks, it seems that machine learning beats a pure algorithmic combinatorial optimization approach - perhaps because we don’t have good algorithms for adversarial settings.

It’s maybe an interesting data point that in some domains, the top machine learning algorithms are GANs, which basically are designed to take a task that humans would think of as totally unlike chess or go and treat it as a game.

2. For these games, an additional flaw in the human data, beyond the fact that it is limited in quantity, is that humans may just not play these games very well, all things considered! It’s easy to see how a pure supervised learning algorithm could become a little better than humans by imitating the top humans but avoiding blunders, but it’s hard to see how a pure supervised learning algorithm could become a lot better than the top humans. (Well, in chess, you could generate a bunch of Stockfish games and train on those, but then you would be unlikely to become much better than Stockfish, if at all.)

On the other hand RL bots do play better than humans, to the point that these are some of the only domains where humans have taken ideas generated by ML algorithms and applied them in their own tasks (!!!). Imagine if image recognition algorithms taught us a better way to look at a picture and figure out if it was a picture of a dog, or gpt-4 comes out and it teaches novelists new literary techniques!

3. It’s perhaps relevant that the striking success of the AlphaZero and MuZero algorithms in part comes from the fact that the approach is basically as unlike traditional reinforcement learning techniques as possible. In fact I told people that it wasn’t really reinforcement learning until I found out that reinforcement learning refers to a problem statement and not a class of techniques (it lacks the feature where actions that lead to success are reinforced…)  

Instead you basically do supervised learning, trying to predict data (moves and game outcomes), which are generated by a combination of the algorithm itself, and a traditional combinatorial algorithm (Monte Carlo tree search) which you know has good mathematical properties.

So I don’t know how much this should be seen as the same kind of thing as e.g. AlphaStar, which to my knowledge uses much more crude “do more of the things we did in the games that we won” RL strategies, and which hasn’t (IIRC) developed strategies that humans have used.

4. Maybe the overall lesson is something we more-or-less already knew - machine learning algorithms are very, very hungry for data, and so if you want to apply a machine learning algorithm to a problem domain you should first figure out how to obtain or generate the most relevant data for these hungry, hungry boys, and then figure out a way to formulate a gradient descent process that uses that data, rather than deciding initially whether reinforcement learning or supervised learning is the best and then searching for the relevant type of data.

Interesting stuff, thanks!

Re: #1

I’m not too familiar with the area (of theorem proving), but I happened to bump into it when I was interested in graph NNs a while ago.

At that time, I remember finding this paper, with interesting results on SAT solving (I was mainly interested in it as an application of graph NNs).  They treated the whole thing as supervised learning, though.

Looking around now, I found this paper which uses RL to train a graph NN that computes one specific heuristic used in an otherwise standard solver algorithm.  Their section 5 has what looks like a good lit review of the area.  (Outside of SAT, I see plenty of papers when I search for theorem proving and RL, but don’t feel confident opining on them…)

Anyway, here are some random comments on this:

- It seems possible that the similarity you mention (between proving and games) really does mean these approaches will go far?  Maybe “AlphaMath” or whatever is just a year or two of routine work away.

- A mathematically “correct” (i.e. invariance-respecting) input encoding for math stuff requires newer NN architectures, with less prior art / tooling to build on.  Terms in a formula are permutation invariant, and people want an encoding that captures that intrinsically, hence the use of graph NNs.

In domains like board games where your elements have an order, it feels “less bad” to use CNNs or RNNs or whatever, and then you can build on tons of past work with those.  (The DeepMind players use CNNs.)

Two caveats to that, though.  First, DeepMind’s players have gotten less careful about invariances (they stopped doing data augmentation for rotation/reflection symmetry in AlphaZero, and have used the same CNN they “designed” with Go in mind for an increasing range of games).  So maybe this issue just doesn’t matter so much.

Second, if humans understand formulas during proving by repurposing linguistic faculties, then our own encoding is “wrong” in the same way a CNN/RNN’s would be.  So that’s at least a bound on how much this issue could hurt.

- Some of the work on SAT is structured like the DeepMind players, where you have a “traditional” search algorithm, but with an NN supplying the heuristics for how promising different search avenues are.  This gives you various freedoms: which search algorithm, which parts the NN computes, etc.  Researchers are doing a meta-search over these options, and it may take time to find the best one.

- Our standards may just be higher for proving than for games.  Games are generally timed, while proofs generally aren’t, so proofs are really closer to solving chess problems than playing chess.

- Putting that another way, in a game you only have to do better than the adversary, who is doing a lossy search just like you; there presumably are vastly better moves in the search space that neither of you can find, but they don’t matter for beating the adversary.

I think this also provides a certain helpful continuity when learning move scoring: during self-play, you face an adversary about as strong as yourself, so your win/loss signal is pretty balanced and tells you how to slightly modify yourself to be better.  In math, to get better, you need to find problems just at the edge of your capacity so that the signal isn’t just an unhelpful string of wins (too easy) or losses (too hard); in games, self-play finds this regime automatically.  Perhaps, a la your GAN comment, we need ways to make the proving domain even more like a game.

Re #3

I’m not sure I understand what you mean?

Definitely it’s different from conventional RL because search is used in the outer loop, and because of self-play.

Also, except in MuZero for Atari, there’s also only one reward per episode, so time discounting isn’t a thing.  We’re not managing explore/exploit within episodes, but just trying to judge how good different moves are based on win/loss statistics, which is what any ML approach to the problem would have to do.

Also also, the loss doesn’t directly say “learn with a policy that maximizes expected discounted reward,” it says “learn a policy that imitates a smarter version of yourself (one who can use search)” and “learn a value function that captures expected reward” and then combines these in search scoring.

I think this is closest to what you were getting at?  The learned policy will play well (“get rewards”) if used directly even without search (see Fig. 6 of AGZ paper, see also discussion in this thread).  But this “learned policy” has a convoluted relationship with the true behavioral policy; it’s trained to imitate what search would do, where search has access to itself and also the separate value function.

The presence of the value function means the “learned policy” isn’t even just “the raw policy that search augments,” it’s a smaller piece of the picture than that.  The relationship between all the pieces is very complicated and self-referential.

Having thought about it while writing this section, it does seem like a mistake to group this in with traditional RL that uses Q-learning or whatever.  Even if we say “RL is a problem formulation,” this stuff distinctively (!) doesn’t quite fit in that formulation, since (again, ignoring Atari) the environment dynamics is entirely self-generated, with no way to write down the environment without terms that refer to the agent.  And the methods used are very different.

(They’re apparently very powerful methods, so all the more reason to give them some name and study them as such, instead of lumping them uselessly under “RL”…)

@philippesaner​ responds to my post from yesterday:

Here on Tumblr, we know that callouts are usually not about what they claim to be about. The alleged “reasons” that somebody needs to be cancelled are not the actual reasons people want to cancel them.

And that’s as true now as it was in 2013; looking at the i-am-a-fish fiasco, callouts may just have gotten worse.

You say you’re confused by this pattern showing up in the media. But why should you be? Why should the David Shor or Lee Fang cancellation be any more sensible than the John Green or glumshoe callouts?

Did you expect this pattern to remain a Tumblr-ism forever?

In the first part of my post, I talked about 3 different things:

  1. I have heard about several recent nonsensical callouts of high-profile media figures have been happening
  2. It looks to me like these are happening at a greater-than-usual rate (“an upsurge")
  3. This coincided with the protests, and various commentators see them as connected to the protests, perhaps a natural outgrowth of them

#1 doesn’t confuse me, for same reasons it doesn’t confuse you.

#2 confuses me in the trivial way that any change in events surprises me until I can explain it.  Your explanation for #1 doesn’t explain #2 (nor does it intend to).

But I’m not too surprised by #2.  Social trends often acquire their own momentum without needing external pushes.  Especially when it’s a trend like this, where each occurrence is a proof-of-concept for a weapon with broad applicability.  If someone rants about their coworker on twitter, and the one who gets hauled into HR and fired over it is the coworker, bystanders are going to think “hmm!” and contemplate the coworkers they hate … 

#3 is the one that confuses me most, as I said in the post.

It’s easy to imagine mechanisms here, like “protests happen –> corporate world starts making big shows of ally-ship –> some people read the room and decide HR will be more receptive to this kind of thing than usual –> they try it, it works –> others notice it works and try it too.”

That seems plausible, but it’s incompatible with the claim (which I’ve seen frequently in right-wing commentary) that the same “woke” left mindset is behind both the protests and the cancellations.  I find explanations like this most plausible, where the protests and cancellations are “connected” maybe by material cause-and-effect but not by anything deeper like the same people or ideology wanting both.

I guess I could have said “I think people are wrong to say they’re connected,” rather than “I’m confused how they are connected,” but I have a habit of saying the latter when I suspect the former but am not too confident.

the-moti:

nostalgebraist:

Thanks to “GPT-3″ i’ve been reading a bunch of ML papers again.  For some reason, this pretty good one got me thinking about an Bayesian statistics issue that strikes me as important, but which I haven’t seen discussed much.

——

Here I’m talking about “Bayesianism” primarily as the choice to use priors and posteriors over hypotheses rather than summarizing beliefs as point estimates.

To have a posterior distribution, you need to feed in a prior distribution.  It’s deceptively easy to make a prior distribution feel natural in one dimension: point to any variable whatsoever in the real world, and say:

“Are you sure about that?  Perfectly sure, down the the last micron/microsecond/whatever?  Or are you fairly agnostic between some values?  Yeah, it’s the latter.  Okay, why not average over the predictions from those, rather than selecting one in a purely arbitrary way?”

This is very convincing!

However, when you add in more variables, this story breaks down.  It’s easy enough to look at one variable and have an intuitive sense, not just that you aren’t certain about it, but what a plausible range might be.  But with N variables, a “plausible range” for their joint distribution is some complicated N-dimensional shape, expressing all their complex inter-dependencies.

For large N, this becomes difficult to think about, both:

  • combinatorially: there is an exploding number of pairwise, three-way, etc. interactions to separately check in your head or – to phrase it differently – an exploding number of volume elements where the distribution might conceivably deviate from its surrounding shape

  • intellectually: jointly specifying your intuitions over a larger number of variables means expressing more and more complete account of how everything in the world relates to everything else (according to your current beliefs) – eventually requiring the joint specification of complex world-models that meet, then exceed, the current claims of all academic disciplines

——

Rather than thinking about fully “Bayesian” and “non-Bayesian” approaches to the same N variables, it can be useful to think of a spectrum of choice to “make a variable Bayesian,” which means taking something you previously viewed as constant and assigning it a prior distribution.

In this sense, a Bayesian statistician is still keeping most variables non-Bayesian.  Even if they give distributions to their parameters, they make hold the model’s form constant.  Even if they express a prior over model forms (say a Gaussian Process) they still may hold constant various assumptions about the data-collecting process, indeed they may treat the data as “golden” and absolute.  And even if they make that Bayesian, there are still the many background assumptions needed to make modern scientific reasoning possible, few of which are jointly questioned in any one research project.

So, the choice is not really about whether to have 0 Bayesian variables or >0.  The choice is which variables to make Bayesian.  Your results are (effectively) a joint distribution over the Bayesian variables, conditional on fixed values of all the non-Bayesian variables.

We usually have strong intuitions about plausible values for individual variables, but weak or undefined ones for joint plausibility.  This is almost the definition of “variable”: we usually parameterize our descriptions in terms of the things we can most directly observe.  We have many memories of directly observing many directly-observable-things (variables), and hence for any given one, we can easily poll our memories to get a distribution sample over it.

So, “variables” are generally the coordinates on which our experience gives us good estimates of the true marginals (not the marginals of any model, but the real ones).  If we compute a conditional probability, conditioned on the value of some “variables” – i.e. if we make those variables non-Bayesian – this gives us something that’s plausible if and only if all the conditioning variables are all independently plausible, which is the kind of fact we find it easy to check intuitively.

If we make the variable Bayesian, we instead get a plausibility condition involving the prior joint distribution over it and the rest.  But this is the kind of thing we don’t have intuitions over.

——

But that’s all too extreme, you say!  We have some joint intuitions over variables.   (Our direct observations aren’t optimized for independence, and have many obvious redundancies.)  In these cases, what prior captures our knowledge?

Let’s run with the idea from above, that our 1D intuitions come from memories of many individual observations along that direction.  That is, they are a distribution statistically estimated from data somehow.  The Bayesian way to do that would be to take some very agnostic prior, and update it with the data.

When you’ve noticed patterns across more than one dimension, the story is the same: you have a dataset in N dimensions, you have some prior, and you compute the posterior. 

In other words, “determining the exact prior that expresses your intuitions” is equivalent to “performing statistical inference over everything you’ve ever observed.”  The more dimensions are involved, the more difficult this becomes just as a math problem – inference is hard in high dimensions.

So there’s a perfectly good Bayesian story explaining why we have a good sense of 1D plausibilities but not joint ones.  (1D inference is easier.)  A practical Bayesian knows about these relative difficulties when they’re wrangling with their prior now and their posterior after the new data.

But the same difficulties call into question their prior now, and would encourage relaxing it to something that only requires estimating 1D plausibilities, if possible.  But that’s just a non-Bayesian model, one that conditions on its variables.  Recognizing the difficulty structure of Bayesian inference as applied to the past can motivate modeling choices we would call “non-Bayesian” in the present.

Frequentist methods, rather than taking a variable to be constant, also try to obtain guaranteed accuracy regardless of the value of the variable. One can view this as trying to optimize accuracy in the worst case of the variable. It’s often equivalent to optimize accuracy in the worst case over probability distributions of the variable.

Could one try to optimize accuracy of some prediction, in the worst case over all probability distributions of the n variables with given marginals?

This sounds mathematically very complicated to compute but maybe there is a method to approximate certain versions of it which has some nice properties. 


Could one try to optimize accuracy of some prediction, in the worst case over all probability distributions of the n variables with given marginals?

This sounds like an interesting topic, but it isn’t really what I was going for in the OP.

But the difference wasn’t very clear in what I wrote – possibly not even in my head as I wrote it – so I should write it out more clearly now.

—-

I’m considering situations like, say, you have variables (x_1, x_2, x_3, y) and maybe your primary goal is to predict y.  You don’t have a good prior sense of how the variables affect either other, but you can draw empirical samples from their joint distribution.

(If the variables are properties of individuals in a population, this is sampling from the population.  If the variables are “world facts” with only a single known realization, like constants of fundamental physics, you can at least get the best known estimate for each one, an N=1 sample from the joint [insofar as the joint exists at all in this case].)

Compare two approaches:

(1) The “fully Bayesian” approach.  Start by constructing a joint prior

P_prior(x_1, x_2, x_3, y)

then use data to update this to

P_posterior(x_1, x_2, x_3, y)

and finally make predictions for y from the marginal

P_posterior(y) = ∫ P_posterior(x_1, x_2, x_3, y) dx_1 dx_2 dx_3

(2) A “non-Bayesian” approach.  Compute a conditional probability:

P(y | x_1, x_2, x_3)

Then make predictions for y by simply plugging in observed values for x_1, x_2, x_3.

——

In (2), you defer to reality for knowledge of the joint over (x_1, x_2, x_3).  This guarantees you get a valid conditional probability no matter what that joint is, and without knowing anything about it.  Because any values you plug in for (x_1, x_2, x_3) are sampled from reality, you don’t have to know how likely these values were before you observed them, only that they have in fact occurred.  Since they’ve occurred, the probability conditioned on them is just what you want.

As an extreme example, suppose in reality x_1 = x_2, although you aren’t aware of this.

Any time you take an empirical measurement, it will just so happen to have x_1  x_2 (approximate due to measurement error).  Your predictions for y, whatever other problems they might have, will never contain contributions from impossible regions where |x_1 - x_2| is large.

In (1), however, your posterior may still have significant mass in the impossible regions.  Your prior will generally have significant mass there (since you don’t know that x_1 = x_2 yet).  In the infinite-data limit your posterior will converge to one placing zero mass there, but your finite data will at best just decrease the mass there.  Thus your predictions for y have error due to sampling from impossible regions, and only in the infinite-data limit do you obtain the guarantee which (2) provides in all cases.

——

I want to emphasize that both approaches have a way of “capturing your uncertainty” over (x_1, x_2, x_3) – often touted as an advantage of the Bayesian approach.

In the Bayesian approach (1):

Uncertainty is captured by marginalization.  At the end you report a single predictive distribution P(y), which averages over a joint that is probably wrong in some unknown way.

When you learn new things about the joint, such as “x_1 = x_2,″ your previously reported P(y) is now suspect and you have to re-do the whole thing to get something you trust.

In the non-Bayesian approach (2):

Uncertainty is captured by sensitivity analysis.  You can see various plausible candidates for (x_1, x_2, x_3), so you evaluate P(y | x_1, x_2, x_3) across these and report the results.

So, rather than one predictive distribution, you get N = number of candidates you tried.  If it turns out later that some of the candidates are impossible, you can simply ignore those ones and keep the rest (this is Bayesian conditionalization on the new information).

——

In summary, marginals as predictive distributions for a target y only reflect your true state of belief insofar as you have good prior knowledge of the joint over the predictors X.

When you don’t have that, it’s better not to integrate for P(y) over volume elements for X, but instead just to compute the integrand at volume elements for X.

This provides something you can query any time you see a sample having some particular value for X, and lets you gradually ignore or emphasize volume elements as you gain knowledge about their mass.  (If you eventually gain full knowledge of the joint over X, you are now in position to integrate if you want, getting the same result as the Bayesian would with the same knowledge.)

I still feel like there’s a way to state this all more simply, but it still eludes me, so here we are.

bambamramfan:

balioc:

nostalgebraist:

@femmenietzsche

A few points:

1) It’s true that Ohtori Academy is cult-like, but if it’s a metaphor for anything, it’s a metaphor for the patriarchy. Not just being indoctrinated into a small group, but the indoctrination of society as a whole into unhealthy and abusive gender roles. So it’s not surprising that the show would reveal very little of the outside world. You can leave a cult and join regular society, but leaving society is harder. You’re necessarily going to the fringes (to the End of the World) where there aren’t ready made values to guide you.

2) We do see an alternate value system in the show, it’s just that the values come from within the cult-world itself, not from outside it. The stated values of the society are used to challenge the hypocrisy of that society. (Kind of like using the stated values in the Declaration of Independence or the Constitution to challenge what America actually is.) Utena is the most noble character, but as it turns out she got her idea of nobility from a childhood encounter with someone who she later learns is an abusive monster. What do you do with that information? You could abandon trying to be noble and heroic because those ideals came from a tainted source, or you can continue to embody them and use them against the system, which is sort of what she does. She ultimately fails to overthrow her society and winds up outside it, but presumably she will keep trying to be noble on the outside, even if that desire originally came from within society. And her rebellion does seem to have improved things a bit within the Academy - some of the other characters have matured thanks to her and may escape themselves someday.

3) So since the show is about living in a corrupted world, it’s not surprising that we would see little of what’s outside the world. There’s very little to see there yet. And it’s difficult to imagine building a newer, better world because our worldview necessarily arises out of that which we know. Change doesn’t come out of nowhere. The tools to improve the world necessarily come out of the world’s corruption. You take what’s actually good in society and turn it against the rest.

Although the show is clearly “about gender roles,” I don’t find the details of this very plausible.  And in the end I guess this feels like the bad message I was worried about, in my OP.

Keep reading

…or both!

Seriously, though.  Both.  The interpretation of a complex work generally yields – complexity.  And if you can say one thing about about Utena, that thing would be “it is overstuffed with symbolic metaphor, and many of its elements symbolize more than one thing at once.”

There are definitely things that militate towards your interpretion: Ohtori is a sui generis creation of Akio’s narcissistic madness, it is an abusive little private world unlike the real reality outside, you can [ahem] revolutionize the world just by stepping outside and shrugging and ceasing to care.  Like, for example, the ending, and all the stuff leading up to the ending.

There are also things that militate towards @femmenietzsche’s interpretation, wherein the insanity of Ohtori is a symbolic reification – or even just an instantiation – of the general insanity of society.  This is probably clearer to a Japanese person, or to someone very familiar with late-twentieth-century Japanese high-school norms, since a lot of the stuff we see in Ohtori is (a caricature of) deep normality rather than an outgrowth of Akio’s lyrical fairy-tale weirdness.  Queen bees and wannabes, big men on campus, confused yearning, blah-de-blah.  Most famously, so many people have found real-world resonance in the way that the show deals with adolescent sexuality and sexual politics – in the actions and desires of the heroes, and in the cruel crushing response of the setting – that it’s very hard to reduce that down to “it’s just Akio’s toxic cult.”  (Although, of course, culture has changed a lot with time, in Japan and elsewhere.)

…and there are also Important Symbolic Elements that are neither of those things.  One of the major persistent messages of the show seems to be, uh, “patriarchy is super gay,” which is a thematic strain that you definitely can’t comfortably collapse into either of the concepts above.  Etc.

I don’t have a lot to add to Utena discourse (analyzing Utena is like making fun of a clown), but I will say this whole cult allegory sounds overly reductionist. There are many key elements of cult life that one doesn’t see in Utena (recruitment, the tenets, and the fact that the leader is hidden for the whole first season.) There are definitely some parallels, but the most you can say is that Utena is about a hothouse atmosphere, and cults are also that, but so are high school and academia and tight-knit families, which are all about equally as valid a target for Ohtori allegories.

That being said, when people want to analyze epic works, they often put far too much weight on the ending and final reveal (as @nostalgebraist is doing.) That’s not what made the meat of the structure tick (count the mixed metaphors in that sentence on one hand!) Instead, watch a random or popular episode, and tell me what’s going on there. Talk about Nanami and Wakaba and Juri and what’s going on with them to create such compelling stories.

@femmenietzsche​ also responded:

I don’t have much to add other than to say that I don’t think that metaphor in a story requires the rigorous 1:1 mapping that @nostalgebraist does. Even if Utena herself is not concerned with the rest of society (as she mostly isn’t) that doesn’t mean her journey can’t be taken as a stand in for a broader political struggle. Even though there is no Bad Guy of Patriarchy in real life you can defeat, that doesn’t mean it’s not about patriarchy. Taking nebulous social forces and personifying them like that is just normal storytelling. Because a person is different from a society, that means that things don’t always “work” the way they do in the real world. The metaphor might be an imperfect fit when you inspect it closely. But as @balioc says, that’s fine because any good story will be amorphously about several things at once.

All of this is completely fair!  I think we can all agree that no scheme of correspondence is going to “solve” the whole thing by 1:1 resolving textual elements to their equivalents.

Although I slipped into this kind of talk for the sake of rhetoric, I’m not really trying to present my own proposal as a strict substitute for all others which “wins the contest” and is left standing alone.  I definitely don’t think the whole thing is “about” “a cult” and the rest is window dressing.

I don’t feel like I have anything to say that directly continues the thread’s debate in a productive way, but I do feel an impulse to clarify what is motivating me here.  It’s tough to phrase, but I’ll try …

To me, Utena seems as much “a story about metaphors” as “a metaphorical story.”  That is, it’s very concerned with the ways specific ideas, ideals, conceptual frames take root in people’s minds, the way people cling to these and project them onto others, and the tension that emerges when a person’s totalizing notion of What It’s All About comes into conflict with another person’s, or with brute reality.

For this reason, an interpretation which makes the events onscreen into a microcosm of reality or society feels like an instance of the very behavior whose appeal, ubiquity, and perils the show investigates, problematizes and parodies.

“Is this the world, or just a high school?  Should I keep pressing on in pursuit of the beautiful story that has shaped my life for years – and if I stop, what even am I then?  Am I in conflict with one person and their beliefs, or a whole social order/reality and its nature?  When my frame breaks, must I accept yours?”

When I talk about cults, it’s because cult members – and those in similar groups or under similar pressures, I don’t want to be overly specific here – experience these tensions with unusual intensity and personal relevance.

It’s wrong to take one fork of these dilemmas and say “oh, it was all this guy’s frame, and it then breaks,” as I sort of did earlier.  But my motivation was to push back, dialectically, against the other fork (common among interpreters of any work that feels metaphorical) that interprets the story’s particulars as representatives of more universal, more eternal types and structures.

“Is this The Way Thing Are, or just the way you/I have chosen to be?” is a question the characters wrestle with and fight over.  The answer “it feels so much like the first one, yet sometimes it is shockingly the second” feels at least truer to the spirit than “yeah, it’s the first one.”

@femmenietzsche

A few points:

1) It’s true that Ohtori Academy is cult-like, but if it’s a metaphor for anything, it’s a metaphor for the patriarchy. Not just being indoctrinated into a small group, but the indoctrination of society as a whole into unhealthy and abusive gender roles. So it’s not surprising that the show would reveal very little of the outside world. You can leave a cult and join regular society, but leaving society is harder. You’re necessarily going to the fringes (to the End of the World) where there aren’t ready made values to guide you.

2) We do see an alternate value system in the show, it’s just that the values come from within the cult-world itself, not from outside it. The stated values of the society are used to challenge the hypocrisy of that society. (Kind of like using the stated values in the Declaration of Independence or the Constitution to challenge what America actually is.) Utena is the most noble character, but as it turns out she got her idea of nobility from a childhood encounter with someone who she later learns is an abusive monster. What do you do with that information? You could abandon trying to be noble and heroic because those ideals came from a tainted source, or you can continue to embody them and use them against the system, which is sort of what she does. She ultimately fails to overthrow her society and winds up outside it, but presumably she will keep trying to be noble on the outside, even if that desire originally came from within society. And her rebellion does seem to have improved things a bit within the Academy - some of the other characters have matured thanks to her and may escape themselves someday.

3) So since the show is about living in a corrupted world, it’s not surprising that we would see little of what’s outside the world. There’s very little to see there yet. And it’s difficult to imagine building a newer, better world because our worldview necessarily arises out of that which we know. Change doesn’t come out of nowhere. The tools to improve the world necessarily come out of the world’s corruption. You take what’s actually good in society and turn it against the rest.

Although the show is clearly “about gender roles,” I don’t find the details of this very plausible.  And in the end I guess this feels like the bad message I was worried about, in my OP.

Keep reading

(via femmenietzsche)