I need to stop arguing about AI alignment foundations on LessWrong for a while.
I’ve been trying to press my case about the “outer optimizing wrapper” stuff I mentioned in this tumblr post, but I feel like I can never get the point across properly. I’m starting to obsess over how to communicate this concept, to an extent that outstrips how much I care about communicating it. It’s more like an unscratched itch – the feeling that there must be some magical way to make people see, and if I only think hard enough, I’ll find it, and until I’ve done that, something’s intolerably wrong.
I tend to get obsessed like this with disagreements that have very deep roots, where it feels like the other person is thinking in a fundamentally different way that I struggle to summarize.
I feel pretty sure that the LW/MIRI way of thinking about AI alignment is confused in a fundamental way, and that this relates to the nebulous concept they call “agency.” But it’s really hard to spell out what the disagreement is, because it involves this whole web of self-reinforcing intuitions and pieces of “folk knowledge.”
Optimization produces agency, humans are a product of optimization, humans are agents, agency is sort of like EU maximization, humans aren’t EU maximizers but only in an irrelevant way (??), intelligence is agency, intelligence is EU maximization, intelligence is doing causal reasoning to select actions, optimization selects for intelligence, optimization selects for EU maximization in general but not for the specific utility function being optimized (???), natural selection has an implicit utility function, humans don’t maximize that function so they must be maximizing a different one, because humans are agents and agents maximize functions, intelligence is being good at maximizing a function (because you can reframe any problem this way), optimization produces intelligence, which is function maximization, which is doing causal reasoning to select actions, and if you’re doing causal reasoning that makes your decisions more consistent, and anything that makes consistent decisions is an EU maximizer …
It’s hard to know how to argue with a giant pile of stuff like this.
There are many, many blog posts about this topic (whatever this topic is, exactly), but they aren’t building pieces of a single interconnected story. Alice writes a post about how agents are EU maximizers, because P. And Bob writes a post about how optimization produces agents, because Q. And Carol writes a post about how EU maximization is optimal, because R.
The three posts look nothing alike, use different formalisms (or no formalism), and are about subtly different senses of the words “agency” and “optimization.” But Alice, Bob and Carol all walk away feeling that they have contributed to the same Giant Pile, a thing the three all believe in. Future blog posts will cite Alice’s, Bob’s and Carol’s in the same breath.
It’s not a logically fleshed-out theory, to the point where you can argue against a premise here and see how that would affect conclusions elsewhere. If you poke at one of the things in the pile, it just goes away for a little while and one of the other ones comes to take its place while it’s gone.

