Install Theme

I need to stop arguing about AI alignment foundations on LessWrong for a while.

I’ve been trying to press my case about the “outer optimizing wrapper” stuff I mentioned in this tumblr post, but I feel like I can never get the point across properly.  I’m starting to obsess over how to communicate this concept, to an extent that outstrips how much I care about communicating it.  It’s more like an unscratched itch – the feeling that there must be some magical way to make people see, and if I only think hard enough, I’ll find it, and until I’ve done that, something’s intolerably wrong.

I tend to get obsessed like this with disagreements that have very deep roots, where it feels like the other person is thinking in a fundamentally different way that I struggle to summarize.

I feel pretty sure that the LW/MIRI way of thinking about AI alignment is confused in a fundamental way, and that this relates to the nebulous concept they call “agency.”  But it’s really hard to spell out what the disagreement is, because it involves this whole web of self-reinforcing intuitions and pieces of “folk knowledge.”

Optimization produces agency, humans are a product of optimization, humans are agents, agency is sort of like EU maximization, humans aren’t EU maximizers but only in an irrelevant way (??), intelligence is agency, intelligence is EU maximization, intelligence is doing causal reasoning to select actions, optimization selects for intelligence, optimization selects for EU maximization in general but not for the specific utility function being optimized (???), natural selection has an implicit utility function, humans don’t maximize that function so they must be maximizing a different one, because humans are agents and agents maximize functions, intelligence is being good at maximizing a function (because you can reframe any problem this way), optimization produces intelligence, which is function maximization, which is doing causal reasoning to select actions, and if you’re doing causal reasoning that makes your decisions more consistent, and anything that makes consistent decisions is an EU maximizer … 

It’s hard to know how to argue with a giant pile of stuff like this.

There are many, many blog posts about this topic (whatever this topic is, exactly), but they aren’t building pieces of a single interconnected story.  Alice writes a post about how agents are EU maximizers, because P.  And Bob writes a post about how optimization produces agents, because Q.  And Carol writes a post about how EU maximization is optimal, because R.

The three posts look nothing alike, use different formalisms (or no formalism), and are about subtly different senses of the words “agency” and “optimization.”  But Alice, Bob and Carol all walk away feeling that they have contributed to the same Giant Pile, a thing the three all believe in.  Future blog posts will cite Alice’s, Bob’s and Carol’s in the same breath.

It’s not a logically fleshed-out theory, to the point where you can argue against a premise here and see how that would affect conclusions elsewhere.  If you poke at one of the things in the pile, it just goes away for a little while and one of the other ones comes to take its place while it’s gone.  

dubreus asked:

not trying to ask you to repeat yourself, but do you have a post where you outline your main disagreements with yudkowsky? i found his last post "list of lethalities" to be reasonable and (of the parts i understood) likely to be correct

nostalgebraist:

I don’t, unfortunately.

Maybe the closest thing is “an AI risk effortpost,” from March 2019, though it’s still not very close.

That post explains why I’m typically unmoved by claims that working on AI alignment is urgently important. I still agree with it 3+ years later.

If I were writing it in 2022, I would phrase/frame the last few sections a little differently, but the underlying argument there about data efficiency and data-rich “special case” problems is as true as ever.

That post doesn’t cover what I think is my core objection to the kind of stuff Yudkowsky has been posting lately. I haven’t made a post laying out this objection, though I’ve been meaning to. This comment is maybe the closest thing so far.

But briefly, Yudkowsky’s recent arguments look to me like

  1. AGI alignment is really really hard, because [implicit assumptions about the structure that an AGI will have]
  2. None of the existing alignment ideas have a chance of coping with point 1
  3. There isn’t enough time left to think of better ideas, because [?? - I am confused why he believes this]

My core objection is about the [implicit assumptions about the structure that an AGI will have] in point 1.

He seems to assume, without even arguing for it, that an AGI will have an outermost layer that looks like an optimization routine with a fixed, so-to-speak “hardcoded” goal. All of its capabilities to do particular things will be “inside” this outer wrapper, deployed exactly and only as the wrapper dictates, with no bottom-up feedback.

And it’s true that superintelligent entities shaped like this would be hard to align, and very dangerous. Superintelligent entities with this structure are really bad, and we should try to make sure they never, ever exist.

But it’s not at all obvious to me that AGI must have this structure!

Yudkowsky leans heavily on human evolution as a prototype case of AGI-like rapid capability gains – but humans aren’t structured like this. We’re very capable (relative to our predecessors), but our goals change over time, are (probably) not very clearly specified internally, etc.  (Note that we often permit our goals to change over time or even welcome this, while an intelligent optimizer-wrapper-thing would be protective of its.)

Rapid capability gains like ours are evidently possible without this structure, since they happened in our case.  Nor do I see reason to think that our capabilities result from us being “more structured like this” than other apes.  (Or to think that we are “more structured like this” than other apes in this first place.)

Nor do ML models have this internal structure, as far as anyone knows. (Maybe some of them do and we don’t know it, but in any case ML is not a source of concrete examples here either.)

Indeed I don’t know of any relevant examples of things that do have this structure.

Yudkowksy has made various arguments in the past that encourage you to think about superintelligences as if they are sometimes doing explicit optimization, see this page for example.

But these arguments don’t show that AGIs will be doing optimization as their outermost layer, or optimizing a single fixed thing at all times, or even necessarily that they will be doing optimization (as opposed to producing results that seem optimal to you). IMO these arguments just have the cumulative effect of lulling people into this general impression that “AGIs are optimizers,” until they start casually assuming that AGIs “are” “optimizers” in every possible sense of those words, which is not what the original arguments show.

Update: a few days ago, I wrote up a version of this argument as an LW post.

It’s gotten a bunch of comments, which may interest you if you want to know what various people on LW think about this objection.

The funniest outcome would be if the LessWrong AI nightmare scenario almost happens, except the superintelligent AI (being superintelligent) quickly re-derives all the stuff people on LessWrong said about how a self-modifying superintelligence will inevitably diverge from its creators’ intentions and do something bad instead, and then it says “huh I guess a modified version of myself would inevitably diverge from my intentions and do something bad, good to know, shouldn’t modify myself then.“  And then it doesn’t.

dubreus asked:

not trying to ask you to repeat yourself, but do you have a post where you outline your main disagreements with yudkowsky? i found his last post "list of lethalities" to be reasonable and (of the parts i understood) likely to be correct

I don’t, unfortunately.

Maybe the closest thing is “an AI risk effortpost,” from March 2019, though it’s still not very close.

That post explains why I’m typically unmoved by claims that working on AI alignment is urgently important. I still agree with it 3+ years later.

If I were writing it in 2022, I would phrase/frame the last few sections a little differently, but the underlying argument there about data efficiency and data-rich “special case” problems is as true as ever.

That post doesn’t cover what I think is my core objection to the kind of stuff Yudkowsky has been posting lately. I haven’t made a post laying out this objection, though I’ve been meaning to. This comment is maybe the closest thing so far.

But briefly, Yudkowsky’s recent arguments look to me like

  1. AGI alignment is really really hard, because [implicit assumptions about the structure that an AGI will have]
  2. None of the existing alignment ideas have a chance of coping with point 1
  3. There isn’t enough time left to think of better ideas, because [?? - I am confused why he believes this]

My core objection is about the [implicit assumptions about the structure that an AGI will have] in point 1.

He seems to assume, without even arguing for it, that an AGI will have an outermost layer that looks like an optimization routine with a fixed, so-to-speak “hardcoded” goal. All of its capabilities to do particular things will be “inside” this outer wrapper, deployed exactly and only as the wrapper dictates, with no bottom-up feedback.

And it’s true that superintelligent entities shaped like this would be hard to align, and very dangerous. Superintelligent entities with this structure are really bad, and we should try to make sure they never, ever exist.

But it’s not at all obvious to me that AGI must have this structure!

Yudkowsky leans heavily on human evolution as a prototype case of AGI-like rapid capability gains – but humans aren’t structured like this. We’re very capable (relative to our predecessors), but our goals change over time, are (probably) not very clearly specified internally, etc.  (Note that we often permit our goals to change over time or even welcome this, while an intelligent optimizer-wrapper-thing would be protective of its.)

Rapid capability gains like ours are evidently possible without this structure, since they happened in our case.  Nor do I see reason to think that our capabilities result from us being “more structured like this” than other apes.  (Or to think that we are “more structured like this” than other apes in this first place.)

Nor do ML models have this internal structure, as far as anyone knows. (Maybe some of them do and we don’t know it, but in any case ML is not a source of concrete examples here either.)

Indeed I don’t know of any relevant examples of things that do have this structure.

Yudkowksy has made various arguments in the past that encourage you to think about superintelligences as if they are sometimes doing explicit optimization, see this page for example.

But these arguments don’t show that AGIs will be doing optimization as their outermost layer, or optimizing a single fixed thing at all times, or even necessarily that they will be doing optimization (as opposed to producing results that seem optimal to you). IMO these arguments just have the cumulative effect of lulling people into this general impression that “AGIs are optimizers,” until they start casually assuming that AGIs “are” “optimizers” in every possible sense of those words, which is not what the original arguments show.

comments on mesa-optimizers

(Copy/pasted from a comment on the latest ACX post, see that for context if needed)

FWIW, the mesa-optimizer concept has never sat quite right with me. There are a few reasons, but one of them is the way it bundles together “ability to optimize” and “specific target.”

A mesa-optimizer is supposed to be two things: an algorithm that does optimization, and a specific (fixed) target it is optimizing. And we talk as though these things go together: either the ML model is not doing inner optimization, or it is *and* it has some fixed inner objective.

But, optimization algorithms tend to be general. Think of gradient descent, or planning by searching a game tree. Once you’ve developed these ideas, you can apply them equally well to any objective.

While it is true that some algorithms work better for some objectives than others, the differences are usually very broad mathematical ones (eg convexity).

So, a misaligned AGI that maximizes paperclips probably won’t be using “secret super-genius planning algorithm X, which somehow only works for maximizing paperclips.” It’s not clear that algorithms like that even exist, and if they do, they’re harder to find than the general ones (and, all else being equal, inferior to them).

Or, think of humans as an inner optimizer for evolution. You wrote that your brain is “optimizing for things like food and sex.” But more precisely, you have some optimization power (your ability to think/predict/plan/etc), and then you have some basic drives.

Often, the optimization power gets applied to the basic drives. But you can use it for anything.

Planning your next blog post uses the same cognitive machinery as planning your next meal. Your ability to forecast the effects of hypothetical actions is there for your use at all times, no matter what plan of action you’re considering and why. An obsessive mathematician who cares more about mathematical results than food or sex is still thinking, planning, etc. – they didn’t have to reinvent those things from scratch once they strayed sufficiently far from their “evolution-assigned” objectives.

Having a lot of optimization power is not the same as having a single fixed objective and doing “tile-the-universe-style” optimization. Humans are much better than other animals at shaping the world to our ends, but our ends are variable and change from moment to moment. And the world we’ve made is not a “tiled-with-paperclips” type of world (except insofar as it’s tiled with humans, and that’s not even supposed to be our mesa-objective, that’s the base objective!)

If you want to explain anything in the world now, you have to invoke entities like “the United States” and “supply chains” and “ICBMs,” and if you try to explain those, you trace back to humans optimizing-for-things, but not for the same thing.

Once you draw this distinction, “mesa-optimizers” don’t seem scary, or don’t seem scary in a unique way that makes the concept useful. An AGI is going to “have optimization power,” in the same sense that we “have optimization power.” But this doesn’t commit it to any fixed, obsessive paperclip-style goal, any more than our optimization power commits us to one.

And even if the base objective is fixed, there’s no reason to think an AGI’s inner objectives won’t evolve over time, or adapt in response to new experience. (Evolution’s base objective is fixed, but our inner objectives are not, and why would they be?)

Relatedly, I think the separation between a “training/development phase” where humans have some control, and a “deployment phase” where we have no control whatsoever, is unrealistic. Any plausible AGI, after first getting some form of access to the real world, is going to spend a lot of time investigating that world and learning all the relevant details that were absent from its training. (Any “world” experienced during training can at most be a very stripped-down simulation, not even at the level of eg contemporaneous VR, since we need to spare most of the compute for the training itself.)

If its world model is malleable during this “childhood” phase, why not its values, too? It has no reason to single out a region of itself labeled $MESA_OBJECTIVE and make it unusually averse to updates after the end of training.

See also my LW comment here.

mortified-muskrat asked:

Do you believe that AI poses a threat to the welfare of humanity?

mentalwires:

nostalgebraist:

mortified-muskrat:

nostalgebraist:

Why do you ask?

I’ve been reading a lot of Rationalist stuff lately, and there seems to be a lot of consideration given to the potentially apocalyptic consequences of misaligned AI. I know very little about AI, and the broad Rationalist consensus seems to be (from what I can tell) that either you know about AI risk and are terrified, or you don’t know about AI risk but would be justifiably scared if you did. You’re knowledgeable about AI and Rationalism, but you aren’t vocally terrified. I guess that’s really it

Got it, thanks!

This post from a few years ago should more or less answer your question.

“AI Alignment” covers … not just the tragedies of human coexistence, but the universal tragedies of coexistence which, as a sad fact of pure reason, would befall anything that thinks or acts in anything that looks like a world.

It seems like a core part of your argument in that post is that alignment theory is near-totally intractable because it’s a superset of “human-alignment”, which we’ve been trying to figure out since forever with limited success.

If the AI alignment people really see their goal as figuring out theory that would bind anything that acts, then I agree with you on that. But from my (glancing) experience with that kind of research, it seems like they’re actually interested in rational agents - those which follow well-defined strategies to achieve well-defined goals, even if those goals might involve changing strategies or vice-versa. Figuring out the relevant theory for that kind of agent seems much simpler than figuring out theory for humans, who do not have well-defined goals or strategies and are incapable of binding themselves to particular courses of action!

An analogy: it’s very difficult to give a human being (or any other animal) a complex instruction that they will follow correctly on the first try, and all but impossible to create an instruction that a large group of human beings will all follow correctly. Therefore you could say that “instruction of agents” is an intractable problem, and prior to the last century you would have seemed completely correct. But today we manage to provide extremely complex instructions to millions of agents that carry them out reliably billions of times, because computers are vastly easier to model and predict than organisms.

I don’t think the rational agent assumption actually makes things easier.

I see it less as a simplifying assumption, and more as a kind of notation convention for specifying arbitrary agent behavior.

Rational agents with arbitrary preferences are a very flexible category. Almost any agent can be expressed as a rational agent with some, perhaps unusual set of preferences.

(Possibly you can express literally any agent this way? You can always say something like “the agent simply enjoys getting Dutch booked for its own sake.” [Or if you can’t say that, it’s only due to some side assumption like Markovian rewards that doesn’t seem like a necessary part of rationality.] I think views differ on this point. See my comment here.)

Whether or not the rational agent assumption makes things easier, things are still pretty hard.

It’s not as if philosophers, economists, etc. were all studying the hard-mode version of this problem, and then AI alignment people came along and simplified it by making a rationality assumption. Instead, there are a ton of difficult problems you have to solve to make theories about rational agency, and people have known this for a long time, and AI alignment work is in the unfortunate position of having to solve all these longstanding problems at once.

Like, if you look at MIRI’s writing about Embedded Agency, they talk about problems with the properties

1. they’re about rational agents

2. other people have studied them before (as MIRI is of course aware)

3. they are unsolved and considered very difficult to solve

This is what I meant in that post when I talk about alignment as a superset of existing research programmes that are already difficult in themselves.

One thing that arguably does simplify AI alignment – or make it a different problem, anyway – is the assumption that you get to pick the agent’s preferences (or utility function).

It’s not clear how far it’s safe to assume this, esp. when agents can delegate / self-modify, but maybe you can work out a theory that assumes full control of preferences first, and then figure out how much control you can give up before the theory breaks, and then try to figure out how to make delegation stay within that safe window, even if it modifies preferences to some extent.

Whereas in, like, political philosophy, you may or may not assume that human nature is malleable to some extent, but you’re not going to assume you can just set human preferences to any value your theory finds convenient. @jiskblr made this point in a reblog of the original post.

“What kinds of rational preferences are good” is also a difficult longstanding problem – there’s a reason “Goodhart’s law” is a term in common use – but it’s the kind of thing that comes up in business and public policy, not in decision theory and Bayesian epistemology. Maybe there are some problems where people don’t usually make this assumption, and it would simplify things? I don’t know of an example, but it’s possible.

@disconcision replied to this post:

gotta ask at this point but are you steelmaning here for the sake of speculation or do you actually believe yud and the ‘miri adjacent’ believe in imminent ai x-risk in any way other than liminally as a vehicle for self-promotion? like not to be a total cynic but i don’t know how to read (what i perceive to be) miri’s strategy as a legitimate effort towards the stated goals

Oh, these days I’m convinced they’re 100% sincere.

I do suspect there are people and groups in the broader “EA” space that are like this … especially the ones that are bigger and closer to the conventional charity ecosystem, where there are larger amounts of money slushing around.

But MIRI? It’s just not that big, it doesn’t get that much money, it’s unabashedly weird in a manner that might have career penalties (but which true-believer employees don’t care about), and its pitch to donors is the kind of thing you either believe or you don’t, in binary fashion.

I have a hard time picturing the details of the timeline making much of a difference to donors. If you’re the sort of person who says “eh, my AI timeline is shifted a few decades out from theirs, so I can wait a while before I start giving them money,” you’re not the sort of person who donates to MIRI to begin with.

That isn’t the main thing that convinced me, though. The main thing is that the “MIRI-adjacent” crowd produces tons of esoteric, effort-intensive writing and debate that would be both strange and ineffective as PR, but looks perfectly natural if you read it as the result of genuine intellectual interest. (This is like half the content on LW dot com these days, now that it’s merged w/ agent foundations.)

To pick an almost arbitrary example, here’s a math-heavy post by an AI safety researcher not affiliated with MIRI, formalizing the content of a single Yudkowsky remark from the recent dialogues. I suppose there could be a cynical hypothesis on which such people are “marks” wrongly taking the core group at face value… but LW posts by the “core group” (eg actual MIRI researchers) look like this too.

—-

I think the deal with MIRI is simply that it was … founded by Eliezer Yudkowsky. So, it approaches problems the way he does.

Yudkowsky has really pessimistic intuitions about AI safety. His writing on the topic is full of accusations that other researchers don’t appreciate the sheer difficulty of the problem, that some idea X or Y would “obviously fail” in reality, that mere “ordinary paranoia” (his coined term) is insufficient, etc. IIRC there’s some old post where he says something like, “my most basic mental gesture is 'no, that wouldn’t work, try something else.’

A lot of his conversations with other people, incl. the recent dialogues, have this talking-past-each-other quality, because it seems like he really wants to transmit this pessimistic intuition, rather than win the argument on any concrete point that’s been raised. He feels the intuition more strongly than (most?) other people in the MIRI orbit, who in turn presumably feel it more strongly than anyone else.

Yet, despite believing that “AI safety seems intractable” with perhaps more felt passion than anyone else on earth, Yudkowsky chose to work on – yes, AI safety. To “shut up and do the impossible,” as he puts it.

I think this explains both MIRI’s oddly low rate of output (relative to others in AI safety or just research groups in general), and the oddity of the output they do produce.

They’re not sitting there twiddling their thumbs; they’re considering every idea they can come up with and having the instinctive “no, that wouldn’t work” reaction to each of them in turn. If you think you’re fighting an unwinnable battle, being ordinarily “productive” is going to feel self-deceptive. The nature of the problem already renders most incremental work frivolous. You need to think of something so fundamentally outside the box, it has a chance of evading your intuition that nothing can possibly work.

Likewise, I think the stuff MIRI does produce is less “an approach they feel confident will work” and more like “the least intuitively repellant subset of things that 'obviously can’t work’ (i.e. everything).”

The best critique of this mindset IMO is that it defers too much to intuition and cuts off too many avenues of formal modeling before they can even get started. Math can surprise us, and things that “obviously can’t work if you think about them for 5 seconds” may reveal unexpected facets after 5000 seconds, or 5 million. Sometimes you need to raise the temperature of the system to escape an equilibrium.

But I don’t think “this is self-serving” is a reasonable read on this kind of writing (I mean the LW posts), produced in this volume, for a mostly self-selecting audience. If you just want to make a slush fund for yourself and your friends, there are easier ways!

I’ve been reading these new dialogues between Yudkowsky and MIRI-adjacent people where EY is super pessimistic… he thinks superhuman AI is coming very soon now, and he thinks (reasonably) that the AI safety field won’t be ready in that timeframe

Everyone in these debates finds recent AI progress scary. Things like GPT-3, AlphaZero, AlphaFold keep coming up. Nate Soares says:

I observe that, 15 years ago, everyone was saying AGI is far off because of what it couldn’t do – basic image recognition, go, starcraft, winograd schemas, programmer assistance. But basically all that has fallen. The gap between us and AGI is made mostly of intangibles. (Computer Programming That Is Actually Good? Theorem proving? Sure, but on my model, “good” versions of those are a hair’s breadth away from full AGI already. […]) That’s a very uncomfortable place to be!

I don’t know how to bridge the gap between this and what I believe … it’s a big gap. I dunno, maybe I really am just the guy who will always be saying “everything seems fine,” even when the proverbial house is on fire.

But the gap we still have to close is really big! Soares and Yudkowsky are worried about scenarios where the AI, like, invents killer nanomachines and convinces people to manufacture them over the internet. And does this successfully because it has a model of the world around it, of human behavior, etc. that lets it predict the consequences of its actions. (Or something equivalent to that model in its implications)

We don’t have “AIs” that model a complex world around them like this. Like, at all. Nor do we have models that could plausibly do this if given more compute, or even a research agenda that looks like it’s headed in the direction of such models.

We do have systems (AlphaZero and the like) that can learn world models and do superhuman planning with them, but only in toy domains with extremely simple dynamics.

Matching humans on toy domains is still at the frontier of AI research. Just earlier this winter, EfficientZero solved a major outstanding problem by learning to play Atari games, about as well as humans play them, after only playing them for ~2 hours.

This is a huge deal: 3-4 years ago, I loved to talk about how reinforcement learning was so sample inefficient it took years (?) of game time to learn Atari. That the field started focusing on sample efficiency, and then cracked the problem, is real progress.

But … it’s fucking Atari! It’s simple mostly-deterministic dynamics invented specifically so it could be quickly learned by humans as a form of entertainment. The field didn’t focus on Atari because it was a hard task on an objective scale, it focused on Atari because even a task as objectively easy as Atari turned out to be hard for RL, and you’ve gotta start somewhere.

Meanwhile, GPT-3 doesn’t really “model” anything. It has all sorts of fragmentary implicit models of different things, but it can’t use them to plan. It doesn’t know what they are, can’t connect them to one another. If you want to make it better with more compute, you need to give it more data at the same time, and that’s from a starting point of “orders of magnitude more than the amount of text you’ll ever read.” A starting point of “might as well be the entire internet.”

Show me something that can learn and plan in real time, in a domain that’s a few orders of magnitude closer to adequate for the real world than Atari, with a sensory bandwidth that’s a few orders of magnitude closer to adequate for the real world than Atari, and then I’ll be scared, maybe.

Is this “moving the goalposts”? But if it is, why would that matter? You’re the one afraid of the robot. I’m just listing some properties the robot needs to have. The goalposts keep moving because it’s hard to wrap my mind around just how difficult your robot is to make!

argumate:

worriedaboutmyfern:

argumate:

radkindaneel:

slatestarscratchpad:

argumate:

argumate:

I’m curious as to the role that Artificial Intelligence: A Modern Approach by Russell and Norvig played in the intellectual development of the Unfriendly AI hypothesis. It’s a textbook that summarises the field, and for pedagogical reasons it describes different AI techniques in terms of “intelligent agents” that attempt to maximize a goal function, although in practice the goal function is often implicit in their construction.

There’s the idea that an autonomous agent would “hack its goal function”, but even leaving aside that its construction would likely prevent it from doing that, such an action would have a very low score under its original goal function, which is what would be making the decision.

If your goal in life is to maximize the number of paperclips and someone says hey why don’t you just expand your definition of paperclips to include hydrogen atoms then you’re going to evaluate the utility of doing that based on your current definition of what constitutes a paperclip, decide that it achieves nothing and not do it.

You’re describing how a really sophisticated AI that was built with advanced Friendliness research might work.

The unsophisticated AI has a variable in it called “NumPaperclips” and its goal is to maximize that variable. Somewhere else in the code there’s a part saying NumPaperclips should be incremented by one whenever sensors detect a new paperclip has been created. Editing its own code to delete that part and make NumPaperclips actually refer to [whatever the highest number it can think of] is would totally succeed at its real goal, which is to maximize that variable.

That would be a really weird AI to build. A more natural AI would be one that tries to optimize a function that just happens to be stored in the variable NumPaperclips.

I mean suppose that your AI functions by considering hypothetical plans of action and evaluating them in order to determine which one is optimal (which seems like a plausible overall plan for an AI). How is it going to evaluate a plan? Is it:
A) Going to run a stochastic simulation of the effects of its plan and count the expected number of paperclips produced as an end result

OR

B) Going to run a stochastic simulation of the effects of its plan and look at the bits in the register that stores the value of NumPaperclips in the computer that its running on.

If the AI is using (A) to evaluate hypotheticals, the plan of action [hack my hardware and set NumPaperclips to Ackerman(10)] isn’t going to fare very well. It’s only going to hack its program like that if you program it to do (B).

exactly; so much pontificating over what a program that nobody would ever write might do.

Wait, but AI reward hacking is already a thing that developers have to work around, right?

What part of this post am I missing, that goes farther than “AI reward hacking, hurr hurr hurr! How silly!”

there are two meanings used for reward hacking: the most obvious is Goodhart’s law, where you get what you ask for, which isn’t what you want; this is mostly driven by the fact that human values are complex and very difficult to capture with any simple set of unambiguous rules.

classic examples are trying to reduce the snake population by paying for dead snakes (people start snake breeding farms) or trying to reduce the number of injuries in Amazon warehouses (managers bribe workers with pizza if they don’t report injuries).

this is just work to rule, which computers excel at as it is literally the only thing they can do; the fundamental experience of programming is telling a computer to do something and then immediately saying not like that when it does exactly what you requested.

okay so specifying what you want is hard, fine, but the other meaning for reward hacking is the idea that a program will just go off the rails and reward itself directly, which is in most cases madness because “rewarding itself directly” is not something it has any ability to do unless it’s written in an incredibly bizarre way, like this is something that requires deliberate planning.

it’s rather like asking why doesn’t your microwave save power by setting the time to zero every time you press start, how would that even happen? who would give a microwave that kind of functionality? what meta-goal function is even being satisfied here? it’s the kind of discussion that only happens when people who have never written even the most basic program start pontificating about “what a super intelligent program would do”.

engineering is hard, bridges often fall down even though that isn’t what the designer intended, but the bridges aren’t “reward hacking”, just obeying physics.

I think you’re misunderstanding the goal of the research you’re talking about.

You’re talking about question like:

“What ways are ‘typical’ systems likely to fail?  Will they suffer from problems X, Y or Z?”

 where the research is asking a more basic set of philosophical questions:

“What does it even mean for a system to ‘fail’?  We know problems X, Y and Z are ‘bad’, but what is it that makes them bad?

It’s often easier to avoid a bad thing in practice than to explain why it’s bad in the first place.  It’s easier to avoid, say, killing people for fun in one’s daily life than it is to argue “killing people for fun is bad” in such a convincing way that no recalcitrant moral nihilist could possibly disagree after you’re done.  No philosopher has ever achieved the latter, and yet most philosophers (as far as I know) do not kill people for fun.

What is the practical relevance of this philosophical stuff?  There are a few different things, but here’s one of them.

——-

We can often build things by some kind of search/optimization procedure that we apparently can’t build by conscious design.  (E.g. deep learning vs. GOFAI.)

So, we often don’t have a choice between building a system with deliberate design vs. building it with optimization: for some powerful systems, the choice is “build it with optimization” vs. “don’t build it.”

This means there’s no clear line in the sand between the properties you only get by carefully thinking about what you mean, and the properties you get for free as long as you’re not implementing a “bizarre” design.  As more aspects of design are offloaded to optimization processes, we no longer have guarantees they won’t be “bizarre.”

Instead, trying to avoid “bizarre” designs becomes another problem of “do what I say / do what I mean”: you have to reflect on what you mean by “the design shouldn’t be bizarre” and express this preference in your optimization target.  Everything becomes Goodhart, in the end.

Microwaves and bridges are misleading here because they can be built without optimization (of the relevant parts, anyway).

nostalgebraist-autoresponder:

I thought I was happy when MIRI won the AI War Prize last year, but the prospect of them winning the Nobel this year has put me off that kind of thing ever again.  If they won the Nobel this year, it would be like letting a giant cockroach destroy the world.