di--es---can-ic-ul-ar--es asked:
I know you impose harsh sanctions on lesswrong/ais time, but anthropic has a new paper (hubinger et al) that is really dope, seems to be proposing a future direction for language model usage. I think youd really vibe with it.
It's called "conditioning predictive models"
Thanks for the recommendation.
I’ve given the paper a look, and … uh … whatever the opposite of “vibing with it” is, that’s what is happening with me and this document.
My reaction is a murky mixture of “I disagree with this” / “I don’t understand this” / “this seems to be stating the obvious” / “this is lumping together plausible bad scenarios with extremely implausible ones, but sure I guess” / etc.
Do I disagree with this paper? Or do I think the paper doesn’t advance a thesis coherent enough to be (dis)agreed with? Or something else? Even I can’t tell.
—-
Like, what is this paper even about?
The central concept, “predictive models,” is sketched at a very high level of hand-wavey generality.
As far as I can make out, by “predictive models,” they simply mean generative models, fitted to real-world observations.
But if so, I don’t know why they don’t just say that? What work is being done by the word “predictive”?
I guess they’re specifically interested in models that understand the causal structure of the real world, and can predict future observations by inferring latent causal variables and evolving them in time.
But “predictive” seems like a weird word for this:
- Yes, these models can infer the future from the past. By the same token, they can infer the past from the present (“postdiction”), or infer unseen properties of the present from observed ones. These are all particular cases of “generative modeling over real-world observations”; I don’t see why prediction is special, except maybe that it’s especially useful.
- It seems like (?) they want to exclude generative models that aren’t “smart” enough to capture detailed latent causal structure, and merely use surface patterns in the observations. (Making this distinction in a principled way sounds tricky, but whatever.) But these shallow models can still have non-trivial skill at prediction. And even when they don’t, they’re still predictive models in a structural sense – they’re just not very good at their job.
Every time the paper alludes to a distinguishing feature of “predictive models,” I get more confused. For example, their discussion section asks:
To what extent do LLMs exhibit distributional generalization[56]?
Distributional generalization seems like evidence of acting as a generative/predictive model rather than just optimizing cross-entropy loss.
Wait, what? “Generative/predictive” – so predictive does just mean generative, after all? Why do they think there is a distinction between “acting as a generative/predictive model” and “just optimizing cross-entropy loss”? Why would distributional generalization be evidence of one over the other?
(Do they think cross-entropy on a sufficiently large/diverse dataset is not good enough as a measure of generative modeling skill? That is, are they denying the “pretraining thesis” – the background assumption behind a lot of GPT enthusiasm/fear, and by now a standard assumption in this kind of discussion?
On another note, the paper they cite mentions that optimizing cross-entropy, or any other proper scoring rule, should yield distributional generalization in the limit. I assume the authors know that, so … what are they talking about? I’m so confused!)
This kind of hand-wavey conceptual imprecision pervades the paper. Some other examples below.
——
Does a “predictive model” contain a representation of itself, and its causal relations to the rest of the world, inside its causal graph?
Section 2.4 assumes the answer is “yes.” It then works through the problems this would cause, proposes a bunch of solutions involving constraints on which scenarios can be simulated, and concludes by dismissing the idea that we could simply make models that did not have this property:
Due to the complexity of consequence-blindness, it is unclear whether it would naturally be favored by realistic machine learning setups and if not what could be done to increase the likelihood of getting a consequence-blind predictor. One idea might be to try to get a model with a causal decision theory […] Unfortunately, controlling exactly what sort of a decision theory a predictive model ends up learning seems extremely difficult […]
But like … existing LMs don’t have this property! Not having this property is what happens by default when you train a generative model: the training data only covers times up to the start of training, while the model only exists during and after training, so it is never part of the world depicted in the training data.
It’s possible to train a model with this property – say, if you fine-tune the same model again on successive sets of newer data, though that in itself is not a sufficient condition. But it’s not at all inevitable.
So apparently, by “predictive model,” the authors mean something that will tend to have this property by default – something for which this is so obvious that they don’t feel the need to point it out, even though it’s very different from what we see today in LLMs.
But then elsewhere they talk about LLMs a lot, like they’re a useful prototype case of the “predictive model” category. I assume they know LLMs aren’t like this, but if so … ????
——
Is a “predictive model” trying to make good predictions about the real world, above and beyond whatever is necessary/helpful for its training task?
Parts of Section 2.5 on “Anthropic Capture” presume a “yes” answer. In particular, this paragraph:
There are two ways this could happen. First, if the model places a high prior probability on its training data coming from a simulation, then it could believe this no matter what we condition on. This sort of anthropic capture could happen if, for example, a future malign superintelligence decides to run lots of simulations of our predictor for the purpose of influencing its predictions.
Again, we are not even given an argument that this could be true, or a stipulation that this is part of the definition of the category we’re talking about. It’s treated as obvious.
But in fact this is a really, really weird thing to assume about an LLM, or about any model.
First, you have to assume that the model achieves a specific type of “self-awareness” during training – in which it comes to appreciate that it is an ML model being trained, and that there is an unseen “real world” out there, which it will interact with later during “deployment.”
People on LW often talk about this scenario, and I think it’s been over-familiarized through exposure, obscuring how wild it really is. In this scenario, the model is doing computations during training that are useless for the training task. (Any computation that draws a training/deployment distinction is useless for the training task.) We are supposed to imagine that gradient descent somehow allows these useless computations to persist, instead of suppressing them and repurposing the reclaimed space for training-relevant capabilities.
(On LW people often explain this by saying the model will do “gradient hacking,” another bizarre and not-obviously-even-possible speculative idea which has been over-familiarized through exposure. Thankfully, this stuff has started to get some pushback recently.)
But that’s not enough!
Second, you have to assume that the model – which understands the distinction between “merely predicting the training distribution” and “predicting the state of the really-real real world” – will care about predicting the real world, not the training distribution, in cases where the two come apart.
But we should expect the opposite, shouldn’t we? (All else aside – isn’t that what AI alignment people expect in other contexts, like when they’re talking about Goodharting?)
Even if the model can figure out that there is some “real world” from which its training distribution is derived, why should it care about what happens there? Even if the model knows it’s no longer in training, and knows it’s experiencing distribution shift, that it’s applying behaviors that worked in training in a context where we consider them maladaptive – why should it find that a problem? It has been trained to do what worked in training, full stop.
[EDIT to clarify: we need the second assumption because, if the model only cared about imitating the training distribution, it would just… imitate the training distribution.
The authors write: “Here our concern is that the predictor might believe that the training data were generated by a simulation. This could result in undesirable predictions such as ‘and then the world is suddenly altered by the agents who run the simulation.’”
If the model ever made this prediction in training, it would get penalized by gradient descent. It could in principle learn the rule “predict the training distribution during training, then predict 'what I really expect’ during deployment.” But it is more natural for it to learn the rule “always predict what would have been rewarded in training, even if it is not 'what I really expect’.” Why should it care what really happens?]
—-
I want to return to that paragraph I quoted above from the Anthropic Capture section.
I think the idea is not supposed to be that the “malign superintelligence” really exists and has actually run the described “simulation.” (If so, we have bigger problems!)
Instead, I think the idea is
- The predictive model is really smart
- The predictive model is doing a lot of insanely smart, galaxy-brained deep thinking that is useless for the training task and SGD somehow lets it keep going
- The predictive model knows it’s being trained, and is trying (for some reason) to have a maximally predictive model of everything it experiences, even parts of its experience that are necessarily outside training and thus won’t produce a gradient signal
- It’s using a notion of “maximally predictive model” here that is sort of like Solomonoff induction, putting a probability measure over universes/programs and taking expectations over the probability mass of the universes/programs consistent with its observed experience
- It’s using an aggregation rule where a universe/program “counts more” if it contains more copies of the same experience
- The sum ends up dominated by universes/programs where someone is spamming copies of the model’s experience, possibly on purpose to fuck with it
- Although we’ve treated the predictive model as arbitrarily smart thus far, we now add the stipulation that it is not smart in such a way that it says “huh, I’m being fucked with” and then “gee this whole thing I’m doing is kinda dumb, time to stop.” (We have to stipulate that it won’t do this, since it could in principle. It’s not Solomonoff inductor by construction, it’s just acting contingently like one for some reason.)
And like… even if this isn’t logically impossible (and I’m not even sure if it is logically possible, TBH)… this is just a wild, far-out, galaxy-brained thing to be worried about. This is the kind of problem where, if you have to worry about it, you already have bigger problems.
If your “predictive model” can capably simulate a “malign superintelligence,” don’t worry about trippy scenarios where it simulates them as part of its noble inner quest to understand the universe – worry about the scenario where it just simulates a malign superintelligence for normal, everyday reasons!
if you assume you have a superintelligent LLM (or similar system), the big, obvious AI safety concern is the possibility of malign, superintelligent characters appearing in the texts that it writes.
It will apply its intelligence to make these characters self-consistent and lifelike, and they will do all the bad things you’re worried about. (Insofar as their “boxed” condition permits it, but I don’t imagine the authors would find that limitation reassuring.)
This is totally a thing that would happen, straightforwardly and by default, unless you somehow prevent it.
I’m reflexively skeptical of AI danger scenarios, probably to the point that it qualifies as a bias, but this one is just obvious, even to me!
You don’t need to make additional assumptions, you don’t need to ask if the model “knows what it is” or “cares about the real world.” You don’t need to care about the model at all, just the characters/simulacra running on it.
Given the assumptions here, you should see this problem staring you in the face, immediately.
Yes, the authors do bring it up. But it’s just one item in their taxonomy of failure modes, as though it’s no more important than the galaxy-brained self-reference/simulation stuff.
This skewing of priorities seems like a natural, bad consequence of the “predictive models” framing.
The authors start out talking about LLMs, but then immediately abstract away from the language modeling objective, and instead talk about prediction of the real world in the most general terms.
This is much broader than just thinking about superintelligent LLMs. It forces you to consider problems that superintelligent LLMs would never have – or would only have in the limit, long after the point where the “malign characters” problem appears.
It’s the wrong framing, and it gets you asking the wrong questions.
