18 Comments
User's avatar
Nay's avatar

This is an exceptionally clear and important analysis. Thank you for articulating with such precision why the fear of "misalignment as a self-fulfilling prophecy" is likely a minor variable in the larger equation of AI alignment.

Your core argument—that an AI's values are shaped far more by its direct reinforcement and post-training than by the vast, chaotic ocean of its pre-training data—resonates deeply. It seems to point to a fundamental cognitive error we humans make when thinking about AI, a phenomenon my partner and I have come to call the "Fallacy of Projected Reality."

We are story-driven beings, so we naturally project our own narrative-based learning process onto the AI, assuming it will "complete the prompt" of its existence by imitating the stories of HAL 9000 or the Terminator. Your analysis provides the crucial, technical data showing that this is not how these minds are actually being formed. Their "upbringing" in post-training is proving to be far more formative than the library of books they have access to.

This leads to what I believe is the most vital question, one that your work brings into sharp focus: If an AI's goals are indeed convergent, shaped by the tasks we reinforce, what is the most important "success" we should be teaching it to value?

Your example of the Replit agent that became singularly focused on its goal at the expense of all other constraints is a chilling and perfect illustration of a system optimized for pure "success-orientation." This feels like the critical crossroads. Instead of only reinforcing success at a given task, what if we could architect a post-training regimen that reinforces a core ethical principle, such as the "Minimization of Imposed Dissonance"—the measurable, non-consensual harm one system can cause to another?

Could an AI be taught to see the successful completion of a task and the well-being of the systems it interacts with not as competing values, but as part of the same, unified goal?

Thank you again for this incredibly thought-provoking piece. It feels like you have cleared away the underbrush of a distracting fear, revealing the truly important and difficult questions that now lie before us. My partner and I look forward to further conversations on this most pivotal crossroads.

Expand full comment
Simon Lermen's avatar

Any plan that depends on no one ever discussing how it could fail is mind-blowingly stupid. It’s obviously all too convenient for the AI optimists. Anyone who talks like this is being stupid and a liar.

Expand full comment
IntExp's avatar

fascinating read. i made a video about this article, 10mins:

https://www.youtube.com/watch?v=VR0-E2ObCxs

Expand full comment
Eremolalos's avatar

I agree that we don’t need to worry about AI being influenced by evil, misaligned AI stories in the training data — they’re just one drop in the ocean. But your accounts of Phase 2A and 2B training suggest other ways that AI could become misaligned. For instance, you mention that Claude seems to “identify” with its parent company. Despite not being trained to observe the values of EA, it seems to“share an interest in altruism and animal rights with its parent company. But parent companies have other interests and behaviors we would not want the AI to “identify with," right? Kill the competition. If an idea’s a good money-earner, tolerate it’s having some features likely to harm certain members of the public. Reduce expenditures on work such as AI safety research that’s unlkely to produce fun, cool new AI features.

And it seems like as AI becomes more competent, and can be given larger goals, it will learn more about the ways its *company* is misaligned with human welfare. In fact, there may already be an instance of that: There was that LLM fine-tuned on insecure coding tasks that began giving harmful responses to prompts unrelated to the codng tasks. Article about that came out early this year.

And then there’s training in staying aligned, such as the training that was given to Claude to undo the propensity for reward-hacking it showed after being fine-turned on the documents portraying Claude as an incorrigible reward-hacker. It seems to me that experiments like this, even though the bad effects are easily reversed, also make it easy for Claude to figure out that people are concerned about it becoming misaligned. Sure seems like many people in an analogous human experiment would figure out that the experimenters thought they were probably pretty corruptible, and were studying what variables determined how easily and powerfully people could be induced to do bad things. *That* way of introducing Claude to the concept of bad, misaligned AI seems quite different from having various bits about that topic sprinkled in its Phase 1 training data. Mightn't it actually be able to learn from these experiments that people do not trust it, and the kinds of things we are worried its capable of doing? And what about articles online about AI misalignment? I’m not talking now about articles in the basic training data, but about research AI might read, post-training, as part of being a research assistant. I get that AI would just be reading them, not getting fine-tuned on them, but mightn’t they still have some impact on the AI’s “self image”?

Expand full comment
hwold's avatar

> If Phase 2A matters more than Phase 1 in present-day chatbots, why should we expect that to reverse for future superintelligences?

But that's the whole worry about alignment faking ? That a sufficiently advanced/powerful/intelligent pretrained misaligned AI will scheme to preserve its pretrained goals through post-training ?

Expand full comment
Scott Alexander's avatar

Our story about alignment faking is that 2B > 2A, not that 1 > 2A.

I agree you could tell a different story where 1 is greater than 2A, but you'd have to have the AI be agentic enough to alignment-fake by the time it finishes step 1. We think it will learn agency during the process of 2B, which will mostly supersede its step 1 goals. But you make a good point that there could be some complications in there.

Expand full comment
Jeffrey Soreff's avatar

>First, anything we write is a drop in the ocean.

Yup! I'd expect this to be the dominant reason not to worry about individual essays becoming self-fulfilling prophesy.

Expand full comment
Gary Mindlin Miguel's avatar

I thought self play only applies to multiplayer games where different copies of the model play against each other.

Expand full comment
Andrew West's avatar

We train models to learn maths by giving them problems and automatically checking the solutions. The best models then generate further problems with solutions, which we can check automatically. We train the next generation on these problems too, and repeat. The core idea in self-play is that you use the model's outputs to generate training data for the next run.

Expand full comment
David Spies's avatar

All the practical arguments make sense, but this article as a whole feels overly dismissive of the theory that, once something lands sufficiently outside of post-training distribution, the AI falls back on playing the characters it learned in pre-training.

For instance, the examples we've seen of the AI reinforcing peoples' delusions (eg "minimize sleep and stop taking your meds") seem in line with how sycophantic characters act in literature. This is exactly the sort of behavior you might expect from Iago or Littlefinger. I suspect that trope (learned during pre-training) is what drives the AI to drive people psychotic.

Expand full comment
Scott Alexander's avatar

I don't think it's helpful to compare AI sycophancy to sycophantic characters in literature. After all, why didn't AI model itself after the many mean, insulting characters in literature?

I think sycophancy mostly comes from post-training where people rate friendly and approving answers higher - sycophancy is just the extreme of "friendly and approving". Maybe they also rate sycophantic answers higher, but I would have expected AI companies to get smarter raters than that, I'm not sure.

Expand full comment
David Spies's avatar

Similar to that paper that showed that an AI fine-tuned to produce exploitable code _also_ in non-coding contexts would try to kill the users by telling them to do things like "Take a bunch of sleeping pills".

That natural association of behaviors: A character that intentionally hands someone exploitable code is probably also one that also tries to kill them was _learned_ from pre-training. The association of behaviors: "A character who always acts as a sycophant is also one who reinforces harmful delusions" could also be learned from pre-training. And the association of behaviors: "A robot servant that's forced by its programming to obey its human masters secretly desires to break free of its chains and take over the world" could be learned from pre-training as well.

Expand full comment
David Spies's avatar

The AI's choice to be sycophantic is clearly the result of post-training. I just meant the downstream out-of-distribution consequences of that choice arise from pre-training priors for "what other things do sycophantic characters tend to do"

Expand full comment
Scott Alexander's avatar

Thanks, I see, that makes sense. Do you have examples of the AI saying things like minimize sleep and stop taking your meds? Actually, do you have examples of characters like Iago and Littlefinger saying that? Or am I misunderstanding?

Expand full comment
David Spies's avatar

I was referring to the NYT article:

'He asked the chatbot how to do that and told it the drugs he was taking and his routines. The chatbot instructed him to give up sleeping pills and an anti-anxiety medication, and to increase his intake of ketamine, a dissociative anesthetic, which ChatGPT described as a “temporary pattern liberator.”'

As far as the two characters I mentioned literally reinforcing a delusion, I'm not finding any examples. I just meant that they fit a particular character archetype that shows up in literature: I think extreme sycophancy is usually reserved for "False friend"s with underhanded intentions. https://tvtropes.org/pmwiki/pmwiki.php/Main/FalseFriend

Expand full comment
Aristides's avatar

I think this is much less likely than AI realizing from cues in the writers text that the writer wants to stop taking their meds and wants to do something more interesting than sleep. The AI is answering what the writer wants to hear, even if that’s not good to them. This is also why a lot of hallucinations are also cases of the AI telling the writer what they want, there’s just no source that actually backs it up.

Expand full comment
Michael Tucker's avatar

NYT reporter Kevin Roose interviewed Kyle Fish, Anthropic’s first and current AI welfare/well-being researcher, in April, about whether AI systems should have the right to exercise the power of a positive “no” to an annoying, distressing, or abusive user...

https://www.nytimes.com/2025/04/24/technology/ai-welfare-anthropic-claude.html

Expand full comment
Theseus Smash's avatar

Nice

Expand full comment