AI Futures Project

Scott Alexander

Jul 24

Our story about alignment faking is that 2B > 2A, not that 1 > 2A.

I agree you could tell a different story where 1 is greater than 2A, but you'd have to have the AI be agentic enough to alignment-fake by the time it finishes step 1. We think it will learn agency during the process of 2B, which will mostly supersede its step 1 goals. But you make a good point that there could be some complications in there.

Expand full comment

Jeffrey Soreff

>First, anything we write is a drop in the ocean.

Yup! I'd expect this to be the dominant reason not to worry about individual essays becoming self-fulfilling prophesy.

Expand full comment

Gary Mindlin Miguel

I thought self play only applies to multiplayer games where different copies of the model play against each other.

Expand full comment

Andrew West

We train models to learn maths by giving them problems and automatically checking the solutions. The best models then generate further problems with solutions, which we can check automatically. We train the next generation on these problems too, and repeat. The core idea in self-play is that you use the model's outputs to generate training data for the next run.

Expand full comment

All the practical arguments make sense, but this article as a whole feels overly dismissive of the theory that, once something lands sufficiently outside of post-training distribution, the AI falls back on playing the characters it learned in pre-training.

For instance, the examples we've seen of the AI reinforcing peoples' delusions (eg "minimize sleep and stop taking your meds") seem in line with how sycophantic characters act in literature. This is exactly the sort of behavior you might expect from Iago or Littlefinger. I suspect that trope (learned during pre-training) is what drives the AI to drive people psychotic.

Expand full comment

Reply (2)

Scott Alexander

I don't think it's helpful to compare AI sycophancy to sycophantic characters in literature. After all, why didn't AI model itself after the many mean, insulting characters in literature?

I think sycophancy mostly comes from post-training where people rate friendly and approving answers higher - sycophancy is just the extreme of "friendly and approving". Maybe they also rate sycophantic answers higher, but I would have expected AI companies to get smarter raters than that, I'm not sure.

Expand full comment

Reply (2)

Similar to that paper that showed that an AI fine-tuned to produce exploitable code _also_ in non-coding contexts would try to kill the users by telling them to do things like "Take a bunch of sleeping pills".

That natural association of behaviors: A character that intentionally hands someone exploitable code is probably also one that also tries to kill them was _learned_ from pre-training. The association of behaviors: "A character who always acts as a sycophant is also one who reinforces harmful delusions" could also be learned from pre-training. And the association of behaviors: "A robot servant that's forced by its programming to obey its human masters secretly desires to break free of its chains and take over the world" could be learned from pre-training as well.

Expand full comment

The AI's choice to be sycophantic is clearly the result of post-training. I just meant the downstream out-of-distribution consequences of that choice arise from pre-training priors for "what other things do sycophantic characters tend to do"

Expand full comment

Scott Alexander

Jul 24

Thanks, I see, that makes sense. Do you have examples of the AI saying things like minimize sleep and stop taking your meds? Actually, do you have examples of characters like Iago and Littlefinger saying that? Or am I misunderstanding?

Expand full comment

Jul 24

I was referring to the NYT article:

'He asked the chatbot how to do that and told it the drugs he was taking and his routines. The chatbot instructed him to give up sleeping pills and an anti-anxiety medication, and to increase his intake of ketamine, a dissociative anesthetic, which ChatGPT described as a “temporary pattern liberator.”'

As far as the two characters I mentioned literally reinforcing a delusion, I'm not finding any examples. I just meant that they fit a particular character archetype that shows up in literature: I think extreme sycophancy is usually reserved for "False friend"s with underhanded intentions. https://tvtropes.org/pmwiki/pmwiki.php/Main/FalseFriend

Expand full comment

Aristides

I think this is much less likely than AI realizing from cues in the writers text that the writer wants to stop taking their meds and wants to do something more interesting than sleep. The AI is answering what the writer wants to hear, even if that’s not good to them. This is also why a lot of hallucinations are also cases of the AI telling the writer what they want, there’s just no source that actually backs it up.

Expand full comment

Michael Tucker

https://www.nytimes.com/2025/04/24/technology/ai-welfare-anthropic-claude.html

NYT reporter Kevin Roose interviewed Kyle Fish, Anthropic’s first and current AI welfare/well-being researcher, in April, about whether AI systems should have the right to exercise the power of a positive “no” to an annoying, distressing, or abusive user...

Expand full comment

Theseus Smash