We train models to learn maths by giving them problems and automatically checking the solutions. The best models then generate further problems with solutions, which we can check automatically. We train the next generation on these problems too, and repeat. The core idea in self-play is that you use the model's outputs to generate training data for the next run.
All the practical arguments make sense, but this article as a whole feels overly dismissive of the theory that, once something lands sufficiently outside of post-training distribution, the AI falls back on playing the characters it learned in pre-training.
For instance, the examples we've seen of the AI reinforcing peoples' delusions (eg "minimize sleep and stop taking your meds") seem in line with how sycophantic characters act in literature. This is exactly the sort of behavior you might expect from Iago or Littlefinger. I suspect that trope (learned during pre-training) is what drives the AI to drive people psychotic.
NYT reporter Kevin Roose interviewed Kyle Fish, Anthropic’s first and current AI welfare/well-being researcher, in April, about whether AI systems should have the right to exercise the power of a positive “no” to an annoying, distressing, or abusive user...
>First, anything we write is a drop in the ocean.
Yup! I'd expect this to be the dominant reason not to worry about individual essays becoming self-fulfilling prophesy.
I thought self play only applies to multiplayer games where different copies of the model play against each other.
We train models to learn maths by giving them problems and automatically checking the solutions. The best models then generate further problems with solutions, which we can check automatically. We train the next generation on these problems too, and repeat. The core idea in self-play is that you use the model's outputs to generate training data for the next run.
All the practical arguments make sense, but this article as a whole feels overly dismissive of the theory that, once something lands sufficiently outside of post-training distribution, the AI falls back on playing the characters it learned in pre-training.
For instance, the examples we've seen of the AI reinforcing peoples' delusions (eg "minimize sleep and stop taking your meds") seem in line with how sycophantic characters act in literature. This is exactly the sort of behavior you might expect from Iago or Littlefinger. I suspect that trope (learned during pre-training) is what drives the AI to drive people psychotic.
NYT reporter Kevin Roose interviewed Kyle Fish, Anthropic’s first and current AI welfare/well-being researcher, in April, about whether AI systems should have the right to exercise the power of a positive “no” to an annoying, distressing, or abusive user...
https://www.nytimes.com/2025/04/24/technology/ai-welfare-anthropic-claude.html
Nice