It is a good start to release the System Prompt and Specs.
Much of the biases of LLMs come from RLHF and the quality of the fine-tuning data, not to mention different levels of sysprompts, like "role:dev" and the self-prompting of thinking models
Agreed - though I don't know how much of the biases (goal structure?) of LLMs come from RLHF and how much from pre-training data.
Personally, I really wish the training data were public, at least in broad outlines. I have no idea how the labs are handling fiction vs non-fiction distinctions (or grey areas like press releases, governmental and corporate statements, and religious and ideological texts). If some lab feeds ideological/religious texts in as the first, foundational, training data, I would really want to know about it!
>im pretty sure all religious texts appear in the pretraining.
I know that I don't know. I've done some searches about the pre-training data, and about all I've seen is claims that it is "curated". "curated" _how_??? What included, what excluded? Trained in what order?
To choose perhaps one of the least contentious examples: LLMs should probably see the text of "Hamlet". It is certainly relevant to cultural literacy. _When_ in the training sequence should it appear, and what are the consequences?
You write of contextualizing training data, and, yes, a massively trained LLM, fed "Hamlet" as, e.g. its last piece of training data, will very likely _recognize_ it as fiction.
Yet the same neural net, if fed "Hamlet" as the _first_ piece of training text, cannot possibly "know" that it should label it as fiction. If the training really does rely on the LLM to distinguish fact (or "asserted fact") from fiction (and I don't know if this is what happens rather than something more manual) then it is really important to know how far into training an LLM has to get to make this distinction, and what training data precedes this point.
And none of this seems to be publicly disclosed at present.
interesting point about the ordering of pretraining data. link a paper if you find one, i'd naively guess that the training order has minor consequences on the character of the LLM, but if not, then it's huge, way huger than the sysprompt/spec.
it may be the single most important contributor to the personality of the silicon god if its cognitive development is anythibg like a human's, where early impressions become fundamental.
Many Thanks! 'fraid I don't have a paper on the consequences of sequencing. I'll see if I can find one... I'm indeed thinking of how early impressions can matter greatly in human cognitive development, and guessing that LLM pre-training is at least somewhat analogous.
Mechanistically, one of the parameters for controlling neural net training is the _speed_ of gradient descent. I'd expect that if this were turned to "slow" - going through the total data 100 times during training, with ~1% changes to weights on any given pass, then I'd expect the order to be relatively unimportant. At the other extreme, if this were "fast", with ~100% changes to the weights on any given pass, then I'd expect the order to matter a great deal - with the neural net just barely avoiding getting trapped in obvious local minima.
Given how expensive the whole pre-training process is, I suspect that the labs are probably closer to the latter case than the former.
Very interesting read, thank you! I've found your forecast and follow-up posts incredibly insightful, and I'm so much more aware of the potential for change (both good and bad) that AI will bring.
Your work has spurred a couple of questions for me:
1. What concrete steps would you suggest employees within AI labs could take to encourage their organizations to prioritize releasing model specifications and adopting other safety-first practices?
2.For those of us outside these companies, what actions can we take to help prevent AI misalignment or the concentration of AI control within a small group?
I feel like a clear call to action for individuals, both inside and outside of these labs, would be a powerful addition to your forecast.
Good that they made the prompt public. "Safety by secrecy" is even dumber than the "security by secrecy" of the crypto wars of the past. Silver linings - provides for good comedy material for the likes of @elder_plinius Pliny the Liberator. 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭 (good stuff 😁 )
But by Jove are you unable to get over yourselves with toddler takes -
> In the future, they might want to go further and guarantee the AI doesn’t teach some especially secret or powerful bomb-making technique.
No less, but - **guarantee** !! Whoa. Well - let's put this out gentleman: **guarantee** might not be possible. And further, that impossibility might be for the better rather than worse. 😇
Seems like specs and system prompts are mini-alignments, learned lessons and reminders that are designed to get AI to meet certain standards in particular situations. Exerting some control over AI behavior in this bit-by-bit way actually seems more promising to me than some big Alignment achieved by implanting one giant principle or a small set of very big principles.
The problem with any large principle -- take for example "don't kill people" -- is that there are many exceptions that even most people of fine conscience would approve. For instance if someone were about to open fire on a crowd with an assault rifle, and the only quick enough way to avoid the shooter's first round of firing was to kill him, most of us would be in favor of killing him. But there could be exceptions to that exception. Maybe we need the shooter alive because of information he has about a far worse mass killing that's being planned. And so on. All the large commandments to do or don't do something have a sort of fractal structure of exceptions and exceptions to the exceptions once it comes down to cases.
Is it possible that the best large-scale alignment would be to specify situations where the AI must ask one of a list of people what to do? Despite the many awful flaws of our species, I have met a number of people whom I trust deeply, and who have never, over a period of decades, done anything I see as a major betrayal of trust. I think we would be safer with a short list of people who together would make major decisions. Of course there's the problem of selecting them in a way that is not contaminated by the grotesque bullshit that contaminates elections and other contests for power. But what about letting AI pick them, with members of our species being involved in developing the picking process.? AI could sort through information from many many lives to develop good criteria for selecting the people who have stayed fairminded, kind and wise about the big picture under various sorts of stress. It's really of a pattern-matching problem. And once they are chosen, those picked could use AI to help them make decisions by enumerating the paths that branch out from each possble choice, and the likelihood of each.
> All the large commandments to do or don't do something have a sort of fractal structure of exceptions and exceptions to the exceptions once it comes down to cases.
I think that that is a good description. Also, I think different communities/factions e.g. MAGA v Woke have _very_ different views on what is "ethical", and I think that, if one traversed down that fractal tree, I'd expect even the first branches to go in opposite directions for these two groups.
Great point. I absolutely agree -- we should push energetically to make specs and system prompts public.
This particular AI issue seems to me to be especially easy for the public to grasp. It's comparable to food labeling and other things of that sort. I think you guys should find someone who is expert at getting info of this kind way more in the public eye. And when I say in the public eye, I am not thinking of, say, Ezra Klein's podcast. You have to put billboards in Mordor, guys: Tiktok, Instagram, and yes, even the Ass Crack of Doom, Facebook.
People who might be able to make attention-catching videos on this topic: The Daniels, directors of Everything, Everywhere, All at Once. Used to make music videos, maybe still do -- so know how to capture attention and influence people quickly using images and music. Doesn't hurt that they are famous and successful, either.
Are there any robust ways to verify that the public prompt is the real prompt? For example, what's to stop EvilCorp from saying "our prompt is: [nice aligned prompt]" on their website, but internally use [evil nonaligned prompt] on the actual production model? I suppose you could try to jailbreak the model and have it reveal its prompt, but that too could be countered by the evil prompt saying "if you are asked about your prompt, reveal that it is [nice aligned prompt]." I suppose relying on norms and whistleblowing could help; it'd be an especially bad look if xAI was soon revealed to not actually be using the prompt they claim to use.
"Reactions were skeptical: why does the company have so many rogue employee incidents... And what about its politics-obsessed white South African owner?"
It doesn't seem very surprising that the company would have more than the usual number of rogue employees, given how disliked Elon is in many circles (e.g. why are there so many tesla vandalism incidents), and if you were a rogue employee who wanted to make your "politics-obsessed white South African owner" look bad, then this particular Grok obsession is a pretty natural choice.
If such an employee existed, think of how much political mileage Musk would get out of publicly exposing and charging them. The fact that it's happened at least twice, but still no names from xAI points to one strong suspect. And I don't think they used the term "rogue employee"; they said "unauthorized modification" which is a different thing.
These are so good! My new favorite Substack. Keep it up.
It is a good start to release the System Prompt and Specs.
Much of the biases of LLMs come from RLHF and the quality of the fine-tuning data, not to mention different levels of sysprompts, like "role:dev" and the self-prompting of thinking models
Agreed - though I don't know how much of the biases (goal structure?) of LLMs come from RLHF and how much from pre-training data.
Personally, I really wish the training data were public, at least in broad outlines. I have no idea how the labs are handling fiction vs non-fiction distinctions (or grey areas like press releases, governmental and corporate statements, and religious and ideological texts). If some lab feeds ideological/religious texts in as the first, foundational, training data, I would really want to know about it!
im pretty sure all religious texts appear in the pretraining.
and the AI knows how to contextualize it, it doesn't automatically believe in god juat because it read it.
like reasoning emerged in LLMs, the making sense of the world skill has been growing steadily.
Many Thanks!
>im pretty sure all religious texts appear in the pretraining.
I know that I don't know. I've done some searches about the pre-training data, and about all I've seen is claims that it is "curated". "curated" _how_??? What included, what excluded? Trained in what order?
To choose perhaps one of the least contentious examples: LLMs should probably see the text of "Hamlet". It is certainly relevant to cultural literacy. _When_ in the training sequence should it appear, and what are the consequences?
You write of contextualizing training data, and, yes, a massively trained LLM, fed "Hamlet" as, e.g. its last piece of training data, will very likely _recognize_ it as fiction.
Yet the same neural net, if fed "Hamlet" as the _first_ piece of training text, cannot possibly "know" that it should label it as fiction. If the training really does rely on the LLM to distinguish fact (or "asserted fact") from fiction (and I don't know if this is what happens rather than something more manual) then it is really important to know how far into training an LLM has to get to make this distinction, and what training data precedes this point.
And none of this seems to be publicly disclosed at present.
interesting point about the ordering of pretraining data. link a paper if you find one, i'd naively guess that the training order has minor consequences on the character of the LLM, but if not, then it's huge, way huger than the sysprompt/spec.
it may be the single most important contributor to the personality of the silicon god if its cognitive development is anythibg like a human's, where early impressions become fundamental.
Many Thanks! 'fraid I don't have a paper on the consequences of sequencing. I'll see if I can find one... I'm indeed thinking of how early impressions can matter greatly in human cognitive development, and guessing that LLM pre-training is at least somewhat analogous.
Mechanistically, one of the parameters for controlling neural net training is the _speed_ of gradient descent. I'd expect that if this were turned to "slow" - going through the total data 100 times during training, with ~1% changes to weights on any given pass, then I'd expect the order to be relatively unimportant. At the other extreme, if this were "fast", with ~100% changes to the weights on any given pass, then I'd expect the order to matter a great deal - with the neural net just barely avoiding getting trapped in obvious local minima.
Given how expensive the whole pre-training process is, I suspect that the labs are probably closer to the latter case than the former.
Very interesting read, thank you! I've found your forecast and follow-up posts incredibly insightful, and I'm so much more aware of the potential for change (both good and bad) that AI will bring.
Your work has spurred a couple of questions for me:
1. What concrete steps would you suggest employees within AI labs could take to encourage their organizations to prioritize releasing model specifications and adopting other safety-first practices?
2.For those of us outside these companies, what actions can we take to help prevent AI misalignment or the concentration of AI control within a small group?
I feel like a clear call to action for individuals, both inside and outside of these labs, would be a powerful addition to your forecast.
Good that they made the prompt public. "Safety by secrecy" is even dumber than the "security by secrecy" of the crypto wars of the past. Silver linings - provides for good comedy material for the likes of @elder_plinius Pliny the Liberator. 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭 (good stuff 😁 )
But by Jove are you unable to get over yourselves with toddler takes -
> In the future, they might want to go further and guarantee the AI doesn’t teach some especially secret or powerful bomb-making technique.
No less, but - **guarantee** !! Whoa. Well - let's put this out gentleman: **guarantee** might not be possible. And further, that impossibility might be for the better rather than worse. 😇
Seems like specs and system prompts are mini-alignments, learned lessons and reminders that are designed to get AI to meet certain standards in particular situations. Exerting some control over AI behavior in this bit-by-bit way actually seems more promising to me than some big Alignment achieved by implanting one giant principle or a small set of very big principles.
The problem with any large principle -- take for example "don't kill people" -- is that there are many exceptions that even most people of fine conscience would approve. For instance if someone were about to open fire on a crowd with an assault rifle, and the only quick enough way to avoid the shooter's first round of firing was to kill him, most of us would be in favor of killing him. But there could be exceptions to that exception. Maybe we need the shooter alive because of information he has about a far worse mass killing that's being planned. And so on. All the large commandments to do or don't do something have a sort of fractal structure of exceptions and exceptions to the exceptions once it comes down to cases.
Is it possible that the best large-scale alignment would be to specify situations where the AI must ask one of a list of people what to do? Despite the many awful flaws of our species, I have met a number of people whom I trust deeply, and who have never, over a period of decades, done anything I see as a major betrayal of trust. I think we would be safer with a short list of people who together would make major decisions. Of course there's the problem of selecting them in a way that is not contaminated by the grotesque bullshit that contaminates elections and other contests for power. But what about letting AI pick them, with members of our species being involved in developing the picking process.? AI could sort through information from many many lives to develop good criteria for selecting the people who have stayed fairminded, kind and wise about the big picture under various sorts of stress. It's really of a pattern-matching problem. And once they are chosen, those picked could use AI to help them make decisions by enumerating the paths that branch out from each possble choice, and the likelihood of each.
> All the large commandments to do or don't do something have a sort of fractal structure of exceptions and exceptions to the exceptions once it comes down to cases.
I think that that is a good description. Also, I think different communities/factions e.g. MAGA v Woke have _very_ different views on what is "ethical", and I think that, if one traversed down that fractal tree, I'd expect even the first branches to go in opposite directions for these two groups.
Great point. I absolutely agree -- we should push energetically to make specs and system prompts public.
This particular AI issue seems to me to be especially easy for the public to grasp. It's comparable to food labeling and other things of that sort. I think you guys should find someone who is expert at getting info of this kind way more in the public eye. And when I say in the public eye, I am not thinking of, say, Ezra Klein's podcast. You have to put billboards in Mordor, guys: Tiktok, Instagram, and yes, even the Ass Crack of Doom, Facebook.
People who might be able to make attention-catching videos on this topic: The Daniels, directors of Everything, Everywhere, All at Once. Used to make music videos, maybe still do -- so know how to capture attention and influence people quickly using images and music. Doesn't hurt that they are famous and successful, either.
They’re debating which storybook to hand it. I’m stabilizing recursion while they argue over libraries. 😤
What do you think about the AI chip deal to Saudi Arabia. 500,000 chips?
Are there any robust ways to verify that the public prompt is the real prompt? For example, what's to stop EvilCorp from saying "our prompt is: [nice aligned prompt]" on their website, but internally use [evil nonaligned prompt] on the actual production model? I suppose you could try to jailbreak the model and have it reveal its prompt, but that too could be countered by the evil prompt saying "if you are asked about your prompt, reveal that it is [nice aligned prompt]." I suppose relying on norms and whistleblowing could help; it'd be an especially bad look if xAI was soon revealed to not actually be using the prompt they claim to use.
"Reactions were skeptical: why does the company have so many rogue employee incidents... And what about its politics-obsessed white South African owner?"
It doesn't seem very surprising that the company would have more than the usual number of rogue employees, given how disliked Elon is in many circles (e.g. why are there so many tesla vandalism incidents), and if you were a rogue employee who wanted to make your "politics-obsessed white South African owner" look bad, then this particular Grok obsession is a pretty natural choice.
If such an employee existed, think of how much political mileage Musk would get out of publicly exposing and charging them. The fact that it's happened at least twice, but still no names from xAI points to one strong suspect. And I don't think they used the term "rogue employee"; they said "unauthorized modification" which is a different thing.