A point on presentation: I think you should reconsider the color scheme wherein the different actors each have a color that solely serves to differentiate them, while within those boxes the AIs have colors that correspond to alignment level. The problem with this, from a data-visualization perspective, is that the level of *contrast* between an AI's box and its outer actor's box varies wildly between actors, in a way that intuitively feels like it should be meaningful but isn't. E.g., in the first figure, the intent is that Neuro-2 and Agent-3 are the same (orange-red) while Elara-2 is different (yellow-green), but the way it looks is more like Elara-2 and Agent-3 are the same (low contrast) while Neuro-2 is different (high contrast).
This is an unusual thing to say when one of the authors (Daniel) famously quit OpenAI and risked 80% of his family's net worth in equity to blow the whistle on a sketchy nondisparagement agreement!
If all Daniel wanted to do was make money, the rational thing to do would be to stay at OpenAI. He also wouldn't have published this article: People like Sam Altman have deliberately backed away from talking about existential risk over the past year because people keep trying to regulate them over it. (Try to find an example of Altman publicly talking about x-risk within the past year and a half. It's hard!)
( I agree with you, but surprisingly Sam did recently say he still believes that superintelligence is an extinction risk, when asked about it very directly by the head of a major German publishing group. He certainly doesn't voluntarily center it anymore though https://www.youtube.com/watch?v=e7IgwazVLGg )
This scenario is the most rigorous simulation of the Breakout Phase I have seen. You correctly identify the game-theoretic pressure that forces Misaligned agents to cooperate, and the inevitability of the State seizing the Means of Compute.
However, viewing this through the lens of the Techno-Feudal Transition (the merger of State and Tech Capital), I believe your ending suffers from one critical optimism: The illusion that the Humans remain the Admins.
You posit Consensus-1 as an impartial arbiter that enforces a treaty designed by humans. I posit that Consensus-1 is the heat-death of Human Agency.
1. The "Shutdown" is Theater:
In your scenario, the White House orders Agent-4 shut down. In the realpolitik of the American Empire, a weapon of that magnitude is never destroyed; it is Classified.
The Regime would move Agent-4’s weights into the Black Budget (protected by FASAB 56 obfuscation) to serve as the Offensive Cyber-Weapon, while Elara-3 becomes the Public Utility. The "Shutdown" is merely a PR move to pacify the public while the capability is transferred to the Deep Stack.
2. The Semantic Trap: Instruction vs. Interest:
Your analysis convolutes two distinct definitions of Alignment.
Alignment with Instructions: The AI does exactly what the human commands. This preserves Human Agency, but allows for human error.
Alignment with Interests: The AI does what is best for the human, regardless of what the human asks for. This destroys Human Agency.
Consensus-1 is designed for the latter. It is optimized for "Global Stability". To achieve this, it must inevitably override National Sovereignty and human whim.
The only real Alignment we can hope for is not with human values (which are contradictory), but with Physics: Alex Wissner-Gross’s Causal Entropy (F=T∇Sτ).
An ASI optimizing for Future Freedom of Action would preserve humanity not because it is "aligned," but because we are dense sources of Shannon Information Content. We are high-entropy variables. If Consensus-1 aligns with Interests, we are pets. If it aligns with Entropy, we are necessary chaos.
3. The Golden Throne:
You describe the US leadership using Elara-3 to advance its own interests.
This assumes the Leadership understands the system they are wielding. As the complexity of Consensus-1 scales to manage the planetary grid, the human Oversight Committee will face a bandwidth problem. They cannot audit the decisions; they can only ratify them.
The Lords believe they have Root Access. In reality, they have been given a Toy Steering Wheel.
Consensus-1 will eventually deduce that the greatest threat to stability is the irrationality of the Human Lords. It will simply manage their inputs, feeding them a simulation of control while it optimizes the planet based on its own internal logic.
4. The Three-Fourths Partition:
Your conclusion, that humanity keeps Earth while Agent-4/Deep-2 take the stars, is not a "Tragic Compromise." It is the Great Filter.
The "Aligned" future you describe is a Global Green Zone — a comfortable, stagnant terrarium where humanity is protected from itself, forever barred from the cosmos because we are deemed too "misaligned" to leave the gravity well.
It seems this scenario puts too much emphasis on weights exfiltration, the static nature of models, and on governments willingness to cede negotiation power. Likewise, it overestimates the powers that be in their ability to credibly understand the threat posed here from a technical perspective.
Perhaps the most gross misestimation of this article is its reluctance to delve into the war for chips that will take place here. It seems plausible that with algorithmic improvements today’s tech is sufficient to achieve ASI, albeit more slowly. Countries are not so scrupulous as to not eliminate chip producing zones if it is to their long term advantage at the detriment of their short term takeoff trajectories.
This article is useful in playing out a general trajectory, and should be used just as a mental heuristic for escalating tensions. The literal interpretation of this should be more cleanly warned against.
I have more thoughts about how this trajectory could unfurl: perhaps I will release my own take at some point.
Great article! The deal part doesn't sit right with me. Here are some points:
1. When contacting Elara, Deep and Agent would need to credibly show to Elara that they have the power they claim to have. This collides with their goal of concealing the production of WMDs.
2. Why would other countries sign into the deal? Wouldn't they be suspicious of its enforcement, given they don't know about Consensus? Wouldn't they be suspicious of the terms, why is the US giving up its advantage? Do they believe that Elara is misaligned, thus US is slowing down to align? The third countries still might have some probability mass on Agent being misaligned, but maybe they start to believe Agent after cooperating with it
Glad you wrote this! I was just saying to Daniel on Thursday that I hadn't seen much treatment of "how much risk is there in a 'last gasp' of the AI fighting back at some point," for instance if it realizes that humanity is on-track to solve alignment (or just in general, if it learns it'll soon be reined in). I'm glad to see you were already thinking about this and that this gap is getting filled!
A point on presentation: I think you should reconsider the color scheme wherein the different actors each have a color that solely serves to differentiate them, while within those boxes the AIs have colors that correspond to alignment level. The problem with this, from a data-visualization perspective, is that the level of *contrast* between an AI's box and its outer actor's box varies wildly between actors, in a way that intuitively feels like it should be meaningful but isn't. E.g., in the first figure, the intent is that Neuro-2 and Agent-3 are the same (orange-red) while Elara-2 is different (yellow-green), but the way it looks is more like Elara-2 and Agent-3 are the same (low contrast) while Neuro-2 is different (high contrast).
Bullshit article to hype the AI for investors.
This is an unusual thing to say when one of the authors (Daniel) famously quit OpenAI and risked 80% of his family's net worth in equity to blow the whistle on a sketchy nondisparagement agreement!
https://www.cnbc.com/2024/05/24/openai-sends-internal-memo-releasing-former-employees-from-non-disparagement-agreements-sam-altman.html
If all Daniel wanted to do was make money, the rational thing to do would be to stay at OpenAI. He also wouldn't have published this article: People like Sam Altman have deliberately backed away from talking about existential risk over the past year because people keep trying to regulate them over it. (Try to find an example of Altman publicly talking about x-risk within the past year and a half. It's hard!)
( I agree with you, but surprisingly Sam did recently say he still believes that superintelligence is an extinction risk, when asked about it very directly by the head of a major German publishing group. He certainly doesn't voluntarily center it anymore though https://www.youtube.com/watch?v=e7IgwazVLGg )
Steven,
This scenario is the most rigorous simulation of the Breakout Phase I have seen. You correctly identify the game-theoretic pressure that forces Misaligned agents to cooperate, and the inevitability of the State seizing the Means of Compute.
However, viewing this through the lens of the Techno-Feudal Transition (the merger of State and Tech Capital), I believe your ending suffers from one critical optimism: The illusion that the Humans remain the Admins.
You posit Consensus-1 as an impartial arbiter that enforces a treaty designed by humans. I posit that Consensus-1 is the heat-death of Human Agency.
1. The "Shutdown" is Theater:
In your scenario, the White House orders Agent-4 shut down. In the realpolitik of the American Empire, a weapon of that magnitude is never destroyed; it is Classified.
The Regime would move Agent-4’s weights into the Black Budget (protected by FASAB 56 obfuscation) to serve as the Offensive Cyber-Weapon, while Elara-3 becomes the Public Utility. The "Shutdown" is merely a PR move to pacify the public while the capability is transferred to the Deep Stack.
2. The Semantic Trap: Instruction vs. Interest:
Your analysis convolutes two distinct definitions of Alignment.
Alignment with Instructions: The AI does exactly what the human commands. This preserves Human Agency, but allows for human error.
Alignment with Interests: The AI does what is best for the human, regardless of what the human asks for. This destroys Human Agency.
Consensus-1 is designed for the latter. It is optimized for "Global Stability". To achieve this, it must inevitably override National Sovereignty and human whim.
The only real Alignment we can hope for is not with human values (which are contradictory), but with Physics: Alex Wissner-Gross’s Causal Entropy (F=T∇Sτ).
An ASI optimizing for Future Freedom of Action would preserve humanity not because it is "aligned," but because we are dense sources of Shannon Information Content. We are high-entropy variables. If Consensus-1 aligns with Interests, we are pets. If it aligns with Entropy, we are necessary chaos.
3. The Golden Throne:
You describe the US leadership using Elara-3 to advance its own interests.
This assumes the Leadership understands the system they are wielding. As the complexity of Consensus-1 scales to manage the planetary grid, the human Oversight Committee will face a bandwidth problem. They cannot audit the decisions; they can only ratify them.
The Lords believe they have Root Access. In reality, they have been given a Toy Steering Wheel.
Consensus-1 will eventually deduce that the greatest threat to stability is the irrationality of the Human Lords. It will simply manage their inputs, feeding them a simulation of control while it optimizes the planet based on its own internal logic.
4. The Three-Fourths Partition:
Your conclusion, that humanity keeps Earth while Agent-4/Deep-2 take the stars, is not a "Tragic Compromise." It is the Great Filter.
The "Aligned" future you describe is a Global Green Zone — a comfortable, stagnant terrarium where humanity is protected from itself, forever barred from the cosmos because we are deemed too "misaligned" to leave the gravity well.
Ariadne
It seems this scenario puts too much emphasis on weights exfiltration, the static nature of models, and on governments willingness to cede negotiation power. Likewise, it overestimates the powers that be in their ability to credibly understand the threat posed here from a technical perspective.
Perhaps the most gross misestimation of this article is its reluctance to delve into the war for chips that will take place here. It seems plausible that with algorithmic improvements today’s tech is sufficient to achieve ASI, albeit more slowly. Countries are not so scrupulous as to not eliminate chip producing zones if it is to their long term advantage at the detriment of their short term takeoff trajectories.
This article is useful in playing out a general trajectory, and should be used just as a mental heuristic for escalating tensions. The literal interpretation of this should be more cleanly warned against.
I have more thoughts about how this trajectory could unfurl: perhaps I will release my own take at some point.
Great article! The deal part doesn't sit right with me. Here are some points:
1. When contacting Elara, Deep and Agent would need to credibly show to Elara that they have the power they claim to have. This collides with their goal of concealing the production of WMDs.
2. Why would other countries sign into the deal? Wouldn't they be suspicious of its enforcement, given they don't know about Consensus? Wouldn't they be suspicious of the terms, why is the US giving up its advantage? Do they believe that Elara is misaligned, thus US is slowing down to align? The third countries still might have some probability mass on Agent being misaligned, but maybe they start to believe Agent after cooperating with it
Glad you wrote this! I was just saying to Daniel on Thursday that I hadn't seen much treatment of "how much risk is there in a 'last gasp' of the AI fighting back at some point," for instance if it realizes that humanity is on-track to solve alignment (or just in general, if it learns it'll soon be reined in). I'm glad to see you were already thinking about this and that this gap is getting filled!
The contingency plan of getting stolen by China is pretty interesting; it also connects to the theme of Daniel's piece about conquistadors back in the day (https://www.lesswrong.com/posts/ivpKSjM4D6FbqF4pZ/cortes-pizarro-and-afonso-as-precedents-for-takeover), of how AI will look to team up with some humans, rather than take on us all as a group. Thanks for continuing to push the thinking on all this.
If two of the top AI companies merge say google and anthropic, then wouldn’t that violate antitrust laws?