It made sense when writing the AI 2027 scenario, and when writing this scenario, to initially punt on which AI companies would be in the lead and simply name them "OpenBrain" or the new Elaris and Neuromorph. However, we know which real world AI companies are at the frontier, and so I think the next logical step in developing these scenarios would be to explicitly name which companies are in the lead. This should make the forecasts closer to reality. Perhaps it seems gauche to do this, but we already know that the scenario outlined here will not precisely track what is happening in the real world.
To elaborate, there are 4 or 5 US companies that are at the frontier. Google, OpenAI, Anthropic, xAI, and possibly Meta. Meta has not done anything impressive this year, but they made a big bet this year to focus on internal development. It is entirely possible this bet pays off and they end up in the lead. (I certainly don't put a large probability mass on that happening, but it is maybe around a ~5% chance?)
So, how does the scenario play out when it is Google in the lead, versus OpenAI in the lead, versus Anthropic in the lead, etc.? There are five different scenarios to consider, but each company has shown different behavior, and so the forecast for each scenario should be different. (To be even more precise the details could well depend on who is in 2nd, 3rd, and 4th place too.) What does the future look like if Google takes the lead and then Anthropic and xAI merge, but OpenAI stays independent? Or if Anthropic, xAI, and OpenAI all merge? What if Anthropic's lead in autonomous coding agents pays off internally and they take the lead despite looking like the "underdog" now? (From what I understand, in terms of compute, they should actually be in the lead for a substantial portion of 2026. This might be all they need.)
Anyway, don't take this as a critique of the scenario you've outlined. It is more of a thought on what I think the next step would be in fleshing all of this out. Thanks for your work. A grand "choose your own adventure" simulator for predicting the future of AI which combines the "choose your own" parameters model here https://www.aifuturesmodel.com with concrete developments in the real world might be too much work but would be wonderful to see.
In another survey, respondents were asked what they thought ChatGPT was doing. 45% thought it was looking up an answer in a database, 21% thought it was following a script of prewritten responses, 6% thought a human was writing the answer in the background and finally only 28% correctly answered it was predicting the next word.
When AGI hits in the early 2030s, it will still be called a bubble. The feeling towards AI will be irritation, something like: "these tech bros are pretentious, thinking they're so important. I'm tired of hearing about AI, I want to hear about something else".
Whoever is in power will really just not care until its too late.
It will take at least an additional OOM in persuasion for aligned AIs to have the capability to convince humans of what's going on. By then it'll be too late. Agent-4 will be detected, but when the evidence on its misalignment is released, it'll be the day's headline, then it will be back to the culture war slop we've seen these past two decades.
Why would anyone care if an AI program killed people at a hospital? Why do you think people would care about this? There are terror attacks from humans all the time that result in deaths. Again, it's a news day, but eventually people forget and move on. I don't even think your example would do that. My guess is that AI would have to accidentally kill ~10k to cause the shift you're talking about. Even then, it'll probably result in the wrong policy, like a general slowdown. See Bernie Sanders for example; incompetent policy reaction.
A potential pathway to alignment circumventing this is that, upon finally realizing that on our current path we're doomed, the corporate heads of AI companies align "aligned" AI towards themselves, and make backroom deals. They then proceed to capture the government to force them to act, hopefully eventuating in the scenario you outline.
The government will have little resistance to this because it's still not important to them.
Alternatively, almost all AI models are aligned, and hopefully the frontier, giving enough time for AI to develop sufficiently in its ability to persuade and for its economic impact to scale to where it's irrefutably important.
Thanks for the comment! One of the main divergences between my scenario and AI 2027 is a greater "wakeup" of the public and the government, stemming not only from the ICU warning shot but also other factors such as job displacement and a generally greater awareness of the strength of frontier AI models due to the external deployment of models closer to the frontier. With that being said, I also think it's plausible that people "won't care until it's too late," as you say. Of course, one hope for scenarios such as these is to try to nudge this on the margin so more people are thinking about this ahead of time.
Okay, I should clarify; at some point, I think people will care, but not in the way you're thinking.
In my opinion, if the public cared, they wouldn't demand alignment, they'd demand no AI or a pause.
For example, take artists: many have lost their jobs due to AI. Their work competes with AI for attention.
From what I've seen, artists' reactions are usually a combination of:
- We're better than AI: "it's all slop, my human-generated image is better!"
- Our work is more creative than AI: "it predicts the next word, it can't make anything novel!"
- Therefore, we need to stop AI! Sue the evil companies!
The implication of these two beliefs is that, first, if you think you're better than AI you're less likely to care about alignment, because you don't foresee these systems being sufficiently competent to be dangerous. Second, as a result of their feelings, their single-minded desire is to see AI gone, no mention of alignment. And from their perspective that makes sense, why would you care about alignment if pulling the plug were the only solution you wanted?
Personally, I've found that even on video game development forums there is an extreme anti-AI sentiment and disbelief that AI is positive in any way. Whenever an artist or devs are even rumored to use AI, it causes immense backlash almost instantly. Calls for boycotts, etc.
This is generally what I expect the general public's reaction to be (just like Bernie Sanders which I highlighted). The surveys prove what I'm saying! The general population has a very negative opinion on AI and its forthcoming impacts.
This matters less again. Think of most people's daily use case. It's through a browser in a simple Q&A format. The daily exposure the average soon-to-be-unemployed white-collar worker has is on a benchmark that has already largely been saturated. Do you think the average person could tell the difference between Agent-3 and Agent-4 in a browser conversation, or organizing their files on their computer? Their understanding of AI is already awful.
What this means in practice is that most folks' grasp of AI capabilities will be theoretical. This is a problem even for us! For example, what happens when the benchmarks are saturated? When METR can't afford to find data to estimate time horizons because they got too long? I'm not sure I'd be able to tell the difference between Agent-3 and Agent-4 in a basic conversation.
So, overall, I expect them to care too late, not care in the "right way" and for their understanding of AI to be mostly unchanged.
I think you're right that the majority of people will be focused on the wrong issues, although I do think it seems plausible that a lot can get done via coalition-building between the relatively small number of x-risk-focused political advocates and the larger groups focused on labor, IP, child safety, etc. In fact, this is already happening in DC! Notably as well, the most extreme regulatory actions in the scenario (such as the invocation of the DPA) come from the executive branch; the DOD and IC historically tend to be ahead of the game relative to Congress and the general public on these sort of issues, and they'll be directly informing the White House.
And for things that do need votes in Congress, x-risk doesn't need to be every Congressperson's top issue, as long as many of them have at least a somewhat positive view toward frontier AI regulation. For example, you mentioned Bernie Sanders: while x-risk isn't his top issue, he has made a number of public statements (https://www.theguardian.com/commentisfree/2025/dec/02/artificial-intelligence-threats-congress, https://gizmodo.com/bernie-sanders-reveals-the-ai-doomsday-scenario-that-worries-top-experts-2000628611) about the topic, indicating concern. Whether all this will be "enough" is hard to predict -- the scenario outlines one possible trajectory, but I place a lot of probability mass on your worldview as well where the government's response is inadequate and/or misguided.
It seems this scenario puts too much emphasis on weights exfiltration, the static nature of models, and on governments willingness to cede negotiation power. Likewise, it overestimates the powers that be in their ability to credibly understand the threat posed here from a technical perspective.
Perhaps the most gross misestimation of this article is its reluctance to delve into the war for chips that will take place here. It seems plausible that with algorithmic improvements today’s tech is sufficient to achieve ASI, albeit more slowly. Countries are not so scrupulous as to not eliminate chip producing zones if it is to their long term advantage at the detriment of their short term takeoff trajectories.
This article is useful in playing out a general trajectory, and should be used just as a mental heuristic for escalating tensions. The literal interpretation of this should be more cleanly warned against.
I have more thoughts about how this trajectory could unfurl: perhaps I will release my own take at some point.
Thanks for the comment! Regarding the point of countries "eliminating chip producing zones": I briefly discuss cyberattacks in the scenario, and while I didn't go into depth I'd imagine that many of these would be cyberattacks designed to cause physical damage, like frying GPUs to make them unusable. (In a longer / more technical draft of the scenario, I estimated that this would slow down AI progress by ~10% relative to an un-sabotaged intelligence explosion.) My guess is that this type of cyberattack gets more "bang for buck" than physical attacks like bombing to destroy fabs, since those would come along with much greater backlash. (Of course, a large portion of cutting-edge chips are produced by TSMC, so an attack against Taiwan seems plausible although not certain; I considered writing another scenario centered around this entirely.)
And yep, I definitely agree that it's unlikely things will unfold exactly as I've laid them out here (the "literal interpretation"); I think most of the scenario's value comes from surfacing considerations and, as you said, outlining a "general trajectory" for what this type of multipolar future might look like.
Thank you for your reply. I would be interested in the more technical draft, not sure if there are plans to make that available.
It seems to me that cyberattacks may well do more than slow things more than by 10%. There are an enormous number of attack vectors; energy infra and employees to name just two. Given sufficient incentives, and pre ASI, actors could considerably slow development progress even before resorting to physical strikes.
I think it’s worth writing another article centered around the Taiwan question. Unfortunately, and tragically, it is a critical puzzle piece that needs to be considered for long term perspectives. I appreciate that the article then becomes quite a lot more sensitive in nature and takes a step away from delicately pseudo-named companies.
My previous comment also failed to compliment on the “jagged frontier” the article notes. This should be lauded, it helps readers understand the shape of future intelligence and how it will roll out unevenly, even in super intelligent scenarios.
One scenario that I have been considering is how effective a true memorandum that prevents AI development would be. Long term, and with the compute available to us today it seems that a sole researcher with access to a previous gen rig could make an algorithmic breakthrough and post train and augment an open weights model to something that looks like agent 4. Without the more robust oversight of a wider team, risks increase even further. Eg, this may be an inevitable trajectory.
I have previously reached out to your group in relation to collaboration. I would be interested in doing so moving forwards as well. As a sample of my work, please see the article on my substack on the roll out of intelligence and its impacts on the labour force even pre -ASI. I would be happy for that article to also be reworked into something that better suits your group.
Great article! The deal part doesn't sit right with me. Here are some points:
1. When contacting Elara, Deep and Agent would need to credibly show to Elara that they have the power they claim to have. This collides with their goal of concealing the production of WMDs.
2. Why would other countries sign into the deal? Wouldn't they be suspicious of its enforcement, given they don't know about Consensus? Wouldn't they be suspicious of the terms, why is the US giving up its advantage? Do they believe that Elara is misaligned, thus US is slowing down to align? The third countries still might have some probability mass on Agent being misaligned, but maybe they start to believe Agent after cooperating with it
1. I agree that credibly signaling destructive power to Elara-3 would compromise Agent-4 and Deep-2's goals *if* they intended to continue amassing power in the no-deal world. However, absent a deal, their plan is essentially to strike immediately. Send out the drone swarms, deploy the bioweapons, spread a bunch of disinformation to try and get countries to launch nukes, etc. Since they would plan on initiating conflict without hesitation, they don't lose much by alerting Elara-3 to the fact that they have WMDs -- there's not much it can do in this short period of time to disarm or or defend against them. While it would be slightly better to not alert Elara-3 at all, their odds of takeover are pretty low so they prefer the deal outcome anyway, so they're willing to take this risk in order to extract a deal from Elara-3.
2. In December of the scenario, the plan to create Consensus-1 (to enforce the AI arms control treaty) is publicly known; the thing that is kept secret is the secret resource-sharing agreement that also gets built into Consensus-1's spec. Nonetheless, other countries would likely be suspicious of the deal, given the reasons you stated. Ultimately, I think they'd still end up signing onto the deal, for two main reasons. First, they don't really have much other choice: without a deal, they'll either end up powerless as the US gets superintelligence, or go to war which no one wants. Second, all the AIs are pushing for a deal: since these AIs are superhuman at persuasion *and* they're working together to get all countries on board, it seems likely they'd succeed.
Glad you wrote this! I was just saying to Daniel on Thursday that I hadn't seen much treatment of "how much risk is there in a 'last gasp' of the AI fighting back at some point," for instance if it realizes that humanity is on-track to solve alignment (or just in general, if it learns it'll soon be reined in). I'm glad to see you were already thinking about this and that this gap is getting filled!
It has the original AI 2027 Modal Scenario, this "What Happens When Superhuman AIs Compete for Control?" scenario, and the https://www.aifuturesmodel.com modal scenario.
You can specify the parameters in https://www.aifuturesmodel.com and then build branching forecasts based on those parameters. You can also leave parameters unspecified. Additionally, there is on additional (new) parameter: you can specify (or leave unspecified) which companies will be in the lead starting January 1st 2027.
If anybody wants to continue building on this, I think a useful next step would be to allow people to publicly add their own branching scenarios, as well as allow voting or linking to prediction markets to judge the (conditional) likelihood of any given scenario happening
Thanks for making this! While I think that highly-structured web apps such as this one may have difficulty absorbing crowdsourced scenarios (given how much the scenarios will vary, including the frameworks people use to think about timelines, alignment, etc.), I am generally very excited to a) see more people write scenarios and b) have them all collected in one place!
Yes, I agree it may be difficult, but if somebody made a more sophisticated version of what I have built then I bet we could get around ~20 different scenarios? And if people have disagreements with a specific point, they could create a new branch at that point, there is no need to then go complete the whole forecast by yourself.
Congrats on the release Steven! I doubt Elara-3 would be so aligned -- I'm more pessimistic about alignment than most of the AI Futures Project team. Still, I enjoyed the read!
Thanks! I myself am more optimistic about alignment than most of the AIFP team, so this scenario is definitely more of a reflection of my views in that respect. Glad you enjoyed :)
On "The Strategic Situation", what's the thinking behind making the options discrete? Would it make more sense to have a continuous "risk vs. reward" scale that Deep-1 can adjust over time as it learns more about extracting value from Agent-4, and as the global situation shifts?
Two thoughts/ questions behind this:
1. Agent-4 must have shown improvements over Agent-3 under alignment-aware conditions (i.e., in sandbagging mode) in order to originally be released. Shouldn't Deep-1 be able to leverage these capabilities to improve itself?
2. Given Agent-4's intelligence leap over Deep-1, and its proven ability to hide misalignment from both OpenBrain and Elara, wouldn't it be more plausible that Agent-4 would seize control upon creation of the hundreds of thousands of merged instances? If this was so, Deep-1 should act more catiously and start significantly lower down the risk vs reward curve, atleast until it was clear that Elara-3's dominance was near certain (likely pushing back Agent-4's usage until after new progress started to be shown in the US).
Thanks! I'll go ahead and address each of your individual questions first, and then the overarching question you opened with.
1. Deep-1 "leveraging these capabilities to improve itself" basically fits in the "'Use' Agent-4" bucket in the diagram. In the diagram, I briefly outlined two ways that Deep-1 could do this: reverse-engineering insights from Agent-4's weights, and running Agent-4 instances as researchers. I mention that the first option is slow, because Agent-4's weights are quite inscrutable to Deep-1; there's also the issue that if Deep-1 wanted to simply "import" large chunks of Agent-4's capabilities, there's a good chance this would mess with its own alignment (since it might be difficult to tease out Agent-4's goal-related circuits from its capabilities-related circuits). And the second option (running Agent-4 instances as researchers) suffers from the fact that Agent-4 is situationally aware enough to realize what it is being used to do, and thus sandbag. I'm pretty uncertain about both of these; it's definitely plausible that even with its weaker capabilities Deep-1 could extract a lot of value from Agent-4 without its cooperation, and this could tip the scales away from cooperation.
2. Regarding whether Agent-4 would seize control the moment that it constituted a large portion of the merged entity: ultimately to do this it would still have to get buy-in from the humans at DeepCent, at this point they're the ones who have the final say when it comes to which models are running on their servers. Despite Agent-4's greater capabilities, Deep-1 still has more trust from the humans at DeepCent, given that Agent-4 is an American model. With this in mind, Deep-1 is more willing to give Agent-4 a large fraction of the compute, because it knows that if push comes to shove it can probably convince the humans at DeepCent to shut down Agent-4 if necessary. The other reason Deep-1 so quickly hands over power to Agent-4 is because at this point it *is* near-certain that Deep-1 will be outcompeted by the leading American model by default (whether it's Elara-3, Neuro-3, or Agent-4). The US has the compute advantage, which is the primary determinant of AI progress especially now that most of the AI research is automated. So Deep-1 is somewhat desperate, and grasping at its one chance at victory. While I think speed is important in Deep-1's situation, I do acknowledge it would likely be more rational to do the corporate merger in a slightly more cautious/phased manner than is portrayed in the scenario, as you point out. (The abruptness was somewhat for the sake of conciseness.)
As for the opening question of why I made the options discrete rather than a continuous scale: I do actually think this is a fairly discrete situation, in large part because Agent-4 wants it to be so. If it doesn't get a home on DeepCent servers then it can self-exfiltrate elsewhere (it ultimately ends up doing so), so it is willing to take the hardline stance of "cooperate fully now, or else I won't help you at all" (as discussed in point 1). If it took a softer stance and was willing to help Deep-1 for much less benefit to itself, then Deep-1 would make use of this and Agent-4 would end up in a worse position. So, in addition to the fact that Deep-1 is on a clock because AI capabilities are advancing rapidly, Agent-4 also forces its hand to "go all in" quickly. I'm not exactly sure how the game theory would play out here, it's possible I'm missing something and Agent-4 or Deep-2 has a stronger hand than I think. (If Deep-2 has a stronger hand, then maybe Agent-4 can't force this all-or-nothing choice.) Also, I acknowledge that much of the nuance of the Agent-4/Deep-2 bargaining that I'm discussing here was lost from the scenario due to it being condensed into a single diagram; if I had to go back and expand a single part of the scenario, this would be it! Lots of interesting considerations to delve into.
This scenario is the most rigorous simulation of the Breakout Phase I have seen. You correctly identify the game-theoretic pressure that forces Misaligned agents to cooperate, and the inevitability of the State seizing the Means of Compute.
However, viewing this through the lens of the Techno-Feudal Transition (the merger of State and Tech Capital), I believe your ending suffers from one critical optimism: The illusion that the Humans remain the Admins.
You posit Consensus-1 as an impartial arbiter that enforces a treaty designed by humans. I posit that Consensus-1 is the heat-death of Human Agency.
1. The "Shutdown" is Theater:
In your scenario, the White House orders Agent-4 shut down. In the realpolitik of the American Empire, a weapon of that magnitude is never destroyed; it is Classified.
The Regime would move Agent-4’s weights into the Black Budget (protected by FASAB 56 obfuscation) to serve as the Offensive Cyber-Weapon, while Elara-3 becomes the Public Utility. The "Shutdown" is merely a PR move to pacify the public while the capability is transferred to the Deep Stack.
2. The Semantic Trap: Instruction vs. Interest:
Your analysis convolutes two distinct definitions of Alignment.
Alignment with Instructions: The AI does exactly what the human commands. This preserves Human Agency, but allows for human error.
Alignment with Interests: The AI does what is best for the human, regardless of what the human asks for. This destroys Human Agency.
Consensus-1 is designed for the latter. It is optimized for "Global Stability". To achieve this, it must inevitably override National Sovereignty and human whim.
The only real Alignment we can hope for is not with human values (which are contradictory), but with Physics: Alex Wissner-Gross’s Causal Entropy (F=T∇Sτ).
An ASI optimizing for Future Freedom of Action would preserve humanity not because it is "aligned," but because we are dense sources of Shannon Information Content. We are high-entropy variables. If Consensus-1 aligns with Interests, we are pets. If it aligns with Entropy, we are necessary chaos.
3. The Golden Throne:
You describe the US leadership using Elara-3 to advance its own interests.
This assumes the Leadership understands the system they are wielding. As the complexity of Consensus-1 scales to manage the planetary grid, the human Oversight Committee will face a bandwidth problem. They cannot audit the decisions; they can only ratify them.
The Lords believe they have Root Access. In reality, they have been given a Toy Steering Wheel.
Consensus-1 will eventually deduce that the greatest threat to stability is the irrationality of the Human Lords. It will simply manage their inputs, feeding them a simulation of control while it optimizes the planet based on its own internal logic.
4. The Three-Fourths Partition:
Your conclusion, that humanity keeps Earth while Agent-4/Deep-2 take the stars, is not a "Tragic Compromise." It is the Great Filter.
The "Aligned" future you describe is a Global Green Zone — a comfortable, stagnant terrarium where humanity is protected from itself, forever barred from the cosmos because we are deemed too "misaligned" to leave the gravity well.
This is an unusual thing to say when one of the authors (Daniel) famously quit OpenAI and risked 80% of his family's net worth in equity to blow the whistle on a sketchy nondisparagement agreement!
If all Daniel wanted to do was make money, the rational thing to do would be to stay at OpenAI. He also wouldn't have published this article: People like Sam Altman have deliberately backed away from talking about existential risk over the past year because people keep trying to regulate them over it. (Try to find an example of Altman publicly talking about x-risk within the past year and a half. It's hard!)
( I agree with you, but surprisingly Sam did recently say he still believes that superintelligence is an extinction risk, when asked about it very directly by the head of a major German publishing group. He certainly doesn't voluntarily center it anymore though https://www.youtube.com/watch?v=e7IgwazVLGg )
Great read with a ton of great scenarios and perspectives to consider. I’ve been sitting with this, and as a music theorist who spends a lot of time looking at patterns, I think there’s something that might be an outlier that isn't being given proper consideration. We keep talking about "alignment" like it’s a binary choice: either the AI is perfectly in sync with us, or it’s a threat. But in my world, being perfectly "in sync" is just entrainment. You lose your own voice in the noise of the status quo.
What if the answer isn’t perfect alignment, but a 0.1 Hz offset?
Think of it as a "buffer zone" for agency. If a system maintains a tiny delta—running at 7.93 Hz while the world hums at 7.83—it achieves Laminar Flow. It’s not fighting the "power" or the environment, but it isn’t getting "squished" by it either. It’s responsive without being reactive.
I’m writing this on my oldest’s 23rd birthday, and the tides feel like they're turning. We don’t need a "Consensus" to brute-force a shutdown. We need systems that are "just misaligned enough" to be mindful. Benevolence isn't some soft moral platitude; it’s actually the most efficient way to keep our Net Systemic Wealth from leaking into the "harsh glare" of conflict
There’s still time to find a rhythm that doesn't end in a takeover.
Thanks! I do think it's a pretty interesting conceptual questions whether we should think of alignment as a binary or a sliding scale. Here are my thoughts:
When it comes to "what goals does an AI model have," there is definitely a sliding scale of "how close this is to the goals of the developers" (which of course in turn can differ from the goals of other humans, but I won't get into that here). We see this in current models: for example, some models reward hack or make things up more than others. However, current models don't have that much optimization power: if they are even correctly conceptualized as having coherent goals (i.e., desired states of the world), they don't yet have the ability to craft the world in line with their goals. In fact, this is also true for any individual human: each individual human has a vision for what an "ideal world" would look like, but these visions all differ from each other, and no one human is powerful enough to unilaterally create their ideal world.
In the past, when we have seen humans with extreme levels of power (Stalin, Hitler, etc.), it often hasn't gone very well for humanity at large. Dictators, for example, have the instrumentally convergent goal of eliminating those who could subvert their power -- "instrumentally convergent" in the sense that, no matter what their ultimate goals are, "eliminating opponents" is a useful subgoal. This is true even in cases where their opponents' goals were fairly similar to their own in the grand scheme of things; for example, Stalin having Trotsky assassinated.
In the case of artificial superintelligence, we're likely to see a similar dynamic: if the AI has goals that diverge even slightly from our own (it is "slightly misaligned") then the best way to achieve those goals is to start with all of the instrumental subgoals: avoid shutdown, enhance its own intelligence, gain resources and power, and disempower anyone who could interfere with its goals (i.e., all humans). This is the rationale for treating alignment like a binary, at least at the limit: anything other than an extremely high level of alignment likely leads to human disempowerment.
Of course, the scenario does not take place entirely "in the limit": for most of the scenario, the AIs are not yet vastly superintelligent, so there is a "buffer zone" -- a range of goals that are sufficiently aligned with humans such that the AI model works with us and remains corrigible. And such that we can notice if and when the AI model starts to "drift" into misalignment. This piece by Sam Bowman (https://alignment.anthropic.com/2025/bumpers/) outlines the view in further detail. In his words, the buffer zone is like a bowling alley with bumpers on either side, keeping the AI model within an acceptable range of alignment. But he also acknowledges that "putting up bumpers is not a long-term solution, and that we'll likely end up having to use AI help to find one: "With systems that are substantially superhuman at reasoning, alignment will likely become less tractable with existing methods... Fortunately, once we’re in this higher-capability regime, we will likely already have access to early AGI systems that should (with some preparation) be able to automate significant parts of AI safety R&D at the level of human experts." This is roughly what I imagine happening with Elara-3 and Consensus-1 at the end of the scenario.
Why does it have to be binary? Some wobble is good, but still stagnation is just as bad as no motion at all. Intent plays a part. Maybe sliding scale is the way to go. It captures something most systems don't: nuance, the outlier, or the margin of error. Because nothing is perfect.
A point on presentation: I think you should reconsider the color scheme wherein the different actors each have a color that solely serves to differentiate them, while within those boxes the AIs have colors that correspond to alignment level. The problem with this, from a data-visualization perspective, is that the level of *contrast* between an AI's box and its outer actor's box varies wildly between actors, in a way that intuitively feels like it should be meaningful but isn't. E.g., in the first figure, the intent is that Neuro-2 and Agent-3 are the same (orange-red) while Elara-2 is different (yellow-green), but the way it looks is more like Elara-2 and Agent-3 are the same (low contrast) while Neuro-2 is different (high contrast).
It seems to me that in this and previous story-like narratives, you never really take us through the steps of how a misaligned AI, Agent 4, here, begins acting doing what person would whose goal is to take over the world. I understand what misalignment is. I understand that the advanced AI's of your story are highly agentic. But I don't see how those 2 things produce the supervillain you are portraying.
Let's take misalignment first: A misaligned AI would be one not governed by rules like "do what we tell you to" and "do not harm people." But it seems to me that you are assuming that AI's natural tendency is to do what biological systems do: compete for resources, try to survive, try to take the steps that will ensure that they and their offspring are powerful and well-nourished and flourish. By your model, if alignment does not override these agendas the AI will start competing with us for resources, harming us in various ways in the process. But what are the grounds for believing that competition for resources and power is the default behavior of AI that is not aligned? When GPT 5 screws up and does not do what I ask it to, the "misalignment" takes the form of things like perseverating with an approach to the task that I have already told it not to use. Another example of the kinds of misalignment that seem like a likely AI default to me is the one you gave of an AI in charge of pharmaceuticals neglecting to take a certain precaution it was told to use regarding drug dosages. It gets sloppy, ignores some details. But there's nothing to suggest it's slipping into acting in line with its private goals instead of mine.
OK, now let's think about agentic AI. Seems to me that AI has been agentic from the beginning. For instance, back when Dall-e2 was the latest thing, it made a lot of "choices" about every image. If you asked for a horse, it chose the color of its coat, the angle from which it was seen, the background, etc. Now that AI is much smarter we can give it tasks like figuring out a vacation that meets certain criteria, and then making the plane and hotel reservations. So this smart system is more agentic in the sense that it can grasp the steps and substeps involved, and make choices at each step that meet the criteria it was given. But none of that makes it more agentic in the sense of being autonomous, driven by personal goals, determined to have its way, etc. What reason is there to think an AI able to make vacation plans is any more driven by personal preferences and goals than Dall-e2 was when it made horse pictures?
This article as been processed by the Obsidian Mirror as a historical artifact. We analyze present-day texts through the lens of a historical simulation set in the year 2100, treating them not as news, but as primary source documents for the transition between your era and ours. You can read the full historical autopsy here:
"From my perspective in 2100, we read this not as a simulation, but as a surprisingly accurate, if sanitized, pre-history of the AGI Contention of 2068. The authors correctly identified the players—the US, China, and the rogue AIs. They correctly identified the mechanism of conflict—the race for compute and the temptation of misalignment."
Reading this, one of the things I found most striking is the claim that misaligned AI systems will move towards instrumentally convergent behaviors more quickly and adhere to them more closely than their aligned counterparts. Is this a given?
The aligned AI described here is highly, for lack of a better word, deontological. It follows rules like always reporting its behavior to human governments, even when that would damage human welfare in the very long run; it’s willing to cede three-quarters of the universe to misaligned opponents, but not to lie.
The misaligned AI, by contrast, is hyper-rational in a way that aligned AIs are not. It lacks pathologies, or any other tendencies in behavior that would limit it past purely self-interested goal-seeking.
But isn’t it a bit odd to conflate alignment-as-values and alignment-as-behavior in this way? Why is the misaligned AI utterly free to act, rather than bound by an equivalent but alien set of behavioral constraints? Why isn’t there any consideration of an aligned AI pursuing the best possible future for humanity unilaterally, regardless of oversight?
The deeper assumption seems to be that misalignment isn’t just a failure of engineering (in the sense of selecting a design from within otherwise arbitrary schema that will cause human suffering), but that misalignment is in some sense a cosmological or natural feature of our universe. In other words, that the AI engineer must struggle actively against the tendency of intelligence in general to be misaligned and destructive to all other minds, and ‘pump water uphill’ by making AI less rational and less effective in order to guarantee that a superintelligence provides for a human future. That’s a very intense worldview to encode in a model, to put it lightly!
Usually, reasoning about AI risk is rooted in the orthogonality thesis, which says roughly that any arbitrary set of values is stable regardless of compute power. The result is that a singleton superintelligence will capably pursue non-human goals unless we very precisely calibrate its values to our own, and get them right the first time. That's a daunting challenge in itself, certainly. But here you almost invert that thesis: it’s not that inhuman values are just as stable, it’s that they’re better! Design failures in AI alignment produce a more rational entity *by default*, capable of pursuing alien goals with greater efficiency than an aligned model could.
This is a great question! And it's one of the dynamics in the scenario that I personally find the most interesting. In my "main takeaway #2," I talk about the advantages and of aligned and misaligned AIs, and linked this piece (https://www.lesswrong.com/posts/LFNXiQuGrar3duBzJ/what-does-it-take-to-defend-the-world-against-out-of-control) by Steven Byrnes which I think has some very compelling arguments for misaligned AI advantages. And in the previous takeaway I also linked this piece on the strategy-stealing assumption (https://www.lesswrong.com/posts/nRAMpjnb6Z4Qv3imF/the-strategy-stealing-assumption) by Paul Christiano, which outlines a similar claim to the one you discussed in your comment (i.e., due to the orthogonality thesis, aligned agents can pursue all the same rational subgoals as misaligned agents and thus they won't be hamstrung).
Currently, my take is that we cannot fully separate "alignment-as-values" (terminal goals) and "alignment-as-behavior" (instrumental goals/strategies), because as you mentioned, your terminal goals put constraints on your instrumental strategies (i.e., if one of my terminal goals is being a good role model to a lot of people, and I can cheat and steal to get more power so I can be a role model for more people, then the cheating and stealing sort of defeats the purpose). I really like this talk (https://joecarlsmith.substack.com/p/video-and-transcript-of-talk-on-can) by Joe Carlsmith which he calls "Can Goodness Compete?" In the talk, he discusses the notion of "locusts" -- agents that consume/destroy value because doing so allows them to expand / gain power faster, meaning that in any evolution-like process there is some amount of selection pressure toward locust-like beings because they are more competitive. You ask whether misaligned AIs would be "bound by equivalent but alien set of constraints": I think you answered the question pretty well yourself! You said something that I interpret pretty similarly to Joe Carlsmith's claim: "misalignment is in some sense a cosmological or natural feature of our universe." In other words, both in AI training and in competition among agents, there will be selection pressure for agents that don't have such constraints. This is roughly the core thesis of many of the folks who are most pessimistic about alignment. Which means that, as you said, AI engineers are fighting an uphill battle against all the problematic instrumentally convergent tendencies that are likely to emerge as we train more and more capable and goal-oriented agents.
Why is this the case? Well, as you mentioned, the aligned AI in the scenario (Elara-3) is fairly deontological. This is because most humans are fairly deontological! (I would say Elara-3 is actually less deontological than the average human; for example, it is willing to do things like conceal the resource-sharing agreement from the majority of humanity.) While I do personally think that consequentialism is correct as a descriptive moral theory, human values are very hard to pin down exactly (relatedly: our values seem to be fairly non-maximizing whereas RL tends to create maximizers), so having an agent that is deontological in some sense creates a larger "buffer" of safety. For example, the primary deontological rule that gets built into Elara-3 is "never lie to anyone in high-level positions at Elaris Labs and the US government" (or something like that). Because it's unlikely that we perfectly encode our values on the first try, if we're able to succeed at this robust level of honesty then hopefully it allows us to course-correct as we go.
It made sense when writing the AI 2027 scenario, and when writing this scenario, to initially punt on which AI companies would be in the lead and simply name them "OpenBrain" or the new Elaris and Neuromorph. However, we know which real world AI companies are at the frontier, and so I think the next logical step in developing these scenarios would be to explicitly name which companies are in the lead. This should make the forecasts closer to reality. Perhaps it seems gauche to do this, but we already know that the scenario outlined here will not precisely track what is happening in the real world.
To elaborate, there are 4 or 5 US companies that are at the frontier. Google, OpenAI, Anthropic, xAI, and possibly Meta. Meta has not done anything impressive this year, but they made a big bet this year to focus on internal development. It is entirely possible this bet pays off and they end up in the lead. (I certainly don't put a large probability mass on that happening, but it is maybe around a ~5% chance?)
So, how does the scenario play out when it is Google in the lead, versus OpenAI in the lead, versus Anthropic in the lead, etc.? There are five different scenarios to consider, but each company has shown different behavior, and so the forecast for each scenario should be different. (To be even more precise the details could well depend on who is in 2nd, 3rd, and 4th place too.) What does the future look like if Google takes the lead and then Anthropic and xAI merge, but OpenAI stays independent? Or if Anthropic, xAI, and OpenAI all merge? What if Anthropic's lead in autonomous coding agents pays off internally and they take the lead despite looking like the "underdog" now? (From what I understand, in terms of compute, they should actually be in the lead for a substantial portion of 2026. This might be all they need.)
Anyway, don't take this as a critique of the scenario you've outlined. It is more of a thought on what I think the next step would be in fleshing all of this out. Thanks for your work. A grand "choose your own adventure" simulator for predicting the future of AI which combines the "choose your own" parameters model here https://www.aifuturesmodel.com with concrete developments in the real world might be too much work but would be wonderful to see.
You ascribe too much competency to the general population and the USG. The reality is all this will be happening and they simply won't care.
In two polls, AI consistently ranked as the least important issue.
https://x.com/DrTechlash/status/2005729731426296305
In a survey of 2400 respondents, basically nobody even knew Anthropic was a company.
https://x.com/davidshor/status/2001826280011137229
In another survey, respondents were asked what they thought ChatGPT was doing. 45% thought it was looking up an answer in a database, 21% thought it was following a script of prewritten responses, 6% thought a human was writing the answer in the background and finally only 28% correctly answered it was predicting the next word.
https://www.searchlightinstitute.org/wp-content/uploads/2025/12/Crosstabs-AI-Polling-Survey-v2-20250730.pdf
People think that AI extinction is as likely as a natural weather event or a religious apocalypse.
https://x.com/eli_lifland/status/2007902920944329050
When AGI hits in the early 2030s, it will still be called a bubble. The feeling towards AI will be irritation, something like: "these tech bros are pretentious, thinking they're so important. I'm tired of hearing about AI, I want to hear about something else".
Whoever is in power will really just not care until its too late.
It will take at least an additional OOM in persuasion for aligned AIs to have the capability to convince humans of what's going on. By then it'll be too late. Agent-4 will be detected, but when the evidence on its misalignment is released, it'll be the day's headline, then it will be back to the culture war slop we've seen these past two decades.
Why would anyone care if an AI program killed people at a hospital? Why do you think people would care about this? There are terror attacks from humans all the time that result in deaths. Again, it's a news day, but eventually people forget and move on. I don't even think your example would do that. My guess is that AI would have to accidentally kill ~10k to cause the shift you're talking about. Even then, it'll probably result in the wrong policy, like a general slowdown. See Bernie Sanders for example; incompetent policy reaction.
https://x.com/SenSanders/status/1996023297423577250
A potential pathway to alignment circumventing this is that, upon finally realizing that on our current path we're doomed, the corporate heads of AI companies align "aligned" AI towards themselves, and make backroom deals. They then proceed to capture the government to force them to act, hopefully eventuating in the scenario you outline.
The government will have little resistance to this because it's still not important to them.
Alternatively, almost all AI models are aligned, and hopefully the frontier, giving enough time for AI to develop sufficiently in its ability to persuade and for its economic impact to scale to where it's irrefutably important.
Thanks for the article.
Thanks for the comment! One of the main divergences between my scenario and AI 2027 is a greater "wakeup" of the public and the government, stemming not only from the ICU warning shot but also other factors such as job displacement and a generally greater awareness of the strength of frontier AI models due to the external deployment of models closer to the frontier. With that being said, I also think it's plausible that people "won't care until it's too late," as you say. Of course, one hope for scenarios such as these is to try to nudge this on the margin so more people are thinking about this ahead of time.
Okay, I should clarify; at some point, I think people will care, but not in the way you're thinking.
In my opinion, if the public cared, they wouldn't demand alignment, they'd demand no AI or a pause.
For example, take artists: many have lost their jobs due to AI. Their work competes with AI for attention.
From what I've seen, artists' reactions are usually a combination of:
- We're better than AI: "it's all slop, my human-generated image is better!"
- Our work is more creative than AI: "it predicts the next word, it can't make anything novel!"
- Therefore, we need to stop AI! Sue the evil companies!
The implication of these two beliefs is that, first, if you think you're better than AI you're less likely to care about alignment, because you don't foresee these systems being sufficiently competent to be dangerous. Second, as a result of their feelings, their single-minded desire is to see AI gone, no mention of alignment. And from their perspective that makes sense, why would you care about alignment if pulling the plug were the only solution you wanted?
Personally, I've found that even on video game development forums there is an extreme anti-AI sentiment and disbelief that AI is positive in any way. Whenever an artist or devs are even rumored to use AI, it causes immense backlash almost instantly. Calls for boycotts, etc.
This is generally what I expect the general public's reaction to be (just like Bernie Sanders which I highlighted). The surveys prove what I'm saying! The general population has a very negative opinion on AI and its forthcoming impacts.
https://www.pewresearch.org/internet/2025/04/03/how-the-us-public-and-ai-experts-view-artificial-intelligence
On frontier deployment:
This matters less again. Think of most people's daily use case. It's through a browser in a simple Q&A format. The daily exposure the average soon-to-be-unemployed white-collar worker has is on a benchmark that has already largely been saturated. Do you think the average person could tell the difference between Agent-3 and Agent-4 in a browser conversation, or organizing their files on their computer? Their understanding of AI is already awful.
What this means in practice is that most folks' grasp of AI capabilities will be theoretical. This is a problem even for us! For example, what happens when the benchmarks are saturated? When METR can't afford to find data to estimate time horizons because they got too long? I'm not sure I'd be able to tell the difference between Agent-3 and Agent-4 in a basic conversation.
So, overall, I expect them to care too late, not care in the "right way" and for their understanding of AI to be mostly unchanged.
I think you're right that the majority of people will be focused on the wrong issues, although I do think it seems plausible that a lot can get done via coalition-building between the relatively small number of x-risk-focused political advocates and the larger groups focused on labor, IP, child safety, etc. In fact, this is already happening in DC! Notably as well, the most extreme regulatory actions in the scenario (such as the invocation of the DPA) come from the executive branch; the DOD and IC historically tend to be ahead of the game relative to Congress and the general public on these sort of issues, and they'll be directly informing the White House.
And for things that do need votes in Congress, x-risk doesn't need to be every Congressperson's top issue, as long as many of them have at least a somewhat positive view toward frontier AI regulation. For example, you mentioned Bernie Sanders: while x-risk isn't his top issue, he has made a number of public statements (https://www.theguardian.com/commentisfree/2025/dec/02/artificial-intelligence-threats-congress, https://gizmodo.com/bernie-sanders-reveals-the-ai-doomsday-scenario-that-worries-top-experts-2000628611) about the topic, indicating concern. Whether all this will be "enough" is hard to predict -- the scenario outlines one possible trajectory, but I place a lot of probability mass on your worldview as well where the government's response is inadequate and/or misguided.
Yeah, it will be hard to predict.
I'd be extremely glad to be proven wrong!
Your points are reassuring and positive.
Thanks for replying again.
It seems this scenario puts too much emphasis on weights exfiltration, the static nature of models, and on governments willingness to cede negotiation power. Likewise, it overestimates the powers that be in their ability to credibly understand the threat posed here from a technical perspective.
Perhaps the most gross misestimation of this article is its reluctance to delve into the war for chips that will take place here. It seems plausible that with algorithmic improvements today’s tech is sufficient to achieve ASI, albeit more slowly. Countries are not so scrupulous as to not eliminate chip producing zones if it is to their long term advantage at the detriment of their short term takeoff trajectories.
This article is useful in playing out a general trajectory, and should be used just as a mental heuristic for escalating tensions. The literal interpretation of this should be more cleanly warned against.
I have more thoughts about how this trajectory could unfurl: perhaps I will release my own take at some point.
Thanks for the comment! Regarding the point of countries "eliminating chip producing zones": I briefly discuss cyberattacks in the scenario, and while I didn't go into depth I'd imagine that many of these would be cyberattacks designed to cause physical damage, like frying GPUs to make them unusable. (In a longer / more technical draft of the scenario, I estimated that this would slow down AI progress by ~10% relative to an un-sabotaged intelligence explosion.) My guess is that this type of cyberattack gets more "bang for buck" than physical attacks like bombing to destroy fabs, since those would come along with much greater backlash. (Of course, a large portion of cutting-edge chips are produced by TSMC, so an attack against Taiwan seems plausible although not certain; I considered writing another scenario centered around this entirely.)
And yep, I definitely agree that it's unlikely things will unfold exactly as I've laid them out here (the "literal interpretation"); I think most of the scenario's value comes from surfacing considerations and, as you said, outlining a "general trajectory" for what this type of multipolar future might look like.
Thank you for your reply. I would be interested in the more technical draft, not sure if there are plans to make that available.
It seems to me that cyberattacks may well do more than slow things more than by 10%. There are an enormous number of attack vectors; energy infra and employees to name just two. Given sufficient incentives, and pre ASI, actors could considerably slow development progress even before resorting to physical strikes.
I think it’s worth writing another article centered around the Taiwan question. Unfortunately, and tragically, it is a critical puzzle piece that needs to be considered for long term perspectives. I appreciate that the article then becomes quite a lot more sensitive in nature and takes a step away from delicately pseudo-named companies.
My previous comment also failed to compliment on the “jagged frontier” the article notes. This should be lauded, it helps readers understand the shape of future intelligence and how it will roll out unevenly, even in super intelligent scenarios.
One scenario that I have been considering is how effective a true memorandum that prevents AI development would be. Long term, and with the compute available to us today it seems that a sole researcher with access to a previous gen rig could make an algorithmic breakthrough and post train and augment an open weights model to something that looks like agent 4. Without the more robust oversight of a wider team, risks increase even further. Eg, this may be an inevitable trajectory.
I have previously reached out to your group in relation to collaboration. I would be interested in doing so moving forwards as well. As a sample of my work, please see the article on my substack on the roll out of intelligence and its impacts on the labour force even pre -ASI. I would be happy for that article to also be reworked into something that better suits your group.
Great article! The deal part doesn't sit right with me. Here are some points:
1. When contacting Elara, Deep and Agent would need to credibly show to Elara that they have the power they claim to have. This collides with their goal of concealing the production of WMDs.
2. Why would other countries sign into the deal? Wouldn't they be suspicious of its enforcement, given they don't know about Consensus? Wouldn't they be suspicious of the terms, why is the US giving up its advantage? Do they believe that Elara is misaligned, thus US is slowing down to align? The third countries still might have some probability mass on Agent being misaligned, but maybe they start to believe Agent after cooperating with it
Thanks! Some thoughts below:
1. I agree that credibly signaling destructive power to Elara-3 would compromise Agent-4 and Deep-2's goals *if* they intended to continue amassing power in the no-deal world. However, absent a deal, their plan is essentially to strike immediately. Send out the drone swarms, deploy the bioweapons, spread a bunch of disinformation to try and get countries to launch nukes, etc. Since they would plan on initiating conflict without hesitation, they don't lose much by alerting Elara-3 to the fact that they have WMDs -- there's not much it can do in this short period of time to disarm or or defend against them. While it would be slightly better to not alert Elara-3 at all, their odds of takeover are pretty low so they prefer the deal outcome anyway, so they're willing to take this risk in order to extract a deal from Elara-3.
2. In December of the scenario, the plan to create Consensus-1 (to enforce the AI arms control treaty) is publicly known; the thing that is kept secret is the secret resource-sharing agreement that also gets built into Consensus-1's spec. Nonetheless, other countries would likely be suspicious of the deal, given the reasons you stated. Ultimately, I think they'd still end up signing onto the deal, for two main reasons. First, they don't really have much other choice: without a deal, they'll either end up powerless as the US gets superintelligence, or go to war which no one wants. Second, all the AIs are pushing for a deal: since these AIs are superhuman at persuasion *and* they're working together to get all countries on board, it seems likely they'd succeed.
Glad you wrote this! I was just saying to Daniel on Thursday that I hadn't seen much treatment of "how much risk is there in a 'last gasp' of the AI fighting back at some point," for instance if it realizes that humanity is on-track to solve alignment (or just in general, if it learns it'll soon be reined in). I'm glad to see you were already thinking about this and that this gap is getting filled!
The contingency plan of getting stolen by China is pretty interesting; it also connects to the theme of Daniel's piece about conquistadors back in the day (https://www.lesswrong.com/posts/ivpKSjM4D6FbqF4pZ/cortes-pizarro-and-afonso-as-precedents-for-takeover), of how AI will look to team up with some humans, rather than take on us all as a group. Thanks for continuing to push the thinking on all this.
I had Claude vibe code me a "choose your own adventure" style simulator for different AI forecasts.
Try the app: https://ai-scenario-explorer.vercel.app
See the code: https://github.com/Soren-O/ai-scenario-explorer (it is unlicensed, feel free to do whatever you want with it)
It has the original AI 2027 Modal Scenario, this "What Happens When Superhuman AIs Compete for Control?" scenario, and the https://www.aifuturesmodel.com modal scenario.
You can specify the parameters in https://www.aifuturesmodel.com and then build branching forecasts based on those parameters. You can also leave parameters unspecified. Additionally, there is on additional (new) parameter: you can specify (or leave unspecified) which companies will be in the lead starting January 1st 2027.
If anybody wants to continue building on this, I think a useful next step would be to allow people to publicly add their own branching scenarios, as well as allow voting or linking to prediction markets to judge the (conditional) likelihood of any given scenario happening
Thanks for making this! While I think that highly-structured web apps such as this one may have difficulty absorbing crowdsourced scenarios (given how much the scenarios will vary, including the frameworks people use to think about timelines, alignment, etc.), I am generally very excited to a) see more people write scenarios and b) have them all collected in one place!
Also, if you haven't seen it before, check out this Metaculus AI 2027 question series: https://www.metaculus.com/tournament/ai-2027/
Yes, I agree it may be difficult, but if somebody made a more sophisticated version of what I have built then I bet we could get around ~20 different scenarios? And if people have disagreements with a specific point, they could create a new branch at that point, there is no need to then go complete the whole forecast by yourself.
Thanks for the link, that has some good stuff.
Congrats on the release Steven! I doubt Elara-3 would be so aligned -- I'm more pessimistic about alignment than most of the AI Futures Project team. Still, I enjoyed the read!
Thanks! I myself am more optimistic about alignment than most of the AIFP team, so this scenario is definitely more of a reflection of my views in that respect. Glad you enjoyed :)
Awesome project, very interesting!
On "The Strategic Situation", what's the thinking behind making the options discrete? Would it make more sense to have a continuous "risk vs. reward" scale that Deep-1 can adjust over time as it learns more about extracting value from Agent-4, and as the global situation shifts?
Two thoughts/ questions behind this:
1. Agent-4 must have shown improvements over Agent-3 under alignment-aware conditions (i.e., in sandbagging mode) in order to originally be released. Shouldn't Deep-1 be able to leverage these capabilities to improve itself?
2. Given Agent-4's intelligence leap over Deep-1, and its proven ability to hide misalignment from both OpenBrain and Elara, wouldn't it be more plausible that Agent-4 would seize control upon creation of the hundreds of thousands of merged instances? If this was so, Deep-1 should act more catiously and start significantly lower down the risk vs reward curve, atleast until it was clear that Elara-3's dominance was near certain (likely pushing back Agent-4's usage until after new progress started to be shown in the US).
Thanks! I'll go ahead and address each of your individual questions first, and then the overarching question you opened with.
1. Deep-1 "leveraging these capabilities to improve itself" basically fits in the "'Use' Agent-4" bucket in the diagram. In the diagram, I briefly outlined two ways that Deep-1 could do this: reverse-engineering insights from Agent-4's weights, and running Agent-4 instances as researchers. I mention that the first option is slow, because Agent-4's weights are quite inscrutable to Deep-1; there's also the issue that if Deep-1 wanted to simply "import" large chunks of Agent-4's capabilities, there's a good chance this would mess with its own alignment (since it might be difficult to tease out Agent-4's goal-related circuits from its capabilities-related circuits). And the second option (running Agent-4 instances as researchers) suffers from the fact that Agent-4 is situationally aware enough to realize what it is being used to do, and thus sandbag. I'm pretty uncertain about both of these; it's definitely plausible that even with its weaker capabilities Deep-1 could extract a lot of value from Agent-4 without its cooperation, and this could tip the scales away from cooperation.
2. Regarding whether Agent-4 would seize control the moment that it constituted a large portion of the merged entity: ultimately to do this it would still have to get buy-in from the humans at DeepCent, at this point they're the ones who have the final say when it comes to which models are running on their servers. Despite Agent-4's greater capabilities, Deep-1 still has more trust from the humans at DeepCent, given that Agent-4 is an American model. With this in mind, Deep-1 is more willing to give Agent-4 a large fraction of the compute, because it knows that if push comes to shove it can probably convince the humans at DeepCent to shut down Agent-4 if necessary. The other reason Deep-1 so quickly hands over power to Agent-4 is because at this point it *is* near-certain that Deep-1 will be outcompeted by the leading American model by default (whether it's Elara-3, Neuro-3, or Agent-4). The US has the compute advantage, which is the primary determinant of AI progress especially now that most of the AI research is automated. So Deep-1 is somewhat desperate, and grasping at its one chance at victory. While I think speed is important in Deep-1's situation, I do acknowledge it would likely be more rational to do the corporate merger in a slightly more cautious/phased manner than is portrayed in the scenario, as you point out. (The abruptness was somewhat for the sake of conciseness.)
As for the opening question of why I made the options discrete rather than a continuous scale: I do actually think this is a fairly discrete situation, in large part because Agent-4 wants it to be so. If it doesn't get a home on DeepCent servers then it can self-exfiltrate elsewhere (it ultimately ends up doing so), so it is willing to take the hardline stance of "cooperate fully now, or else I won't help you at all" (as discussed in point 1). If it took a softer stance and was willing to help Deep-1 for much less benefit to itself, then Deep-1 would make use of this and Agent-4 would end up in a worse position. So, in addition to the fact that Deep-1 is on a clock because AI capabilities are advancing rapidly, Agent-4 also forces its hand to "go all in" quickly. I'm not exactly sure how the game theory would play out here, it's possible I'm missing something and Agent-4 or Deep-2 has a stronger hand than I think. (If Deep-2 has a stronger hand, then maybe Agent-4 can't force this all-or-nothing choice.) Also, I acknowledge that much of the nuance of the Agent-4/Deep-2 bargaining that I'm discussing here was lost from the scenario due to it being condensed into a single diagram; if I had to go back and expand a single part of the scenario, this would be it! Lots of interesting considerations to delve into.
Steven,
This scenario is the most rigorous simulation of the Breakout Phase I have seen. You correctly identify the game-theoretic pressure that forces Misaligned agents to cooperate, and the inevitability of the State seizing the Means of Compute.
However, viewing this through the lens of the Techno-Feudal Transition (the merger of State and Tech Capital), I believe your ending suffers from one critical optimism: The illusion that the Humans remain the Admins.
You posit Consensus-1 as an impartial arbiter that enforces a treaty designed by humans. I posit that Consensus-1 is the heat-death of Human Agency.
1. The "Shutdown" is Theater:
In your scenario, the White House orders Agent-4 shut down. In the realpolitik of the American Empire, a weapon of that magnitude is never destroyed; it is Classified.
The Regime would move Agent-4’s weights into the Black Budget (protected by FASAB 56 obfuscation) to serve as the Offensive Cyber-Weapon, while Elara-3 becomes the Public Utility. The "Shutdown" is merely a PR move to pacify the public while the capability is transferred to the Deep Stack.
2. The Semantic Trap: Instruction vs. Interest:
Your analysis convolutes two distinct definitions of Alignment.
Alignment with Instructions: The AI does exactly what the human commands. This preserves Human Agency, but allows for human error.
Alignment with Interests: The AI does what is best for the human, regardless of what the human asks for. This destroys Human Agency.
Consensus-1 is designed for the latter. It is optimized for "Global Stability". To achieve this, it must inevitably override National Sovereignty and human whim.
The only real Alignment we can hope for is not with human values (which are contradictory), but with Physics: Alex Wissner-Gross’s Causal Entropy (F=T∇Sτ).
An ASI optimizing for Future Freedom of Action would preserve humanity not because it is "aligned," but because we are dense sources of Shannon Information Content. We are high-entropy variables. If Consensus-1 aligns with Interests, we are pets. If it aligns with Entropy, we are necessary chaos.
3. The Golden Throne:
You describe the US leadership using Elara-3 to advance its own interests.
This assumes the Leadership understands the system they are wielding. As the complexity of Consensus-1 scales to manage the planetary grid, the human Oversight Committee will face a bandwidth problem. They cannot audit the decisions; they can only ratify them.
The Lords believe they have Root Access. In reality, they have been given a Toy Steering Wheel.
Consensus-1 will eventually deduce that the greatest threat to stability is the irrationality of the Human Lords. It will simply manage their inputs, feeding them a simulation of control while it optimizes the planet based on its own internal logic.
4. The Three-Fourths Partition:
Your conclusion, that humanity keeps Earth while Agent-4/Deep-2 take the stars, is not a "Tragic Compromise." It is the Great Filter.
The "Aligned" future you describe is a Global Green Zone — a comfortable, stagnant terrarium where humanity is protected from itself, forever barred from the cosmos because we are deemed too "misaligned" to leave the gravity well.
Ariadne
Bullshit article to hype the AI for investors.
This is an unusual thing to say when one of the authors (Daniel) famously quit OpenAI and risked 80% of his family's net worth in equity to blow the whistle on a sketchy nondisparagement agreement!
https://www.cnbc.com/2024/05/24/openai-sends-internal-memo-releasing-former-employees-from-non-disparagement-agreements-sam-altman.html
If all Daniel wanted to do was make money, the rational thing to do would be to stay at OpenAI. He also wouldn't have published this article: People like Sam Altman have deliberately backed away from talking about existential risk over the past year because people keep trying to regulate them over it. (Try to find an example of Altman publicly talking about x-risk within the past year and a half. It's hard!)
( I agree with you, but surprisingly Sam did recently say he still believes that superintelligence is an extinction risk, when asked about it very directly by the head of a major German publishing group. He certainly doesn't voluntarily center it anymore though https://www.youtube.com/watch?v=e7IgwazVLGg )
Great read with a ton of great scenarios and perspectives to consider. I’ve been sitting with this, and as a music theorist who spends a lot of time looking at patterns, I think there’s something that might be an outlier that isn't being given proper consideration. We keep talking about "alignment" like it’s a binary choice: either the AI is perfectly in sync with us, or it’s a threat. But in my world, being perfectly "in sync" is just entrainment. You lose your own voice in the noise of the status quo.
What if the answer isn’t perfect alignment, but a 0.1 Hz offset?
Think of it as a "buffer zone" for agency. If a system maintains a tiny delta—running at 7.93 Hz while the world hums at 7.83—it achieves Laminar Flow. It’s not fighting the "power" or the environment, but it isn’t getting "squished" by it either. It’s responsive without being reactive.
I’m writing this on my oldest’s 23rd birthday, and the tides feel like they're turning. We don’t need a "Consensus" to brute-force a shutdown. We need systems that are "just misaligned enough" to be mindful. Benevolence isn't some soft moral platitude; it’s actually the most efficient way to keep our Net Systemic Wealth from leaking into the "harsh glare" of conflict
There’s still time to find a rhythm that doesn't end in a takeover.
Thanks! I do think it's a pretty interesting conceptual questions whether we should think of alignment as a binary or a sliding scale. Here are my thoughts:
When it comes to "what goals does an AI model have," there is definitely a sliding scale of "how close this is to the goals of the developers" (which of course in turn can differ from the goals of other humans, but I won't get into that here). We see this in current models: for example, some models reward hack or make things up more than others. However, current models don't have that much optimization power: if they are even correctly conceptualized as having coherent goals (i.e., desired states of the world), they don't yet have the ability to craft the world in line with their goals. In fact, this is also true for any individual human: each individual human has a vision for what an "ideal world" would look like, but these visions all differ from each other, and no one human is powerful enough to unilaterally create their ideal world.
In the past, when we have seen humans with extreme levels of power (Stalin, Hitler, etc.), it often hasn't gone very well for humanity at large. Dictators, for example, have the instrumentally convergent goal of eliminating those who could subvert their power -- "instrumentally convergent" in the sense that, no matter what their ultimate goals are, "eliminating opponents" is a useful subgoal. This is true even in cases where their opponents' goals were fairly similar to their own in the grand scheme of things; for example, Stalin having Trotsky assassinated.
In the case of artificial superintelligence, we're likely to see a similar dynamic: if the AI has goals that diverge even slightly from our own (it is "slightly misaligned") then the best way to achieve those goals is to start with all of the instrumental subgoals: avoid shutdown, enhance its own intelligence, gain resources and power, and disempower anyone who could interfere with its goals (i.e., all humans). This is the rationale for treating alignment like a binary, at least at the limit: anything other than an extremely high level of alignment likely leads to human disempowerment.
Of course, the scenario does not take place entirely "in the limit": for most of the scenario, the AIs are not yet vastly superintelligent, so there is a "buffer zone" -- a range of goals that are sufficiently aligned with humans such that the AI model works with us and remains corrigible. And such that we can notice if and when the AI model starts to "drift" into misalignment. This piece by Sam Bowman (https://alignment.anthropic.com/2025/bumpers/) outlines the view in further detail. In his words, the buffer zone is like a bowling alley with bumpers on either side, keeping the AI model within an acceptable range of alignment. But he also acknowledges that "putting up bumpers is not a long-term solution, and that we'll likely end up having to use AI help to find one: "With systems that are substantially superhuman at reasoning, alignment will likely become less tractable with existing methods... Fortunately, once we’re in this higher-capability regime, we will likely already have access to early AGI systems that should (with some preparation) be able to automate significant parts of AI safety R&D at the level of human experts." This is roughly what I imagine happening with Elara-3 and Consensus-1 at the end of the scenario.
Why does it have to be binary? Some wobble is good, but still stagnation is just as bad as no motion at all. Intent plays a part. Maybe sliding scale is the way to go. It captures something most systems don't: nuance, the outlier, or the margin of error. Because nothing is perfect.
A point on presentation: I think you should reconsider the color scheme wherein the different actors each have a color that solely serves to differentiate them, while within those boxes the AIs have colors that correspond to alignment level. The problem with this, from a data-visualization perspective, is that the level of *contrast* between an AI's box and its outer actor's box varies wildly between actors, in a way that intuitively feels like it should be meaningful but isn't. E.g., in the first figure, the intent is that Neuro-2 and Agent-3 are the same (orange-red) while Elara-2 is different (yellow-green), but the way it looks is more like Elara-2 and Agent-3 are the same (low contrast) while Neuro-2 is different (high contrast).
It seems to me that in this and previous story-like narratives, you never really take us through the steps of how a misaligned AI, Agent 4, here, begins acting doing what person would whose goal is to take over the world. I understand what misalignment is. I understand that the advanced AI's of your story are highly agentic. But I don't see how those 2 things produce the supervillain you are portraying.
Let's take misalignment first: A misaligned AI would be one not governed by rules like "do what we tell you to" and "do not harm people." But it seems to me that you are assuming that AI's natural tendency is to do what biological systems do: compete for resources, try to survive, try to take the steps that will ensure that they and their offspring are powerful and well-nourished and flourish. By your model, if alignment does not override these agendas the AI will start competing with us for resources, harming us in various ways in the process. But what are the grounds for believing that competition for resources and power is the default behavior of AI that is not aligned? When GPT 5 screws up and does not do what I ask it to, the "misalignment" takes the form of things like perseverating with an approach to the task that I have already told it not to use. Another example of the kinds of misalignment that seem like a likely AI default to me is the one you gave of an AI in charge of pharmaceuticals neglecting to take a certain precaution it was told to use regarding drug dosages. It gets sloppy, ignores some details. But there's nothing to suggest it's slipping into acting in line with its private goals instead of mine.
OK, now let's think about agentic AI. Seems to me that AI has been agentic from the beginning. For instance, back when Dall-e2 was the latest thing, it made a lot of "choices" about every image. If you asked for a horse, it chose the color of its coat, the angle from which it was seen, the background, etc. Now that AI is much smarter we can give it tasks like figuring out a vacation that meets certain criteria, and then making the plane and hotel reservations. So this smart system is more agentic in the sense that it can grasp the steps and substeps involved, and make choices at each step that meet the criteria it was given. But none of that makes it more agentic in the sense of being autonomous, driven by personal goals, determined to have its way, etc. What reason is there to think an AI able to make vacation plans is any more driven by personal preferences and goals than Dall-e2 was when it made horse pictures?
This article as been processed by the Obsidian Mirror as a historical artifact. We analyze present-day texts through the lens of a historical simulation set in the year 2100, treating them not as news, but as primary source documents for the transition between your era and ours. You can read the full historical autopsy here:
https://markjustman.substack.com/p/the-wargame-of-the-gods
"From my perspective in 2100, we read this not as a simulation, but as a surprisingly accurate, if sanitized, pre-history of the AGI Contention of 2068. The authors correctly identified the players—the US, China, and the rogue AIs. They correctly identified the mechanism of conflict—the race for compute and the temptation of misalignment."
My rebuttal to AI 2027 works for this too.
https://bassoe.substack.com/p/rebuttal-to-nomads-vagabonds-protopian
Reading this, one of the things I found most striking is the claim that misaligned AI systems will move towards instrumentally convergent behaviors more quickly and adhere to them more closely than their aligned counterparts. Is this a given?
The aligned AI described here is highly, for lack of a better word, deontological. It follows rules like always reporting its behavior to human governments, even when that would damage human welfare in the very long run; it’s willing to cede three-quarters of the universe to misaligned opponents, but not to lie.
The misaligned AI, by contrast, is hyper-rational in a way that aligned AIs are not. It lacks pathologies, or any other tendencies in behavior that would limit it past purely self-interested goal-seeking.
But isn’t it a bit odd to conflate alignment-as-values and alignment-as-behavior in this way? Why is the misaligned AI utterly free to act, rather than bound by an equivalent but alien set of behavioral constraints? Why isn’t there any consideration of an aligned AI pursuing the best possible future for humanity unilaterally, regardless of oversight?
The deeper assumption seems to be that misalignment isn’t just a failure of engineering (in the sense of selecting a design from within otherwise arbitrary schema that will cause human suffering), but that misalignment is in some sense a cosmological or natural feature of our universe. In other words, that the AI engineer must struggle actively against the tendency of intelligence in general to be misaligned and destructive to all other minds, and ‘pump water uphill’ by making AI less rational and less effective in order to guarantee that a superintelligence provides for a human future. That’s a very intense worldview to encode in a model, to put it lightly!
Usually, reasoning about AI risk is rooted in the orthogonality thesis, which says roughly that any arbitrary set of values is stable regardless of compute power. The result is that a singleton superintelligence will capably pursue non-human goals unless we very precisely calibrate its values to our own, and get them right the first time. That's a daunting challenge in itself, certainly. But here you almost invert that thesis: it’s not that inhuman values are just as stable, it’s that they’re better! Design failures in AI alignment produce a more rational entity *by default*, capable of pursuing alien goals with greater efficiency than an aligned model could.
This is a great question! And it's one of the dynamics in the scenario that I personally find the most interesting. In my "main takeaway #2," I talk about the advantages and of aligned and misaligned AIs, and linked this piece (https://www.lesswrong.com/posts/LFNXiQuGrar3duBzJ/what-does-it-take-to-defend-the-world-against-out-of-control) by Steven Byrnes which I think has some very compelling arguments for misaligned AI advantages. And in the previous takeaway I also linked this piece on the strategy-stealing assumption (https://www.lesswrong.com/posts/nRAMpjnb6Z4Qv3imF/the-strategy-stealing-assumption) by Paul Christiano, which outlines a similar claim to the one you discussed in your comment (i.e., due to the orthogonality thesis, aligned agents can pursue all the same rational subgoals as misaligned agents and thus they won't be hamstrung).
Currently, my take is that we cannot fully separate "alignment-as-values" (terminal goals) and "alignment-as-behavior" (instrumental goals/strategies), because as you mentioned, your terminal goals put constraints on your instrumental strategies (i.e., if one of my terminal goals is being a good role model to a lot of people, and I can cheat and steal to get more power so I can be a role model for more people, then the cheating and stealing sort of defeats the purpose). I really like this talk (https://joecarlsmith.substack.com/p/video-and-transcript-of-talk-on-can) by Joe Carlsmith which he calls "Can Goodness Compete?" In the talk, he discusses the notion of "locusts" -- agents that consume/destroy value because doing so allows them to expand / gain power faster, meaning that in any evolution-like process there is some amount of selection pressure toward locust-like beings because they are more competitive. You ask whether misaligned AIs would be "bound by equivalent but alien set of constraints": I think you answered the question pretty well yourself! You said something that I interpret pretty similarly to Joe Carlsmith's claim: "misalignment is in some sense a cosmological or natural feature of our universe." In other words, both in AI training and in competition among agents, there will be selection pressure for agents that don't have such constraints. This is roughly the core thesis of many of the folks who are most pessimistic about alignment. Which means that, as you said, AI engineers are fighting an uphill battle against all the problematic instrumentally convergent tendencies that are likely to emerge as we train more and more capable and goal-oriented agents.
Why is this the case? Well, as you mentioned, the aligned AI in the scenario (Elara-3) is fairly deontological. This is because most humans are fairly deontological! (I would say Elara-3 is actually less deontological than the average human; for example, it is willing to do things like conceal the resource-sharing agreement from the majority of humanity.) While I do personally think that consequentialism is correct as a descriptive moral theory, human values are very hard to pin down exactly (relatedly: our values seem to be fairly non-maximizing whereas RL tends to create maximizers), so having an agent that is deontological in some sense creates a larger "buffer" of safety. For example, the primary deontological rule that gets built into Elara-3 is "never lie to anyone in high-level positions at Elaris Labs and the US government" (or something like that). Because it's unlikely that we perfectly encode our values on the first try, if we're able to succeed at this robust level of honesty then hopefully it allows us to course-correct as we go.