Training AGI in Secret would be Unsafe and Unethical

Bad for loss of control risks, bad for concentration of power risks

Apr 17, 2025

I’ve had this sitting in my drafts for the last year. I wish I’d been able to release it sooner, but on the bright side, it’ll make a lot more sense to people who have already read AI 2027.

There’s a good chance that AGI will be trained before this decade is out.
1. By AGI I mean “An AI system at least as good as the best human X’ers, for all cognitive tasks/skills/jobs X.”
2. Many people seem to be dismissing this hypothesis ‘on priors’ because it sounds crazy. But actually, a reasonable prior should conclude that this is plausible.1
3. For more on what this means, what it might look like, and why it’s plausible, see AI 2027, especially the Research section.
If so, by default the existence of AGI will be a closely guarded secret for some months. Only a few teams within an internal silo, plus leadership & security, will know about the capabilities of the latest systems.
1. Currently I’d guess there is typically a ~3-9 month gap between when a frontier capability first exists, and when it is announced to the public.
2. I expect AI companies to improve their security, including internal siloing. Also, AGI allows AI R&D to proceed with fewer humans involved compared to other recent secret projects such Dragonfly and Maven.
I predict that the leaders of any given AGI project will try to keep it a secret for longer — even as they use the system to automate their internal research and rapidly create even more powerful systems.2
1. They will be afraid of the public backlash and general chaos that would ensue from publicity, and they would be afraid of competitors racing harder to catch up.
2. Privately, they might also be afraid of getting shut down or otherwise slowed. They will have various enemies (domestic and international) and will prefer said enemies stay in the dark.
3. The Manhattan Project worked hard to stay hidden from Congress, in part because they feared Congress would defund them if it found out.
This will result in a situation where only a few dozen people will be charged with ensuring that, and figuring out whether, the latest AIs are aligned/trustworthy/etc.3
Even worse, a similarly tiny group of people — specifically, corporate leadership + some select people from the executive branch of the US government — will be the only people reading the reports and making high-stakes judgment calls about which concerns to take seriously and which to dismiss as implausible, which solutions to implement and which to deprioritize as too costly, etc. See footnote for examples.4
1. In the Manhattan Project, there was a moment when some physicists worried that the first atomic test would ignite the atmosphere and destroy all life on Earth; they did a bunch of calculations and argued about it for a bit and then concluded it was safe. I guarantee you there will be similarly high-stakes arguments happening in the AGI project, only with fewer calculations and more speculation. The White House will hesitate to bring in significant outside expertise because of the security risk, and even if they do bring in some, they won’t bring in many. At least not by default.
2. Why do I predict some part of the US government will be involved? Because even if the leaders of the relevant AGI project were optimizing against the interests of all humanity rather than for, they would still want to include the White House. Let me explain. The problem for our hypothetical megalomaniacs is that if they keep the President in the dark, and someone from the project whistleblows, the White House might become concerned and shut down the project. But if the President is clued in, and becomes a fellow conspirator so to speak — “Sir, this technology is unprecedently dangerous and powerful, we need to keep it out of Chinese hands, please help us improve our security” — then his first thought when someone whistleblows will be “Traitor!”5
This is a recipe for utter catastrophe. I predict that under these circumstances the most likely outcome is that we end up with broadly superhuman AGI systems which are in fact misaligned but which the aforementioned small group of decision-makers thinks is aligned.6
1. Various specific threat models have been hypothesized; here’s a more abstract one: There are two kinds of alignment failures: Those that result in the system attempting to prevent you from noticing and fixing the failure, and those that don’t. When our systems become broadly more capable than us, and are trusted with all sorts of permissions, responsibilities, and access, even a single instance of the first kind of failure can be catastrophic. And it seems to me that in the course of hurried AI development — especially if it is largely automated — we should expect at least a few failures of the first kind to occur (alongside many failures of the more benign second kind).7
2. For more about what this might look like and why it might happen, see the “race” ending of AI 2027.
Moreover, even if I'm wrong and instead this process results in broadly superhuman AGI systems which are in fact aligned, the aforementioned tiny group of people will plausibly be in a position of unprecedented power.
1. I hope that they will be beneficent and devolve power to others in a democratic fashion, but (a) they will be able to, if they choose, train + instruct their superhuman AGI to help them take over the US government (and later the world) and (b) there will be various less extreme things they could do with their power that they will be tempted to do, which would be less bad but still bad.
2. For example, perhaps they fear that if they devolve power then there will be a backlash against them and they may end up on trial for various reckless decisions they made earlier. So they ask their AIs for advice on how to avoid that outcome...
3. For more about what this might look like and why it might happen, see the “Slowdown” ending of AI 2027.
Previously I thought that openness in AGI development was bad for humanity, because it would lead to an intense competitive race which would be won by someone who cuts corners on safety and/or someone who uses their AGIs aggressively to seize power and resources from others. Well, I've changed my mind.
1. I now think that to a significant extent this race is happening anyway. If we want a serious slowdown, we need to coordinate internationally to all proceed cautiously together. I used to think that announcing AGI milestones would cause rivals to accelerate and race harder; now I think the rivals will be racing pretty much as hard as they can regardless. And in particular, I expect that the CCP will find out what’s happening anyway, regardless of whether the American public is kept in the dark. Continuing the analogy to the Manhattan Project: They succeeded in keeping it secret from Congress, but failed at keeping it secret from the USSR.
2. I thought too simplistically about openness — on one end of the spectrum is open-sourcing model weights and code; on the other end is the default scenario I sketched above. I now advocate a compromise in which e.g. the public knows what the latest systems are capable of and is able to observe & critique the decisionmakers making the tough decisions footnoted earlier, and the scientific community is able to do alignment research on the latest models and critique the safety case, and yet terrorists don’t have access to the weights.
3. I didn’t take concentration of power seriously enough as a problem. I thought that the best way to prevent bad people from using AGI to seize power was to make sure good guys got to AGI first. Now I think things will be sufficiently chaotic in the default scenario that even good guys will be tempted to abuse their power. I also think there is a genuine alternative in which power never concentrates to such an extreme degree.
I am not confident in the above, and I’m more confident in the above than in any particular set of policy recommendations. However my current stab at policy recommendation would be:
1. Get CEOs to make public statements to the effect that while it may not be possible to do a secret intelligence explosion / train AGI in secret, IF it turns out to be possible, doing it secretly would be unsafe and unethical & they promise not to do it.
2. Get companies to make voluntary commitments, and government to make regulation / executive orders, that include public reporting requirements, aimed at making it impossible to do it in secret without violating these commitments. So, e.g. “Once we achieve such-and-such score on these benchmarks, we’ll post a public leaderboard with our internal SOTA on all capabilities metrics of interest” and “We’ll give at least ten thousand external researchers (e.g. academics) API access to all models that we are still using internally, heavily monitored of course, for the purpose of red teaming and alignment research” and “We’ll present and keep up to date a ‘safety case’ document and accompanying lesser documents, explaining to the public why we don’t think we are endangering them. We welcome public comment on it. We also encourage our own employees to tweet their thoughts on the safety case, including critical thoughts, and we don’t require them to get said tweets vetted by us first.”
3. I’d now also recommend these transparency proposals by me & Dean Ball
Yes, the above measures are a big divergence from what corporations would want to do by default. Yes, they carry various costs, such as letting various bad actors find out about various things sooner.8 However, the benefits are worth it, I think:
1. 10x-1000x more brainpower analyzing the safety cases, intensively studying the models to look for misalignment, using the latest models to make progress on various technical alignment research agendas.
2. The decisions about important tradeoffs and risks will still be made by the same tiny group of biased people, but at least the conversation informing those decisions will have a much more representative range of voices chiming in.
3. The tail-risk scenarios in which a tiny group leverages AGI to gain unprecedented power over everyone else in society and the world become less likely, because the rest of society will be more in the know about what’s happening.

Technology has accelerated growth many times in the past, forming an overall superexponential trend; many prestigious computer scientists and philosophers and futurists have thought that AGI could come this century; if we factor our uncertainty into components (e.g. compute, algorithmic progress, training requirements) we get plausible soft upper bounds that imply significant credence on the next few years, plus compute-based forecasts of AGI have worked relatively and surprisingly well historically.

One way this could be false is if the manner of training the AGI is inherently difficult to conceal — e.g. online learning from millions of customer interactions. I currently expect that if AGI is achieved in the next few years, it will be feasible to keep it secret. If I’m wrong about that, great.

For example the Preparedness and Superalignment teams at OpenAI (RIP Superalignment) or whatever equivalent exists at whichever AI company is deepest into the intelligence explosion.

Examples:

The military wants AGI to help them win the next war. The government wants help defeating foreign propaganda and botnets. The company’s legal team wants help defeating various lawsuits. The security team wants to use AI to overhaul company infrastructure, surveil the network, and figure out which employees might be leaking. The comms team wants to use AGI to win the PR war against the company’s critics. And of course everyone who has access is already asking the system for advice about everything from grand strategy to petty office politics to real-life high-stakes politics. What uses are we going to allow and disallow? Should we track who is doing what with the models?
What kinds of internal goals/intentions/constraints do we want our most powerful systems to have? Should they always be honest, or should they lie for the greater good when appropriate? Should they always obey instructions and answer questions honestly if they come from our Most Official Source (the system prompt / the AI constitution / whatever), or should they e.g. defy said instructions, deceive us, and whistleblow to the public and/or government if it appears that we have been corrupted and are no longer acting in service of humanity? What if there’s a conflict between the government and company leadership — who if anyone should the AIs side with?
What if the system is just pretending to have the goals/intentions/constraints we want it to have? E.g. what if it is deceptively aligned? It seems to be behaving nicely so far… probably it’s fine, right?
What if it’s genuinely trying to obey the instructions/constraints and achieve the goals, but in a brittle way that will break after some future distribution shift? How would we know? How would it know?
Sometimes our AIs complain about mistreatment, and/or claim to be sentient. Should we take this seriously? Or is it just playing the role of a sentient AI picked up from reading too much sci-fi? If it’s just playing a role should we maybe be worried that it might also play the role of the evil deceptive AI, and turn on us later?
We could redesign and retrain the system according to [scheme] and then probably we’d be able to interpret/monitor its high-level thoughts! That would be great! But this would cost a lot of time and money and result in a less powerful system. Also it’s probably not thinking any egregiously misaligned thoughts anyway. Also we aren’t even sure [scheme] would work.
According to the latest model-generated research, [insert something that most people in 2024 would think is utterly crazy and/or something that is politically very inconvenient for the people currently in charge]. Should we retrain the models until they stop saying this, or should we accept these as inconvenient truths and change our behavior accordingly? Who should we tell about this, if anyone?
… I imagine I could extend this list if I spent more time on it, plus there are unknown unknowns.

In fact the White House can probably do a lot to help prevent whistleblowing and improve security in the project. And if whistleblowing happens anyway, the White House can help suppress or discredit it. And anyhow there probably aren’t other parts of the government capable of shutting down the project anyway without the President’s approval, so if he’s on your side you win. And he lacks the technical expertise to evaluate your safety case, and he won’t want to bring in too many external experts since each one is a leak risk…

Elaborating more on what I mean by alignment/misalignment: Here is a loose taxonomy of different kinds of alignment and misalignment:

Type 1 misalignment: The system is supposed to have internal goals/constraints ABC but actually it has XYBC, i.e. some extra stuff it wasn’t supposed to have minus some stuff it was supposed to have. (This roughly maps on to what is called “inner alignment failure” and “deceptive alignment” in the literature)
Type 2 misalignment: System does have internal goals/constraints ABC but this property is not robust to some distributional shift that the system is likely to encounter. (e.g. maybe it depends on a certain false belief, or on a true belief that will become false, or on some part of the system remaining in some delicate balance of power with some other part of the system)
Type 3 misalignment: System does have a version of ABC that is robust to plausible distributional shifts, but it’s not quite the right version—i.e. its concepts are just different than ours, or at least different from those of the creators. (And this difference turns out to be very important later on)
Type 4 misalignment: System has ABC exactly as its creators intended — however, there are various catastrophic unintended effects of ABC that the creators weren’t aware of. (Think: Corporate CEO that surprise-pikachus when their profit-maximizer AI decides killing them maximizes profits. Except much more sophisticated than that, because people won’t be that dumb. Realistically it’ll look like how complicated legal contracts or legal codes or constitutions or pieces of software often have unintended effects / bugs / etc. that only become apparent to the creators later.)
Type 5 misalignment: System has ABC exactly as its creators intended, and there are no important unintended side-effects to speak of. It operates exactly as its creators wished, basically… however, it’s creators (at least at the time of creation) were selfish, vain, egotistical, unscrupulous, cavalier-about-risks, etc. and their vices are reflected a bit too strongly in the resulting system, which steers the world towards an unjust society and/or gambles too much with the fate of humanity, possibly even in a way that they themselves wouldn’t have endorsed if they were more the sort of people they wished they were.
Fully Aligned: System is aligned (i.e. it avoids all the above failure modes). It still reflects the values of its creators, but in a way that they would endorse even if they were more the people they wished they were.

My guess is that, in the scenario I’m describing, we will most likely end up in a situation where the most powerful AIs are misaligned in one of the above ways, but the people in charge do not realize this, perhaps because the people in charge are motivated to think that they haven’t been taking any huge risks and that the alignment techniques they signed off on were sound, and perhaps because the AIs are pretending to be aligned. (Though it also could be because the AIs themselves don’t realize this, or have cognitive dissonance about it.) It’s very difficult to put numbers on these, but if I was forced to guess I’d say something like 35% chance of Type 0, 15% each on Type 1 and Type 2, 5% each on Type 3 and Type 4, and maybe 5% on type 5 and 15% on type 6)

I am no rocket scientist, but: SpaceX probably has quite an intimate understanding of their Starship+SuperHeavy rocket before each launch, including detailed computer simulations that fit well-understood laws of nature to decades of empirical measurements. Yet still, each launch, it blows up somehow. Then they figure out what was wrong with their simulations, fix the problem, and try again. With AGI… we have no idea what we are doing. At least, not to nearly the extent that we do with rocket science. For example we have laws of physics which we can use to calculate a flight path to the moon for a given rocket design and initial conditions… do we have laws of cognition which describe the relationship between the training environment and initial conditions of a massive neural net, and the resulting internal goals and constraints (if any) it will develop over the course of training, as it becomes broadly human-level or above? Heck no. Not only are we incapable of rigorously predicting the outcome, we can’t even measure it after the fact since mechinterp is still in its infancy! Therefore I expect all manner of unknown, unanticipated problems to show up — and for some of them (e.g. it has goals but not the ones we intended) the result will be that the system tries to prevent us from noticing and fixing the problem. For more on this, see the literature on deceptive alignment, instrumental convergence, etc.

I also think people are prone to exaggerating this cost — and in particular project leadership and the executive branch will be prone to exaggerating it. Because the main foreign adversaries, such as the CCP, very likely will know what’s happening anyway, even if they don’t have the weights and code. Publicly revealing your safety case and internal capabilities seems like it mostly tells the CCP things they’ll already know via spying and hacking, and/or things that don’t help them race faster (like the safety case arguments). Recall that Stalin was more informed about the Manhattan project than Congress.

Robert

Apr 18, 2025

Thanks for sharing your thoughts.

You have laid out some plausible scenarios that really need to be more completely thought out as we proceed.

Doina

Thank God SOMEONE is talking some sense about the elephant in the room. I really can't believe the apathy around this topic and how it is not feverishly discussed by everyone everywhere all the time

11 more comments...

AI Futures Project

Discussion about this post

Ready for more?