15 Comments
User's avatar
Konrad's avatar

Legends – absolutely killer piece. I shared AI 2027 with a bunch of people because it hit exactly where my head’s been at for the past two years. Not everything will happen that fast, sure, but even 10% of it would be enough to redraw the map. And that’s exactly what’s happening.

AI feels like the internet all over again – just way deeper. It’ll be in everything. What I keep thinking about is: what does this mean for Europe? We talk a lot, regulate fast, but when it comes to actually building? That’s where it gets shaky. Yeah, we’ve got Mistral – but let’s be honest, that’s not enough. The real weight – the models, the infra, the talent – it’s mostly elsewhere.

I’d genuinely love to hear your take on Europe’s position in all this. AI isn’t optional anymore. And we can’t just be users forever. Maybe there’s hope through EU-based compute infrastructure, regulation-aligned data centers, or some unexpected angle. But the window’s closing fast.

Appreciate the piece – made a lot of people around me think.

Expand full comment
Ebenezer's avatar

Given that AI 2027 ends in likely catastrophe, the obvious thing for people in general to do is to advocate slower, more responsible AI development to avoid the catastrophe. This obvious thing to do goes double for people outside of US/China, since they are less exposed to possible upsides.

I would love it if Europe could advocate and referee some sort of AI treaty between the US and China. Maybe they could apply leverage via ASML, for instance. That would be a big win for everyone, as far as I'm concerned.

Every nation outside of the US and China should basically form a bloc advocating responsible AI development. Right now, no country can be trusted to get AI right.

Expand full comment
Liron Shapira's avatar

Thanks for the shoutout :)

Keep up the great work!

Expand full comment
Liface's avatar

Did you see the r/slatestarcodex post pointing out an issue with the model assumptions? "The AI 2027 Model would predict nearly the same doomsday if our effective compute was about 10^20 times lower than it is today"

https://old.reddit.com/r/slatestarcodex/comments/1k2up73/the_ai_2027_model_would_predict_nearly_the_same/

Expand full comment
Colin Brown's avatar

Great piece. Super useful summary!

Expand full comment
SorenJ's avatar

If you could turn the tabletop exercise(s) into actual board games, I think a lot of people would find that useful and fun, and it would align with your goal of trying to diffuse these ideas into broad society. Something like the board game Pandemic.

Expand full comment
AdamB's avatar
20hEdited

I am willing to stipulate that the current paradigm of AIs may well saturate RE-bench in something like the timeline you project.

However, I feel very strongly (80%) that nothing in the current paradigm (transformer-based large language model + reinforcement learning + tool-use + chain-of-thought + external agent harness) will ever, ever achieve the next milestone, described as "Achieving ... software ... tasks that take humans about 1 month ... with 80% reliability, [at] the same cost and speed as humans." Certainly not within Eli's 80% confidence window of 144 months after RE-bench saturation. (I can't rule out a brand new non-transformer paradigm achieving it within 12 years, but I don't think there will be anything resembling continuous steady progress from Re-bench saturation to this milestone.)

I would love to bet on this. Anyone have ideas for operationalizing it?

(For epistemic reflective honesty, I must admit that if you had asked me a year or two ago whether the current paradigm would ever saturate something like RE-bench, I probably would've predicted 95% not.)

Reasoning: The current crop of companies pushing the current paradigm have proven surprisingly adept at saturating benchmarks. (Maybe we should call this the current meta-paradigm.) RE-bench is not a "good" benchmark, though I agree it is better than nothing. I am calling a benchmark "good" to the extent that it, roughly: (a) fairly represents the milestone; (b) can be evaluated quickly and cheaply; (c) has a robust human baseline; (d) has good test-retest accuracy; (e) has a continuous score which increases vaguely linearly with model capability; (f) is sufficiently understandable and impressive to be worth optimizing (for purposes of scientific prestige and/or attracting capital investment). As the target milestone of human-time-horizon increases, the quality of the benchmarks necessarily decreases. I think RE-bench is near the frontier of benchmark possibility. I do not think we will ever see a benchmark for "160 hours of human software engineering" that is "good" enough for the current meta-paradigm to saturate it.

However, my prediction that there will never be a good benchmark for this milestone also makes it hard to adjudicate my desired bet that we will not see continuous progress towards the milestone.

Would the AI2027 authors be willing to make a prediction and/or a bet about the appearance of a suitable benchmark?

Expand full comment
SorenJ's avatar

What about a task like, "make a word editor which has all the capabilities of Microsoft Word, and more." You give that to your AI agent, and it spits out an entire program MAIcrosoft Word 2.0.

How long would you estimate it would take a human, or a team of humans, to recreate a Microsoft Word from scratch? Or a full fledged modern-day-sized video game, like GTA V? (To be clear, I am not saying that an AI will certainly be able to do that, but this is how you could "benchmark" that.)

Expand full comment
AdamB's avatar
20hEdited

I find some support for my claim in Fig 3 of the HCAST paper. There are 189 tasks (not clear to me if this is all tasks or just the ones that had at least one successful baseline). They made an informal prediction of how long it would take a human to solve each one. Looks like 11 tasks were >= 8h predicted. Of tasks with at least one success, looks like ~4 actually took >= 8h.

Double the target timeline to 16h and they had 4 predictions but only 1 was actually achieved in >=16h.

Meanwhile across their whole dataset they say only 61% of human baselines were successful. They highlight "practical challenges with acquiring sufficiently many high-quality human baseliners".

Each time you double the horizon, it becomes harder to create tasks, harder to predict the time, more expensive to bounty them, harder to QA the tasks, and harder to validate models. RE-bench is already barely servicable with 5 (of 7) tasks at ~8h. I predict that with supreme effort and investment, a similarly "barely servicable" benchmark could maybe be squeaked out at a 32h horizon, with a year of effort and a couple million dollars. I think making a servicable benchmark with a 64h horizon would not be practical in the current US economy. Making a servicable benchmark with a 128h horizon may not be possible at all in less than 10 years with anything like our current planetary society.

Expand full comment
Rick H's avatar

Thanks for a valuable contribution. Who do you think has done a good job of analyzing the related fat tail risks? What does your team consider the probability of major kinetic conflict due to fear of a competing state's AI progress (real or imagined)?

Expand full comment
Nathan Lambert's avatar

Now I know my response post is mega late when the reaction roundups are coming too. Regardless, congrats!

Expand full comment
Citizen Penrose's avatar

One thing I haven't seen anyone ask is: will China's much larger industrial base give them any kind of head start that could offset the US's lead in AI capabilities. I'd guess China could get an industrial robot explosion started much faster since they're already a few doublings further ahead on the exponential.

Expand full comment
Misha's avatar

Could you clarify how did you arrive at –24 as OpenAI’s net approval rating. According to https://today.yougov.com/topics/technology/explore/brand/OpenAI, it is either +21 = (35 – 19) / .75 (among people who are familiar) or just +16 = 35 – 19 (treating unfamiliar as neutral).

Expand full comment
Scott Alexander's avatar

Interesting - I was going off https://drive.google.com/drive/folders/18_hrXchAN42UYhC93YEqaPZEzVWQAc2q . I don't know why these are so different.

Expand full comment
Adam's avatar

It looks like the AIPI poll has the approval rating question after a bunch of safety-related questions, so maybe that affects people's views / the salience of safety concerns when they're thinking about approval

Expand full comment