Legends – absolutely killer piece. I shared AI 2027 with a bunch of people because it hit exactly where my head’s been at for the past two years. Not everything will happen that fast, sure, but even 10% of it would be enough to redraw the map. And that’s exactly what’s happening.
AI feels like the internet all over again – just way deeper. It’ll be in everything. What I keep thinking about is: what does this mean for Europe? We talk a lot, regulate fast, but when it comes to actually building? That’s where it gets shaky. Yeah, we’ve got Mistral – but let’s be honest, that’s not enough. The real weight – the models, the infra, the talent – it’s mostly elsewhere.
I’d genuinely love to hear your take on Europe’s position in all this. AI isn’t optional anymore. And we can’t just be users forever. Maybe there’s hope through EU-based compute infrastructure, regulation-aligned data centers, or some unexpected angle. But the window’s closing fast.
Appreciate the piece – made a lot of people around me think.
Given that AI 2027 ends in likely catastrophe, the obvious thing for people in general to do is to advocate slower, more responsible AI development to avoid the catastrophe. This obvious thing to do goes double for people outside of US/China, since they are less exposed to possible upsides.
I would love it if Europe could advocate and referee some sort of AI treaty between the US and China. Maybe they could apply leverage via ASML, for instance. That would be a big win for everyone, as far as I'm concerned.
Every nation outside of the US and China should basically form a bloc advocating responsible AI development. Right now, no country can be trusted to get AI right.
Did you see the r/slatestarcodex post pointing out an issue with the model assumptions? "The AI 2027 Model would predict nearly the same doomsday if our effective compute was about 10^20 times lower than it is today"
If you could turn the tabletop exercise(s) into actual board games, I think a lot of people would find that useful and fun, and it would align with your goal of trying to diffuse these ideas into broad society. Something like the board game Pandemic.
I am willing to stipulate that the current paradigm of AIs may well saturate RE-bench in something like the timeline you project.
However, I feel very strongly (80%) that nothing in the current paradigm (transformer-based large language model + reinforcement learning + tool-use + chain-of-thought + external agent harness) will ever, ever achieve the next milestone, described as "Achieving ... software ... tasks that take humans about 1 month ... with 80% reliability, [at] the same cost and speed as humans." Certainly not within Eli's 80% confidence window of 144 months after RE-bench saturation. (I can't rule out a brand new non-transformer paradigm achieving it within 12 years, but I don't think there will be anything resembling continuous steady progress from Re-bench saturation to this milestone.)
I would love to bet on this. Anyone have ideas for operationalizing it?
(For epistemic reflective honesty, I must admit that if you had asked me a year or two ago whether the current paradigm would ever saturate something like RE-bench, I probably would've predicted 95% not.)
Reasoning: The current crop of companies pushing the current paradigm have proven surprisingly adept at saturating benchmarks. (Maybe we should call this the current meta-paradigm.) RE-bench is not a "good" benchmark, though I agree it is better than nothing. I am calling a benchmark "good" to the extent that it, roughly: (a) fairly represents the milestone; (b) can be evaluated quickly and cheaply; (c) has a robust human baseline; (d) has good test-retest accuracy; (e) has a continuous score which increases vaguely linearly with model capability; (f) is sufficiently understandable and impressive to be worth optimizing (for purposes of scientific prestige and/or attracting capital investment). As the target milestone of human-time-horizon increases, the quality of the benchmarks necessarily decreases. I think RE-bench is near the frontier of benchmark possibility. I do not think we will ever see a benchmark for "160 hours of human software engineering" that is "good" enough for the current meta-paradigm to saturate it.
However, my prediction that there will never be a good benchmark for this milestone also makes it hard to adjudicate my desired bet that we will not see continuous progress towards the milestone.
Would the AI2027 authors be willing to make a prediction and/or a bet about the appearance of a suitable benchmark?
What about a task like, "make a word editor which has all the capabilities of Microsoft Word, and more." You give that to your AI agent, and it spits out an entire program MAIcrosoft Word 2.0.
How long would you estimate it would take a human, or a team of humans, to recreate a Microsoft Word from scratch? Or a full fledged modern-day-sized video game, like GTA V? (To be clear, I am not saying that an AI will certainly be able to do that, but this is how you could "benchmark" that.)
I find some support for my claim in Fig 3 of the HCAST paper. There are 189 tasks (not clear to me if this is all tasks or just the ones that had at least one successful baseline). They made an informal prediction of how long it would take a human to solve each one. Looks like 11 tasks were >= 8h predicted. Of tasks with at least one success, looks like ~4 actually took >= 8h.
Double the target timeline to 16h and they had 4 predictions but only 1 was actually achieved in >=16h.
Meanwhile across their whole dataset they say only 61% of human baselines were successful. They highlight "practical challenges with acquiring sufficiently many high-quality human baseliners".
Each time you double the horizon, it becomes harder to create tasks, harder to predict the time, more expensive to bounty them, harder to QA the tasks, and harder to validate models. RE-bench is already barely servicable with 5 (of 7) tasks at ~8h. I predict that with supreme effort and investment, a similarly "barely servicable" benchmark could maybe be squeaked out at a 32h horizon, with a year of effort and a couple million dollars. I think making a servicable benchmark with a 64h horizon would not be practical in the current US economy. Making a servicable benchmark with a 128h horizon may not be possible at all in less than 10 years with anything like our current planetary society.
Thanks for a valuable contribution. Who do you think has done a good job of analyzing the related fat tail risks? What does your team consider the probability of major kinetic conflict due to fear of a competing state's AI progress (real or imagined)?
One thing I haven't seen anyone ask is: will China's much larger industrial base give them any kind of head start that could offset the US's lead in AI capabilities. I'd guess China could get an industrial robot explosion started much faster since they're already a few doublings further ahead on the exponential.
Could you clarify how did you arrive at –24 as OpenAI’s net approval rating. According to https://today.yougov.com/topics/technology/explore/brand/OpenAI, it is either +21 = (35 – 19) / .75 (among people who are familiar) or just +16 = 35 – 19 (treating unfamiliar as neutral).
It looks like the AIPI poll has the approval rating question after a bunch of safety-related questions, so maybe that affects people's views / the salience of safety concerns when they're thinking about approval
Legends – absolutely killer piece. I shared AI 2027 with a bunch of people because it hit exactly where my head’s been at for the past two years. Not everything will happen that fast, sure, but even 10% of it would be enough to redraw the map. And that’s exactly what’s happening.
AI feels like the internet all over again – just way deeper. It’ll be in everything. What I keep thinking about is: what does this mean for Europe? We talk a lot, regulate fast, but when it comes to actually building? That’s where it gets shaky. Yeah, we’ve got Mistral – but let’s be honest, that’s not enough. The real weight – the models, the infra, the talent – it’s mostly elsewhere.
I’d genuinely love to hear your take on Europe’s position in all this. AI isn’t optional anymore. And we can’t just be users forever. Maybe there’s hope through EU-based compute infrastructure, regulation-aligned data centers, or some unexpected angle. But the window’s closing fast.
Appreciate the piece – made a lot of people around me think.
Given that AI 2027 ends in likely catastrophe, the obvious thing for people in general to do is to advocate slower, more responsible AI development to avoid the catastrophe. This obvious thing to do goes double for people outside of US/China, since they are less exposed to possible upsides.
I would love it if Europe could advocate and referee some sort of AI treaty between the US and China. Maybe they could apply leverage via ASML, for instance. That would be a big win for everyone, as far as I'm concerned.
Every nation outside of the US and China should basically form a bloc advocating responsible AI development. Right now, no country can be trusted to get AI right.
Thanks for the shoutout :)
Keep up the great work!
Did you see the r/slatestarcodex post pointing out an issue with the model assumptions? "The AI 2027 Model would predict nearly the same doomsday if our effective compute was about 10^20 times lower than it is today"
https://old.reddit.com/r/slatestarcodex/comments/1k2up73/the_ai_2027_model_would_predict_nearly_the_same/
Great piece. Super useful summary!
If you could turn the tabletop exercise(s) into actual board games, I think a lot of people would find that useful and fun, and it would align with your goal of trying to diffuse these ideas into broad society. Something like the board game Pandemic.
I am willing to stipulate that the current paradigm of AIs may well saturate RE-bench in something like the timeline you project.
However, I feel very strongly (80%) that nothing in the current paradigm (transformer-based large language model + reinforcement learning + tool-use + chain-of-thought + external agent harness) will ever, ever achieve the next milestone, described as "Achieving ... software ... tasks that take humans about 1 month ... with 80% reliability, [at] the same cost and speed as humans." Certainly not within Eli's 80% confidence window of 144 months after RE-bench saturation. (I can't rule out a brand new non-transformer paradigm achieving it within 12 years, but I don't think there will be anything resembling continuous steady progress from Re-bench saturation to this milestone.)
I would love to bet on this. Anyone have ideas for operationalizing it?
(For epistemic reflective honesty, I must admit that if you had asked me a year or two ago whether the current paradigm would ever saturate something like RE-bench, I probably would've predicted 95% not.)
Reasoning: The current crop of companies pushing the current paradigm have proven surprisingly adept at saturating benchmarks. (Maybe we should call this the current meta-paradigm.) RE-bench is not a "good" benchmark, though I agree it is better than nothing. I am calling a benchmark "good" to the extent that it, roughly: (a) fairly represents the milestone; (b) can be evaluated quickly and cheaply; (c) has a robust human baseline; (d) has good test-retest accuracy; (e) has a continuous score which increases vaguely linearly with model capability; (f) is sufficiently understandable and impressive to be worth optimizing (for purposes of scientific prestige and/or attracting capital investment). As the target milestone of human-time-horizon increases, the quality of the benchmarks necessarily decreases. I think RE-bench is near the frontier of benchmark possibility. I do not think we will ever see a benchmark for "160 hours of human software engineering" that is "good" enough for the current meta-paradigm to saturate it.
However, my prediction that there will never be a good benchmark for this milestone also makes it hard to adjudicate my desired bet that we will not see continuous progress towards the milestone.
Would the AI2027 authors be willing to make a prediction and/or a bet about the appearance of a suitable benchmark?
What about a task like, "make a word editor which has all the capabilities of Microsoft Word, and more." You give that to your AI agent, and it spits out an entire program MAIcrosoft Word 2.0.
How long would you estimate it would take a human, or a team of humans, to recreate a Microsoft Word from scratch? Or a full fledged modern-day-sized video game, like GTA V? (To be clear, I am not saying that an AI will certainly be able to do that, but this is how you could "benchmark" that.)
I find some support for my claim in Fig 3 of the HCAST paper. There are 189 tasks (not clear to me if this is all tasks or just the ones that had at least one successful baseline). They made an informal prediction of how long it would take a human to solve each one. Looks like 11 tasks were >= 8h predicted. Of tasks with at least one success, looks like ~4 actually took >= 8h.
Double the target timeline to 16h and they had 4 predictions but only 1 was actually achieved in >=16h.
Meanwhile across their whole dataset they say only 61% of human baselines were successful. They highlight "practical challenges with acquiring sufficiently many high-quality human baseliners".
Each time you double the horizon, it becomes harder to create tasks, harder to predict the time, more expensive to bounty them, harder to QA the tasks, and harder to validate models. RE-bench is already barely servicable with 5 (of 7) tasks at ~8h. I predict that with supreme effort and investment, a similarly "barely servicable" benchmark could maybe be squeaked out at a 32h horizon, with a year of effort and a couple million dollars. I think making a servicable benchmark with a 64h horizon would not be practical in the current US economy. Making a servicable benchmark with a 128h horizon may not be possible at all in less than 10 years with anything like our current planetary society.
Thanks for a valuable contribution. Who do you think has done a good job of analyzing the related fat tail risks? What does your team consider the probability of major kinetic conflict due to fear of a competing state's AI progress (real or imagined)?
Now I know my response post is mega late when the reaction roundups are coming too. Regardless, congrats!
One thing I haven't seen anyone ask is: will China's much larger industrial base give them any kind of head start that could offset the US's lead in AI capabilities. I'd guess China could get an industrial robot explosion started much faster since they're already a few doublings further ahead on the exponential.
Could you clarify how did you arrive at –24 as OpenAI’s net approval rating. According to https://today.yougov.com/topics/technology/explore/brand/OpenAI, it is either +21 = (35 – 19) / .75 (among people who are familiar) or just +16 = 35 – 19 (treating unfamiliar as neutral).
Interesting - I was going off https://drive.google.com/drive/folders/18_hrXchAN42UYhC93YEqaPZEzVWQAc2q . I don't know why these are so different.
It looks like the AIPI poll has the approval rating question after a bunch of safety-related questions, so maybe that affects people's views / the salience of safety concerns when they're thinking about approval