I was just reading AI 2027 yesterday and doing the same thing, seeing how well it had fared so far!
I was surprised progress on quantitative metrics is at roughly 65%, subjectively reality felt slower than the forceast, but not by too much.
"(To avoid singling out any one existing company, we’re going to describe a fictional artificial general intelligence company, which we’ll call OpenBrain. We imagine the others to be 3–9 months behind OpenBrain.)
The race appears to be closer than we predicted, more like a 0-2 month lead between the top US AGI companies."
It doesn't even seem like it is accurate to describe any company as being in the lead anymore, because the different companies have focused on different things and are ahead in different areas. You're also not saying which company you think is in a 0-2 month lead! How are you grading yourself on this forcecast then?
"So when OpenBrain finishes training Agent-1, a new model under internal development, it’s good at many things but great at helping with AI research."
You didn't comment on this directly, but it seems like the gap in time between what is made available to consumers and what companies have available internally is very small nowadays. Originally when I read AI 2027 I had the impression that a company would have this Agent-1 internally for something like at least 4 months and make a ton of progress behind closed doors. Now it seems like consumers are getting the best coding agents available with only a ~1 month time lag. (The time delay between Opus 4.5 and Opus 4.6 was 2 months and 12 days. It doesn't seem like Anthropic had Opus 4.6 available internally at the time of the Opus 4.5 release? I guess I am just speculating...)
"By this point “finishes training” is a bit of a misnomer; models are frequently updated to newer versions trained on additional data or partially re-trained to patch some weaknesses."
Hmmm I would give this one a grade of a B- to B? You never specified what you meant by "frequently". While originally reading AI2027 was under the impression that you meant by this something more like continual learning: every day or week the model would be trained on the internet's worth of text that had occured in that time period and be retrained so that it was up to date with the news. But that might have just been me misinterpreting. It still feels like we are waiting, for example, for Sonnet 5.0 to "finish training" so that it can be released.
> You didn't comment on this directly, but it seems like the gap in time between what is made available to consumers and what companies have available internally is very small nowadays. Originally when I read AI 2027 I had the impression that a company would have this Agent-1 internally for something like at least 4 months and make a ton of progress behind closed doors. Now it seems like consumers are getting the best coding agents available with only a ~1 month time lag. (The time delay between Opus 4.5 and Opus 4.6 was 2 months and 12 days. It doesn't seem like Anthropic had Opus 4.6 available internally at the time of the Opus 4.5 release? I guess I am just speculating...)
Agree with this! I just forgot to mention this in the post, my bad.
Thanks! I agree with most of your points. Re continual learning: we talk about how it happens in early 2027; therefore, I don't think it is correct to interpret us as predicting it happening by end of 2025.
I'm somewhat surprised by the uplift prediction scoring so badly, given that my subjective experience has been much more progress than I anticipated particularly towards the end of the year. I'd intuitively expect uplift to correlate with coding time horizon, so definitely weird to see them diverge so much.
In addition to the initial overestimate you mention, I think the data quality makes it hard to judge. The data points are extremely rare. I think only that one METR study collected real time based data? Everything else is subjective reporting which that same study demonstrates as poor quality. Plus anecdotally the variance across individual users is extreme, borne out by the Opus 4.6 model card (30% to 700% speed up reported!)
This particular factor seems pretty load bearing for takeoff timelines, so the data uncertainty matters a lot. Would love to have better primary data collection on this. My suspicion is that uplift is being underestimated, but the confidence interval is super wide.
We really wish we had better estimates of uplift. What happened in this grading process is that when we published AI 2027, the METR downlift study hadn't come out yet. Not knowing about that study, our estimate in AI 2027 of the uplift happening in early 2025 was higher than it should have been. Our estimate of the uplift happening in late 2025 is about where we thought it was in early 2025.
So yeah, there's definitely been significant progress in uplift over the course of 2025, but it's hard to measure it and we have lots of uncertainty about the absolute level of uplift and that arguably makes AI 2027's forecast get graded unfairly harshly. Arguably.
The cutoff date was exactly the end of 2025, so not including 4.6/5.3. I don't think it would have changed the result by much, though we don't know those models' time horizons yet.
Great post! Very informative. I thought this post did a great job explaining both of your points view. That I was confused by some of statements that were made in the post.
(1). Agent 1 gets mentioned quite a few times in the qualitative predictions grading portion of this post. Looking at the AI Futures graph from April it appears that agent 1 needs an 80% time horizon that is in between 1 week (40 hours) and 1 month (167 hours). This would imply that we don’t have agent 1 yet let alone agent 0. If this is the case then why are the qualitative predictions that center around agent 1 considered mostly correct?
(2). In the looking ahead to 2026 and beyond, the coding time horizon section says “A central AI-2027-speed trajectory from the AI 2027 timelines model predicts ~3 work week 80% coding time horizons by the end of 2026. Time horizons also play a large role in our newer AI Futures Model. In this model, a handcrafted AI-2027-speed trajectory achieves time horizons of about a year by the end of 2026.” I am by what is being predicted. Is the prediction that by the end of 2026 there will be a 3 work week (120 hours) 80% time horizon or is the prediction instead that the 80% time horizon will be 1 year?
(3). The post further says “we’d guess that it will unfortunately be difficult to be highly confident by the end of 2026 that AI takeoff will or won’t begin in 2027.” Why is this case? I would assume that by the end of 2026 we would now whether a takeoff is going to occur.
(4). Is the automated coder the same as Leopoldo’s drop in remote worker?
Jimmy Ba recently left XAi on February 10. He claims that the recursive self improvement loops will be live in 12 months, so sometime in February 2027. I am wondering if this has affected your predictions? How much weight do you put on this statement?
Your 0.65x doesn’t sound like mere noise to me. It sounds like that, even if capability keeps advancing, the conversion into real-world impact can be capped by exogenous cycle times. This isn’t an attempt to explain why your benchmarks are at 0.65x. It’s an attempt to explain why the translation from capability to throughput in heavily regulated, heavy industry tends to remain below 1.0x.
I work in Legal and Corporate Affairs at the La Oroya metallurgical complex in Peru, at Metalurgia Business Peru S.A.A., formerly Doe Run Peru. It’s an environment with intense environmental and regulatory load, constant oversight, operational risk, and a governance structure where worker shareholding slows internal alignment.
In that setting, AI uplift is affected with particular clarity by diminishing returns. It can reduce errors, organize evidence, prepare filings, and improve traceability. But it doesn’t shorten environmental remediation subject to verification, it doesn’t change permitting timelines, and it doesn’t eliminate internal negotiation when the people executing are also the ones deciding. The limiting unit isn’t tokens per second. It’s months per permit. If a takeoff scenario aims to speak about global economic transformation and not only capability, it needs to parameterize that decision-to-execution latency.
I do not criticize the analysis of the metric itself, but its application to typical structures and its departure from atypical structures, which are those that make up a large part of the current industrial world, and especially those in development, which, as I repeat, are not geopolitically irrelevant.
What I mean is that intelligence should not be analyzed in a vacuum, but rather based on the real infrastructure that, if altered, is capable of changing the course of the global economy, and not just benchmarks. I mention this because “takeoff” would not be an isolated event. It is framed as a global inflection point, so incorporating global components like the one I’m describing isn’t arbitrary.
I love your guys stuff but it seems like the whole project isn’t much more than a game. Like ok, say you are more accurate than 99% of people over some time horizon. Whats the end game?
Like what would be the takeaway or result of your work that would make you the happiest?
They've offered policy recommendations as well. The end game is that by having accurate forecasts we can more wisely steer towards the futures we would like to see.
I encourage you to read through more of the blog, some of your questions are answered there. And spokespeople like these authors have talked to congresspeople and politicians. Also, public persuasion can (and has before) have real downstream effects.
I've read the blog. I also asked Daniel directly. His response was that they are really not in business to make political change (I'm paraphrasing).
It doesn't work. Change comes from power. There are two kinds of power, I'll use automotive safety as an example:
External but with teeth: NTSB can force recalls, etc.
Internal: there are people within, say, Ford, who are responsible for safety because maiming and killing customers is bad business. They have a say in car design from within the organization.
This blog, as other external "AI safety" orgs are... well wishers? They can deliver whatever insightful analysis they want, great recommendations, etc. But they can't make Altman do anything, nor they are affecting things from the inside.
You know, it's a hard question. An obvious one would be an statement from either a politician (e.g., a congresscreature), or from an AI company that they are considering policy recommendations from / working with AI Futures. But I understand that the absence of an announcement doesn't prove an absence of action.
I'd also trust the AI Futures crew if they came out with an update on their work with powers that be. Like, "in the past N months, we engaged with discussion with such and such", with these outcomes".
The imbalance still remains, i.e., nobody HAS TO take their recommendations, but at least it'd be good to know they are finding sympathetic ears.
I was just reading AI 2027 yesterday and doing the same thing, seeing how well it had fared so far!
I was surprised progress on quantitative metrics is at roughly 65%, subjectively reality felt slower than the forceast, but not by too much.
"(To avoid singling out any one existing company, we’re going to describe a fictional artificial general intelligence company, which we’ll call OpenBrain. We imagine the others to be 3–9 months behind OpenBrain.)
The race appears to be closer than we predicted, more like a 0-2 month lead between the top US AGI companies."
It doesn't even seem like it is accurate to describe any company as being in the lead anymore, because the different companies have focused on different things and are ahead in different areas. You're also not saying which company you think is in a 0-2 month lead! How are you grading yourself on this forcecast then?
"So when OpenBrain finishes training Agent-1, a new model under internal development, it’s good at many things but great at helping with AI research."
You didn't comment on this directly, but it seems like the gap in time between what is made available to consumers and what companies have available internally is very small nowadays. Originally when I read AI 2027 I had the impression that a company would have this Agent-1 internally for something like at least 4 months and make a ton of progress behind closed doors. Now it seems like consumers are getting the best coding agents available with only a ~1 month time lag. (The time delay between Opus 4.5 and Opus 4.6 was 2 months and 12 days. It doesn't seem like Anthropic had Opus 4.6 available internally at the time of the Opus 4.5 release? I guess I am just speculating...)
"By this point “finishes training” is a bit of a misnomer; models are frequently updated to newer versions trained on additional data or partially re-trained to patch some weaknesses."
Hmmm I would give this one a grade of a B- to B? You never specified what you meant by "frequently". While originally reading AI2027 was under the impression that you meant by this something more like continual learning: every day or week the model would be trained on the internet's worth of text that had occured in that time period and be retrained so that it was up to date with the news. But that might have just been me misinterpreting. It still feels like we are waiting, for example, for Sonnet 5.0 to "finish training" so that it can be released.
> You didn't comment on this directly, but it seems like the gap in time between what is made available to consumers and what companies have available internally is very small nowadays. Originally when I read AI 2027 I had the impression that a company would have this Agent-1 internally for something like at least 4 months and make a ton of progress behind closed doors. Now it seems like consumers are getting the best coding agents available with only a ~1 month time lag. (The time delay between Opus 4.5 and Opus 4.6 was 2 months and 12 days. It doesn't seem like Anthropic had Opus 4.6 available internally at the time of the Opus 4.5 release? I guess I am just speculating...)
Agree with this! I just forgot to mention this in the post, my bad.
Thanks! I agree with most of your points. Re continual learning: we talk about how it happens in early 2027; therefore, I don't think it is correct to interpret us as predicting it happening by end of 2025.
Writing in November about your project (https://open.substack.com/pub/mwrussell1969/p/hyperstition?r=av0kj&utm_campaign=post&utm_medium=web), I closed with an assessment of it's predictive validity to date, which I thought was pretty damn good, if a bit slower than advertised.
I didn't put a % on it, but pressed to go back and do so I would have thought 75-80%.
I find your grade quite fair, and much more well calculated than mine (duh!).
Keep up the good work!
I'm somewhat surprised by the uplift prediction scoring so badly, given that my subjective experience has been much more progress than I anticipated particularly towards the end of the year. I'd intuitively expect uplift to correlate with coding time horizon, so definitely weird to see them diverge so much.
In addition to the initial overestimate you mention, I think the data quality makes it hard to judge. The data points are extremely rare. I think only that one METR study collected real time based data? Everything else is subjective reporting which that same study demonstrates as poor quality. Plus anecdotally the variance across individual users is extreme, borne out by the Opus 4.6 model card (30% to 700% speed up reported!)
This particular factor seems pretty load bearing for takeoff timelines, so the data uncertainty matters a lot. Would love to have better primary data collection on this. My suspicion is that uplift is being underestimated, but the confidence interval is super wide.
We really wish we had better estimates of uplift. What happened in this grading process is that when we published AI 2027, the METR downlift study hadn't come out yet. Not knowing about that study, our estimate in AI 2027 of the uplift happening in early 2025 was higher than it should have been. Our estimate of the uplift happening in late 2025 is about where we thought it was in early 2025.
So yeah, there's definitely been significant progress in uplift over the course of 2025, but it's hard to measure it and we have lots of uncertainty about the absolute level of uplift and that arguably makes AI 2027's forecast get graded unfairly harshly. Arguably.
What was your "cutoff" date for the 65% figure? Were 4.6/5.3 included in that? If not, would it have even mattered?
The cutoff date was exactly the end of 2025, so not including 4.6/5.3. I don't think it would have changed the result by much, though we don't know those models' time horizons yet.
Also really want to know, as that seems relevant.
Great post! Very informative. I thought this post did a great job explaining both of your points view. That I was confused by some of statements that were made in the post.
(1). Agent 1 gets mentioned quite a few times in the qualitative predictions grading portion of this post. Looking at the AI Futures graph from April it appears that agent 1 needs an 80% time horizon that is in between 1 week (40 hours) and 1 month (167 hours). This would imply that we don’t have agent 1 yet let alone agent 0. If this is the case then why are the qualitative predictions that center around agent 1 considered mostly correct?
(2). In the looking ahead to 2026 and beyond, the coding time horizon section says “A central AI-2027-speed trajectory from the AI 2027 timelines model predicts ~3 work week 80% coding time horizons by the end of 2026. Time horizons also play a large role in our newer AI Futures Model. In this model, a handcrafted AI-2027-speed trajectory achieves time horizons of about a year by the end of 2026.” I am by what is being predicted. Is the prediction that by the end of 2026 there will be a 3 work week (120 hours) 80% time horizon or is the prediction instead that the 80% time horizon will be 1 year?
(3). The post further says “we’d guess that it will unfortunately be difficult to be highly confident by the end of 2026 that AI takeoff will or won’t begin in 2027.” Why is this case? I would assume that by the end of 2026 we would now whether a takeoff is going to occur.
(4). Is the automated coder the same as Leopoldo’s drop in remote worker?
(5). In several recent posts by Zvi Mowshowits (https://substack.com/@thezvi/note/p-187032711?r=6lp84s&utm_medium=ios&utm_source=notes-share-action, https://substack.com/@thezvi/note/p-187443082?r=6lp84s&utm_medium=ios&utm_source=notes-share-action, https://substack.com/@thezvi/note/p-187029171?r=6lp84s&utm_medium=ios&utm_source=notes-share-action and https://thezvi.substack.com/p/ai-155-welcome-to-recursive-self?r=6lp84s&utm_medium=ios), related to Claude Opus 4.6 and GPT 5.3 Codex Zvi comes to the conclusion that future models including Anthropic and OpenAI models will likely be released every two or so months. Dean Ball believes that in context learning will solve continual learning and Zvi says “I expect ‘continual learning’ to be solved primarily via skills and context, and for this to be plenty good enough, and for this to be clear within the year.” Daniel and Eli, I am wondering if you both believe that skills and in context learning is enough to solve continual learning or if instead a new paradigm shift is needed in order to solve continual learning?
Jimmy Ba recently left XAi on February 10. He claims that the recursive self improvement loops will be live in 12 months, so sometime in February 2027. I am wondering if this has affected your predictions? How much weight do you put on this statement?
Your 0.65x doesn’t sound like mere noise to me. It sounds like that, even if capability keeps advancing, the conversion into real-world impact can be capped by exogenous cycle times. This isn’t an attempt to explain why your benchmarks are at 0.65x. It’s an attempt to explain why the translation from capability to throughput in heavily regulated, heavy industry tends to remain below 1.0x.
I work in Legal and Corporate Affairs at the La Oroya metallurgical complex in Peru, at Metalurgia Business Peru S.A.A., formerly Doe Run Peru. It’s an environment with intense environmental and regulatory load, constant oversight, operational risk, and a governance structure where worker shareholding slows internal alignment.
In that setting, AI uplift is affected with particular clarity by diminishing returns. It can reduce errors, organize evidence, prepare filings, and improve traceability. But it doesn’t shorten environmental remediation subject to verification, it doesn’t change permitting timelines, and it doesn’t eliminate internal negotiation when the people executing are also the ones deciding. The limiting unit isn’t tokens per second. It’s months per permit. If a takeoff scenario aims to speak about global economic transformation and not only capability, it needs to parameterize that decision-to-execution latency.
I do not criticize the analysis of the metric itself, but its application to typical structures and its departure from atypical structures, which are those that make up a large part of the current industrial world, and especially those in development, which, as I repeat, are not geopolitically irrelevant.
What I mean is that intelligence should not be analyzed in a vacuum, but rather based on the real infrastructure that, if altered, is capable of changing the course of the global economy, and not just benchmarks. I mention this because “takeoff” would not be an isolated event. It is framed as a global inflection point, so incorporating global components like the one I’m describing isn’t arbitrary.
I love your guys stuff but it seems like the whole project isn’t much more than a game. Like ok, say you are more accurate than 99% of people over some time horizon. Whats the end game?
Like what would be the takeaway or result of your work that would make you the happiest?
They've offered policy recommendations as well. The end game is that by having accurate forecasts we can more wisely steer towards the futures we would like to see.
Recommendations for whom? Who will be doing the steering?
I'm not trolling, these are serious questions. Who is it that both:
Pays attention to what this here substack says, and
Has the power to implement suggestions ("wisely steer")?
I encourage you to read through more of the blog, some of your questions are answered there. And spokespeople like these authors have talked to congresspeople and politicians. Also, public persuasion can (and has before) have real downstream effects.
I've read the blog. I also asked Daniel directly. His response was that they are really not in business to make political change (I'm paraphrasing).
It doesn't work. Change comes from power. There are two kinds of power, I'll use automotive safety as an example:
External but with teeth: NTSB can force recalls, etc.
Internal: there are people within, say, Ford, who are responsible for safety because maiming and killing customers is bad business. They have a say in car design from within the organization.
This blog, as other external "AI safety" orgs are... well wishers? They can deliver whatever insightful analysis they want, great recommendations, etc. But they can't make Altman do anything, nor they are affecting things from the inside.
What evidence could I give that would persuade you otherwise?
You know, it's a hard question. An obvious one would be an statement from either a politician (e.g., a congresscreature), or from an AI company that they are considering policy recommendations from / working with AI Futures. But I understand that the absence of an announcement doesn't prove an absence of action.
I'd also trust the AI Futures crew if they came out with an update on their work with powers that be. Like, "in the past N months, we engaged with discussion with such and such", with these outcomes".
The imbalance still remains, i.e., nobody HAS TO take their recommendations, but at least it'd be good to know they are finding sympathetic ears.