8 Comments
User's avatar
intexp's avatar

Great work!

Nice graphs, but where is the CUMULATIVE probability?

That's the most relevant and easiest-to-read one on the "when superhuman coder", you could even combine the yearly and cumulative probabilities on one graph.

Nice that you included the 10/50/90th percentile data points but a curve would be better.

Again, great work!

Expand full comment
Jonathan Shaw's avatar

"To get a good sense of how AI horizons break down, we recommend watching Claude play Pokemon."

This made sense to me intuitively, but alas it's flawed. As an assessment of agency it is hopelessly confounded with the problem of vision. According to someone who has studied this very carefully (https://www.lesswrong.com/posts/8aPyKyRrMAQatFSnG/untitled-draft-x7cc, see the section 'Model Vision of Pokémon Red is Bad. Really Bad.'):

"It takes only the briefest amount of watching to convince yourself that the models can barely see what's going on, if at all."

Expand full comment
tomdhunt's avatar

This "straight lines on (log-scale) graphs" forecasting method is fundamentally unsound.

Moore's Law is a strange outlier in the history of technological development. The normal course of development for a single technology is an initial, extremely steep (far faster than Moore) climb of early development, domain exploration and low-hanging fruit, followed by a plateau. See e.g. https://www.construction-physics.com/i/145994276/how-to-think-about-fusion-progress for some examples here.

Extrapolating the initial steep climb of early development out to some arbitrary level of development is simply completely fallacious. By this logic, extrapolating the steep climb of the 1890s-1900s, cars would have been traveling at Mach 12 by 1930.

Expand full comment
tomdhunt's avatar

(for the record, log-plot of land speed records over time: https://files.catbox.moe/v1yglu.png)

Expand full comment
Ariel's avatar

It’s weird that you draw a line starting from GPT-4o. As far as I can tell, it was not actually designed to further the frontier in coding but rather to be a cheaper GPT-4.

So the only datapoint is Claude 3.7, and AFAICT much of it being high is because of good “tree of thought” engineering by Anthropic that won’t obviously scale.

Expand full comment
Alvin Ånestrand's avatar

Thank you for a great post!

Also, small typo:

> these AI-human discrepancy

Expand full comment
Noah's avatar

> a minutes

typo

Expand full comment
Ash Jafari's avatar

Thank you for sharing those valuable insights. I have a question regarding Sakana's AI Scientist v2, which recently became the first AI to generate a paper that passed peer review at a workshop level. Given this achievement, couldn't we argue that this specialized type of AI can already operate effectively over extended periods—weeks or even one month—rather than being limited to the typical 1-hour timeframe? Looking forward to your thoughts.

Expand full comment