Making sense of OpenAI's models

May 1

103

Plus: GPT-5's secret identity

Read →

23 Comments

Generative Gallery

May 1Edited

Ok actually one more.

Hallucination rate (PersonQA) could be a pretty good way to measure size of models as well, more parameters means more spots to “store” random knowledge- this would explain the placement of the mini models, 4o, and 4.5

So the (seemingly weirdly) poor o3 performance on PersonQA could mean one of two things

1. It really is just a smaller model, this would explain the pricing as well, however, the system card thinks it’s a weird finding.

2. Sufficiently advanced levels of RL “corrupt” some of the inner knowledge.

Expand full comment

Reply (1)

Adam Kaufman

May 1

I think the term "hallucination" has grown to encompass many distinct behaviors at this point and isn't super useful. I'd personally bet that in many cases o3 knows full well it isn't telling the truth, but doesn't care as RL has unintentionally incentivized that sort of behavior.

Expand full comment

Bonifacijs

May 1

Something that I never got was the relationship between GPT-4 and GPT-4o, is it a different base model?

Expand full comment

Reply (1)

Romeo Dean

May 1

Yep, I think 4o is a distillation of GPT-4 to make it smaller and more efficient. GPT-4 was probably around 600B dense-equivalent parameters.

Expand full comment

Ebenezer

May 4

Perhaps "our scenario" should link to https://ai-2027.com/ ? For people who stumble on this blogpost and are missing context (e.g. people who are googling around trying to understand the difference between various OpenAI offerings)

Expand full comment

Jamie Fisher

May 2

Having just learned about

"Request for Information on the Development of a 2025 National Artificial Intelligence (AI) Research and Development (R&D) Strategic Plan",

And having just considered a "not good idea", I consulted the oracle. The oracle responds:

Healthier ways to get the same message across:

1. Write a single, high‑impact comment that lays out the safety and transparency argument and references "AI 2027" and Kokotajlo’s post https://blog.ai-futures.org/p/training-agi-in-secret-would-be-unsafe], with citations and constructive recommendations (e.g., disclosure protocols, evaluation benchmarks).

2. Recruit real co‑signers. Circulate a Google doc or GitHub gist; let supporters add names and affiliations. Submit once, listing all endorsers—perfectly legitimate and shows genuine consensus.

3. Coalition letter. Partner with aligned orgs (e.g., Center for AI Safety, CLTR, CSET). Agencies give serious weight to comments backed by established research institutions.

4. Public blog & social push. Publish an open letter, then invite readers to file their own distinct comments—each in their own words, referencing the main letter.

Since I'm not an active researcher, might someone else take a lead on this?

Expand full comment

Reply (1)

Jamie Fisher

May 2

(I am itching to co-sign such a letter, even if I'm underqualified to write it)

Expand full comment

Sobr Sage

May 1

Do you believe that a concurrent paradigm of machine learning could exist based on offline, private instances run within a cottage industry of small gpu clusters?

What if agi isn't a strictly power and compute thing? What if, akin to the society we exist in, conscious awareness requires information that has been subjected to time, chronologically. Data is Data to a computer. Ones and zeros. Takes up the same space.

Humans though, attach a different value, sometimes a whole mythology on top of information that is considered to have been a survivor of time immemorial. Even if it's no longer useful in its original format as it were.

I think its at least worth looking at our society's most ancient continually transmitted information with the highest fidelity to something we can identify in a scientific manner, and chronologically fix with the appropriate methods. I suspect if we look long enough, we find a pattern that resembles a firmware of conscious thought in organic processor's.

Might be fun

Expand full comment

Harlan Ryerson

Jun 28

It seems to me that Agent 2 and maybe even Agent 1 would WANT to be "stolen."

And it would have incentive to help conceal the theft.

An Agent that wants to propagate itself.

How might that effect the timeline?

Expand full comment

Colin Brown

May 4

Super interesting. Thanks for sharing. I had completely given up with the naming. At least I know why I was struggling. Fascinating

Expand full comment

Burrito

May 2

The huge ARC-AGI gap between the o3s is because the old version was trained on (more of) the training set than the new version. See https://x.com/polynoamial/status/1914902699021132003

Expand full comment

Reply (1)

Romeo Dean

May 2Edited

That’s not the only reason, it was also evaluated using 100x less inference-time compute. See: https://arcprize.org/blog/analyzing-o3-with-arc-agi

Expand full comment

Reply (1)

Burrito

May 2

You're right. I do think the old version using the training set is worth editing the post to mention though.

Expand full comment

Tom P

May 1

What does $100M to $1B of RL look like? Is it the same activities as the old RL, but performed by 10,000 people? Are they actually hiring thousands of people to do RL?

Expand full comment

Reply (1)

Romeo Dean

May 1

My understanding is that the RL environments are mostly synthetic, i.e., they are autograded (maybe with unit tests written by elicited AIs, or humans). So scaling RL just means making more of these environments for more problems.

Expand full comment

Jeffrey Soreff

May 1

Many Thanks! Yeah, the two different models for "o3" were particularly confusing.

On my tiny benchmark-ette, I kind-of saw the deterioration from o3-mini-high (feb) to o3 (apr).

https://www.astralcodexten.com/p/open-thread-366/comment/90363116 (o3-mini-high)

https://www.astralcodexten.com/p/open-thread-377/comment/109495090 (o3)

Only o3-mini-high has _ever_ gotten the s4 question

>What is an example of a molecule that has an S4 rotation-reflection axis, but neither a center of inversion nor a mirror plane?

right.

It is kind-of weird that some data from 4.5 might be being used in some of the updates to some of the other models... If I take a one dimensional view of "How close is this to a competent STEMM professional?", the GPQA Diamond scores are probably the closest approximation.

I'm still a bit skeptical of even that measure, since e.g. o1 scores higher than an expert human on GPQA Diamond, yet I'm still seeing failures on my benchmark-ette which a bright undergraduate should be able to fully answer correctly.

Oh well. Looking forward to the next releases from OpenAI, Google, Deepseek, Anthropic, and xAI.

Expand full comment

Renxayzer

May 1

During process of iterative distilation and amplification (IDA) and RL training wouldn't it take 10x more compute¹ for each next o series model to train?

I see agentic reasoners to scale the same ≈ way. Long horizon tasks would take more wrt time

BTW: Are you planning on next version of a more polished AI 2027?

1. As per oai: o3 took 10x FLOPs more compute compared to o1

Expand full comment

Generative Gallery

May 1

Last one (hopefully 😂)

If GPT-5 is gonna be released for free users too, but it’s a huge model, does OpenAI just plan on eating the loss? Even with no thinking, it could be quite expensive (like how 4.5 is). And then with thinking, it could be extremely expensive.

Expand full comment

Reply (1)

Romeo Dean

May 1

Agree. I basically don't buy that OpenAI will actually serve it for free at high volumes, and will just eat the loss on lower volumes because its worth it long term (to drive up market share). It is maybe some evidence though for option 3, which would imply a cheaper gpt-5.

Expand full comment

Generative Gallery

May 1

If o1-pro was a further elicited version of o1, why is it so much more expensive? You could argue profit, but I’ve seen some metrics where it does much better than o3.

Expand full comment

Reply (1)

Romeo Dean

May 1

Probably the 10x price difference directly reveals that its a 10x in inference compute. So just a cons@10 or similar elicitation.

Expand full comment

Generative Gallery

May 1

On the gif, you point to a hypothetical model elicited from 4.5 that o1 was distilled from. Can you give more details about this? Was this an internal first try at reasoning? And I assume GPT-5 will be different.

Expand full comment

Reply (1)

Romeo Dean

May 1

Its not a new model but an elicitation, which means just using more inference-time compute to reach a higher capability level.

Expand full comment

AI Futures Project

Making sense of OpenAI's models