20 Comments
User's avatar
Generative Gallery's avatar

Ok actually one more.

Hallucination rate (PersonQA) could be a pretty good way to measure size of models as well, more parameters means more spots to “store” random knowledge- this would explain the placement of the mini models, 4o, and 4.5

So the (seemingly weirdly) poor o3 performance on PersonQA could mean one of two things

1. It really is just a smaller model, this would explain the pricing as well, however, the system card thinks it’s a weird finding.

2. Sufficiently advanced levels of RL “corrupt” some of the inner knowledge.

Expand full comment
Adam Kaufman's avatar

I think the term "hallucination" has grown to encompass many distinct behaviors at this point and isn't super useful. I'd personally bet that in many cases o3 knows full well it isn't telling the truth, but doesn't care as RL has unintentionally incentivized that sort of behavior.

Expand full comment
Bonifacijs's avatar

Something that I never got was the relationship between GPT-4 and GPT-4o, is it a different base model?

Expand full comment
Romeo Dean's avatar

Yep, I think 4o is a distillation of GPT-4 to make it smaller and more efficient. GPT-4 was probably around 600B dense-equivalent parameters.

Expand full comment
Jamie Fisher's avatar

Having just learned about

"Request for Information on the Development of a 2025 National Artificial Intelligence (AI) Research and Development (R&D) Strategic Plan",

And having just considered a "not good idea", I consulted the oracle. The oracle responds:

Healthier ways to get the same message across:

1. Write a single, high‑impact comment that lays out the safety and transparency argument and references "AI 2027" and Kokotajlo’s post https://blog.ai-futures.org/p/training-agi-in-secret-would-be-unsafe], with citations and constructive recommendations (e.g., disclosure protocols, evaluation benchmarks).

2. Recruit real co‑signers. Circulate a Google doc or GitHub gist; let supporters add names and affiliations. Submit once, listing all endorsers—perfectly legitimate and shows genuine consensus.

3. Coalition letter. Partner with aligned orgs (e.g., Center for AI Safety, CLTR, CSET). Agencies give serious weight to comments backed by established research institutions.

4. Public blog & social push. Publish an open letter, then invite readers to file their own distinct comments—each in their own words, referencing the main letter.

Since I'm not an active researcher, might someone else take a lead on this?

Expand full comment
Jamie Fisher's avatar

(I am itching to co-sign such a letter, even if I'm underqualified to write it)

Expand full comment
Burrito's avatar

The huge ARC-AGI gap between the o3s is because the old version was trained on (more of) the training set than the new version. See https://x.com/polynoamial/status/1914902699021132003

Expand full comment
Romeo Dean's avatar

That’s not the only reason, it was also evaluated using 100x less inference-time compute. See: https://arcprize.org/blog/analyzing-o3-with-arc-agi

Expand full comment
Burrito's avatar

You're right. I do think the old version using the training set is worth editing the post to mention though.

Expand full comment
Sobr Sage's avatar

Do you believe that a concurrent paradigm of machine learning could exist based on offline, private instances run within a cottage industry of small gpu clusters?

What if agi isn't a strictly power and compute thing? What if, akin to the society we exist in, conscious awareness requires information that has been subjected to time, chronologically. Data is Data to a computer. Ones and zeros. Takes up the same space.

Humans though, attach a different value, sometimes a whole mythology on top of information that is considered to have been a survivor of time immemorial. Even if it's no longer useful in its original format as it were.

I think its at least worth looking at our society's most ancient continually transmitted information with the highest fidelity to something we can identify in a scientific manner, and chronologically fix with the appropriate methods. I suspect if we look long enough, we find a pattern that resembles a firmware of conscious thought in organic processor's.

Might be fun

Expand full comment
Tom P's avatar

What does $100M to $1B of RL look like? Is it the same activities as the old RL, but performed by 10,000 people? Are they actually hiring thousands of people to do RL?

Expand full comment
Romeo Dean's avatar

My understanding is that the RL environments are mostly synthetic, i.e., they are autograded (maybe with unit tests written by elicited AIs, or humans). So scaling RL just means making more of these environments for more problems.

Expand full comment
Jeffrey Soreff's avatar

Many Thanks! Yeah, the two different models for "o3" were particularly confusing.

On my tiny benchmark-ette, I kind-of saw the deterioration from o3-mini-high (feb) to o3 (apr).

https://www.astralcodexten.com/p/open-thread-366/comment/90363116 (o3-mini-high)

https://www.astralcodexten.com/p/open-thread-377/comment/109495090 (o3)

Only o3-mini-high has _ever_ gotten the s4 question

>What is an example of a molecule that has an S4 rotation-reflection axis, but neither a center of inversion nor a mirror plane?

right.

It is kind-of weird that some data from 4.5 might be being used in some of the updates to some of the other models... If I take a one dimensional view of "How close is this to a competent STEMM professional?", the GPQA Diamond scores are probably the closest approximation.

I'm still a bit skeptical of even that measure, since e.g. o1 scores higher than an expert human on GPQA Diamond, yet I'm still seeing failures on my benchmark-ette which a bright undergraduate should be able to fully answer correctly.

Oh well. Looking forward to the next releases from OpenAI, Google, Deepseek, Anthropic, and xAI.

Expand full comment
Renxayzer's avatar

During process of iterative distilation and amplification (IDA) and RL training wouldn't it take 10x more compute¹ for each next o series model to train?

I see agentic reasoners to scale the same ≈ way. Long horizon tasks would take more wrt time

BTW: Are you planning on next version of a more polished AI 2027?

1. As per oai: o3 took 10x FLOPs more compute compared to o1

Expand full comment
Generative Gallery's avatar

Last one (hopefully 😂)

If GPT-5 is gonna be released for free users too, but it’s a huge model, does OpenAI just plan on eating the loss? Even with no thinking, it could be quite expensive (like how 4.5 is). And then with thinking, it could be extremely expensive.

Expand full comment
Romeo Dean's avatar

Agree. I basically don't buy that OpenAI will actually serve it for free at high volumes, and will just eat the loss on lower volumes because its worth it long term (to drive up market share). It is maybe some evidence though for option 3, which would imply a cheaper gpt-5.

Expand full comment
Generative Gallery's avatar

If o1-pro was a further elicited version of o1, why is it so much more expensive? You could argue profit, but I’ve seen some metrics where it does much better than o3.

Expand full comment
Romeo Dean's avatar

Probably the 10x price difference directly reveals that its a 10x in inference compute. So just a cons@10 or similar elicitation.

Expand full comment
Generative Gallery's avatar

On the gif, you point to a hypothetical model elicited from 4.5 that o1 was distilled from. Can you give more details about this? Was this an internal first try at reasoning? And I assume GPT-5 will be different.

Expand full comment
Romeo Dean's avatar

Its not a new model but an elicitation, which means just using more inference-time compute to reach a higher capability level.

Expand full comment