Hallucination rate (PersonQA) could be a pretty good way to measure size of models as well, more parameters means more spots to “store” random knowledge- this would explain the placement of the mini models, 4o, and 4.5
So the (seemingly weirdly) poor o3 performance on PersonQA could mean one of two things
1. It really is just a smaller model, this would explain the pricing as well, however, the system card thinks it’s a weird finding.
2. Sufficiently advanced levels of RL “corrupt” some of the inner knowledge.
I think the term "hallucination" has grown to encompass many distinct behaviors at this point and isn't super useful. I'd personally bet that in many cases o3 knows full well it isn't telling the truth, but doesn't care as RL has unintentionally incentivized that sort of behavior.
"Request for Information on the Development of a 2025 National Artificial Intelligence (AI) Research and Development (R&D) Strategic Plan",
And having just considered a "not good idea", I consulted the oracle. The oracle responds:
Healthier ways to get the same message across:
1. Write a single, high‑impact comment that lays out the safety and transparency argument and references "AI 2027" and Kokotajlo’s post https://blog.ai-futures.org/p/training-agi-in-secret-would-be-unsafe], with citations and constructive recommendations (e.g., disclosure protocols, evaluation benchmarks).
2. Recruit real co‑signers. Circulate a Google doc or GitHub gist; let supporters add names and affiliations. Submit once, listing all endorsers—perfectly legitimate and shows genuine consensus.
3. Coalition letter. Partner with aligned orgs (e.g., Center for AI Safety, CLTR, CSET). Agencies give serious weight to comments backed by established research institutions.
4. Public blog & social push. Publish an open letter, then invite readers to file their own distinct comments—each in their own words, referencing the main letter.
Since I'm not an active researcher, might someone else take a lead on this?
Do you believe that a concurrent paradigm of machine learning could exist based on offline, private instances run within a cottage industry of small gpu clusters?
What if agi isn't a strictly power and compute thing? What if, akin to the society we exist in, conscious awareness requires information that has been subjected to time, chronologically. Data is Data to a computer. Ones and zeros. Takes up the same space.
Humans though, attach a different value, sometimes a whole mythology on top of information that is considered to have been a survivor of time immemorial. Even if it's no longer useful in its original format as it were.
I think its at least worth looking at our society's most ancient continually transmitted information with the highest fidelity to something we can identify in a scientific manner, and chronologically fix with the appropriate methods. I suspect if we look long enough, we find a pattern that resembles a firmware of conscious thought in organic processor's.
What does $100M to $1B of RL look like? Is it the same activities as the old RL, but performed by 10,000 people? Are they actually hiring thousands of people to do RL?
My understanding is that the RL environments are mostly synthetic, i.e., they are autograded (maybe with unit tests written by elicited AIs, or humans). So scaling RL just means making more of these environments for more problems.
Only o3-mini-high has _ever_ gotten the s4 question
>What is an example of a molecule that has an S4 rotation-reflection axis, but neither a center of inversion nor a mirror plane?
right.
It is kind-of weird that some data from 4.5 might be being used in some of the updates to some of the other models... If I take a one dimensional view of "How close is this to a competent STEMM professional?", the GPQA Diamond scores are probably the closest approximation.
I'm still a bit skeptical of even that measure, since e.g. o1 scores higher than an expert human on GPQA Diamond, yet I'm still seeing failures on my benchmark-ette which a bright undergraduate should be able to fully answer correctly.
Oh well. Looking forward to the next releases from OpenAI, Google, Deepseek, Anthropic, and xAI.
During process of iterative distilation and amplification (IDA) and RL training wouldn't it take 10x more compute¹ for each next o series model to train?
I see agentic reasoners to scale the same ≈ way. Long horizon tasks would take more wrt time
BTW: Are you planning on next version of a more polished AI 2027?
1. As per oai: o3 took 10x FLOPs more compute compared to o1
If GPT-5 is gonna be released for free users too, but it’s a huge model, does OpenAI just plan on eating the loss? Even with no thinking, it could be quite expensive (like how 4.5 is). And then with thinking, it could be extremely expensive.
Agree. I basically don't buy that OpenAI will actually serve it for free at high volumes, and will just eat the loss on lower volumes because its worth it long term (to drive up market share). It is maybe some evidence though for option 3, which would imply a cheaper gpt-5.
If o1-pro was a further elicited version of o1, why is it so much more expensive? You could argue profit, but I’ve seen some metrics where it does much better than o3.
On the gif, you point to a hypothetical model elicited from 4.5 that o1 was distilled from. Can you give more details about this? Was this an internal first try at reasoning? And I assume GPT-5 will be different.
Ok actually one more.
Hallucination rate (PersonQA) could be a pretty good way to measure size of models as well, more parameters means more spots to “store” random knowledge- this would explain the placement of the mini models, 4o, and 4.5
So the (seemingly weirdly) poor o3 performance on PersonQA could mean one of two things
1. It really is just a smaller model, this would explain the pricing as well, however, the system card thinks it’s a weird finding.
2. Sufficiently advanced levels of RL “corrupt” some of the inner knowledge.
I think the term "hallucination" has grown to encompass many distinct behaviors at this point and isn't super useful. I'd personally bet that in many cases o3 knows full well it isn't telling the truth, but doesn't care as RL has unintentionally incentivized that sort of behavior.
Something that I never got was the relationship between GPT-4 and GPT-4o, is it a different base model?
Yep, I think 4o is a distillation of GPT-4 to make it smaller and more efficient. GPT-4 was probably around 600B dense-equivalent parameters.
Having just learned about
"Request for Information on the Development of a 2025 National Artificial Intelligence (AI) Research and Development (R&D) Strategic Plan",
And having just considered a "not good idea", I consulted the oracle. The oracle responds:
Healthier ways to get the same message across:
1. Write a single, high‑impact comment that lays out the safety and transparency argument and references "AI 2027" and Kokotajlo’s post https://blog.ai-futures.org/p/training-agi-in-secret-would-be-unsafe], with citations and constructive recommendations (e.g., disclosure protocols, evaluation benchmarks).
2. Recruit real co‑signers. Circulate a Google doc or GitHub gist; let supporters add names and affiliations. Submit once, listing all endorsers—perfectly legitimate and shows genuine consensus.
3. Coalition letter. Partner with aligned orgs (e.g., Center for AI Safety, CLTR, CSET). Agencies give serious weight to comments backed by established research institutions.
4. Public blog & social push. Publish an open letter, then invite readers to file their own distinct comments—each in their own words, referencing the main letter.
Since I'm not an active researcher, might someone else take a lead on this?
(I am itching to co-sign such a letter, even if I'm underqualified to write it)
The huge ARC-AGI gap between the o3s is because the old version was trained on (more of) the training set than the new version. See https://x.com/polynoamial/status/1914902699021132003
That’s not the only reason, it was also evaluated using 100x less inference-time compute. See: https://arcprize.org/blog/analyzing-o3-with-arc-agi
You're right. I do think the old version using the training set is worth editing the post to mention though.
Do you believe that a concurrent paradigm of machine learning could exist based on offline, private instances run within a cottage industry of small gpu clusters?
What if agi isn't a strictly power and compute thing? What if, akin to the society we exist in, conscious awareness requires information that has been subjected to time, chronologically. Data is Data to a computer. Ones and zeros. Takes up the same space.
Humans though, attach a different value, sometimes a whole mythology on top of information that is considered to have been a survivor of time immemorial. Even if it's no longer useful in its original format as it were.
I think its at least worth looking at our society's most ancient continually transmitted information with the highest fidelity to something we can identify in a scientific manner, and chronologically fix with the appropriate methods. I suspect if we look long enough, we find a pattern that resembles a firmware of conscious thought in organic processor's.
Might be fun
What does $100M to $1B of RL look like? Is it the same activities as the old RL, but performed by 10,000 people? Are they actually hiring thousands of people to do RL?
My understanding is that the RL environments are mostly synthetic, i.e., they are autograded (maybe with unit tests written by elicited AIs, or humans). So scaling RL just means making more of these environments for more problems.
Many Thanks! Yeah, the two different models for "o3" were particularly confusing.
On my tiny benchmark-ette, I kind-of saw the deterioration from o3-mini-high (feb) to o3 (apr).
https://www.astralcodexten.com/p/open-thread-366/comment/90363116 (o3-mini-high)
https://www.astralcodexten.com/p/open-thread-377/comment/109495090 (o3)
Only o3-mini-high has _ever_ gotten the s4 question
>What is an example of a molecule that has an S4 rotation-reflection axis, but neither a center of inversion nor a mirror plane?
right.
It is kind-of weird that some data from 4.5 might be being used in some of the updates to some of the other models... If I take a one dimensional view of "How close is this to a competent STEMM professional?", the GPQA Diamond scores are probably the closest approximation.
I'm still a bit skeptical of even that measure, since e.g. o1 scores higher than an expert human on GPQA Diamond, yet I'm still seeing failures on my benchmark-ette which a bright undergraduate should be able to fully answer correctly.
Oh well. Looking forward to the next releases from OpenAI, Google, Deepseek, Anthropic, and xAI.
During process of iterative distilation and amplification (IDA) and RL training wouldn't it take 10x more compute¹ for each next o series model to train?
I see agentic reasoners to scale the same ≈ way. Long horizon tasks would take more wrt time
BTW: Are you planning on next version of a more polished AI 2027?
1. As per oai: o3 took 10x FLOPs more compute compared to o1
Last one (hopefully 😂)
If GPT-5 is gonna be released for free users too, but it’s a huge model, does OpenAI just plan on eating the loss? Even with no thinking, it could be quite expensive (like how 4.5 is). And then with thinking, it could be extremely expensive.
Agree. I basically don't buy that OpenAI will actually serve it for free at high volumes, and will just eat the loss on lower volumes because its worth it long term (to drive up market share). It is maybe some evidence though for option 3, which would imply a cheaper gpt-5.
If o1-pro was a further elicited version of o1, why is it so much more expensive? You could argue profit, but I’ve seen some metrics where it does much better than o3.
Probably the 10x price difference directly reveals that its a 10x in inference compute. So just a cons@10 or similar elicitation.
On the gif, you point to a hypothetical model elicited from 4.5 that o1 was distilled from. Can you give more details about this? Was this an internal first try at reasoning? And I assume GPT-5 will be different.
Its not a new model but an elicitation, which means just using more inference-time compute to reach a higher capability level.