OpenAI’s o3 model beats the average human score on a test designed to measure artificial general intelligence (AGI)—the point where machines match human-level thinking.
Yet, as its capabilities shine, a growing question looms over its practicality: Can such models be scaled affordably?
Revolutionary performance at a high cost
Sam Altman, CEO of OpenAI, announced o3 during the company’s “12 Days of OpenAI” event in December 2024. The AI model scored 87.5% on ARC-AGI, a test designed to measure how well computers can think like humans. This score beats the average human score of 80%, marking the first time an AI system has outperformed people on this benchmark.
François Chollet, the creator of the test, described it as a “step-function increase in AI capabilities.” However, he noted that the high performance came at a steep price – each puzzle solution cost more than $1,000 in computing power.
The o3 model works differently from previous AI systems. Instead of giving quick answers, it takes time to “think” through problems, sometimes spending 10 to 15 minutes analyzing a single question. OpenAI relied on test-time scaling, where additional compute power is applied during inference. This approach helps solve complex problems but is too expensive for routine tasks.
Economic challenges of scalability
OpenAI has yet to release detailed pricing, but the high costs associated with o3 could limit its adoption. For context, human labor for similar tasks would cost just $5 per assignment. This stark comparison has led critics like Gary Marcus to question the model’s practicality. Marcus argues that its application might remain limited to well-funded institutions or niche use cases.
Jack Clark, co-founder of AI company Anthropic, explained that unlike traditional software, which becomes cheaper as it improves, the o3 model’s operating costs remain high because it requires significant computing power for each task it performs.
The model showed impressive abilities in areas like mathematics and scientific reasoning. However, it still struggled with basic tasks that humans find easy.
“While the new model is very impressive and represents a big milestone on the way towards AGI, I don’t believe this is AGI – there’s still a fair number of very easy [ARC Challenge] tasks that o3 can’t solve,” Chollet said.
Altman gears toward superintelligence
Altman remains optimistic. “We are confident we know how to build AGI,” he wrote in a recent blog post. “We believe that, in 2025, we may see the first AI agents ‘join the workforce’ and materially change the output of companies.”
He also mentions that the company is gearing towards superintelligence beyond AGI. “With superintelligence, we can do anything else,” noted Altman. “Superintelligent tools could massively accelerate scientific discovery and innovation well beyond what we are capable of doing on our own.”
For now, o3 remains in testing, with access limited to safety researchers and AI institutions. OpenAI hasn’t announced when the public might be able to use it, but they’re releasing a smaller, more efficient version called o3-mini first.
The breakthrough shows both the progress and challenges in AI development. While o3 can match human performance in some areas, its high operating costs mean it might only be practical for specialized tasks where the benefits justify the expense.