OpenAI’s o3 achieves human-level problem solving at $1,000 per puzzle

Written by Michael Anthony Bitoon

Published 8 Jan 2025

Fact checked by

Sophia Feona Cantiller

Why trust Greenbot

We maintain a strict editorial policy dedicated to factual accuracy, relevance, and impartiality. Our content is written and edited by top industry professionals with first-hand experience. The content undergoes thorough review by experienced editors to guarantee and adherence to the highest standards of reporting and publishing.

OpenAI’s o3 model beats the average human score on a test designed to measure artificial general intelligence (AGI)—the point where machines match human-level thinking.

Yet, as its capabilities shine, a growing question looms over its practicality: Can such models be scaled affordably?

Revolutionary performance at a high cost

Sam Altman, CEO of OpenAI, announced o3 during the company’s “12 Days of OpenAI” event in December 2024. The AI model scored 87.5% on ARC-AGI, a test designed to measure how well computers can think like humans. This score beats the average human score of 80%, marking the first time an AI system has outperformed people on this benchmark.

Chart showing OpenAI’s o-series performance on the ARC-AGI

Source: ARC Prize

François Chollet, the creator of the test, described it as a “step-function increase in AI capabilities.” However, he noted that the high performance came at a steep price – each puzzle solution cost more than $1,000 in computing power.

The o3 model works differently from previous AI systems. Instead of giving quick answers, it takes time to “think” through problems, sometimes spending 10 to 15 minutes analyzing a single question. OpenAI relied on test-time scaling, where additional compute power is applied during inference. This approach helps solve complex problems but is too expensive for routine tasks.

Economic challenges of scalability

OpenAI has yet to release detailed pricing, but the high costs associated with o3 could limit its adoption. For context, human labor for similar tasks would cost just $5 per assignment. This stark comparison has led critics like Gary Marcus to question the model’s practicality. Marcus argues that its application might remain limited to well-funded institutions or niche use cases.

Jack Clark, co-founder of AI company Anthropic, explained that unlike traditional software, which becomes cheaper as it improves, the o3 model’s operating costs remain high because it requires significant computing power for each task it performs.

The model showed impressive abilities in areas like mathematics and scientific reasoning. However, it still struggled with basic tasks that humans find easy.

“While the new model is very impressive and represents a big milestone on the way towards AGI, I don’t believe this is AGI – there’s still a fair number of very easy [ARC Challenge] tasks that o3 can’t solve,” Chollet said.

Altman gears toward superintelligence

Altman remains optimistic. “We are confident we know how to build AGI,” he wrote in a recent blog post. “We believe that, in 2025, we may see the first AI agents ‘join the workforce’ and materially change the output of companies.”

He also mentions that the company is gearing towards superintelligence beyond AGI. “With superintelligence, we can do anything else,” noted Altman. “Superintelligent tools could massively accelerate scientific discovery and innovation well beyond what we are capable of doing on our own.”

For now, o3 remains in testing, with access limited to safety researchers and AI institutions. OpenAI hasn’t announced when the public might be able to use it, but they’re releasing a smaller, more efficient version called o3-mini first.

The breakthrough shows both the progress and challenges in AI development. While o3 can match human performance in some areas, its high operating costs mean it might only be practical for specialized tasks where the benefits justify the expense.