Recently, a new achievement in the AI field, the o3 model, has garnered significant attention due to its exceptional performance. However, its high computational costs have also become a focal point of discussion.
It is understood that the o3 model achieved remarkable results in the ARC-AGI benchmark test. However, the high-scoring version of o3 consumed over $1,000 in computational resources per task, compared to the more efficient version. In contrast, the o1 model requires only about $5 per task, and the o1-mini model needs just a few cents.
According to François Chollet, the creator of the ARC-AGI benchmark, to achieve 88% of the score, OpenAI used approximately 170 times more computational resources than the efficient version of o3, which scored only 12% lower. This means that the high-scoring version of o3 consumed over $10,000 in resources to complete the test, making it prohibitively expensive for the ARC competition, which aims to challenge AI models to pass the ARC test.
Despite this, Chollet still considers o3 to be a groundbreaking model in the AI field. He notes that o3 can adapt to tasks it has never encountered before, achieving near-human performance in the ARC-AGI domain. However, this versatility comes at a high cost, which is currently not economically viable. For example, the cost for a human to solve an ARC-AGI task is around $5 per task, with energy costs being just a few cents.
There are differing opinions in the industry regarding the computational costs of o3 and its subsequent versions. Some argue that, given the significant drop in AI model prices over the past year and the fact that OpenAI has not yet disclosed the actual costs of o3, it is too early to discuss specific prices. Nevertheless, these costs do reflect the substantial computational resources required to break through the current performance barriers of AI models.
The practical applications of o3 and its successors have also sparked discussions. Due to their high computational costs, o3 and its subsequent versions are unlikely to become everyday tools like GPT-4 or Google Search. Instead, these models may be more suitable for handling a wide range of tasks that require significant computational resources, such as strategic planning.
Additionally, some experts point out that o3 is not a general artificial intelligence (AGI) and can still fail at simple tasks. Moreover, large language models still suffer from severe hallucination issues, which o3 and the testing computations have not resolved.
To reduce the costs and improve the efficiency of testing computations, some startups are developing better AI inference chips. These chips are expected to play a more significant role in testing computations in the future.