OpenAI Event 12: Preview of New Inference Models o3 and o3-mini

2024-12-23

On the final day of the "12 Days of OpenAI" event, OpenAI provided a preview of its latest advanced reasoning models, o3 and o3-mini. Media outlets had previously reported that new reasoning models would be unveiled during this event.

Although these models have not yet been officially released (the company acknowledges that the final versions may undergo further post-training adjustments), OpenAI has begun accepting applications from the research community to test these systems before their public release (the exact release date is still to be determined). It is noteworthy that in September, OpenAI launched o1 (codenamed Strawberry) and skipped o2 to avoid confusion or trademark issues with the UK telecom company O2.

In the AI industry, the term "reasoning" has become common, referring to the process by which machines break down instructions into smaller tasks to produce stronger results. These models typically demonstrate the steps they take to reach an answer, rather than simply providing the final result without explanation.

According to OpenAI, o3 surpasses its predecessors in various performance metrics. In programming tests (known as SWE-Bench Verified), o3 outperformed its predecessor by 22.8% and even outshone OpenAI's chief scientist in a programming competition. In a highly challenging math competition (AIME 2024), o3 missed only one question, achieving near-perfect scores. In an expert-level science question benchmark (GPQA Diamond), o3 scored 87.7%. In the most difficult mathematical and reasoning challenges, o3 solved 25.2% of the problems, while other models had a success rate of less than 2%.

Additionally, OpenAI announced new research in the area of deliberative alignment, a technique that requires AI models to step through safety decisions. This means that instead of simply setting yes-or-no rules for the AI model, it is required to actively reason whether a user's request aligns with OpenAI's safety policies. The company claims that when tested on o1, the model's ability to follow safety guidelines surpassed that of previous models, including GPT-4.

The o3 and o3-mini models previewed by OpenAI set new benchmarks in both technical capabilities and safety advancements. The o3 model series excels in coding, mathematics, and scientific reasoning, while also incorporating advanced safety features. Specifically, o3 outperformed previous models in programming (with a Codeforces rating of 2727), mathematics (achieving 96.7% accuracy in the AIME 2024 competition), and science (scoring 87.7% in the GPQA Diamond benchmark).

In EpochAI's advanced mathematics benchmark, o3 solved 25.2% of the problems, compared to a maximum accuracy of 2% for previous models. In the ARC-AGI benchmark, o3 scored 87.5%, surpassing human performance and marking a significant milestone in conceptual reasoning.

Meanwhile, o3-mini, a streamlined version of o3, is optimized for coding tasks to enhance efficiency. While maintaining excellent performance, o3-mini reduces computational costs and supports three adjustable inference effort settings (low, medium, and high), allowing for flexible application across different tasks.

OpenAI stated that it will take a cautious approach to the release of o3. The company plans to initially open both models to the public for safety testing, with the application deadline set for January 10, 2025. The o3-mini is expected to be officially released around the end of January, followed by o3.

Furthermore, OpenAI introduced a new safety technology called deliberative alignment, which leverages the model's reasoning capabilities to better identify and handle potentially unsafe prompts. This development marks a significant advancement in the field of AI safety, demonstrating superior performance in accurately rejecting inappropriate requests while avoiding over-rejection of legitimate ones.