A research team has developed a large language model (LLM) that outperforms OpenAI’s o1-preview in certain tasks at a fraction of the cost.
Researchers from Stanford University and the University of Washington detailed their findings in a paper published last Friday. The new algorithm, named s1-32B, is available on GitHub.
In September of last year, OpenAI introduced o1-preview, an inference-optimized LLM. A key innovation of this algorithm is a technique known as test-time computation, which the creators of the new open-source s1-32B model refer to as test-time scaling. This technique enhances the quality of LLM outputs by increasing the time and hardware resources required to generate prompt responses.
Following the release of o1-preview, several research groups attempted to replicate test-time scaling. In their paper, the creators of s1-32B claim their LLM represents the first publicly disclosed successful replication of "explicit test-time scaling behavior."
"Our model, s1-32B, exhibits test-time scaling," the researchers wrote in their paper. "Additionally, s1-32B is the most sample-efficient inference model, outperforming closed-source models like OpenAI's o1-preview."
The project began with Qwen2.5-32B-Instruct, an open-source LLM released last year by Alibaba Group. Researchers fine-tuned Qwen2.5-32B-Instruct using a dataset comprising 1,000 prompts and AI-generated answers, sourced from Google LLC's Gemini thought experiment LLM.
Unlike simply providing direct answers, the Gemini thought experiment showcases the reasoning process behind its responses. The model provides natural language summaries of each step in its reasoning process. These summaries were incorporated into the training dataset for s1-32B alongside the 1,000 example prompts and corresponding AI-generated answers.
The dataset was created through a multi-step process. First, the researchers collected 59,029 questions covering topics such as mathematics, physics, and chemistry from public sources. They then removed questions containing errors. Afterward, the dataset was further filtered to retain only the most challenging 1,000 questions.
After training s1-32B with the dataset, the researchers applied a novel machine learning approach called budget enforcement. This involves giving the LLM a prompt to either think longer about a question than usual or shorten the reasoning process. According to the researchers, this method addresses two primary challenges in implementing test-time scaling within LLMs.
The first challenge is when LLMs spend too little time thinking about a task, leading to errors. Budget enforcement solves this issue by inputting the word "wait" when s1-32B hasn’t spent enough time processing a query. According to the creators of s1-32B, this prompt enhances the LLM’s reasoning workflow.
In one test, s1-32B initially provided an incorrect answer to a user prompt. After being instructed to wait, the model recognized the error and generated the correct response.
The second issue addressed by the researchers' budget enforcement method is when LLMs spend too much time thinking about a prompt, potentially reducing output quality. For instance, an LLM might find the correct answer to a prompt but alter it during subsequent processing steps. Budget enforcement avoids this problem by instructing the LLM to skip those subsequent steps.
The researchers compared s1-32B with o1-preview on the MATH and AIME24 math benchmarks. The former scored 27% higher than OpenAI’s LLM. In another test involving math problems, s1-32B successfully improved its score from 50% to 57% through test-time computation.
Budget enforcement not only enables s1-32B to outperform o1-preview in certain tasks but also achieves this at a lower cost. Niklas Muennighoff, one of the researchers involved in the study, stated that the development cost was approximately $20 worth of hardware. The researchers explained in their paper that s1-32B was trained for 26 minutes using 16 Nvidia H100 GPUs.