Recently, a research team named NovaSky from the Sky Computing Lab at University of California, Berkeley, unveiled a new inference model called Sky-T1-32B-Preview. This model has demonstrated competitive performance against early versions of OpenAI's o1 model across various key benchmarks. Notably, Sky-T1 is the first truly open-source inference model that can be replicated from scratch. The NovaSky team has made both the dataset used for training and the necessary training code publicly available.
The training cost for Sky-T1-32B-Preview was less than $450, highlighting the potential for replicating advanced inference capabilities in an affordable and efficient manner. While $450 may not seem inexpensive to many, it represents a significant reduction compared to the millions of dollars previously required to train models with similar performance levels. The cost reduction is partly attributed to the use of synthetic training data generated by other models. For instance, AI company Writer's recently released Palmyra X 004 model was almost entirely trained on synthetic data, reportedly costing only $700,000 to develop.
Inference models differ from other AI models primarily due to their self-fact-checking capability, which helps avoid common errors. Although these models may take a few extra seconds to minutes to solve problems compared to ordinary models, they tend to be more reliable in fields such as physics, science, and mathematics.
During the development of Sky-T1, the NovaSky team initially used Alibaba's QwQ-32B-Preview inference model to generate the initial training data, followed by filtering and optimizing the data. They then reformatted the data using OpenAI's GPT-4o-mini for easier processing. Training the 32 billion parameter Sky-T1 model on a rack of 8 Nvidia H100 GPUs took approximately 19 hours.
In terms of performance, Sky-T1 outperformed the early preview version of o1 in the MATH500 math challenge and also surpassed the o1 preview version on the challenging problem set of the LiveCodeBench coding evaluation platform. However, in the GPQA-Diamond test, which includes questions related to physics, biology, and chemistry, Sky-T1 performed below the o1 preview version. Additionally, OpenAI's officially released o1 model outperforms its preview version and is expected to introduce an even more powerful inference model, o3, in the coming weeks.
Despite this, the NovaSky team emphasized that Sky-T1 is just the beginning of their efforts to develop advanced, open-source models with strong inference capabilities. Going forward, the team plans to focus on creating more efficient models that maintain robust inference performance while exploring cutting-edge techniques to further enhance efficiency and accuracy during testing.