A team of researchers at OpenAI has recently published a paper introducing a novel model known as the Continuous-Time Consistency Model (sCM). This model enhances the generation speed of multimedia content such as images, videos, and audio by a factor of 50 compared to traditional diffusion models. Specifically, sCM can produce an image in approximately 0.1 seconds, whereas conventional diffusion models require over 5 seconds.
The introduction of sCM enables OpenAI to achieve sample quality comparable to traditional models using only two sampling steps, thereby accelerating the generation process without compromising quality.
This innovation is detailed by Cheng Lu and Yang Song in a preprint on arXiv.org and in a blog post released today. Their breakthrough allows the model to generate high-quality samples in just two steps, significantly faster than the diffusion models that previously required hundreds of steps.
Yang Song is also a leading author of a 2023 paper published by OpenAI researchers, including former Chief Scientist Ilya Sutskever, which introduces the concept of "consistency models." This concept posits that points along the same trajectory map to a single initial point.
Although diffusion models have achieved remarkable success in generating realistic images, 3D models, audio, and video, their sampling efficiency is low, often requiring dozens to hundreds of consecutive steps, making them less suitable for real-time applications.
Theoretically, this technology lays the groundwork for OpenAI to develop near real-time AI image generation models.
In traditional diffusion models, numerous denoising steps are required to create samples, resulting in slow generation speeds. In contrast, sCM can directly transform noise into high-quality samples within one to two steps, thereby reducing computational costs and time.
OpenAI's largest sCM model comprises 1.5 billion parameters and can generate samples in just 0.11 seconds on a single A100 GPU. This represents a 50-fold speed increase over diffusion models on the timing wall, making real-time AI generation applications more feasible.
The sCM team trained a Continuous-Time Consistency Model on ImageNet 512×512 and scaled it up to 1.5 billion parameters. Even at this scale, the model's sample quality matches that of the best diffusion models, achieving a Fréchet Inception Distance (FID) score of 1.88 on ImageNet 512×512.
This narrows the quality gap with diffusion models to less than 10%, whereas diffusion models require more computational resources to achieve similar results.
OpenAI's new approach has been extensively benchmarked against other state-of-the-art generative models. By evaluating sample quality through FID scores and effective sampling compute, research demonstrates that sCM delivers top-tier results while significantly reducing computational overhead.
While previous fast sampling methods struggled with reduced sample quality or complex training setups, sCM overcomes these challenges, delivering both speed and high fidelity.
The success of sCM is also attributed to its ability to scale proportionally with teacher diffusion models from which it distills knowledge. As both sCM and the teacher diffusion models increase in scale, the sample quality gap further diminishes, and increasing the number of sampling steps in sCM can further reduce quality discrepancies.
The rapid sampling and scalability of sCM models open up new possibilities for real-time generative AI across various domains. From image generation to audio and video synthesis, sCM provides practical solutions for applications requiring fast, high-quality output.
Furthermore, OpenAI's research suggests the potential for further system optimizations, which could enhance performance even more, enabling these models to meet the specific needs of various industries.