Stability AI Unveils Stable Cascade: A Novel, Efficient Architecture for AI Image Generation

2024-02-20

Stability AI announces the launch of Stable Cascade, a new text-to-image architecture that focuses on exceptional quality, flexibility, and hardware efficiency. Stable Cascade is built on a three-stage pipeline consisting of different neural networks, achieving state-of-the-art results while compressing the latent space, enabling training and fine-tuning on consumer-grade GPUs. This breakthrough will allow more users than ever before to participate in AI image generation, enhancement, and experimentation. The key to Stable Cascade's capabilities lies in its ability to compress the latent space, which is the abstract representation of AI's interpretation of images. The model consists of three stages: the latent generator (Stage C), which converts user input into a compact 24x24 latent space; followed by the latent decoders (Stages A and B), responsible for compressing the image and achieving unparalleled output quality through highly compressed latent space. The modular design of Stable Cascade also allows targeted individual fine-tuning for each stage. By decoupling text-conditioned generation from high-resolution decoding, Stability AI has reduced the training cost by 16 times compared to similarly scaled models. This makes the technology not only more affordable but also more adaptable to a wider range of applications. For most purposes, users are encouraged to focus their efforts on Stage C and explore this cutting-edge architecture using the provided training scripts, ControlNet, and LoRA training capabilities. Stable Cascade introduces two models for Stage C (1B and 3.6B parameters) and two models for Stage B (700M and 1.5B parameters), with the 3.6B version of Stage C recommended for users seeking the highest quality output. Despite the modular approach, Stable Cascade still maintains relatively low VRAM requirements for inference, approximately 20GB, further enabling high-fidelity image generation. In addition to standard text-to-image generation, Stable Cascade excels in generating image variations and facilitating image-to-image transformations. These capabilities allow users to explore a wide range of creative possibilities, from generating multiple interpretations of a single image to transforming existing images based on new prompts, demonstrating the model's versatility and adaptability. The company has released all necessary training, fine-tuning, ControlNet, and LoRA codes on GitHub to support customization. They also provide scripts for specialized applications such as image restoration/inpainting, Canny edge generation, and 2x super-resolution. Currently, the model is only available for non-commercial use and is subject to strict guidelines, with further policies being developed.