The DouBao large model team has officially released the technical report for Seedream 2.0, an image generation model. The report fully discloses technical details regarding data construction, pre-training frameworks, and post-training RLHF.
Launched on the DouBao app and Jimeng platform in early December 2024, Seedream 2.0 now serves hundreds of millions of end users and has become a top choice for professional designers. Compared to leading models like Ideogram 2.0, Midjourney V6.1, and Flux 1.1 Pro, Seedream 2.0 demonstrates enhanced capabilities in text rendering, understanding of Chinese culture, aesthetics, and instruction adherence, with native bilingual support for both Chinese and English.
In terms of data preprocessing, the team developed a "knowledge integration"-centered framework using a four-dimensional data structure to achieve dynamic balance between quality and knowledge. They employed an intelligent annotation engine to enable three levels of cognitive evolution and restructured engineering processes to allow pipeline parallel processing of billions of data points, significantly improving efficiency and quality.
During the pre-training phase, Seedream 2.0 adopted a new architectural design focusing on multilingual semantic understanding, bilingual text rendering, and multi-resolution scene adaptation. The team introduced an LLM-based bilingual alignment approach to strengthen the model's grasp of Chinese semantics and cultural nuances. A dual-modal encoding fusion system was built to let the model learn rendering properties directly from textual features. Additionally, they made three key upgrades to the DiT architecture for scalable image generation across different resolutions.
In the post-training stage, Seedream 2.0 improved overall performance through RLHF optimization. The team collected and organized multifunctional Prompt sets for reward model training and created multi-dimensional fusion annotations to expand preference boundaries. Three specialized reward models were meticulously trained to enhance image-text alignment, aesthetics, and text rendering, enabling stable feedback learning and further boosting model performance.
To comprehensively evaluate the model, the team established the rigorous Bench-240 evaluation benchmark, testing fundamental aspects such as image-text matching, structural accuracy, and aesthetics. Results showed that Seedream 2.0 outperformed mainstream models in structural rationality and text comprehension for English prompts. In Chinese language tasks, it also excelled in generating and rendering usable and perfectly aligned text outputs.