The VideoWorld video generation experimental model, jointly developed by the Dabao Large Model team, Beijing Jiaotong University, and the University of Science and Technology of China, has officially been open-sourced. This model achieves a significant breakthrough in the industry: it can understand the world without relying on language models.
Currently, mainstream multimodal models such as Sora, DALL-E, and Midjourney mostly depend on language or tag data for knowledge learning. However, language has limitations when describing all knowledge of the real world. For example, complex skills like origami or tying a tie are often difficult to accurately describe through language alone. VideoWorld abandons the reliance on language models and realizes the ability to perform unified understanding and reasoning tasks.
VideoWorld is built on a latent dynamic model that can effectively compress the change information between video frames, thereby improving the efficiency and effectiveness of knowledge learning. Notably, without relying on any reinforcement learning search or reward function mechanisms, VideoWorld has reached the level of professional 5-dan in 9x9 Go and can perform robotic tasks in various environments.
The open-source release of the VideoWorld model brings new research directions and technical support to the fields of video generation and cognition, potentially driving further development of related technologies.