Video diffusion models have emerged as powerful tools for video generation and physical simulations, demonstrating significant potential in the development of game engines. These generative game engines, capable of producing videos with controllable actions responsive to user inputs such as keyboard and mouse interactions, offer users immersive gaming experiences. However, scene generalization - the ability to create new game environments beyond existing ones - remains a critical challenge in this domain.
Although collecting large-scale action-annotated video datasets is the most straightforward approach to achieving scene generalization, the associated annotation costs are prohibitively high, especially for open-domain scenarios. This limitation hinders the development of versatile game engines capable of generating diverse and novel game environments.
To address this challenge, recent research in video generation and game physics has explored various methods, with video diffusion models emerging as a significant advancement. Evolving from U-Net architectures to Transformer-based ones, these models enable the creation of more realistic and longer-duration videos. Techniques like Direct-a-Video provide basic camera controls, while MotionCtrl and CameraCtrl offer more sophisticated camera pose manipulation. Nevertheless, these projects are often constrained to specific games and datasets, limiting their scene generalization capabilities.
Recently, researchers from the University of Hong Kong and Kuaishou Technology introduced GameFactory, an innovative framework designed to tackle the issue of scene generalization in game video generation. By leveraging pre-trained video diffusion models trained on open-domain video data and employing a multi-stage training strategy, GameFactory successfully generates diverse new game environments.
The multi-stage training strategy of GameFactory enables effective scene generalization and action control. The process begins with a pre-trained video diffusion model and proceeds through three stages. In Stage One, the model focuses on the target game domain using LoRA adaptation while retaining most original parameters. Stage Two concentrates on training the action control module by freezing the pre-trained parameters and LoRA to avoid entanglement between style and control. In Stage Three, LoRA weights are removed, allowing the system to generate controlled game videos across varied open-domain scenes.
Evaluations of GameFactory's performance reveal its excellence under different control mechanisms. For discrete control signals like keyboard inputs, cross-attention outperforms concatenation on the Flow-MSE metric, whereas for continuous mouse movement signals, concatenation proves more effective. Regarding style consistency, different methods exhibit comparable performance. The system demonstrates mastery over fundamental atomic actions and complex composite actions across various game scenes, showcasing strong scene generalization and action control capabilities.
GameFactory represents a major step forward in generative game engines, addressing the crucial challenge of scene generalization in game video generation. By effectively utilizing open-domain video data and implementing novel multi-stage training strategies, it illustrates the feasibility of creating new games through generative interactive videos. While this achievement marks a significant milestone, numerous challenges remain in developing fully-fledged generative game engines. GameFactory lays a solid foundation for this evolving field, providing promising directions for future research.