TikTok Unveils Boximator: A Novel Fine-grained Motion Control Technology in Video Generation
ByteDance, the parent company of TikTok, has published a research paper on Boximator, a new technology that allows for precise control of object motion in generated videos.
Boximator, a combination of "box" and "animator," introduces a simple yet powerful motion specification approach. Users first select objects in reference images and draw boxes around them. They can then define the end positions of objects or the entire motion path across frames using additional boxes and lines. This visual-based technique eliminates the need for verbal descriptions of desired motion.
As a plugin, Boximator integrates these user constraints into existing video synthesis models. While freezing the weights of the base model, it can also train additional modules to achieve direct integration with state-of-the-art systems.
Based on experience, Boximator enhances the model while preserving the original video quality, as measured by the Fréchet Video Distance (FVD) score, and achieves precise motion control capabilities. On the MSR-VTT dataset, this module improves the FVD of two baseline models and demonstrates strong motion alignment abilities, quantified by the Average Precision metric comparing generated motion with ground truth bounding boxes.
Qualitative results further highlight the realism of the technology, with objects faithfully following complex user-defined paths, interactions, and scene entrances/exits. Boximator can handle composite elements such as people riding horses and control object quantities, sizes, distances, and more.
This marks an important step towards a versatile video generation platform that balances quality, diversity, and user control. By externalizing motion specification, Boximator has the potential to save significant computational resources required for internal learning of such fine-grained aspects.