Genmo Releases Preview Version of Open-Source Video Generation Model Mochi 1

2024-10-23

Genmo, a leading AI content generation platform, has recently introduced its cutting-edge open-source model, Mochi 1 Preview, equipped with video creation functionalities.

According to Genmo, Mochi 1 has achieved significant advancements in dynamic quality and demonstrates improved adherence to user-written query text instructions. Typically, AI models may produce unpredictable results even when provided with specific text commands, but Genmo states that their model has been trained to strictly follow the given instructions.

In addition to launching the new model, Genmo has introduced a hosted trial platform, allowing users to test Mochi 1 for free. The model's weights are also available for download on the AI model hosting site Hugging Face.

Furthermore, Genmo revealed that it has secured $28.4 million in Series A funding, led by NEA, with participation from The House Fund, Gold House Ventures, WndrCo, Eastlink Capital Partners, and Essence VC. These funds will be utilized to advance what Genmo refers to as the 'right brain' of artificial general intelligence.

Mochi 1 is considered Genmo's initial step in developing the 'right brain,' typically associated with creativity, in contrast to the 'left brain,' which relates to analytical and logical thinking. Since the emergence of high-performance AI video generators like Runway AI Inc.'s models and OpenAI's Sora, the video generation sector has attracted substantial investment and workforce dedication.

Genmo states that the new model sets high standards for realistic motion dynamics by understanding physical principles such as fluid movement, hair simulation, and human motion. The model can generate up to 5.4-second videos at 30 frames per second, aligning with the current industry standards for most models on the market.

When processing instructions, the model closely adheres to users' clear and concise directives, thereby producing videos that accurately reflect user requirements. This offers users detailed control over elements such as characters, settings, and other aspects.

To develop Mochi 1, Genmo utilized a diffusion model with 10 billion parameters, indicating the number of variables available for training to enhance model accuracy. At the core architecture, the company employed its proprietary AsymmDiT (Asymmetric Diffusion Transformer) framework, which is said to efficiently handle user prompts and compressed video tokens by streamlining text processing to focus on visual content.

AsymmDiT constructs videos by combining text and visual tokens, similar to Stable Diffusion 3. However, Genmo indicates that its streaming architecture's text token parameters are nearly four times larger than those of Stable Diffusion 3, achieved through greater hidden dimensions. The asymmetric design reduces memory usage during deployment.

The Mochi 1 Preview showcases a baseline model capable of generating 480p videos. The company announced that the official version will be released by the end of the year, featuring Mochi 1 HD, which supports 720p video generation and smoother, higher-fidelity motion.

Genmo stated that Mochi 1 was entirely trained from scratch. With 10 billion parameters, it is currently the largest open-source video generation model released. The company's existing proprietary image and video generation models already serve over 2 million users. Mochi 1's model weights and source code have been made available under the Apache 2.0 open-source license on GitHub and Hugging Face, accessible to developers and researchers.