One month ago, Genmo released Mochi 1, an open state-of-the-art video generation model. Given a text prompt, Mochi 1 generates matching short video clips — like Stable Diffusion or Flux do for images and ChatGPT does for conversations. The results are stunning, especially for those of us who remember when GANs struggled to make 32x32 pixel frog images from CIFAR-10.
As an open (Apache 2.0-licensed!) foundation model, Mochi 1 is designed so that others can build on top of it. And today, Genmo just made that much easier by releasing a collection of scripts and sample code for fine-tuning Mochi 1 with LoRA.
Here are some of our results from fine-tuning Mochi on short clips of objects dissolving:
Check out our fine-tuning recipe to create your own custom video generator on Modal.
What is fine-tuning? What is LoRA?
Mochi 1 has been trained on data that is broadly representative of the kinds of videos people want to watch. You can use it off the shelf via text prompting (see our docs for how to run baseline Mochi inference on Modal).
But as with image generation models, fine-tuning the model by continuing its training process on new data allows for superior control of style and increased character consistency across generations.
- Foundation models aren’t always masters of your task: While Mochi excels at video generation in general, your specific use case might require specialization in particular styles, subjects, or motions that aren’t well-represented in the training data.
- Proprietary data needs proprietary weights: If you have unique video content that defines your brand or style, fine-tuning helps Mochi learn these distinctive characteristics.
- Improve consistency: Fine-tuned models typically require less prompt engineering to produce consistent results for your specific use case.
Training neural networks like Mochi is more resource-intensive than running their inference, since training involves running inference and using the results to update the model. In particular, vanilla training techniques require much more memory, since they maintain model parameters and parameter updates.
LoRA (Low-Rank Adaptation) dramatically reduces the memory required for fine-tuning by training a separate set of parameters, a model adapter, that are run in parallel with the base foundation model’s parameters, but are much smaller. With LoRA, the Mochi 1 model can be fine-tuned on a single H100 GPU.
These smaller parameter sets are also easier to train and require less data — as few as fifteen video clips of only a few seconds each.
What is Modal?
Modal is a high-performance AI infrastructure platform perfect for demanding generative modeling tasks like Mochi fine-tuning and inference.
Our platform offers:
- Access to powerful H100 and A100 GPUs without server management or up-front reservations
- Resources that automatically scale up and down based on your needs
- Usage-based pricing
- Enterprise-grade security and reliability
Tips for fine-tuning Mochi
If the videos you want to train on don’t come with captions, try ChatGPT or Gemini’s AIStudio.
Captions should be fairly detailed (>50 words). Here’s an example:
High-resolution footage depicting the controlled dissolution of a metallic robot. The robot, characterized by a polished chrome finish and retro design, holds a bouquet of various colored daisies (yellow, orange, pink, white, blue). The dissolution process initiates subtly, progressing from localized melting at the joints and edges. The metal transitions from a solid state to a liquid state exhibiting properties consistent with a high-viscosity, reflective substance. The liquid metal flows downward under the influence of gravity, creating visible pooling and surface tension effects. Observe the displacement of the flowers as they are immersed in the molten metal. The liquid metal maintains a high degree of reflectivity, showcasing the distorted image of the background environment.
You can get good results with LoRA fine-tuning with around 15 training clips.