Top open-source text-to-video AI models
Updated: 2025-10-29
Open-source text-to-video AI models are rapidly approaching the quality of leading closed-source models like Kling or OpenAI’s Sora.
In this article we’ll compare the following popular open-source text-to-video models:
| Model | Parameters | Created by | Released |
|---|---|---|---|
| HunyuanVideo | 13B+ | Tencent | Dec 2024 |
| Mochi (deploy on Modal) | 10B | Genmo | Oct 2024 |
| Wan2.2 | 5B and 14B | Alibaba | Jul 2025 |
Text-to-video AI models build on text-to-image foundations but add a more difficult dimension: time. Every frame must not only look convincing on its own but also stay coherent across seconds of motion. This shift introduces new failure modes:
- Small artifacts that flicker or accumulate across frames
- Motion that feels jittery or unnatural
- Styles that fade or drift as a clip unfolds
Beyond model design, running these systems is considerably more demanding. Video synthesis consumes far more GPU memory than still images and require careful batching just to generate a few seconds of footage.
Closed-source systems like OpenAI’s Sora and Kuaishou’s Kling have proved that high quality, long-form video generation is possible. But they remain inaccessible for most developers. That leaves open-source alternatives as the most practical path to experiment, fine-tune, and deploy video generation pipelines without depending on proprietary APIs.
This article focuses on some of the most widely adopted open-weight models and examines their capabilities through prompt-level examples, deployment considerations, and trade-offs. The goal is not to declare a “best” model, but to map out what each system does well and its limits.
To ground the comparison, let’s start with a side-by-side example prompt: “A white dove is flapping its wings, flying freely in the sky, in anime style.” (prompt taken from Penguin Video Benchmark)
Now, let’s dive deeper into each of these models.
HunyuanVideo
- Released: Dec 3, 2024
- Creator: Tencent
Hunyuan (roughly pronounced “hwen-yoo-en” in English) was one of the first large-scale systems to demonstrate that open-source approaches could begin matching the temporal consistency of closed platforms. It is consistently at or near the top of HuggingFace’s trending models and by far the most discussed model in our community Slack.
Key Features
- Over 13 billion parameters
- Diffusers integration for plug-and-play workflows
- FP8 model weights to reduce GPU memory usage
- Official ComfyUI nodes for quick prototyping
- Prompt rewriting utility to improve alignment with user instructions
- Several popular fine-tunes e.g. SkyReels V1 which is fine-tuned on 10s of millions of human-centric film and television clips
Example Videos
These videos demonstrate Hunyuan’s high quality and realistic generation capabilities, though the astronaut video does not really adhere to the style prompt.
Operational Footprint
Running Hunyuan requires substantial resources. Even relatively short clips at moderate resolution push beyond the capacity of most consumer GPUs, placing the model firmly in the datacenter-class hardware category.
Tencent does provide options for multi-GPU sequence parallelism (xDiT) which helps distribute workloads and FP8 quantization which reduces memory pressure. However, even with these optimizations, Hunyuan is still impractical for most consumer setups.
Strengths
Hunyuan is strong in motion consistency and texture realism. As shown in the example videos, backgrounds remain coherent across frames, and fine details (i.e., snowflake pattern) hold together better than in many smaller models. Also, ecosystem maturity (through Diffusers and ComfyUI support) makes it easier to integrate into existing workflows.
Weaknesses
For prompts requesting artistic rendering (i.e., Ukiyochi astronaut), Hunyuan often defaults toward photorealism which takes away from stylistic control. Its high VRAM demand also places it out of reach for consumer-grade GPUs, which limits accessibility to most developers.
Mochi
- Released: Oct 22, 2024
- Creator: Genmo
Mochi marked one of the first Apache-2.0 licensed video models released with training code, making it an attractive option for both experimentation and downstream fine-tuning. It ranks similarly to Hunyuan on crowd-sourced leaderboards.
Key Features
- 10 billion parameters
- Apache-2.0 license for open research and commercial use
- LoRA trainer support for lightweight fine-tuning
- Native ComfyUI integration
- AsymmDiT backbone optimized for video synthesis
- Easy Deployment on Modal
Example Videos
In these examples, Mochi’s quality is generally a little worse compared to Hunyuan, though the first example is arguably my favorite in the entire series of videos in this article.
Operational Footprint
Running Mochi is resource-intensive given its size. At default settings, it requires more GPU memory than most consumers can provide, putting it in the same class of hardware demand as larger open-weight models like Hunyuan. ComfyUI optimizations can lower the memory footprint, but comes with trade-offs in generation speed and clip length.
Modal’s deployment estimates the cost at around $0.33 per short clip on H100-class hardware, which positions Mochi as relatively efficient for cloud runs, but still impractical for most local GPUs.
Strengths
Mochi has great photorealistic rendering and flexible fine-tuning. Its support for LoRA adapters lets team specialize the model quickly on custom data. Also, the Apache-2.0 license makes it one of the most permissively usable models.
Weaknesses
Stylized outputs, especially animated sequences, are weaker. The authors note that Mochi is primarily optimized for photorealism. Also, its relatively high memory demand in regards to its parameter count raises operational cost.
Wan2.2
Wan2.2 is the latest open-weight model from Alibaba’s Wan series. It builds on Wan 2.1 but introduces substantial architectural upgrades. The release emphasizes stylization control, motion fidelity, and computational efficiency, while preserving open tooling support (Diffusers, ComfyUI).
- Released: July 28, 2025
- Creator: Alibaba
Key Features
- Hybrid TI2V-5B model combining both text-to-video and image-to-video capabilities
- Apache-2.0 license
- Mixture-of-Experts (MoE) backbone, with two specialized experts (high-noise/low-noise), allowing for efficient capacity scaling
- Cinematic aesthetic controls: lighting, composition, contrast, color tone labels, etc.
Example Videos
The overall quality of Wan2.2 is maybe slightly worse than Hunyuan, but it does the best job adhering to the style instructions of the astronaut prompt.
Operational Footprint
The TI2V-5B variant is optimized for 720p/24 fps and is reported to run on high-end consumer GPUs. The A14B MoE variants (T2V-A14B, I2V-A14B) require more resources and target higher fidelity use cases.
Strengths
Wan2.2 stands out for its balance between stylization control and accessibility. The model maintains readable on-screen text and performs well across Chinese and English prompts. Also, its resource efficiency makes it one of the most approachable models for developers testing video workflows on local GPUs.
Weaknesses
While Wan2.2 narrows the realism gap, its fine-texture detail such as lighting still trails larger models like Hunyuan in complex scenes. Some users also report longer inference times per frame under high-motion prompts due to the MoE routing overhead. Also, performance on multi-GPU scale-outs remains less documented than other models.
Note: Wan2.2 replaces the concept of a “lightweight variant” from 2.1 (like the 1.3B model). In 2.2, the TI2V-5B model serves as the efficiency tier, and the A14B MoE models handle specialization.
All of these comparisons highlight model capabilities, but developers must also weigh the practicalities of running them. This next section addresses this.
Running Text-to-Video Models
Text-to-video models introduce a whole new set of unique operational challenges. This means that the choice of a model isn’t as easy as picking the one that looks good. We have to find a model that fits into our existing workflow and hardware budget.
Things to Think About When Selecting a Model
- Choose one with a Diffusers or ComfyUI integration. This saves setup time and gives you standardized preprocessing, inference, and visualization pipelines out of the box. Models without official integrations usually require more custom code or community wrappers.
- Match model size to your GPU. Larger architectures often require datacenter GPUs to run at full resolution. Smaller or quantized variants can run on consumer hardware but typically produce shorter or lower-quality clips.
- Prototype small, scale later. Start with low-resolution, short clips to confirm your pipeline works. Once the workflow is stable, you can increase resolution and clip length without wasting GPU hours on debugging.
- Consider latency and cost as part of quality. Generating a few seconds of video can take several minutes on high-end hardware. Faster models or shorter clips might deliver better iteration speed, even if the visual quality is slightly lower.
- Use optimization tools deliberately. Quantization, offloading, and multi-GPU parallelism can stretch hardware capacity, but each comes with trade-offs in speed, fidelity, or complexity. Apply them only where they make sense.
Closing Thoughts
The text-to-video space is moving at a very fast clip, with new models claiming “state-of-the-art” being released every few weeks. The common message across these models is that there is no single “best” option, but a growing set of trade-offs.
Some models prioritize realism while others are better at stylization or text rendering. Larger architectures allow for longer, higher-resolution clips, but require datacenter-class GPUs. Smaller variants reduce hardware requirements, making them more accessible, at the cost of fidelity. Ultimately, gains in one area lead to trade-offs in another.
As GPUs become easier and cheaper to access, deploying open-source models like Hunyuan, Mochi, and Wan2.2 are becoming even more attractive options. At Modal, this is as simple as running our end-to-end Mochi example, but you can run any code on Modal in a cost-effective and developer-friendly way.