LongVie 2 is an open-source multimodal controllable ultra-long video world model that generates coherent videos up to 5 minutes using depth and pointmap controls for precise guidance.

When was LongVie 2 released?

The paper was submitted and published on arXiv on December 15, 2025, with model weights and code released around the same time.

Is LongVie 2 free to use?

Yes, it is fully open-source with model weights, code, and inference scripts available on Hugging Face and GitHub under permissive license, no usage fees.

How long of videos can LongVie 2 generate?

It supports continuous autoregressive generation up to 5 minutes (typically 3-5 minutes demonstrated) with maintained quality and consistency.

What controls does LongVie 2 use?

It integrates dense depth maps and sparse pointmaps/keypoints for multimodal guidance, enabling fine-grained semantic and motion control over long sequences.

Where can I download LongVie 2?

Model weights and code are hosted on Hugging Face at Vchitect/LongVie2, with GitHub repo at Vchitect/LongVie and project page at vchitect.github.io/LongVie2-project/.

What benchmark does LongVie 2 use?

It introduces and tops LongVGenBench, a new evaluation set with 100 high-resolution one-minute videos across diverse environments for long-video assessment.

Is LongVie 2 suitable for beginners?

No, it's research-oriented requiring technical setup, GPU hardware, and control signal preparation; best for developers and researchers rather than casual users.

LongVie 2

Name: LongVie 2
Author: Zelili AI

From Vchitect (research group)

Multimodal Controllable Ultra-Long Video World Model – Generate Consistent Videos Up to 5 Minutes with Depth and Pointmap Control

Video & Animation

Pricing Model

Free

Starting Price

$0/Month

Last Updated: January 7, 2026

By Zelili AI

About This AI

LongVie 2 is an advanced open-source end-to-end autoregressive video generation framework designed as a controllable ultra-long video world model.

It builds upon pretrained short-clip diffusion backbones to enable continuous generation of high-quality videos lasting up to 5 minutes (or 3-5 minutes in practice) while maintaining strong temporal consistency, visual fidelity, and fine-grained controllability.

The model uses three progressive training stages: multi-modal guidance integrating dense (depth maps) and sparse (keypoints/pointmaps) control signals for world-level supervision and enhanced controllability; degradation-aware training to bridge the training-inference gap and preserve long-term visual quality; and history-context guidance to align information across adjacent clips for seamless temporal coherence.

It supports multimodal inputs including text prompts combined with depth and pointmap controls, allowing precise semantic direction over long sequences without degradation or inconsistency.

Evaluated on the new LongVGenBench (100 high-res one-minute videos across real/synthetic environments), it achieves state-of-the-art results in long-range controllability, temporal coherence, and visual fidelity.

Released December 15, 2025, with code, weights, and demos openly available on GitHub and Hugging Face under Vchitect organization.

As a research-oriented model, it requires significant compute for inference (GPU recommended) and is ideal for researchers, developers, and creators pushing boundaries in long-form video synthesis, world modeling, and controllable generation.

Key Features

Ultra-long video generation: Autoregressively generates coherent videos up to 5 minutes by chaining clips seamlessly
Multimodal controllability: Integrates dense depth maps and sparse pointmaps/keypoints for fine-grained semantic and structural control
Temporal consistency: History-context guidance aligns adjacent clips to prevent drift or inconsistency over long durations
Visual quality preservation: Degradation-aware training mitigates quality drop during extended inference
End-to-end autoregressive framework: Built on pretrained diffusion backbones (e.g., Wan2.1 I2V) with DiT transformer core
Long-range controllability: World-level supervision via multi-modal signals for precise action, scene, and style direction
State-of-the-art benchmarks: Tops LongVGenBench in controllability, coherence, and fidelity metrics
Open-source availability: Full weights, inference code, and project demos on Hugging Face and GitHub
Image-to-video extension: Supports starting from images with control signals for extended sequences

Price Plans

Free ($0): Fully open-source model weights, code, and inference scripts available on Hugging Face and GitHub with no usage fees
Cloud/Compute (Variable): Run on paid GPU platforms like RunPod, Vast.ai, or local hardware costs for heavy inference

Pros

Breaks long-video barriers: Achieves 3-5 minute coherent generation, far beyond typical short-clip models
Strong controllability: Depth + pointmap fusion enables precise, semantically meaningful guidance over minutes
Excellent temporal coherence: History-context modeling minimizes drift in ultra-long outputs
High visual fidelity: Degradation-aware approach keeps quality stable throughout extended sequences
Fully open-source: Weights and code freely available for research, fine-tuning, and community use
Significant research impact: Introduces new benchmark (LongVGenBench) and advances video world modeling
Builds on proven backbones: Leverages strong pretrained models for reliable base generation

Cons

High compute requirements: Inference for long videos needs powerful GPUs; not feasible on consumer hardware easily
Research-focused: Requires technical setup (e.g., custom scripts, environment) rather than user-friendly app
Limited accessibility: No hosted demo or easy web interface; primarily for developers/researchers
Generation time: Autoregressive chaining for minutes-long videos can be slow even on high-end hardware
Control signal dependency: Best results need accurate depth/pointmap inputs; poor controls degrade output
Early-stage model: Released December 2025; community integrations, optimizations still emerging
No native audio: Focuses on visual generation; no built-in sound or multimodal audio support

Use Cases

Long-form video synthesis: Create extended cinematic sequences or simulations up to 5 minutes
Research in video world models: Study controllability, temporal consistency, and degradation in autoregressive generation
Controlled animation prototyping: Use depth/pointmap guidance for precise scene evolution over time
AI filmmaking experiments: Generate storyboards or short films with consistent style and motion
Benchmarking long-video AI: Evaluate on LongVGenBench or extend for new long-sequence tasks
Creative visual storytelling: Build narrative-driven videos with semantic control from text + maps
Extension of short clips: Take existing short videos/images and autoregressively extend them coherently

Target Audience

AI researchers in video generation: Studying world models, controllability, and long-sequence synthesis
Computer vision developers: Implementing or fine-tuning autoregressive video frameworks
Generative AI enthusiasts: Experimenting with open-source long-video models on powerful hardware
Filmmakers and animators: Exploring AI for extended sequence prototyping with controls
Academic groups: Reproducing or building on LongVGenBench and world modeling advances
Open-source contributors: Integrating or optimizing for community tools like ComfyUI wrappers

How To Use

Visit repo: Go to https://huggingface.co/Vchitect/LongVie2 or https://github.com/Vchitect/LongVie
Install dependencies: Set up environment with required libraries (PyTorch, diffusers, etc.) per README
Download weights: Pull model checkpoints from Hugging Face using provided scripts
Prepare controls: Generate or provide depth maps and pointmaps/keypoints for guidance
Run inference: Use sample scripts for autoregressive generation with multi-modal inputs
Chain clips: Extend sequences by feeding history context to maintain consistency
Experiment: Adjust parameters for length, control strength, and quality settings

How we rated LongVie 2

Performance: 4.7/5
Accuracy: 4.6/5
Features: 4.8/5
Cost-Efficiency: 5.0/5
Ease of Use: 3.8/5
Customization: 4.9/5
Data Privacy: 5.0/5
Support: 4.2/5
Integration: 4.5/5
Overall Score: 4.6/5

LongVie 2 integration with other tools

Hugging Face Hub: Direct model loading and inference via transformers/diffusers libraries
ComfyUI Wrappers: Community nodes/extensions for graphical workflow integration (e.g., kijai/ComfyUI-WanVideoWrapper)
GitHub Codebase: Custom scripts and training/inference pipelines for research modification
Depth/Pointmap Generators: Compatible with external tools like MiDaS (depth) or OpenPose (keypoints) for control signals
Video Editing Suites: Export frames/sequences for post-processing in DaVinci Resolve, Premiere, or After Effects

Best prompts optimised for LongVie 2

A serene mountain landscape at dawn, mist rolling over peaks, gentle camera pan right to reveal a flowing river, depth map guided for realistic foreground/background separation, pointmap tracking smooth motion, ultra-high detail, cinematic lighting
Futuristic city street at night with flying cars, neon reflections on wet pavement, slow dolly zoom in on a cyberpunk character walking, sparse keypoint controls for character pose consistency, dense depth for layered buildings, high fidelity 8k
Underwater coral reef exploration, colorful fish swimming around diver, camera following diver forward then orbiting coral, pointmap for fish trajectories, depth guidance for water layers and bubbles, vibrant marine colors, realistic physics
Epic fantasy battle scene in ancient ruins, warriors clashing with dragons overhead, dynamic tracking shot following main hero, multi-element controls for multiple characters and dragon flight paths, temporal consistency across extended fight sequence
Cozy autumn forest walk, leaves falling gently, first-person view walking along path with occasional look around, depth maps for tree layers, pointmaps for falling leaves motion, warm golden hour lighting, peaceful atmosphere

LongVie 2 pushes boundaries in open-source video generation by enabling coherent, controllable ultra-long videos up to 5 minutes through clever multi-modal guidance and training stages. It excels in temporal consistency and visual quality for research and advanced creative use. Compute-heavy and technical to run, but a breakthrough for world modeling enthusiasts and developers.

FAQs

What is LongVie 2?
LongVie 2 is an open-source multimodal controllable ultra-long video world model that generates coherent videos up to 5 minutes using depth and pointmap controls for precise guidance.
When was LongVie 2 released?
The paper was submitted and published on arXiv on December 15, 2025, with model weights and code released around the same time.
Is LongVie 2 free to use?
Yes, it is fully open-source with model weights, code, and inference scripts available on Hugging Face and GitHub under permissive license, no usage fees.
How long of videos can LongVie 2 generate?
It supports continuous autoregressive generation up to 5 minutes (typically 3-5 minutes demonstrated) with maintained quality and consistency.
What controls does LongVie 2 use?
It integrates dense depth maps and sparse pointmaps/keypoints for multimodal guidance, enabling fine-grained semantic and motion control over long sequences.
Where can I download LongVie 2?
Model weights and code are hosted on Hugging Face at Vchitect/LongVie2, with GitHub repo at Vchitect/LongVie and project page at vchitect.github.io/LongVie2-project/.
What benchmark does LongVie 2 use?
It introduces and tops LongVGenBench, a new evaluation set with 100 high-resolution one-minute videos across diverse environments for long-video assessment.
Is LongVie 2 suitable for beginners?
No, it’s research-oriented requiring technical setup, GPU hardware, and control signal preparation; best for developers and researchers rather than casual users.

Newly Added Tools

Qwen-Image-2.0

Image & Design

$0/Month

Qodo AI

Code & Development

$0/Month

Codiga

Code & Development

$10/Month

Tabnine

Code & Development

$59/Month

LongVie 2 Alternatives

Seedance 2.0

Video & Animation

$0/Month

VideoGen

Video & Animation

$12/Month

WUI.AI

Video & Animation

$10/Month

About Author

Hi Guys! We are a group of ML Engineers by profession with years of experience exploring and building AI tools, LLMs, and generative technologies. We analyze new tools not just as a user, but as someone who understands their technical depth and real-world value.We know how overwhelming these tools can be for most people, that’s why we break down complex AI concepts into simple, practical insights. Our goal is to help you discover these magical AI tools that actually save your time and make everyday work smarter, not harder.“We don’t just write about AI: We build, test and simplify it for you.”

LongVie 2

From Vchitect (research group)

About This AI

Key Features

Price Plans

Pros

Cons

Use Cases

Target Audience

How To Use

How we rated LongVie 2

LongVie 2 integration with other tools

Best prompts optimised for LongVie 2

FAQs

What is LongVie 2?

When was LongVie 2 released?

Is LongVie 2 free to use?

How long of videos can LongVie 2 generate?

What controls does LongVie 2 use?

Where can I download LongVie 2?

What benchmark does LongVie 2 use?

Is LongVie 2 suitable for beginners?

Newly Added Tools​

Qwen-Image-2.0

Qodo AI

Codiga

Tabnine

Seedance 2.0

VideoGen

WUI.AI

Newly Added Tools