What is LongVie 2?
LongVie 2 is an open-source multimodal controllable ultra-long video world model that generates coherent videos up to 5 minutes using depth and pointmap controls for precise guidance.
When was LongVie 2 released?
The paper was submitted and published on arXiv on December 15, 2025, with model weights and code released around the same time.
Is LongVie 2 free to use?
Yes, it is fully open-source with model weights, code, and inference scripts available on Hugging Face and GitHub under permissive license, no usage fees.
How long of videos can LongVie 2 generate?
It supports continuous autoregressive generation up to 5 minutes (typically 3-5 minutes demonstrated) with maintained quality and consistency.
What controls does LongVie 2 use?
It integrates dense depth maps and sparse pointmaps/keypoints for multimodal guidance, enabling fine-grained semantic and motion control over long sequences.
Where can I download LongVie 2?
Model weights and code are hosted on Hugging Face at Vchitect/LongVie2, with GitHub repo at Vchitect/LongVie and project page at vchitect.github.io/LongVie2-project/.
What benchmark does LongVie 2 use?
It introduces and tops LongVGenBench, a new evaluation set with 100 high-resolution one-minute videos across diverse environments for long-video assessment.
Is LongVie 2 suitable for beginners?
No, it’s research-oriented requiring technical setup, GPU hardware, and control signal preparation; best for developers and researchers rather than casual users.
LongVie 2

About This AI
LongVie 2 is an advanced open-source end-to-end autoregressive video generation framework designed as a controllable ultra-long video world model.
It builds upon pretrained short-clip diffusion backbones to enable continuous generation of high-quality videos lasting up to 5 minutes (or 3-5 minutes in practice) while maintaining strong temporal consistency, visual fidelity, and fine-grained controllability.
The model uses three progressive training stages: multi-modal guidance integrating dense (depth maps) and sparse (keypoints/pointmaps) control signals for world-level supervision and enhanced controllability; degradation-aware training to bridge the training-inference gap and preserve long-term visual quality; and history-context guidance to align information across adjacent clips for seamless temporal coherence.
It supports multimodal inputs including text prompts combined with depth and pointmap controls, allowing precise semantic direction over long sequences without degradation or inconsistency.
Evaluated on the new LongVGenBench (100 high-res one-minute videos across real/synthetic environments), it achieves state-of-the-art results in long-range controllability, temporal coherence, and visual fidelity.
Released December 15, 2025, with code, weights, and demos openly available on GitHub and Hugging Face under Vchitect organization.
As a research-oriented model, it requires significant compute for inference (GPU recommended) and is ideal for researchers, developers, and creators pushing boundaries in long-form video synthesis, world modeling, and controllable generation.
Key Features
- Ultra-long video generation: Autoregressively generates coherent videos up to 5 minutes by chaining clips seamlessly
- Multimodal controllability: Integrates dense depth maps and sparse pointmaps/keypoints for fine-grained semantic and structural control
- Temporal consistency: History-context guidance aligns adjacent clips to prevent drift or inconsistency over long durations
- Visual quality preservation: Degradation-aware training mitigates quality drop during extended inference
- End-to-end autoregressive framework: Built on pretrained diffusion backbones (e.g., Wan2.1 I2V) with DiT transformer core
- Long-range controllability: World-level supervision via multi-modal signals for precise action, scene, and style direction
- State-of-the-art benchmarks: Tops LongVGenBench in controllability, coherence, and fidelity metrics
- Open-source availability: Full weights, inference code, and project demos on Hugging Face and GitHub
- Image-to-video extension: Supports starting from images with control signals for extended sequences
Price Plans
- Free ($0): Fully open-source model weights, code, and inference scripts available on Hugging Face and GitHub with no usage fees
- Cloud/Compute (Variable): Run on paid GPU platforms like RunPod, Vast.ai, or local hardware costs for heavy inference
Pros
- Breaks long-video barriers: Achieves 3-5 minute coherent generation, far beyond typical short-clip models
- Strong controllability: Depth + pointmap fusion enables precise, semantically meaningful guidance over minutes
- Excellent temporal coherence: History-context modeling minimizes drift in ultra-long outputs
- High visual fidelity: Degradation-aware approach keeps quality stable throughout extended sequences
- Fully open-source: Weights and code freely available for research, fine-tuning, and community use
- Significant research impact: Introduces new benchmark (LongVGenBench) and advances video world modeling
- Builds on proven backbones: Leverages strong pretrained models for reliable base generation
Cons
- High compute requirements: Inference for long videos needs powerful GPUs; not feasible on consumer hardware easily
- Research-focused: Requires technical setup (e.g., custom scripts, environment) rather than user-friendly app
- Limited accessibility: No hosted demo or easy web interface; primarily for developers/researchers
- Generation time: Autoregressive chaining for minutes-long videos can be slow even on high-end hardware
- Control signal dependency: Best results need accurate depth/pointmap inputs; poor controls degrade output
- Early-stage model: Released December 2025; community integrations, optimizations still emerging
- No native audio: Focuses on visual generation; no built-in sound or multimodal audio support
Use Cases
- Long-form video synthesis: Create extended cinematic sequences or simulations up to 5 minutes
- Research in video world models: Study controllability, temporal consistency, and degradation in autoregressive generation
- Controlled animation prototyping: Use depth/pointmap guidance for precise scene evolution over time
- AI filmmaking experiments: Generate storyboards or short films with consistent style and motion
- Benchmarking long-video AI: Evaluate on LongVGenBench or extend for new long-sequence tasks
- Creative visual storytelling: Build narrative-driven videos with semantic control from text + maps
- Extension of short clips: Take existing short videos/images and autoregressively extend them coherently
Target Audience
- AI researchers in video generation: Studying world models, controllability, and long-sequence synthesis
- Computer vision developers: Implementing or fine-tuning autoregressive video frameworks
- Generative AI enthusiasts: Experimenting with open-source long-video models on powerful hardware
- Filmmakers and animators: Exploring AI for extended sequence prototyping with controls
- Academic groups: Reproducing or building on LongVGenBench and world modeling advances
- Open-source contributors: Integrating or optimizing for community tools like ComfyUI wrappers
How To Use
- Visit repo: Go to https://huggingface.co/Vchitect/LongVie2 or https://github.com/Vchitect/LongVie
- Install dependencies: Set up environment with required libraries (PyTorch, diffusers, etc.) per README
- Download weights: Pull model checkpoints from Hugging Face using provided scripts
- Prepare controls: Generate or provide depth maps and pointmaps/keypoints for guidance
- Run inference: Use sample scripts for autoregressive generation with multi-modal inputs
- Chain clips: Extend sequences by feeding history context to maintain consistency
- Experiment: Adjust parameters for length, control strength, and quality settings
How we rated LongVie 2
- Performance: 4.7/5
- Accuracy: 4.6/5
- Features: 4.8/5
- Cost-Efficiency: 5.0/5
- Ease of Use: 3.8/5
- Customization: 4.9/5
- Data Privacy: 5.0/5
- Support: 4.2/5
- Integration: 4.5/5
- Overall Score: 4.6/5
LongVie 2 integration with other tools
- Hugging Face Hub: Direct model loading and inference via transformers/diffusers libraries
- ComfyUI Wrappers: Community nodes/extensions for graphical workflow integration (e.g., kijai/ComfyUI-WanVideoWrapper)
- GitHub Codebase: Custom scripts and training/inference pipelines for research modification
- Depth/Pointmap Generators: Compatible with external tools like MiDaS (depth) or OpenPose (keypoints) for control signals
- Video Editing Suites: Export frames/sequences for post-processing in DaVinci Resolve, Premiere, or After Effects
Best prompts optimised for LongVie 2
- A serene mountain landscape at dawn, mist rolling over peaks, gentle camera pan right to reveal a flowing river, depth map guided for realistic foreground/background separation, pointmap tracking smooth motion, ultra-high detail, cinematic lighting
- Futuristic city street at night with flying cars, neon reflections on wet pavement, slow dolly zoom in on a cyberpunk character walking, sparse keypoint controls for character pose consistency, dense depth for layered buildings, high fidelity 8k
- Underwater coral reef exploration, colorful fish swimming around diver, camera following diver forward then orbiting coral, pointmap for fish trajectories, depth guidance for water layers and bubbles, vibrant marine colors, realistic physics
- Epic fantasy battle scene in ancient ruins, warriors clashing with dragons overhead, dynamic tracking shot following main hero, multi-element controls for multiple characters and dragon flight paths, temporal consistency across extended fight sequence
- Cozy autumn forest walk, leaves falling gently, first-person view walking along path with occasional look around, depth maps for tree layers, pointmaps for falling leaves motion, warm golden hour lighting, peaceful atmosphere
FAQs
Newly Added Tools
About Author
