Yume 1.5

Text-Controlled Interactive World Generation Model – Realistic Continuous Virtual Worlds from Image or Text with Keyboard Exploration
Last Updated: January 5, 2026
By Zelili AI

About This AI

Yume 1.5 is a novel open-source framework for generating realistic, interactive, and continuous virtual worlds from a single image or text prompt using autoregressive video diffusion.

It addresses key limitations in prior world models: exploding memory/compute for long contexts, slow multi-step inference blocking real-time exploration, and limited text-controlled event generation.

The model supports three modes: text-to-world, image-to-world, and text-based event editing, allowing users to explore generated environments via keyboard controls (WASD movement, arrow keys for camera) in a continuous video stream.

Core innovations include Joint Temporal-Spatial-Channel Modeling (TSCM) for efficient context compression preserving long-range details, real-time streaming acceleration via bidirectional attention distillation and enhanced text embeddings, and discrete action tokens for intuitive control.

It achieves stable 12 FPS on a single A100 GPU (70x faster than baselines), supports long-horizon coherence, and generates high-quality, dynamic worlds with emergent behaviors.

Released December 26, 2025 (arXiv 2512.22096) by researchers from Shanghai AI Laboratory and Fudan University, with code on GitHub and preview weights on Hugging Face (Yume-5B-720P).

As a fully open-source model (Apache 2.0 planned), it serves as a strong alternative to closed systems for interactive simulation, game prototyping, embodied AI, and research in world models.

Key Features

  1. Text-to-World Generation: Create interactive worlds directly from descriptive text prompts
  2. Image-to-World Generation: Turn a single static image into an explorable continuous video world
  3. Text-Controlled Event Editing: Modify ongoing worlds via natural language (e.g., add events, change environment)
  4. Keyboard-Based Exploration: WASD movement and arrow key camera control for real-time navigation
  5. Joint Temporal-Spatial-Channel Modeling (TSCM): Efficient long-context compression across dimensions
  6. Real-Time Streaming Acceleration: Bidirectional attention distillation and self-forcing for low-latency inference
  7. Long-Horizon Coherence: Maintains consistency over extended interactions without collapse
  8. High-Performance Inference: Stable 12 FPS on single A100 GPU, supporting 720P output
  9. Open-Source Availability: Code on GitHub, preview 5B model weights on Hugging Face
  10. Autoregressive Video Diffusion: Generates continuous video streams with emergent dynamics

Price Plans

  1. Free ($0): Fully open-source under planned Apache 2.0 license with code on GitHub and preview 5B model weights on Hugging Face; no costs for download or local use
  2. Cloud/Enterprise (Custom): Potential future hosted inference or premium services (not available yet)

Pros

  1. Breakthrough speed: 12 FPS real-time on consumer-grade hardware, 70x faster than prior methods
  2. Strong interactivity: True keyboard control enables natural exploration and event editing
  3. Long-context stability: Handles extended sessions with preserved coherence and details
  4. High-quality output: Realistic visuals and dynamics from image or text inputs
  5. Fully open-source: Code, weights, and paper publicly available for research and extension
  6. Efficient compression: TSCM reduces memory/compute demands for long worlds
  7. Versatile applications: Suitable for gaming, simulation, robotics, and VFX prototyping

Cons

  1. Recent release: Limited community adoption and real-world testing as of early 2026
  2. Hardware demands: Optimal performance requires strong GPU (A100 or equivalent for 12 FPS)
  3. Preview stage: Weights and full features still evolving; actions/fast variants upcoming
  4. Setup required: Local deployment involves GitHub repo, dependencies, and model loading
  5. No hosted demo: No simple web interface; primarily for developers/researchers
  6. Potential inconsistencies: Long or complex interactions may show artifacts in edge cases
  7. No official user numbers: Very new, no reported widespread usage yet

Use Cases

  1. Game prototyping: Quickly generate and explore procedural levels or worlds without manual assets
  2. Embodied AI research: Simulate environments for agent training and navigation experiments
  3. Robotics simulation: Create dynamic scenes for robot learning and testing
  4. Autonomous driving: Generate traffic scenarios for safe virtual validation
  5. VFX and film pre-vis: Build explorable digital sets with camera control
  6. Interactive storytelling: Develop dynamic narratives controlled by text events
  7. Scientific visualization: Model complex systems or phenomena in explorable formats

Target Audience

  1. AI researchers: Studying world models, video diffusion, and interactive generation
  2. Game developers: Prototyping environments and testing mechanics rapidly
  3. Robotics/embodied AI teams: Needing realistic simulation sandboxes
  4. Autonomous systems engineers: Generating diverse driving or navigation scenarios
  5. VFX artists and filmmakers: Creating pre-visualization with interactive control
  6. Open-source enthusiasts: Extending or fine-tuning the model locally

How To Use

  1. Visit GitHub: Go to github.com/stdstu12/YUME for code, docs, and instructions
  2. Download weights: Get preview 5B model from huggingface.co/stdstu123/Yume-5B-720P
  3. Setup environment: Install dependencies (PyTorch, etc.) following repo guide
  4. Run inference: Use provided scripts for text-to-world or image-to-world generation
  5. Initialize world: Provide text prompt or upload starting image
  6. Explore interactively: Use WASD keys for movement and arrows for camera control
  7. Edit events: Input text commands like 'add rain' or 'spawn character' to modify

How we rated Yume 1.5

  • Performance: 4.6/5
  • Accuracy: 4.5/5
  • Features: 4.8/5
  • Cost-Efficiency: 5.0/5
  • Ease of Use: 4.1/5
  • Customization: 4.7/5
  • Data Privacy: 5.0/5
  • Support: 4.2/5
  • Integration: 4.4/5
  • Overall Score: 4.6/5

Yume 1.5 integration with other tools

  1. Hugging Face: Preview model weights and inference pipelines available for easy download
  2. GitHub Repository: Full code, training/inference scripts, and community extensions
  3. Game Engines (Potential): Compatible with Unity or Unreal for procedural world integration via custom wrappers
  4. Simulation Frameworks: Works with robotics sims like MuJoCo or Isaac Sim for embodied training
  5. Local GPU Setup: Runs directly on hardware with CUDA support; no external cloud required

Best prompts optimised for Yume 1.5

  1. A vibrant cyberpunk city street at night with neon lights, rain, and flying vehicles, start from this image [upload city photo], enable keyboard navigation for exploration
  2. Fantasy enchanted forest with glowing mushrooms and ancient ruins, anime style, maintain long-term consistency and allow text events like 'summon dragon'
  3. Busy modern highway during golden hour sunset, realistic traffic and pedestrians, support collision-aware autonomous driving simulation
  4. Sci-fi spaceship corridor with holographic interfaces and crew members, zero-gravity effects, interactive agent navigation
  5. Serene mountain lake at dawn with mist and wildlife, photorealistic, enable dynamic weather changes via text like 'make it snow'
Yume 1.5 advances interactive world generation with real-time keyboard exploration, long coherence, and efficient compression, making it a promising open-source rival to closed models. Its 12 FPS on A100 and text-event control suit research and prototyping well. As a recent release, it requires strong hardware and setup, but offers huge potential for gaming, simulation, and AI embodied tasks.

FAQs

  • What is Yume 1.5?

    Yume 1.5 is an open-source framework for generating realistic, interactive, continuous virtual worlds from text prompts or single images, supporting keyboard exploration and text-controlled events.

  • When was Yume 1.5 released?

    The paper was published on arXiv on December 26, 2025, with preview 5B model weights released around the same time.

  • Is Yume 1.5 free to use?

    Yes, it is open-source with code on GitHub and preview weights on Hugging Face; no usage fees for local deployment.

  • What hardware does Yume 1.5 require?

    It achieves 12 FPS on a single A100 GPU; consumer high-end GPUs can run it with potential reduced performance.

  • How fast is Yume 1.5 compared to similar models?

    It runs at 12 FPS, approximately 70 times faster than previous interactive world models due to advanced compression and acceleration.

  • What control methods does Yume 1.5 support?

    Users explore with WASD keys for movement and arrow keys for camera, plus text prompts for events like weather changes or object addition.

  • Where can I find Yume 1.5 code and weights?

    Code is on github.com/stdstu12/YUME, preview 5B-720P weights on huggingface.co/stdstu123/Yume-5B-720P.

  • What is Yume 1.5 best suited for?

    Ideal for game prototyping, embodied AI research, robotics simulation, autonomous driving scenarios, and VFX pre-visualization.

Newly Added Tools​

Qwen-Image-2.0

$0/Month

Qodo AI

$0/Month

Codiga

$10/Month

Tabnine

$59/Month
Yume 1.5 Alternatives

Seedance 2.0

$0/Month

VideoGen

$12/Month

WUI.AI

$10/Month

About Author

Hi Guys! We are a group of ML Engineers by profession with years of experience exploring and building AI tools, LLMs, and generative technologies. We analyze new tools not just as a user, but as someone who understands their technical depth and real-world value.We know how overwhelming these tools can be for most people, that’s why we break down complex AI concepts into simple, practical insights. Our goal is to help you discover these magical AI tools that actually save your time and make everyday work smarter, not harder.“We don’t just write about AI: We build, test and simplify it for you.”