What is D4RT?
D4RT (Dynamic 4D Reconstruction and Tracking) is a unified AI model from Google DeepMind that reconstructs dynamic 4D scenes (3D space plus time) from monocular video, disentangling camera and object motion efficiently.
When was D4RT announced?
D4RT was introduced by Google DeepMind on January 22, 2026, via their official blog post.
Is D4RT open-source or publicly available?
No, D4RT is currently a research model with no code, weights, or public demo released; only the technical report and project page are available.
How fast is D4RT compared to previous methods?
It processes one-minute videos in about 5 seconds on a single TPU, up to 300x faster than prior state-of-the-art approaches.
What tasks does D4RT support?
It enables all-pixels 3D tracking, point cloud reconstruction, camera pose estimation, and long-term prediction through a single query interface.
What are D4RT’s main applications?
Primarily robotics (dynamic navigation/manipulation), augmented reality (low-latency scene understanding), and advancing AI world models for physical perception.
What benchmarks does D4RT excel on?
It achieves SOTA on MPI Sintel (complex motion), Aria Digital Twin (ego-motion/occlusions), and RE10k (diverse scenes) for 4D reconstruction and tracking.
Who developed D4RT?
D4RT was developed by Google DeepMind researchers Guillaume Le Moing and Mehdi S. M. Sajjadi.

D4RT

About This AI
D4RT (Dynamic 4D Reconstruction and Tracking) is a groundbreaking unified AI model from Google DeepMind that enables machines to understand dynamic scenes captured in 2D videos by reconstructing a coherent 4D representation (3D space plus time).
It disentangles camera motion, object motion, and static geometry in a single feedforward process, providing a flexible query-based interface to answer questions like ‘Where is a given pixel located in 3D space at any time from any camera viewpoint?’.
Built on a Transformer encoder-decoder architecture, D4RT compresses input videos into a compact latent representation and uses lightweight querying for parallel, efficient inference across multiple 4D tasks.
Capabilities include all-pixels 3D point tracking (even through occlusions), point cloud reconstruction at arbitrary time steps, long-term prediction, and camera pose estimation, all from monocular video without heavy optimization.
It processes one-minute videos in roughly 5 seconds on a single TPU (up to 300x faster than prior SOTA methods), achieves state-of-the-art results on benchmarks like MPI Sintel (complex motion), Aria Digital Twin (household ego-motion), and RE10k (diverse scenes), and excels at handling fast motion blur, non-rigid deformation, occlusions, and dynamic objects.
Announced January 22, 2026, D4RT advances toward robust world models for AI, with strong potential in robotics (spatial awareness for navigation/manipulation), augmented reality (low-latency scene understanding), and broader perception for physical intelligence.
While not open-source or publicly available yet, the technical report is on arXiv, and the project page offers visuals and comparisons.
Key Features
- Unified query interface: Single encoder-decoder handles multiple 4D tasks via flexible pixel queries
- All-pixels 3D tracking: Predicts 3D trajectories for every pixel across time, even occluded
- Point cloud reconstruction: Generates accurate 3D structure at any frozen time and viewpoint
- Camera pose estimation: Recovers full camera trajectory by aligning 3D snapshots
- Long-term prediction: Maintains coherent future scene understanding beyond input frames
- High efficiency: Processes 1-minute video in 5 seconds on single TPU (18x to 300x faster than SOTA)
- Robust to dynamics: Handles fast motion blur, non-rigid deformation, occlusions, and object motion
- Feedforward architecture: No iterative optimization needed for inference
- Disentangled representation: Separates camera, object motion, and static geometry
Price Plans
- Research/Non-Commercial ($0): Announced as a research model with no public access or pricing; technical report free on arXiv
- Potential Future Enterprise (Custom): DeepMind may offer access via API or partnerships (not available yet)
Pros
- Extreme speed gains: Up to 300x faster inference than previous dynamic 4D methods
- Superior benchmark performance: SOTA on MPI Sintel, Aria Digital Twin, RE10k for tracking and reconstruction
- Unified flexible interface: One model for tracking, reconstruction, pose estimation without task-specific heads
- Handles complex dynamics: Robust to occlusions, fast motion, non-rigid objects, and ego-motion
- Advances world models: Step toward AI with true 4D physical understanding for robotics and AR
- Research impact potential: Enables safer robotics, better AR overlays, and physical intelligence progress
Cons
- Not publicly available: No code, weights, or demo released as of announcement
- Research-stage only: Focused on academic benchmarks; real-world deployment not yet demonstrated
- Compute-intensive training: Likely requires massive resources (though inference is efficient)
- Limited to monocular video: Relies on single-view input without depth sensors
- No open-source access: Unlike many DeepMind releases, no GitHub or Hugging Face repo mentioned
- Early announcement: Full capabilities and limitations still under exploration
Use Cases
- Robotics navigation: Enable robots to perceive and predict dynamic environments with moving objects
- Augmented reality overlays: Provide low-latency 4D scene understanding for accurate digital object placement
- Autonomous systems simulation: Reconstruct 4D scenes for testing and training in varied conditions
- Video analysis and editing: Track objects in motion, estimate camera paths, or predict future frames
- Physical world modeling: Build toward AI agents with true spatiotemporal awareness
- Research in perception: Advance dynamic scene understanding and world models for AGI
Target Audience
- Robotics researchers and engineers: Needing fast, accurate 4D perception for real-world interaction
- AR/VR developers: Requiring low-latency dynamic scene reconstruction for immersive experiences
- Computer vision scientists: Exploring unified models for tracking, reconstruction, and pose estimation
- Autonomous vehicle teams: Simulating complex dynamic environments from video
- AI research community: Studying advances in world models and spatiotemporal understanding
- DeepMind collaborators: Potential access through partnerships or future releases
How To Use
- Read the blog: Visit deepmind.google/blog/d4rt-teaching-ai-to-see-the-world-in-four-dimensions for overview and visuals
- Review paper: Access technical report at arXiv.org/abs/2512.08924 for architecture and results
- Explore project page: Check d4rt-paper.github.io for demos, videos, and comparisons
- Wait for potential release: Monitor DeepMind announcements for code, weights, or API availability
- Reproduce results: Use described querying mechanism if/when implementation released
- Apply in research: Reference D4RT baselines for new 4D reconstruction or tracking work
How we rated D4RT
- Performance: 4.9/5
- Accuracy: 4.8/5
- Features: 4.7/5
- Cost-Efficiency: 4.5/5
- Ease of Use: 3.5/5
- Customization: 4.2/5
- Data Privacy: 4.0/5
- Support: 4.0/5
- Integration: 4.3/5
- Overall Score: 4.4/5
D4RT integration with other tools
- Research Frameworks: Potential compatibility with computer vision libraries like PyTorch or JAX for reproduction/experiments
- Simulation Environments: Designed for integration with robotics sims (e.g., MuJoCo, Isaac Sim) for dynamic perception testing
- AR/VR Platforms: Future low-latency 4D understanding suitable for Unity/Unreal Engine plugins
- Video Processing Pipelines: Could feed into tools like OpenCV or FFmpeg for preprocessing input videos
- DeepMind Ecosystem: Likely ties into broader Google AI research tools and datasets
Best prompts optimised for D4RT
- N/A - D4RT is a research model for 4D scene reconstruction from video input, not a text-to-video or prompt-based generative tool. It processes existing videos to query 3D positions over time, without user text prompts for content creation.
- N/A - This is a unified feedforward model for dynamic scene understanding; usage involves feeding monocular video and querying specific points/time/camera views, not descriptive prompts.
- N/A - No generative prompting interface; it's designed for reconstruction and tracking tasks from video data directly.
FAQs
Newly Added Tools
About Author
