What is EgoEdit?
EgoEdit is a research framework from Snap Research for real-time, instruction-guided editing of egocentric (first-person) videos, enabling interactive AR applications with object manipulation and style transfer.
When was EgoEdit announced?
The EgoEdit paper (arXiv 2512.06065) was published on December 5, 2025, with dataset and benchmark release planned soon after.
Is EgoEdit open-source or free?
It’s a research project; dataset (EgoEditData) and benchmark (EgoEditBench) are planned for public release, but model code/demo availability is not yet confirmed as of early 2026.
What makes EgoEdit unique?
It specializes in egocentric videos, handling rapid motion, hand occlusions, and interactions for real-time AR editing on a single GPU with low latency.
What hardware does EgoEdit require?
It runs in real time on a single H100 GPU with 855ms first-frame latency and 38.1 FPS streaming performance.
What are EgoEdit’s main capabilities?
Object morphing/substitution, addition/removal, scene replacement, style transfer (e.g., ukiyo-e), depth maps, and complex instruction following in first-person views.
Who developed EgoEdit?
Led by Snap Research with collaborators from Rice University and University of Oxford, including authors like Runjia Li and Sergey Tulyakov.
What is EgoEditBench?
A comprehensive benchmark for evaluating egocentric video editing systems, used to compare EgoEdit against baselines like Senorita and InsV2V.

EgoEdit

About This AI
EgoEdit is a research framework from Snap Research for real-time instruction-guided editing of egocentric (first-person) videos, targeting interactive augmented reality (AR) applications.
It addresses unique challenges in egocentric footage like rapid egomotion, frequent hand-object occlusions, and interactions that create domain gaps for existing third-person video editors.
The system includes three components: EgoEditData (a manually curated dataset of 100k video editing pairs focused on egocentric cases with object substitution/removal under tough conditions), EgoEdit (the core real-time autoregressive model for streaming inference), and EgoEditBench (a comprehensive benchmark for evaluating egocentric video editing).
EgoEdit enables live AR interactions by processing video frames sequentially with low latency (855ms first-frame on a single H100 GPU, 38.1 FPS streaming).
Capabilities include object morphing/substitution (e.g., turn bottle into goblet), removal/addition (e.g., add spoon in hand), scene replacement (e.g., kitchen to office), style transfer (e.g., ukiyo-e art, psychedelic poster), depth map generation, and handling complex instructions involving textures, lighting, and materials.
It supports temporally stable, instruction-faithful results with high robustness to motion and occlusions.
Announced in December 2025 (arXiv paper 2512.06065), with dataset and benchmark planned for release to support research; the model runs in real time on a single GPU.
Primarily a research contribution from Snap Research (with collaborators from Rice University and University of Oxford), aimed at advancing egocentric video editing for AR/VR, robotics, and interactive media.
No public user numbers or widespread adoption reported as it’s a recent research release focused on academic and development use.
Key Features
- Real-time streaming inference: Processes egocentric video frames sequentially with 855ms first-frame latency and 38.1 FPS on single H100 GPU
- Instruction-guided editing: Follows natural language prompts for object morphing, substitution, addition/removal, scene changes, and style transfers
- Robust to egocentric challenges: Handles rapid egomotion, hand occlusions, interactions, and large motion without domain gap issues
- Temporal stability: Produces consistent, coherent edits across frames for live AR interactions
- Complex instruction support: Manages detailed attributes like textures, lighting, materials, and artistic styles (e.g., ukiyo-e, psychedelic)
- Object manipulation: Precise substitution (e.g., bottle to goblet), addition (e.g., spoon in hand), and removal in occluded scenes
- Scene and style transformation: Replace backgrounds (kitchen to office), apply art styles, or generate depth maps
- Dataset and benchmark integration: Trained/evaluated on EgoEditData (100k pairs) and EgoEditBench for standardized testing
Price Plans
- Free ($0): Research project with planned public release of dataset and benchmark; no commercial pricing or subscriptions mentioned
Pros
- Real-time performance: Enables live AR editing on a single GPU with low latency and high FPS
- Strong egocentric specialization: Outperforms general video editors in handling first-person challenges like hand occlusions and motion
- High instruction fidelity: Accurately follows complex prompts for object/scene/style changes
- Research-grade quality: Superior temporal consistency and robustness demonstrated on benchmarks
- Comprehensive ecosystem: Includes dataset, model, and benchmark to advance the field
- Potential for AR/VR: Opens doors for interactive augmented reality applications
Cons
- Research-oriented: Not yet a consumer tool; requires technical setup for local inference
- Hardware demands: Needs high-end GPU (e.g., H100) for real-time performance
- No public code/demo yet: Dataset and benchmark planned for release; model availability unclear
- Limited scope: Focused on egocentric videos; may not generalize as well to third-person
- Early-stage release: Announced December 2025; no widespread user adoption or stats
- Potential artifacts: Complex long videos or extreme occlusions may still show inconsistencies
Use Cases
- Augmented reality prototyping: Live editing of first-person videos for AR experiences
- Object manipulation research: Testing substitution/removal in occluded, high-motion scenes
- Style transfer in egocentric views: Applying artistic filters or transformations to wearable camera footage
- Scene editing for VR/AR: Replacing environments while maintaining user interactions
- Benchmarking video editors: Using EgoEditBench to evaluate other egocentric editing systems
- Robotics and embodied AI: Simulating edited first-person views for training agents
Target Audience
- AI and computer vision researchers: Studying egocentric video editing and AR
- AR/VR developers: Prototyping real-time interactive editing features
- Robotics teams: Using first-person simulations for agent training
- Academic institutions: Leveraging dataset and benchmark for experiments
- Snap Research collaborators: Building on the framework for future work
How To Use
- Visit project page: Go to snap-research.github.io/EgoEdit for details, videos, and updates
- Wait for release: Dataset and benchmark planned for public release post-announcement
- Access code/model: Check GitHub (github.com/snap-research/EgoEdit) once artifacts are shared
- Run inference: Use provided scripts on compatible GPU for real-time editing
- Input video/instructions: Feed egocentric footage and text prompts for editing
- Evaluate results: Compare against EgoEditBench metrics for research
How we rated EgoEdit
- Performance: 4.8/5
- Accuracy: 4.7/5
- Features: 4.6/5
- Cost-Efficiency: 4.9/5
- Ease of Use: 4.0/5
- Customization: 4.5/5
- Data Privacy: 5.0/5
- Support: 4.1/5
- Integration: 4.3/5
- Overall Score: 4.5/5
EgoEdit integration with other tools
- Research Frameworks: Compatible with video processing pipelines like PyTorch for local inference and experimentation
- Benchmark Tools: Designed to work with EgoEditBench for standardized evaluation of editing models
- Potential AR Platforms: Outputs suitable for integration with AR/VR headsets or frameworks like Unity/ARCore
- Dataset Usage: EgoEditData supports training/fine-tuning in custom video editing research setups
- High-End GPUs: Optimized for single H100 or similar hardware for real-time streaming
Best prompts optimised for EgoEdit
- Morph the white shaker bottle with blue cap into an ornate silver goblet while keeping hand interactions natural
- Replace the kitchen background with a small home office desk, preserving lighting and subject pose
- Apply ukiyo-e woodblock print art style to the entire egocentric video scene
- Add a spoon in the person's hand during the stirring motion, matching grip and lighting
- Turn the video into a realistic depth map visualization with accurate foreground-background separation
FAQs
Newly Added Tools
About Author
