ShowUI

Lightweight Open-Source Vision-Language-Action Model for GUI Agents – End-to-End Screen Understanding and Action Execution
Last Updated: January 20, 2026
By Zelili AI

About This AI

ShowUI is an open-source vision-language-action model specifically designed for GUI visual agents, enabling end-to-end perception and interaction with user interfaces through screenshots alone.

Unlike traditional agents relying on closed-source APIs and text metadata (HTML/accessibility trees), ShowUI processes visual screenshots directly like humans, supporting tasks such as navigation, grounding, and action prediction in web, mobile, and desktop environments.

The 2B-parameter lightweight model (trained on 256K high-quality GUI instruction-following data) features UI-Guided Visual Token Selection to reduce redundant tokens by 33% and speed up training/inference by 1.4x, interleaved Vision-Language-Action streaming for multi-turn query-action handling, and curated datasets addressing data imbalances.

It achieves strong performance with 75.1% zero-shot screenshot grounding accuracy and competitive navigation results on benchmarks like Mind2Web (web), AITW (mobile), and MiniWob (online).

Released November 26, 2024 (arXiv submission), with code on GitHub (showlab/ShowUI), model weights on Hugging Face (showlab/ShowUI-2B), and a demo Space, it is licensed under MIT/Apache-2.0 and supports local deployment.

ShowUI advances GUI automation by making agents more visual, efficient, and accessible for developers, researchers, and applications in browser automation, desktop control, and mobile testing without external APIs.

Key Features

  1. UI-Guided Visual Token Selection: Reduces computational cost by identifying and pruning redundant visual tokens in screenshots via UI graph structure
  2. Interleaved Vision-Language-Action Streaming: Unifies multi-turn interactions, managing visual-action history for efficient navigation and query handling
  3. Zero-Shot Screenshot Grounding: Achieves 75.1% accuracy in locating UI elements from natural language instructions without prior examples
  4. End-to-End GUI Agent Capabilities: Processes screenshots to predict actions (click, tap, input, scroll etc.) across web, mobile, and desktop
  5. Lightweight 2B Model: Efficient design trained on only 256K curated high-quality GUI data for fast local inference
  6. Benchmark Navigation Performance: Competitive results on Mind2Web (web), AITW (mobile), and MiniWob (online) environments
  7. Open-Source Full Stack: Complete code, weights, datasets, and demo available on GitHub and Hugging Face
  8. Multi-Platform Support: Works for browser, desktop apps, and mobile UI automation without API dependencies

Price Plans

  1. Free ($0): Fully open-source under MIT/Apache-2.0 with model weights, code, datasets, and demo on Hugging Face/GitHub; no fees for use or deployment

Pros

  1. Lightweight and efficient: 2B parameters enable local deployment on consumer hardware with fast inference
  2. Strong zero-shot performance: 75.1% grounding accuracy and solid navigation on standard GUI benchmarks
  3. Fully open-source: MIT/Apache license with code, models, datasets, and demo Space for easy use/modification
  4. Visual-first approach: Avoids reliance on brittle text metadata, more robust to dynamic UIs
  5. Training optimizations: UI-guided token pruning speeds up by 1.4x and reduces resource needs
  6. Versatile GUI tasks: Supports web, mobile, desktop automation in one unified model
  7. Community resources: Hugging Face integration and active GitHub repo for extensions

Cons

  1. Requires technical setup: Local inference needs GPU, dependencies, and code execution knowledge
  2. Limited action space: Focused on standard GUI actions; may need extension for complex interactions
  3. Recent release: As a 2024 paper/model, adoption and community fine-tunes still emerging
  4. No hosted service: No cloud API or easy web UI; primarily for developers/researchers
  5. Potential visual artifacts: Complex or dense screens may challenge grounding accuracy
  6. Dataset scale: Trained on 256K samples—smaller than some proprietary models
  7. Hardware demands: 2B model still requires decent GPU for real-time use

Use Cases

  1. Web automation and testing: Navigate browsers, fill forms, and interact with dynamic sites visually
  2. Desktop GUI control: Automate applications, click buttons, input text in software without APIs
  3. Mobile app testing: Simulate user interactions on Android/iOS interfaces from screenshots
  4. GUI agent research: Benchmark, extend, or fine-tune vision-language-action models
  5. Accessibility tools: Assist users by grounding instructions to visual UI elements
  6. Robotic process automation: Visual-based RPA for legacy or non-API software
  7. Screen understanding prototypes: Build agents that reason and act purely from pixels

Target Audience

  1. AI researchers in GUI agents: Studying vision-language-action models and screen understanding
  2. Developers building automation tools: Creating visual agents for web/desktop/mobile without APIs
  3. QA and testing engineers: Automating UI tests across platforms
  4. Open-source AI enthusiasts: Experimenting with or extending the 2B model locally
  5. Robotics/embodied AI teams: Applying GUI perception to real-world interfaces
  6. Accessibility developers: Building visual assistants for disabled users

How To Use

  1. Visit GitHub repo: Go to github.com/showlab/ShowUI for code, installation guide, and examples
  2. Install dependencies: Set up Python environment with required libraries (transformers, torch etc.)
  3. Download model: Pull ShowUI-2B from Hugging Face (showlab/ShowUI-2B)
  4. Load model: Use provided code snippet with Qwen2VLForConditionalGeneration and AutoProcessor
  5. Run inference: Input screenshot + task query to get action predictions (coordinates, action type)
  6. Test demo: Try the Hugging Face Space (showlab/ShowUI) for no-setup online interaction
  7. Integrate agent: Combine with environment simulators (e.g., MiniWob, AndroidEnv) for full loops

How we rated ShowUI

  • Performance: 4.5/5
  • Accuracy: 4.6/5
  • Features: 4.7/5
  • Cost-Efficiency: 5.0/5
  • Ease of Use: 4.2/5
  • Customization: 4.8/5
  • Data Privacy: 5.0/5
  • Support: 4.3/5
  • Integration: 4.5/5
  • Overall Score: 4.6/5

ShowUI integration with other tools

  1. Hugging Face: Model weights, processor, and demo Space for easy testing and inference
  2. GitHub Repository: Full open-source code, training scripts, and evaluation pipelines
  3. GUI Environments: Compatible with benchmarks like Mind2Web, AITW, MiniWob for testing
  4. Local Deployment: Runs via PyTorch/transformers on GPU; no cloud required
  5. Agent Frameworks: Can integrate with LangChain, AutoGen, or custom loops for full GUI agents

Best prompts optimised for ShowUI

  1. N/A - ShowUI is a vision-language-action model for GUI agents that processes screenshots and task queries directly; it does not use text prompts for generation like text-to-image/video tools. Input is screenshot + instruction (e.g., 'Click the login button'), output is action (coordinates + type).
  2. N/A - Core usage is providing UI screenshots and natural language instructions for grounding/actions; no manual creative prompting needed.
  3. N/A - For best results, use clear task queries like 'Find and click the search bar' with a desktop/web screenshot as visual input.
ShowUI is an impressive lightweight 2B open-source vision-language-action model that enables GUI agents to understand and act on screenshots directly, achieving strong zero-shot grounding (75.1%) and navigation performance. Fully free with code and weights, it’s ideal for researchers and developers building visual automation without APIs. Setup is technical, but its efficiency and innovations make it a top choice for advancing GUI agents.

FAQs

  • What is ShowUI?

    ShowUI is a lightweight 2B open-source vision-language-action model for GUI visual agents, enabling direct screenshot perception and action prediction without text metadata APIs.

  • When was ShowUI released?

    The paper was submitted and model released on November 26, 2024, with code and weights available on GitHub and Hugging Face.

  • Is ShowUI free to use?

    Yes, it is fully open-source under MIT/Apache-2.0 with model weights, code, and demo freely available; no usage fees.

  • What are ShowUI’s main capabilities?

    It supports zero-shot screenshot grounding (75.1% accuracy), navigation on web/mobile/online benchmarks, and end-to-end GUI action execution from visuals.

  • Where can I download or try ShowUI?

    Model weights at huggingface.co/showlab/ShowUI-2B, code at github.com/showlab/ShowUI, and interactive demo Space at huggingface.co/spaces/showlab/ShowUI.

  • What benchmarks does ShowUI perform well on?

    It shows strong results on Mind2Web (web), AITW (mobile), MiniWob (online), and 75.1% zero-shot grounding accuracy.

  • How does ShowUI differ from other GUI agents?

    Unlike text/API-based agents, ShowUI processes screenshots visually like humans, using innovations like UI-guided token selection for efficiency.

  • What hardware is needed for ShowUI?

    The 2B model runs locally on GPUs via transformers/PyTorch; consumer hardware sufficient for inference, though faster with high-end cards.

Newly Added Tools​

Qwen-Image-2.0

$0/Month

Qodo AI

$0/Month

Codiga

$10/Month

Tabnine

$59/Month
ShowUI Alternatives

Qodo AI

$0/Month

Codiga

$10/Month

Tabnine

$59/Month

About Author

Hi Guys! We are a group of ML Engineers by profession with years of experience exploring and building AI tools, LLMs, and generative technologies. We analyze new tools not just as a user, but as someone who understands their technical depth and real-world value.We know how overwhelming these tools can be for most people, that’s why we break down complex AI concepts into simple, practical insights. Our goal is to help you discover these magical AI tools that actually save your time and make everyday work smarter, not harder.“We don’t just write about AI: We build, test and simplify it for you.”