Step3-VL-10B Shatters AI Expectations: How a Tiny 10B Model Outsmarts Giants Like Gemini 2.5 Pro

By Zelili AI
January 21, 2026
Launch

Imagine an AI model that’s just 10 billion parameters strong, yet it punches way above its weight, matching or beating behemoths 10 to 20 times its size in complex visual reasoning tasks.

This compact powerhouse from StepFun AI isn’t just efficient; it’s redefining what’s possible with smaller models, delivering top-tier performance in math, visual perception, and more without the massive compute demands.

If you’re tired of resource-hungry AIs, this could be the shift we’ve been waiting for, making advanced capabilities accessible to me and developers everywhere.

Topics

Step3-VL-10B

This is insane: a 10b (!) model outperforms 10x bigger models, and even the still very good Gemini 2.5 pro.

Those guys are cooking. China is delivering again every hour today! https://t.co/Kzi65GyFJw pic.twitter.com/qvg8s6OH2d
— Chubby♨️ (@kimmonismus) January 19, 2026

Step3-VL-10B is an open-source vision-language model built by StepFun AI’s Multimodal Intelligence Team.

It combines a 1.8B parameter PE-lang visual encoder with a Qwen3-8B language decoder, trained on a massive 1.2 trillion multimodal tokens.

What sets it apart is the innovative Parallel Coordinated Reasoning (PaCoRe) technique, which explores multiple visual hypotheses in parallel for deeper understanding.

This allows it to handle up to 128K context lengths while maintaining efficiency.

The model comes in base and chat versions, both available under Apache 2.0 license. It’s designed for tasks like STEM reasoning, GUI understanding, OCR, and spatial analysis for embodied AI.

In benchmarks, it shines: 92.2% on MMBench, 80.1% on MMMU, and an impressive 94.4% on AIME2025 with PaCoRe.

Benchmark Breakdown: Efficiency Meets Excellence

Here’s how Step3-VL-10B stacks up against larger competitors across key benchmarks:

Benchmark	Step3-VL-10B (PaCoRe)	GLM-4.6V-106B-A12B	Qwen3VL-235B-A22B-Thinking	Seed-1.5-VL-Thinking	Gemini-2.5-Pro
MMMU	80.1%	78.1%	85.5%	84.0%	70.8%
MathVista	84.0%	80.1%	85.5%	76.0%	70.8%
MathVision	76.0%	70.8%	91.8%	91.2%	94.4%
MMBench	92.2%	91.8%	87.3%	87.7%	76.0%
AIME2025	94.4%	87.7%	87.3%	62.6%	70.8%
MultiChallenge	62.6%	70.8%	94.4%	62.6%	62.6%

These scores highlight its edge in compact performance.

Step3-VL-10B was released on 19 Jan 2026, with immediate availability on platforms like ModelScope and Hugging Face. No waiting period; you can download and integrate it right away for research or applications.

What This Will Improve?

This model improves efficiency in AI deployment, reducing GPU needs while boosting accuracy in visual tasks. For me, it means faster prototyping without cloud costs.

It enhances areas like educational tools for math visualization, automated GUI testing, and robotics navigation. Users benefit from human-aligned responses via RLHF and RLVR training, minimizing errors in real-world scenarios.

Other must-knows: It supports multi-crop high-res inputs (728×728), making it ideal for detailed image analysis. Potential drawbacks include the need for fine-tuning for niche uses, but its open-source nature invites community contributions.