Alibaba Unveils Qwen3-Max-Thinking: New Leader in AI Reasoning Benchmarks

By Zelili AI
January 27, 2026
Launch

Alibaba’s Qwen team released Qwen3-Max-Thinking on January 26, 2026, positioning it as their most advanced reasoning model to date.

Trained on massive scale with reinforcement learning, this model excels in complex problem-solving, tool usage, and agentic capabilities.

It introduces two standout innovations: adaptive tool-use (automatically selecting and applying tools like search, memory, and code interpreter) and test-time scaling (TTS, a multi-round self-reflection process that refines answers during inference for superior reasoning depth).

Topics

🚀 Introducing Qwen3-Max-Thinking, our most capable reasoning model yet. Trained with massive scale and advanced RL, it delivers strong performance across reasoning, knowledge, tool use, and agent capabilities.
✨ Key innovations:
✅ Adaptive tool-use: intelligently leverages… pic.twitter.com/6sZiKWQAq3
— Qwen (@Alibaba_Qwen) January 26, 2026

These features enable Qwen3-Max-Thinking to outperform leading competitors including GPT-5.2, Claude Opus-4.5, Gemini 3 Pro, and DeepSeek-V3.2 across multiple rigorous benchmarks.

The release demonstrates a clear focus on practical intelligence, where the model “thinks deeper” to tackle graduate-level science, competition math, coding challenges, software engineering tasks, function calling, and broad expert exams.

Key Innovations Explained

Adaptive Tool-Use: The model intelligently decides when and how to use external tools without user prompts, improving reliability in real-world applications.
Test-Time Scaling (TTS): During inference, it performs iterative self-reflection and refinement, significantly boosting accuracy over single-pass generation. Results show TTS versions consistently outperform non-TTS variants and rivals.

Benchmark Performance Highlights

Qwen3-Max-Thinking dominates in most categories, especially with TTS enabled. Below is a summarized comparison of select benchmarks (scores in percentage accuracy):

Benchmark	Description	Qwen3-Max-Thinking (with TTS)	Qwen3-Max-Thinking (without TTS)	Top Competitor (e.g., GPT-5.2 / Claude / Gemini)
GPQA Diamond	PhD-level science questions	92.8	87.4	91.9 – 92.4
IMO-AnswerBench	IMO-level math problems	91.5	83.9	83.3 – 86.3
LiveCodeBench	Competition coding problems (2025)	91.4	85.9	80.8 – 90.7
SWE-bench Verified	Real software engineering tasks	75.3	–	73.1 – 80.9
τ²-Bench	Function calling / tool use	82.1	–	80.3 – 85.7
Humanity’s Last Exam	Expert-level questions across subjects	36.5	–	25.1 – 37.5
Humanity’s Last Exam (with Search)	Same, but using search tools	49.8	–	40.8 – 45.8

These results highlight consistent leadership in reasoning-heavy tasks. Notably, it achieves elite scores like 98.0 on HMMT advanced math competitions and strong gains when tools are allowed.