What is Step 3.5 Flash?

Step 3.5 Flash is StepFun's open-source sparse MoE foundation model (196B total, 11B active) optimized for fast, frontier reasoning and agentic tasks with 256K context and high throughput.

When was Step 3.5 Flash released?

It was released in early February 2026 (around February 2-3), with weights on Hugging Face and API access shortly after.

Is Step 3.5 Flash free to use?

Yes, the model is completely open-source under Apache 2.0 for local inference; API/hosted access via providers like OpenRouter uses token-based pricing.

What are the key specs of Step 3.5 Flash?

196B parameters (11B active), 256K context, 100-300 tok/s (up to 350 tok/s coding), MTP-3 acceleration, strong on math/coding/agent benchmarks.

How fast is Step 3.5 Flash?

It achieves 100-300 tokens per second in typical use, peaking at 350 tok/s for single-stream coding on high-end hardware.

Can Step 3.5 Flash run locally?

Yes, with GGUF INT4 quantization via llama.cpp or vLLM, it runs on consumer GPUs/Macs with full 256K context support.

What benchmarks does Step 3.5 Flash excel at?

Leads open models in SWE-bench (74.4%), Terminal-Bench (51.0%), AIME/HMMT math, agentic tasks, and overall reasoning averages.

Who should use Step 3.5 Flash?

Developers building agents/coding tools, researchers in reasoning AI, enterprises needing efficient local/cloud LLMs for production workflows.

Step 3.5 Flash

Name: Step 3.5 Flash
Author: Zelili AI

From StepFun

Fast, Efficient Frontier Reasoning Model – Sparse MoE with Agentic Power at 100-350 tok/s Speeds

Text Generator

Pricing Model

Free

Starting Price

$0/Month

Last Updated: February 3, 2026

By Zelili AI

About This AI

Step 3.5 Flash is StepFun’s most capable open-source foundation model, released in early February 2026 as a sparse Mixture-of-Experts (MoE) LLM optimized for frontier reasoning and agentic capabilities with exceptional efficiency.

It has 196B total parameters but activates only about 11B per token, enabling high intelligence density while maintaining real-time responsiveness and low compute costs.

Powered by 3-way Multi-Token Prediction (MTP-3), it delivers 100-300 tokens per second throughput (peaking at 350 tok/s for coding tasks) and supports a 256K context window via hybrid sliding-window/full attention.

Excels in deep reasoning, math competitions (AIME 2025: 97.3, HMMT 2025: 96.2 avg), coding (SWE-bench Verified: 74.4%, Terminal-Bench 2.0: 51.0%), agentic tasks (τ²-Bench: 88.2), tool-use orchestration, multi-agent frameworks, and professional data analysis.

Features include scalable RL (MIS-PO) for self-improvement, bilingual reliability (English/Chinese), edge-cloud synergy (e.g., Step-GUI), and proactive intent clarification.

Available via API platforms, web chat, mobile apps, local deployment (GGUF INT4 quantized for consumer hardware), and Hugging Face weights under Apache 2.0.

It rivals or exceeds larger closed/open models in many benchmarks while being far more efficient, making it ideal for production agent workflows, coding assistants, research, and real-time applications.

Variants include PaCoRe enhanced version for boosted reasoning scores.

Key Features

Sparse MoE Architecture: 196B total parameters with only 11B active per token for efficient high-performance inference
3-way Multi-Token Prediction (MTP-3): Enables 100-300 tok/s throughput, peaking at 350 tok/s for coding tasks
256K Context Window: Long-context support via hybrid sliding-window/full attention with compensation
Frontier Reasoning: Strong on math (AIME, HMMT, IMO), coding (SWE-bench, Terminal-Bench), and agentic benchmarks
Agentic Capabilities: Tool-use orchestration, multi-agent frameworks, proactive clarification, and scalable RL (MIS-PO)
Bilingual Reliability: High performance in English and Chinese for global users
Local Deployment: GGUF INT4 quantized weights for consumer hardware via llama.cpp
Edge-Cloud Synergy: Supports hybrid workflows like Step-GUI for real-world applications
Multi-Agent Research: DeepSearch, ReAct-style agents for complex tasks
Open-Source Accessibility: Apache 2.0 license with Hugging Face weights, vLLM support, and easy integration

Price Plans

Free ($0): Open-source model weights and local inference under Apache 2.0; no cost for download/use on your hardware
API/Hosted (Token-based ~$0.10/M input, $0.40/M output): Pricing via providers like OpenRouter or StepFun platform for cloud access
Enterprise (Custom): Potential premium support, higher limits, or dedicated deployment through StepFun

Pros

Exceptional efficiency: High performance with low active parameters and fast inference speeds
Strong benchmark results: Outperforms or matches larger models in reasoning, coding, and agentic tasks
Fully open-source: Weights, code, and deployment options freely available under Apache 2.0
Real-time agent suitability: Low-latency design ideal for production autonomous workflows
Long-context strength: Reliable 256K handling with reduced compute overhead
Local runnable: Quantized versions work on consumer GPUs/Macs for private use
Continuous improvement: RL framework enables self-enhancement and stability

Cons

Requires strong hardware: Full real-time performance needs high-end GPUs; quantized versions trade some quality
Longer trajectories: May generate more tokens than some peers for complex reasoning
Limited free hosted access: Primarily API or local; free tiers (if any) have limits
Recent release: Community integrations and fine-tunes still emerging
Potential instability: Edge cases in long-horizon or mixed-language may occur
No official user count: Adoption metrics not publicly detailed yet
API pricing varies: Token-based costs on platforms like OpenRouter/StepFun

Use Cases

Agentic coding and development: Handle complex software engineering tasks, debugging, and repo analysis
Advanced reasoning and math: Solve competition-level problems, financial modeling, scientific queries
Autonomous agents: Multi-step tool-use, research orchestration, proactive workflows
Data analysis and research: DeepSearch, BI engine, stock investment automation
Simulation and visualization: 3D dashboards, procedural content, scientific viz
Production chatbots: Fast, reliable responses with long context and tool integration
Local/private deployments: Run on-prem for sensitive data or edge devices

Target Audience

Developers and AI engineers: Building agents, coding tools, or local LLMs
Researchers in reasoning/agentic AI: Experimenting with frontier open models
Business and finance professionals: Using for analysis, modeling, and automation
Game/simulation creators: Generating dynamic environments or agents
Enterprises: Needing efficient, private, high-performance LLMs via API/local
Open-source enthusiasts: Deploying and fine-tuning frontier models

How To Use

Download from Hugging Face: Visit huggingface.co/stepfun-ai/Step-3.5-Flash for weights
Local setup: Use vLLM, llama.cpp, or provided scripts; install dependencies per repo
Run inference: Load model with INT4/FP8 quantization for efficiency
Prompt for tasks: Use detailed instructions for reasoning, coding, or agent workflows
Enable agent mode: Integrate tools (Python exec, search) for multi-step actions
Cloud API: Use platforms like OpenRouter or StepFun API with model endpoint
Optimize speed: Leverage MTP-3 and low active params for fast responses

How we rated Step 3.5 Flash

Performance: 4.8/5
Accuracy: 4.7/5
Features: 4.9/5
Cost-Efficiency: 5.0/5
Ease of Use: 4.4/5
Customization: 4.8/5
Data Privacy: 4.9/5
Support: 4.5/5
Integration: 4.7/5
Overall Score: 4.8/5

Step 3.5 Flash integration with other tools

Hugging Face: Model weights, inference examples, and community pipelines
vLLM: High-throughput serving with day-0 support for Step 3.5 Flash
llama.cpp: Local quantized inference (INT4 GGUF) on consumer hardware
NVIDIA NIM: Optimized deployment on NVIDIA infrastructure
Tool-Use Frameworks: Compatible with LangChain, LlamaIndex for agent orchestration

Best prompts optimised for Step 3.5 Flash

Solve this AIME 2025 problem step-by-step with detailed reasoning: [insert math problem]
As a senior software engineer, analyze this codebase and suggest refactors to improve performance: [paste code]
Act as a research agent: DeepSearch and summarize the latest developments in quantum computing from reliable sources
Build a multi-agent system to plan a stock investment strategy using market data tools: start with current AAPL analysis
Generate a 3D visualization dashboard for weather patterns using procedural generation and real data inputs

Step 3.5 Flash is a highly efficient open-source frontier model with sparse MoE delivering strong reasoning, coding, and agentic performance at impressive speeds. Fully free for local use and competitively priced via APIs, it rivals larger proprietary systems while being accessible and deployable. Excellent for developers and agent builders seeking cost-effective intelligence.

FAQs

What is Step 3.5 Flash?
Step 3.5 Flash is StepFun’s open-source sparse MoE foundation model (196B total, 11B active) optimized for fast, frontier reasoning and agentic tasks with 256K context and high throughput.
When was Step 3.5 Flash released?
It was released in early February 2026 (around February 2-3), with weights on Hugging Face and API access shortly after.
Is Step 3.5 Flash free to use?
Yes, the model is completely open-source under Apache 2.0 for local inference; API/hosted access via providers like OpenRouter uses token-based pricing.
What are the key specs of Step 3.5 Flash?
196B parameters (11B active), 256K context, 100-300 tok/s (up to 350 tok/s coding), MTP-3 acceleration, strong on math/coding/agent benchmarks.
How fast is Step 3.5 Flash?
It achieves 100-300 tokens per second in typical use, peaking at 350 tok/s for single-stream coding on high-end hardware.
Can Step 3.5 Flash run locally?
Yes, with GGUF INT4 quantization via llama.cpp or vLLM, it runs on consumer GPUs/Macs with full 256K context support.
What benchmarks does Step 3.5 Flash excel at?
Leads open models in SWE-bench (74.4%), Terminal-Bench (51.0%), AIME/HMMT math, agentic tasks, and overall reasoning averages.
Who should use Step 3.5 Flash?
Developers building agents/coding tools, researchers in reasoning AI, enterprises needing efficient local/cloud LLMs for production workflows.

Newly Added Tools

Qwen-Image-2.0

Image & Design

$0/Month

Qodo AI

Code & Development

$0/Month

Codiga

Code & Development

$10/Month

Tabnine

Code & Development

$59/Month

Step 3.5 Flash Alternatives

Cognosys AI

Text Generator

$0/Month

AI Perfect Assistant

Text Generator

$17/Month

Intern-S1-Pro

Text Generator

$0/Month

About Author

Hi Guys! We are a group of ML Engineers by profession with years of experience exploring and building AI tools, LLMs, and generative technologies. We analyze new tools not just as a user, but as someone who understands their technical depth and real-world value.We know how overwhelming these tools can be for most people, that’s why we break down complex AI concepts into simple, practical insights. Our goal is to help you discover these magical AI tools that actually save your time and make everyday work smarter, not harder.“We don’t just write about AI: We build, test and simplify it for you.”

Step 3.5 Flash

From StepFun

About This AI

Key Features

Price Plans

Pros

Cons

Use Cases

Target Audience

How To Use

How we rated Step 3.5 Flash

Step 3.5 Flash integration with other tools

Best prompts optimised for Step 3.5 Flash

FAQs

What is Step 3.5 Flash?

When was Step 3.5 Flash released?

Is Step 3.5 Flash free to use?

What are the key specs of Step 3.5 Flash?

How fast is Step 3.5 Flash?

Can Step 3.5 Flash run locally?

What benchmarks does Step 3.5 Flash excel at?

Who should use Step 3.5 Flash?

Newly Added Tools​

Qwen-Image-2.0

Qodo AI

Codiga

Tabnine

Cognosys AI

AI Perfect Assistant

Intern-S1-Pro

Newly Added Tools