
Summary Box [In a hurry? Just read this⚡]
- Z.ai released GLM-OCR, a lightweight open-source multimodal OCR model with only 0.9B parameters that dominates document parsing benchmarks.
- It achieves the highest score on OmniDocBench v1.5 at 94.6%, leading in formulas (96.5%) and tables (86.0%).
- Processes 1.86 PDF pages per second on a single GPU — significantly faster than PaddleOCR and other competitors.
- Fully open-source under MIT license on Hugging Face; runs locally with Ollama, vLLM, or Transformers — no cloud required.
- Excellent at complex layouts, multi-language text, formulas, tables, and stamps, but still has some limitations with heavy handwriting recognition.
Z.ai has launched GLM-OCR, a cutting-edge open-source multimodal model engineered for superior document understanding and parsing.
Released in early February 2026, this 0.9B parameter powerhouse sets new standards in optical character recognition by delivering exceptional accuracy across challenging layouts, all while maintaining remarkable efficiency.
Topics
ToggleAvailable on Hugging Face under the permissive MIT license, it empowers developers, businesses, and researchers to integrate high-quality OCR without hefty computational demands.
At its core, GLM-OCR leverages a sophisticated architecture combining the CogViT visual encoder pretrained on vast image-text datasets, a streamlined cross-modal connector for token downsampling, and a GLM-0.5B language decoder.
Introducing GLM-OCR: SOTA performance, optimized for complex document understanding.
— Z.ai (@Zai_org) February 3, 2026
With only 0.9B parameters, GLM-OCR delivers state-of-the-art results across major document understanding benchmarks, including formula recognition, table recognition, and information extraction.… pic.twitter.com/2c6iSsaXYs
This setup enables a two-stage process: initial layout analysis via PP-DocLayout-V3, followed by parallel recognition, ensuring robust handling of diverse document types.
Benchmark Dominance
GLM-OCR excels on key evaluations, topping the OmniDocBench v1.5 with a score of 94.6%. It leads in specialized sub-tasks, achieving 96.5% in formula recognition and 86.0% in table parsing. Below is a comparison of its performance against leading models:
| Benchmark | GLM-OCR | PaddleOCR-v1.5 | Deepseek-OCR2 | MinerU2.5 | dots.ocr | Gemini-3-Pro | GPT-5.2-2025-12-11 |
|---|---|---|---|---|---|---|---|
| OmniDocBench v1.5 (Document Parsing) | 94.6 | 94.5 | 91.1 | 90.7 | 88.4 | 90.3 | 85.4 |
| OCRBench (Text Recognition) | 94.0 | 75.3 | 34.7 | 75.3 | 92.1 | 91.9 | 83.7 |
| UniMERNet (Formula Recognition) | 96.5 | 96.1 | 85.8 | 96.4 | 90.0 | 96.4 | 90.5 |
| PubTabNet (Table Recognition) | 85.2 | 84.6 | – | 88.4 | 71.0 | 91.4 | 84.4 |
| TEDS_TEST (Table Recognition) | 86.0 | 83.3 | – | 85.4 | 62.4 | 81.8 | 67.6 |
| Nanonets-KIE (Information Extraction) | 93.7 | – | – | – | – | 95.2 | 87.5 |
| Handwritten-Forms (Information Extraction) | 86.1 | – | – | – | – | 94.5 | 78.2 |
These scores highlight GLM-OCR‘s edge over both specialized and general vision-language models, particularly in real-world complexities like code documents, intricate tables, and official stamps.
Speed and Efficiency
Performance extends to inference speed, where GLM-OCR processes 1.86 PDF pages per second on a single GPU, surpassing competitors like PaddleOCR.
For images, it handles 0.67 per second, making it ideal for high-volume workflows. Its lightweight design ensures seamless local deployment without sacrificing quality.
Key efficiency features include:
- Multi-language support for global applications.
- Multi-Token Prediction for faster, more accurate outputs.
- Robustness to messy layouts, including seals and non-standard formats.
- Low resource footprint, running on standard hardware.
Deployment and Practical Use

Getting started is straightforward. Developers can deploy GLM-OCR via popular frameworks:
- Transformers for flexible integration.
- vLLM or SGLang for high-throughput serving.
- Ollama for easy local testing.
A comprehensive SDK simplifies installation and usage, with one-line calls for parsing. For businesses in finance, legal, or archiving, this means faster digitization of scanned documents, automated data extraction, and reduced manual errors.
While praised for its speed and accuracy on printed text, users should note limitations in handling extensive handwriting, where specialized models may still hold an advantage.
Overall, GLM-OCR democratizes advanced OCR, offering a balance of performance and accessibility that could transform document-heavy industries.



