HunyuanOCR Technical Report

Published on Nov 16, 2025 Updated on Jan 15, 2026 VLMs VLMs, OCR 4 minutes

Tencent Hunyuan Vision Team arXiv 2511.19575 Tencent-Hunyuan/HunyuanOCR Hugging Face tencent/HunyuanOCR

Motivation

Traditional OCR systems rely on the modularized pipeline architecture, primarily including, but not limited to: text detection, text recognition, document layout analysis, named entity recognition, and optional text translation modules, which inevitably result in cumulative error propagation, elevate deployment and maintenance overhead. -> End-to-End
While leading general VLMs (e.g., Gemini, Qwen-VL) deliver superior OCR performance, they often entail excessive computational overhead and high latency due to the massive parameter scales. -> OCR-specific, lightweight(1B)
unified multi-task modeling, including text spotting, document parsing, information extraction, visual question answering, and text image translation.

Method

Model Architecture

Large Language Model (0.5B): HunyuanOCR model is initialized with pre-trained weights from Hunyuan-0.5B with xD-RoPE.

Native ResolutionVision Encoder (0.4B): SigLIP-v2-400M pre-trained model.

Vision-Language Adapter:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
(perceive): HunYuanVisionPatchMerger(
  (proj): Sequential(
    (0): Conv2d(1152, 2304, kernel_size=(2, 2), stride=(2, 2))
    (1): GELU(approximate='none')
    (2): Conv2d(2304, 4608, kernel_size=(1, 1), stride=(1, 1))
  )
  (mlp): Linear(in_features=4608, out_features=1024, bias=True)
  (before_rms): HunYuanVLRMSNorm((1152,), eps=1e-05)
  (after_rms): HunYuanVLRMSNorm((1024,), eps=1e-05)
)

End-to-End Optimization

Data Recipe

Data Collection

public benchmarks
web crawling
synthetic data

200 million image-text pairs spanning nine major real-world scenarios—street views, documents, advertisements, handwritten text, screenshots, cards/certificates/invoices, game interfaces, video frames, and artistic typography—and covering more than 130 languages.

Data Synthesis

Data Augmentation

geometric deformation via control-point manipulation to emulate folds, curves, and perspective distortions.
imaging degradation with motion blur, Gaussian noise, and compression artifacts.
illumination perturbations that model global/local lighting variations, shadows and reflections.

Training Recipe

Stages	Pre-training				Reinforcement Learning
Stages	Stage-1	Stage-2	Stage-3	Stage-4	Reinforcement Learning
Purpose	Vision-Language Alignment	Multimodal Pre-training	Long-context Pre-training	Application-oriented SFT	-
Trainable Params	ViT & Adapter	All	All	All	-
Learning Rate	3e-4 → 3e-5	2e-4 → 5e-5	8e-5 → 5e-6	2e-5 → 1e-6	-
Training Tokens	50B	300B	80B	24B	-
Sequence Length	8k	8k	32k	32k	-
Data Composition	Synthetic Parsing and Recognition Data General Image Caption Data Pure Text (≤10%)	Increase the proportion of synthetic spotting, parsing, translation and VQA data. Pure Text (≤10%)	Long Pure Text Real-world Auto-annotated Data Long Document Parsing Data Information Extraction Data	Human-annotated Data Hard-negative Data Standardized Instruction Data	-

a small proportion of plain text to preserve the core linguistic capabilities of the language model.

Supplementary Material

Evolution of Optical Character Recognition (OCR)

1950s-1980s: OCR systems were based on template matching and feature engineering, focusing on basic text recognition in scanned documents.
1990s: machine learning
2015s: deep learning
2020s: vision-language models
- General VLMs
- Specialized VLMs (Modular): still depend on a preliminary layout analysis module to detect document elements, with the VLM subsequently parsing content within localized regions.
- Specialized OCR Models (End2End)

Performance Comparison of OCR Systems

OCR Tasks

Text Spotting

detect and recognize text within an image, and output the line-leveltext content and coordinates in a formatted manner. <ref>text</ref><quad>(x1,y1),(x2,y2)</quad>

<ref>text</ref>: text content
<quad>(x1,y1),(x2,y2)</quad>: text coordinates with its top-left and bottom-right vertices, normalized to the range [0, 1000] to maintain consistency across input images of varying resolutions.

Document Parsing

parse the text into a structured format.

Fine-Grained Element Parsing: support independent identification and extraction of specialized document elements, including mathematical formulas, chemical formulas, tables, and charts.
End-to-End Document Parsing

Information Extraction (IE)

extract information from the text.

Visual Question Answering (VQA)

answer questions based on the visual content of a given image.

Text Image Translation

translate the text in an image into either Chinese or English.

support over 14 languages
support both document-oriented images and general-purpose images

Recommended Instruction

Common Supported IE Categories

Reinforcement Learning Details

Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
HunYuanVLForConditionalGeneration(
  (model): HunYuanVLModel(
    (embed_tokens): Embedding(120818, 1024, padding_idx=120817)
    (layers): ModuleList(
      (0-23): 24 x HunYuanVLDecoderLayer(
        (self_attn): HunYuanVLAttention(
          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
          (query_layernorm): HunYuanVLRMSNorm((128,), eps=1e-05)
          (key_layernorm): HunYuanVLRMSNorm((128,), eps=1e-05)
          (rotary_emb): HunYuanVLRotaryEmbedding()
        )
        (mlp): HunYuanVLMLP(
          (gate_proj): Linear(in_features=1024, out_features=3584, bias=False)
          (up_proj): Linear(in_features=1024, out_features=3584, bias=False)
          (down_proj): Linear(in_features=3584, out_features=1024, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): HunYuanVLRMSNorm((1024,), eps=1e-05)
        (post_attention_layernorm): HunYuanVLRMSNorm((1024,), eps=1e-05)
      )
    )
    (norm): HunYuanVLRMSNorm((1024,), eps=1e-05)
  )
  (lm_head): Linear(in_features=1024, out_features=120818, bias=False)
  (vit): HunYuanVisionTransformer(
    (embeddings): HunYuanVisionPatchEmbed(
      (patch_embedding): Conv2d(3, 1152, kernel_size=(16, 16), stride=(16, 16))
      (position_embedding): Embedding(16385, 1152)
    )
    (layers): ModuleList(
      (0-26): 27 x HunYuanVisionBlock(
        (self_attn): HunYuanVisionAttention(
          (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
          (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
          (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
          (o_proj): Linear(in_features=1152, out_features=1152, bias=True)
        )
        (mlp): HunYuanVisionMLP(
          (act_fn): GELUActivation()
          (dense_h_to_4h): Linear(in_features=1152, out_features=4304, bias=True)
          (dense_4h_to_h): Linear(in_features=4304, out_features=1152, bias=True)
        )
        (input_layernorm): LayerNorm((1152,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((1152,), eps=1e-05, elementwise_affine=True)
      )
    )
    (perceive): HunYuanVisionPatchMerger(
      (proj): Sequential(
        (0): Conv2d(1152, 2304, kernel_size=(2, 2), stride=(2, 2))
        (1): GELU(approximate='none')
        (2): Conv2d(2304, 4608, kernel_size=(1, 1), stride=(1, 1))
      )
      (mlp): Linear(in_features=4608, out_features=1024, bias=True)
      (before_rms): HunYuanVLRMSNorm((1152,), eps=1e-05)
      (after_rms): HunYuanVLRMSNorm((1024,), eps=1e-05)
    )
  )
)

Contents