HunyuanOCR Technical Report

Naifan Li — Sun, 16 Nov 2025 14:27:22 +0800

Motivation

Traditional OCR systems rely on the modularized pipeline architecture, primarily including, but not limited to: text detection, text recognition, document layout analysis, named entity recognition, and optional text translation modules, which inevitably result in cumulative error propagation, elevate deployment and maintenance overhead. -> End-to-End
While leading general VLMs (e.g., Gemini, Qwen-VL) deliver superior OCR performance, they often entail excessive computational overhead and high latency due to the massive parameter scales. -> OCR-specific, lightweight(1B)
unified multi-task modeling, including text spotting, document parsing, information extraction, visual question answering, and text image translation.