<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title>OCR - Tag - Naifan Li's Blog</title><link>https://blog.omagiclee.com/tags/ocr/</link><description>OCR - Tag - Naifan Li's Blog</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Sun, 16 Nov 2025 14:27:22 +0800</lastBuildDate><atom:link href="https://blog.omagiclee.com/tags/ocr/" rel="self" type="application/rss+xml"/><item><title>HunyuanOCR Technical Report</title><link>https://blog.omagiclee.com/posts/vlms/hunyuan-ocr/</link><pubDate>Sun, 16 Nov 2025 14:27:22 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/vlms/hunyuan-ocr/</guid><description><![CDATA[<p><i class="fas fa-building fa-fw" aria-hidden="true"></i><span style="color:gray">Tencent Hunyuan Vision Team</span>
<i class="fas fa-file-pdf fa-fw" aria-hidden="true"></i><a href="https://arxiv.org/abs/2511.19575" target="_blank" rel="noopener noreffer ">arXiv 2511.19575</a>
<i class="fab fa-github fa-fw" aria-hidden="true"></i><a href="https://github.com/Tencent-Hunyuan/HunyuanOCR" target="_blank" rel="noopener noreffer ">Tencent-Hunyuan/HunyuanOCR</a>
<a href="https://huggingface.co/tencent/HunyuanOCR" target="_blank" rel="noopener noreffer ">tencent/HunyuanOCR</a></p>
<h2 id="motivation">Motivation</h2>
<ul>
<li>Traditional OCR systems rely on the modularized pipeline architecture, primarily including, but not limited to: text detection, text recognition, document layout analysis, named entity recognition, and optional text translation modules, which inevitably result in cumulative error propagation, elevate deployment and maintenance overhead. -&gt; <strong>End-to-End</strong></li>
<li>While leading general VLMs (e.g., Gemini, Qwen-VL) deliver superior OCR performance, they often entail excessive computational overhead and high latency due to the massive parameter scales. -&gt; <strong>OCR-specific, lightweight(1B)</strong></li>
<li><strong>unified multi-task modeling, including text spotting, document parsing, information extraction, visual question answering, and text image translation.</strong></li>
</ul>
<h2 id="method">Method</h2>
<h3 id="model-architecture">Model Architecture</h3>
<p></p>]]></description></item></channel></rss>