Contents

Summary: VLMs

VLM Tasks

  • Image Captioning: generate a description for a given image
  • General Visual Question Answering: answer questions based on the visual content of a given image.
  • Text-oriented Visual Question Answering: Text-VQA is a specialized sub-task of VQA where answering questions critically depends on reading and comprehending text in a given image.
    • Multilingual Text Recognition and Understanding
  • Refer Expression Comprehension
  • Visual Grounding
  • Mathematical Reasoning
  • Video Understanding
  • Visual Agent
    • Function Calling
    • UI Operations/Games/Robotics/Navigation

VLMs Summary

ModelYearModel ArchitectureTraining RecipeData Recipe
Vision EncoderAdapterLLM
BLIP2022.01-----
BLIP-22023.01-Q-Former---
LLaVA2023.04CLIP ViT-L/14LinearVicunaPre-training + Fine-tuningImage-text pairs
Qwen-VL2023.08ViT-bigGCross-attentionQwenPre-training + SFTImage-text pairs
Qwen2-VL2024.09ViTMLPQwen2Pre-training + Post-training1.2T tokens
Qwen2.5-VL2025.02ViTMLPQwen2.5Pre-training + Post-training4T tokens
Qwen3-VL2025.02SigLIP-2MLPQwen3Pre-training + Post-training-
HunyuanOCR2025.11SigLIP-v2Conv2d + MLPHunyuanMulti-stage + RL200M image-text pairs