Summary: VLMs
Contents
VLM Tasks
- Image Captioning: generate a description for a given image
- General Visual Question Answering: answer questions based on the visual content of a given image.
- Text-oriented Visual Question Answering: Text-VQA is a specialized sub-task of VQA where answering questions critically depends on reading and comprehending text in a given image.
- Multilingual Text Recognition and Understanding
- Refer Expression Comprehension
- Visual Grounding
- Mathematical Reasoning
- Video Understanding
- Visual Agent
- Function Calling
- UI Operations/Games/Robotics/Navigation
VLMs Summary
| Model | Year | Model Architecture | Training Recipe | Data Recipe | ||
|---|---|---|---|---|---|---|
| Vision Encoder | Adapter | LLM | ||||
| BLIP | 2022.01 | - | - | - | - | - |
| BLIP-2 | 2023.01 | - | Q-Former | - | - | - |
| LLaVA | 2023.04 | CLIP ViT-L/14 | Linear | Vicuna | Pre-training + Fine-tuning | Image-text pairs |
| Qwen-VL | 2023.08 | ViT-bigG | Cross-attention | Qwen | Pre-training + SFT | Image-text pairs |
| Qwen2-VL | 2024.09 | ViT | MLP | Qwen2 | Pre-training + Post-training | 1.2T tokens |
| Qwen2.5-VL | 2025.02 | ViT | MLP | Qwen2.5 | Pre-training + Post-training | 4T tokens |
| Qwen3-VL | 2025.02 | SigLIP-2 | MLP | Qwen3 | Pre-training + Post-training | - |
| HunyuanOCR | 2025.11 | SigLIP-v2 | Conv2d + MLP | Hunyuan | Multi-stage + RL | 200M image-text pairs |