Summary: VLMs

Published on Nov 26, 2025 Updated on Jan 15, 2026 VLMs VLMs One minute

Contents

VLM Tasks

Image Captioning: generate a description for a given image
General Visual Question Answering: answer questions based on the visual content of a given image.
Text-oriented Visual Question Answering: Text-VQA is a specialized sub-task of VQA where answering questions critically depends on reading and comprehending text in a given image.
- Multilingual Text Recognition and Understanding
Refer Expression Comprehension
Visual Grounding
Mathematical Reasoning
Video Understanding
Visual Agent
- Function Calling
- UI Operations/Games/Robotics/Navigation

Model	Year	Model Architecture			Training Recipe	Data Recipe
Model	Year	Vision Encoder	Adapter	LLM	Training Recipe	Data Recipe
BLIP	2022.01	-	-	-	-	-
BLIP-2	2023.01	-	Q-Former	-	-	-
LLaVA	2023.04	CLIP ViT-L/14	Linear	Vicuna	Pre-training + Fine-tuning	Image-text pairs
Qwen-VL	2023.08	ViT-bigG	Cross-attention	Qwen	Pre-training + SFT	Image-text pairs
Qwen2-VL	2024.09	ViT	MLP	Qwen2	Pre-training + Post-training	1.2T tokens
Qwen2.5-VL	2025.02	ViT	MLP	Qwen2.5	Pre-training + Post-training	4T tokens
Qwen3-VL	2025.02	SigLIP-2	MLP	Qwen3	Pre-training + Post-training	-
HunyuanOCR	2025.11	SigLIP-v2	Conv2d + MLP	Hunyuan	Multi-stage + RL	200M image-text pairs