Contents

LLaVA: Visual Instruction Tuning

NeurIPS 2023 (Oral) Microsoft Research arXiv 2304.08485 haotian-liu/LLaVA llava-vl.github.io

Motivation

Contribution

Method

Architecture

/posts/vlms/llava/images/1764216846718.webp

Large Language Model: Vicuna

Vision Encoder: the pre-trained CLIP visual encoder ViT-L/14

Adapter: While a simple linear layer is employed here, more sophisticated alternatives, such as gated cross-attention in Flamingo and Q-former in BLIP-2, could be optionally substitued.

  • cross-modal alignment between visual space and text space.
  • visual feature compression

Training Recipe

Pre-training for Feature Alignment:

Fine-tuning End-to-End:

Data Recipe

GPT-assisted Visual Instruction Data Generation

Experiment

Reference

Question

NIPS 2023, Oral - ReadPaper

Motivations

Contributions

Method [图片] Task Data: GPT-assisted Visual Instruction Data Generation Multi-modal data

  1. Image-text pairs: CC LAION

Input/Output Multi-Modal Tokenizer Image Tokenizer

  • Encoder: the pre-trained CLIP visual encoder ViT-L/14
  • Projector: connect image features into the word embedding space.
    • (this paper) a simple linear layer
    • Gated cross-attention in Flamingo
    • Q-Former in BLIP-2 Text Tokenizer LLM Decoder (Vicuna) Training Pre-training for Feature Alignment Fine-tuning End-to-End