Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking 01-08
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models 11-27
TransDiffuser: End-to-end Trajectory Generation with Decorrelated Multi-modal Representation for Autonomous Driving 05-20
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond 08-25