| Foundation VLM | CLIP | ICML 2021 | 大规模 image-text 对齐学习可迁移视觉表征 |
| Foundation VLM | Flamingo | NeurIPS 2022 | 交错 image/video/text 的 few-shot VLM |
| Foundation VLM | BLIP-2 | ICML 2023 | Q-Former 桥接 frozen image encoder 与 frozen LLM |
| Foundation VLM | LLaVA | NeurIPS 2023 | 用 GPT-4 生成多模态 instruction data |
| MSR foundation | KOSMOS-1 | NeurIPS 2023 | perception 不是语言的附属品;含 Raven IQ 非语言推理 |
| MSR foundation | KOSMOS-2.5 | arXiv 2023 | 生成 spatially-aware text blocks 与 markdown structure |
| MSR agent | TaskMatrix.AI | 2024 | foundation model 连接 API/tool 完成数字和物理任务 |
| MSR | Visualization-of-Thought (VoT) | NeurIPS 2024 | 可视化中间空间状态,提升 grid navigation / tiling |
| MSR | MVoT | ICML 2025 | MLLM 生成 text-image interleaved reasoning traces |
| MSR | 11Plus-Bench | arXiv 2025 | 认知启发的空间能力题评估 MLLM |
| MSR | Latent Sketchpad | arXiv 2025 | visual latents 作内部 sketchpad,可解码成草图 |
| Think w/ image | Thinking with Images (Survey) | 2025 | external tools → programmatic manipulation → intrinsic imagination |
| Think w/ image | Image-of-Thought | arXiv 2024 | step-by-step 提取 visual rationales |
| Think w/ image | GRIT | arXiv 2025 | reasoning chains 交错文本和 bbox,用 RL 训练 grounded reasoning |
| Think w/ image | Visual Planning | arXiv 2025 | 用 image sequences 做 planning,减少文本中介 |
| Spatial | Visual Spatial Reasoning (VSR) | TACL 2023 | 10k+ pairs,66 类 spatial relations |
| Spatial | SpatialEval | NeurIPS 2024 | “加图” 不等于 “使用视觉”;VLM 不一定优于 LLM |
| Spatial | SpatialVLM | CVPR 2024 | 大规模 3D spatial VQA,改善定量 / 定性 spatial reasoning |
| Spatial | SpatialRGPT | NeurIPS 2024 | region-level 3D spatial reasoning;depth plugin |
| Spatial | Scaling and Beyond (Position) | 2025 | 呼吁 multimodal reasoning traces 和 spatial agentic workflows |
| Document | LayoutLM | KDD 2020 | joint text + layout + image pretraining |
| Document | DocVQA | WACV 2021 | 仅 OCR 文本不足以回答文档问题 |
| Document | Donut | ECCV 2022 | OCR-free document understanding |
| Document | Nougat | arXiv 2023 | scientific PDF image → markup;公式 / 表格损失 |
| Chart | PlotQA | WACV 2020 | 真实 plot 大量 OOV / real-valued answers |
| Chart | ChartQA | ACL Findings 2022 | chart visual + logical reasoning |
| Chart | ChartBench | arXiv 2023 | 从 legend / color / coordinate 推导数据可靠性 |
| Chart/code | ChartMimic | ICLR 2025 | chart image + instruction → rendering code |
| Multi-discipline | MathVista | ICLR 2024 | visual contexts 中数学推理 |
| Multi-discipline | MMMU | CVPR 2024 | 30 种 image types 的 college-level 多模态题 |
| Multi-discipline | MMMU-Pro | ACL 2025 | 过滤 text-only 可答题,加入 vision-only setting |
| Math vision | MATH-Vision | NeurIPS 2024 | 真实竞赛数学视觉题,仍有明显人机差距 |
| GUI agent | VisualWebArena | ACL 2024 | 910 真实 visually grounded web tasks |
| GUI agent | SeeAct | ICML 2024 | GPT-4V 网页 agent;grounding 是瓶颈 |
| GUI grounding | ScreenSpot-Pro | arXiv 2025 | 高分辨率专业 GUI grounding,最好仅 18.9% |
| Computer use | OSWorld | NeurIPS 2024 | 真实 OS/app 任务;human 72.36% vs best model 12.24% |
| Robotics | SayCan | CoRL 2022 | LLM semantic plan + affordance value |
| Robotics | PaLM-E | ICML 2023 | embodied multimodal LM,接入视觉和连续状态 |
| Robotics | VIMA | ICML 2023 | robot manipulation 表示为 multimodal prompts |
| Robotics | RT-2 | CoRL 2023 | vision-language-action;web knowledge → robot control |
| Artifact/code | Design2Code | NAACL 2025 | webpage screenshot → code;layout / visual recall 是短板 |
| Artifact/RL | RRVF | arXiv 2025 | reasoning-rendering-visual-feedback;only raw images |
| Artifact/RL | ReLook | arXiv 2025 | web code generate-diagnose-refine;MLLM critic |
| Artifact/reward | Visual-ERM | arXiv 2026 | visual equivalence reward model,细粒度 discrepancy feedback |