Paper Survey · 视觉信息不可替代性

来源：3-Paper_Survey/pre-survey/2026-06-11-vision-irreplaceability-thread/

支撑论文 thesis 的 ~44 篇参考表（“Beyond Textualization”）：从 CLIP → Flamingo → … → RRVF / Visual-ERM，论证哪些视觉信息无法被文本替代。按“不可替代信息”重排为 7 类——这正是 verifier 必须依赖多模态证据的根本理由。

按“不可替代视觉信息”归类

不可替代信息	代表论文	说明
空间 / 几何 / 3D	VSR · SpatialEval · SpatialVLM · SpatialRGPT · VoT · MVoT	文本可描述局部关系，但多对象、多步、连续几何状态非常容易丢失
文档 layout	LayoutLM · DocVQA · Donut · Nougat · KOSMOS-2.5	OCR 文本丢失坐标、层级、阅读顺序、样式
图表视觉编码	PlotQA · ChartQA · ChartBench · ChartMimic	数值、趋势、legend / color / axis 绑定来自视觉编码
GUI / action state	VisualWebArena · SeeAct · ScreenSpot-Pro · OSWorld	当前屏幕、元素可见性、点击目标和交互状态依赖视觉
具身 affordance	SayCan · PaLM-E · VIMA · RT-2	“应该做什么”必须受“当前世界能不能做”约束
渲染结果	Design2Code · RRVF · ReLook · Visual-ERM	代码 / 文档源文本不是最终视觉交付物；错误常在 render 后出现
中间视觉思维	VoT · MVoT · Latent Sketchpad · GRIT · Visual Planning	视觉可作为 reasoning trace / scratchpad / latent state

完整论文表（~44 篇，按类别）

类别	论文	年份/场景	关键点
Foundation VLM	CLIP	ICML 2021	大规模 image-text 对齐学习可迁移视觉表征
Foundation VLM	Flamingo	NeurIPS 2022	交错 image/video/text 的 few-shot VLM
Foundation VLM	BLIP-2	ICML 2023	Q-Former 桥接 frozen image encoder 与 frozen LLM
Foundation VLM	LLaVA	NeurIPS 2023	用 GPT-4 生成多模态 instruction data
MSR foundation	KOSMOS-1	NeurIPS 2023	perception 不是语言的附属品；含 Raven IQ 非语言推理
MSR foundation	KOSMOS-2.5	arXiv 2023	生成 spatially-aware text blocks 与 markdown structure
MSR agent	TaskMatrix.AI	2024	foundation model 连接 API/tool 完成数字和物理任务
MSR	Visualization-of-Thought (VoT)	NeurIPS 2024	可视化中间空间状态，提升 grid navigation / tiling
MSR	MVoT	ICML 2025	MLLM 生成 text-image interleaved reasoning traces
MSR	11Plus-Bench	arXiv 2025	认知启发的空间能力题评估 MLLM
MSR	Latent Sketchpad	arXiv 2025	visual latents 作内部 sketchpad，可解码成草图
Think w/ image	Thinking with Images (Survey)	2025	external tools → programmatic manipulation → intrinsic imagination
Think w/ image	Image-of-Thought	arXiv 2024	step-by-step 提取 visual rationales
Think w/ image	GRIT	arXiv 2025	reasoning chains 交错文本和 bbox，用 RL 训练 grounded reasoning
Think w/ image	Visual Planning	arXiv 2025	用 image sequences 做 planning，减少文本中介
Spatial	Visual Spatial Reasoning (VSR)	TACL 2023	10k+ pairs，66 类 spatial relations
Spatial	SpatialEval	NeurIPS 2024	“加图” 不等于 “使用视觉”；VLM 不一定优于 LLM
Spatial	SpatialVLM	CVPR 2024	大规模 3D spatial VQA，改善定量 / 定性 spatial reasoning
Spatial	SpatialRGPT	NeurIPS 2024	region-level 3D spatial reasoning；depth plugin
Spatial	Scaling and Beyond (Position)	2025	呼吁 multimodal reasoning traces 和 spatial agentic workflows
Document	LayoutLM	KDD 2020	joint text + layout + image pretraining
Document	DocVQA	WACV 2021	仅 OCR 文本不足以回答文档问题
Document	Donut	ECCV 2022	OCR-free document understanding
Document	Nougat	arXiv 2023	scientific PDF image → markup；公式 / 表格损失
Chart	PlotQA	WACV 2020	真实 plot 大量 OOV / real-valued answers
Chart	ChartQA	ACL Findings 2022	chart visual + logical reasoning
Chart	ChartBench	arXiv 2023	从 legend / color / coordinate 推导数据可靠性
Chart/code	ChartMimic	ICLR 2025	chart image + instruction → rendering code
Multi-discipline	MathVista	ICLR 2024	visual contexts 中数学推理
Multi-discipline	MMMU	CVPR 2024	30 种 image types 的 college-level 多模态题
Multi-discipline	MMMU-Pro	ACL 2025	过滤 text-only 可答题，加入 vision-only setting
Math vision	MATH-Vision	NeurIPS 2024	真实竞赛数学视觉题，仍有明显人机差距
GUI agent	VisualWebArena	ACL 2024	910 真实 visually grounded web tasks
GUI agent	SeeAct	ICML 2024	GPT-4V 网页 agent；grounding 是瓶颈
GUI grounding	ScreenSpot-Pro	arXiv 2025	高分辨率专业 GUI grounding，最好仅 18.9%
Computer use	OSWorld	NeurIPS 2024	真实 OS/app 任务；human 72.36% vs best model 12.24%
Robotics	SayCan	CoRL 2022	LLM semantic plan + affordance value
Robotics	PaLM-E	ICML 2023	embodied multimodal LM，接入视觉和连续状态
Robotics	VIMA	ICML 2023	robot manipulation 表示为 multimodal prompts
Robotics	RT-2	CoRL 2023	vision-language-action；web knowledge → robot control
Artifact/code	Design2Code	NAACL 2025	webpage screenshot → code；layout / visual recall 是短板
Artifact/RL	RRVF	arXiv 2025	reasoning-rendering-visual-feedback；only raw images
Artifact/RL	ReLook	arXiv 2025	web code generate-diagnose-refine；MLLM critic
Artifact/reward	Visual-ERM	arXiv 2026	visual equivalence reward model，细粒度 discrepancy feedback

来源：3-Paper_Survey/pre-survey/2026-06-11-vision-irreplaceability-thread/papers.md。相关：视觉反馈 + RL（7 篇）→ · 证据与文档索引 →