核心主张与 Loop

来源：1-Docs/20260622-multimodal-verification-scope.md

Core Claim

很多 coding-agent 任务的成功标准并不只存在于代码文本、编译结果或单元测试里，而是存在于最终可见的多模态产物里：页面是否和目标截图一致、UI 组件是否在正确位置、chart 是否表达了正确的数据关系、文字是否被遮挡或截断、prompt 里的视觉要求是否真的出现在渲染结果里、交互/响应式/动画状态是否符合预期。因此一个完整的 agentic coding loop 不应只有 instruction → code → execution → error/debug → revision，而应当显式加入验证环节。

显式加入 multimodal verification 的 loop

Instruction指令 / 视觉目标

→

Generate生成代码

→

Render渲染 / 执行 / 观察

→

Verify多模态验证

→

Feedback可执行反馈

→

Revise修正代码 ↻

这个 verification 环节的作用不是替代代码执行，而是补上 execution feedback 覆盖不到的视觉、空间、语义-视觉一致性和用户意图满足问题。

Paper Position & 贡献

最合适的定位是：A position / vision paper arguing that multimodal verification is a missing loop-level capability for coding agents, supported by a task taxonomy and small-scale prototype experiments.

Contribution 形态

Problem framing：把验证定义为 loop-level 能力，而非某个 image-to-code trick
Scope & taxonomy：哪些任务真正需要多模态验证
Verifier abstraction：unbiased、implementation agnostic、可替换
Prototype evidence + research agenda

Verifier 设计原则

Unbiased（作为目标，非既成事实）：避免复用 generator 判断路径
Implementation agnostic：看 outcome 与可见产物，不强制特定实现
Multimodal & evidence-aware：综合 instruction、reference、render、DOM、logs、OCR
Actionable feedback：是否通过 / 失败检查 / 证据 / 位置 / 修改提示 / 不确定性

与 RRVF 的区别

RRVF：用 visual feedback + RL 改进 image-to-code（method / training framework）
本文：验证是通用 loop-level 能力（position / vision paper）
image-to-code 是代表性 instance 与原型起点，不是全部
输出重点：问题定义、taxonomy、prototype observation、research agenda

实现选项（backend 可替换）

截图讨论：multimodal unit tests / sequential tool use / frontier MLLM / subagent

选项	做法	优点	风险 / 代价
A. Multimodal unit tests	把需求拆成可独立验证的多模态测试（OCR/text fidelity、layout/spatial、visual reference match、component completeness、chart semantics、readability、interaction state）	可并行、结构化、易 ablation、易替换 backend	需求不总能拆成 tests；单 test 通过 ≠ 整体好；可能过度离散化
B. Sequential tool use / thinking with images	verifier 像视觉检查 agent：查看全图 → 裁剪可疑区域 → 与 reference 对比 → 局部 VQA → 必要时查 DOM/code → 汇总	适合复杂页面、局部错误、多区域对比；产生更丰富 reasoning trace	更慢、更贵、更难稳定复现；顺序工具调用引入中间错误
C. Frontier MLLM as general verifier	初版最 doable：用前沿 MLLM 作最通用 verifier tool，处理 NL + 截图 + reference + 代码 + 局部 crop	覆盖面最大、无需训练专门 verifier、快速观察前后变化	会 hallucinate、过度相信 prompt、漏检、判断不稳定
D. Subagent verification with image	verifier subagent 接收 instruction + images + 可选代码/上下文，返回结构化反馈；可并行多个	工程自然、无需重构 coding agent、generator/verifier 角色分离	subagent 不自动等于 unbiased（取决于模型、prompt、上下文隔离、证据输入）

策略：初版优先 C 或 D，保持 backend 可替换；后续逐步加入或替换为 V-DIFF / learned visual diff、OCR、DOM/CSS 结构检查、layout analyzer、chart evaluator、deterministic screenshot diff 或更小的本地多模态模型。

Scope 边界

In Scope

coding agent 生成/修改代码后的产物验证
可渲染/可观察的 code artifact（HTML/CSS、React UI、chart code、SVG、LaTeX/Typst、dashboard、diagram-as-code、slides/report）
需求含视觉目标、截图、设计稿、空间关系或视觉约束
verification 发生在 loop 中，能指导下一轮 revision
verifier 可用图片、截图、OCR、DOM、代码、logs、视觉差异、NL 任务描述等多种证据

Out of Scope

只研究模型如何直接生成图片
只做通用 VQA，不连接 code artifact 或 coding loop
只做传统 unit test、compiler/runtime error
只做 pixel-level screenshot diff，不考虑 instruction following、空间关系、语义和可操作反馈
只复现某一个 image-to-code 方法，不讨论 general coding loop

重要约束：VQA 可以进入本文，但更适合被定义为 verification 的一种形式（VQA-style multimodal unit tests），始终绑定在 generated code artifact 的 rendered observation 上，避免文章从 code-related scope 滑走。

来源：1-Docs/20260622-multimodal-verification-scope.md。相关：验证场景族 → · 视觉反馈 + RL 调研 →