Multimodal Verification Benchmark Map

initialized workstream 2

目标

0-Project/20260622-multimodal-verification-benchmark-map/

整理 multimodal verification for agentic coding loop 可能覆盖的任务和 benchmark，回答：如果验证是比 image-to-code 更大的问题，它具体覆盖哪些 code-related tasks？现有 benchmark 能否支持我们观察这个问题？这个 subproject 不只是 paper survey，也不只是 dataset list，而应形成一个 task / benchmark map。已收敛出 7 个场景族与 P0/P1/P2 优先级。

两个必须守住的边界

任务必须和 code generation / revision / rendered artifact 有关系
VQA、OCR、pixel diff、spatial reasoning 更适合作为 verifier capability，而非把主线改成通用视觉理解

收敛结论

7 个场景族（见验证场景族）
P0：族 1–4；P1/P2：族 5–7
“已有数据可改造” vs “需自建” 两类来源
首版做 8–12 张 case card，而非大 benchmark

候选任务族（初版整理）

Family	示例任务	验证目标	初始角色
Reference-based replication	screenshot-to-HTML, chart image-to-code, formula image-to-LaTeX	rendered output 是否接近 reference	core
Prompt-to-visual-code artifact	prompt-to-UI, text-to-dashboard, data-to-chart, prompt-to-SVG	instruction following、layout、readability、语义正确	core / near-core
Visual repair / debugging	HTML/CSS repair, chart refinement, layout overflow fixing	根据 screenshot 找到并修复视觉问题	core
VQA-style verifier checks	screenshot + question about generated artifact	把需求拆成局部可判定问题	verifier capability
OCR / text fidelity	UI labels, chart labels, table fields, report text	文字是否正确、完整、可见	verifier capability
Spatial / layout	left/right, alignment, grouping, overlap, responsive	空间关系与区域结构是否满足需求	verifier capability
Interaction / state	hover, modal, responsive breakpoint, animation frame	多状态、多帧、多截图是否正确	extension

Benchmark 字段（每个任务后续记录）

Task / benchmark · Source · Input modality · Output artifact · Code-relatedness · Reference availability · Current evaluation · Verification target · Multimodal gap · Renderer feasibility · Prototype suitability · Notes / risks。

子项目文档

文件	用途
`Docs/20260623-task-family-map-synthesis.md`	收敛当前讨论：candidate-conditioned 定义、场景族、P0/P1/P2、已有数据与需自建
`Docs/20260622-agent-loop-baseline-handoff.md`	总结前期讨论，交接如何搭建最初的 agent loop baseline / pipeline，含 verifier YAML schema
`Docs/discussions/responses/`	外部 AI 或合作者对 task families / benchmarks / case cards 的原始回答

下一步

产出首版 case-card 草稿（Docs/20260623-case-card-drafts-v0.md），从 4 个 P0 场景族抽 8–12 张卡片，明确每张的 objective oracle / acceptance relation、所需证据通道与 hard negative。

README task-family synthesis baseline handoff → 验证场景族（看板版）