RRVF Reproduction

archived · 历史 / 方法参考

目标与现状

0-Project/.cache/20260621-rrvf-reproduction/

复现 RRVF（Learning Only with Images）的视觉反馈 RL image-to-code 工作流。官方代码缺失，需自行实现协议。已完成 source audit 与阶段化复现路线：L0–L2 完成、L3 计划中、L4 受算力限制。当前物理位置在 .cache/ 下，作为历史与方法参考——其方法笔记仍是本文“与 RRVF 区别”论证的关键来源。

RRVF 方法要点

闭环

Reasoning（推理生成代码）→ Rendering（渲染为图像）→ Visual Feedback（与源图像比较得奖励），GRPO 端到端优化，支持多轮自我修正。

验证非对称性

验证渲染结果是否匹配源图像，远比从图像生成代码简单——这种不对称性天然提供 RL 奖励信号，使模型能仅从原始图像学习。

关键参数

Qwen2.5-VL-7B policy、72B judge、GRPO group 8；reward 权重 vision 0.2 / format 0.8 / tool 1.0。

对本文的意义

RRVF 是非常近邻的工作，区别必须讲清楚：RRVF 关注“如何用 visual feedback 和 RL 改进 image-to-code”（method / training framework，中心任务是 chart-to-code 和 web screenshot-to-code）；本文关注“为什么 agentic coding loop 需要 multimodal verification”（position / vision paper，验证是通用 loop-level 组件，image-to-code 只是代表性 instance）。

RRVF shows that visual feedback can improve image-to-code training. Our focus is broader: multimodal verification should be treated as a general loop-level capability for coding agents, where image-to-code is one important instance rather than the full problem.

RRVF Reproduction

目标与现状

RRVF 方法要点

对本文的意义

相关文档