CVPR 2026

MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction

Amazon AGI      Brown University
arXiv (Coming Soon)

Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated promising capabilities in multimodal coding tasks such as chart-to-code generation. However, existing methods primarily rely on supervised finetuning (SFT), which requires the model to learn code patterns through chart-code pairs but does not expose the model to a code execution environment. Moreover, while self-correction through execution feedback offers a potential route to improve coding quality, even state-of-the-art MLLMs have been shown to struggle with effective self-correction. In this work, we introduce MM-ReCoder, a chart-to-code generation model trained with reinforcement learning (RL) and equipped with self-correction ability. We propose a two-stage multi-turn self-correction RL strategy based on Group Relative Policy Optimization (GRPO). The first stage enhances the model's self-correction ability via rolling out a shared first turn, while the second stage improves the coding capability with full-trajectory optimization. MM-ReCoder learns to produce more accurate and executable code through the interaction with the environment and by iteratively correcting its own outputs. Our results on three chart-to-code benchmarks demonstrate the state-of-the-art performance of MM-ReCoder.

Motivation


Scientific charts play a crucial role in helping humans interpret complex information by highlighting trends, relationships, and comparisons. Being able to automatically generate the source code of a chart from its image makes it easy to edit, reproduce, and reuse visualizations. However, existing chart-to-code approaches treat the problem as a one-shot generation task — generating code in a single pass without executing or refining based on feedback from rendered results. While humans naturally operate iteratively (implement → execute → visualize → refine), current models do not replicate this self-correcting process.

We reveal that existing open-source MLLMs struggle to self-correct on multimodal coding tasks. Although scores on evaluation benchmarks appear to improve between turns, gains mainly come from increased code executability, not from refining already-executable code. When filtering for charts that successfully render in both turns, existing models show negative improvement — whereas our model, MM-ReCoder, achieves positive improvement.


Motivation: self-correction analysis

MM-ReCoder


We propose MM-ReCoder, trained with a cold start phase followed by a two-stage multi-turn self-correction RL strategy.


MM-ReCoder training pipeline

Cold Start


We first perform SFT on 160k chart-code pairs (Chart2Code-160k) to build basic coding capability. We then construct 7k two-turn self-correction conversations using Qwen3-VL-235B-A22B-Instruct and perform a second round of SFT to initialize self-correction behavior.


Two-Stage Multi-Turn Self-Correction RL


We use Group Relative Policy Optimization (GRPO) in two stages:

Stage 1 — Shared First Turn: Fix an online-sampled first-turn output and roll out multiple second-turn candidates. This allows the model to explore diverse refinement strategies and directly trains self-correction capability.

Stage 2 — Full Trajectory: Jointly optimize both turns end-to-end, with rewards computed at the final turn.


Reward Design


We combine three reward signals:

Rule-based Reward: Hooks into Matplotlib to extract chart elements (type, text, colors, layout) and computes an F1-based similarity score against the reference chart.

Model-based Reward: Uses Qwen2.5-VL-72B to evaluate six aspects — chart type, layout, text, data, style.

Format Reward: Checks that the output follows the required <think>...</think>'''python...''' structure.


Experiments


Main Results


We evaluate MM-ReCoder on three benchmarks: ChartMimic, Plot2Code, and ChartX. Compared to its base model (Qwen2.5-VL-7B), MM-ReCoder improves execution rate by +22%, low-level score by +27%, and high-level score by +24%. Moreover, MM-ReCoder not only significantly outperforms chart-domain specialist models and models of comparable size, but also achieves the best ChartMimic low-level score and Plot2Code text-match score among all models, surpassing GPT-4o and Qwen3-VL-235B-A22B.


Main quantitative results

Self-Correction Analysis


When evaluating only on charts that successfully render in both turns, MM-ReCoder achieves a +0.30% low-level score improvement and +0.89% high-level score improvement. In contrast, existing models, especially models of comparable size, show negative improvement in this setting.


Self-correction analysis

Human Evaluation


In A/B testing, MM-ReCoder wins against ChartCoder (37% Win / 43% Tie / 20% Loss) and Qwen2.5-VL-72B (40% Win / 37% Tie / 23% Loss). Among samples with score improvements, 76.5% show visually discernible improvements as judged by humans.

Human evaluation results

Qualitative Results


Examples of Self-correction


MM-ReCoder successfully corrects a wide range of chart rendering issues, including label placement, axis ranges, color mismatches, and style inconsistencies.




Comparison with Other Models


MM-ReCoder achieves superior color, text, and style accuracy compared with ChartCoder, Qwen3-VL-235B, and GPT-4o across diverse chart types.


Comparison with other models

BibTeX

@inproceedings{tang2026mmrecoder,
    title={MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction},
    author={Zitian Tang and Xu Zhang and Jianbo Yuan and Yang Zou and Varad Gunjal and Songyao Jiang and Davide Modolo},
    booktitle={CVPR},
    year={2026}
}