CVPR 2026

MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction

Zitian Tang, Xu Zhang, Jianbo Yuan, Yang Zou, Varad Gunjal, Songyao Jiang, Davide Modolo

Amazon AGI Brown University

arXiv (Coming Soon)

Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated promising capabilities in multimodal coding tasks such as chart-to-code generation. However, existing methods primarily rely on supervised finetuning (SFT), which requires the model to learn code patterns through chart-code pairs but does not expose the model to a code execution environment. Moreover, while self-correction through execution feedback offers a potential route to improve coding quality, even state-of-the-art MLLMs have been shown to struggle with effective self-correction. In this work, we introduce MM-ReCoder, a chart-to-code generation model trained with reinforcement learning (RL) and equipped with self-correction ability. We propose a two-stage multi-turn self-correction RL strategy based on Group Relative Policy Optimization (GRPO). The first stage enhances the model's self-correction ability via rolling out a shared first turn, while the second stage improves the coding capability with full-trajectory optimization. MM-ReCoder learns to produce more accurate and executable code through the interaction with the environment and by iteratively correcting its own outputs. Our results on three chart-to-code benchmarks demonstrate the state-of-the-art performance of MM-ReCoder.

Motivation

Scientific charts play a crucial role in helping humans interpret complex information by highlighting trends, relationships, and comparisons. Being able to automatically generate the source code of a chart from its image makes it easy to edit, reproduce, and reuse visualizations. However, existing chart-to-code approaches treat the problem as a one-shot generation task — generating code in a single pass without executing or refining based on feedback from rendered results. While humans naturally operate iteratively (implement → execute → visualize → refine), current models do not replicate this self-correcting process.

We reveal that existing open-source MLLMs struggle to self-correct on multimodal coding tasks. Although scores on evaluation benchmarks appear to improve between turns, gains mainly come from increased code executability, not from refining already-executable code. When filtering for charts that successfully render in both turns, existing models show negative improvement — whereas our model, MM-ReCoder, achieves positive improvement.

Motivation: self-correction analysis

MM-ReCoder

We propose MM-ReCoder, trained with a cold start phase followed by a two-stage multi-turn self-correction RL strategy.

MM-ReCoder training pipeline

Cold Start

We first perform SFT on 160k chart-code pairs (Chart2Code-160k) to build basic coding capability. We then construct 7k two-turn self-correction conversations using Qwen3-VL-235B-A22B-Instruct and perform a second round of SFT to initialize self-correction behavior.

Two-Stage Multi-Turn Self-Correction RL

We use Group Relative Policy Optimization (GRPO) in two stages:

• Stage 1 — Shared First Turn: Fix an online-sampled first-turn output and roll out multiple second-turn candidates. This allows the model to explore diverse refinement strategies and directly trains self-correction capability.

• Stage 2 — Full Trajectory: Jointly optimize both turns end-to-end, with rewards computed at the final turn.

Reward Design

We combine three reward signals:

• Rule-based Reward: Hooks into Matplotlib to extract chart elements (type, text, colors, layout) and computes an F1-based similarity score against the reference chart.

• Model-based Reward: Uses Qwen2.5-VL-72B to evaluate six aspects — chart type, layout, text, data, style.

• Format Reward: Checks that the output follows the required <think>...</think>'''python...''' structure.

Experiments

Main Results

We evaluate MM-ReCoder on three benchmarks: ChartMimic, Plot2Code, and ChartX. Compared to its base model (Qwen2.5-VL-7B), MM-ReCoder improves execution rate by +22%, low-level score by +27%, and high-level score by +24%. Moreover, MM-ReCoder not only significantly outperforms chart-domain specialist models and models of comparable size, but also achieves the best ChartMimic low-level score and Plot2Code text-match score among all models, surpassing GPT-4o and Qwen3-VL-235B-A22B.

Main quantitative results

Self-Correction Analysis

When evaluating only on charts that successfully render in both turns, MM-ReCoder achieves a +0.30% low-level score improvement and +0.89% high-level score improvement. In contrast, existing models, especially models of comparable size, show negative improvement in this setting.

Self-correction analysis

Human Evaluation

In A/B testing, MM-ReCoder wins against ChartCoder (37% Win / 43% Tie / 20% Loss) and Qwen2.5-VL-72B (40% Win / 37% Tie / 23% Loss). Among samples with score improvements, 76.5% show visually discernible improvements as judged by humans.

Human evaluation results

Qualitative Results

Examples of Self-correction

MM-ReCoder successfully corrects a wide range of chart rendering issues, including label placement, axis ranges, color mismatches, and style inconsistencies.

Comparison with Other Models

MM-ReCoder achieves superior color, text, and style accuracy compared with ChartCoder, Qwen3-VL-235B, and GPT-4o across diverse chart types.

Comparison with other models

BibTeX

@inproceedings{tang2026mmrecoder,
    title={MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction},
    author={Zitian Tang and Xu Zhang and Jianbo Yuan and Yang Zou and Varad Gunjal and Songyao Jiang and Davide Modolo},
    booktitle={CVPR},
    year={2026}
}