GARDO: Reinforcing Diffusion Models without Reward Hacking
Abstract
Online reinforcement learning for diffusion model fine-tuning suffers from reward hacking due to proxy reward mismatches, which GARDO addresses through selective regularization, adaptive reference updates, and diversity-aware reward amplification.
Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment. However, since precisely specifying a ground-truth objective for visual tasks remains challenging, the models are often optimized using a proxy reward that only partially captures the true goal. This mismatch often leads to reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses. While common solutions add regularization against the reference policy to prevent reward hacking, they compromise sample efficiency and impede the exploration of novel, high-reward regions, as the reference policy is usually sub-optimal. To address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking, we propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO), a versatile framework compatible with various RL algorithms. Our key insight is that regularization need not be applied universally; instead, it is highly effective to selectively penalize a subset of samples that exhibit high uncertainty. To address the exploration challenge, GARDO introduces an adaptive regularization mechanism wherein the reference model is periodically updated to match the capabilities of the online policy, ensuring a relevant regularization target. To address the mode collapse issue in RL, GARDO amplifies the rewards for high-quality samples that also exhibit high diversity, encouraging mode coverage without destabilizing the optimization process. Extensive experiments across diverse proxy rewards and hold-out unseen metrics consistently show that GARDO mitigates reward hacking and enhances generation diversity without sacrificing sample efficiency or exploration, highlighting its effectiveness and robustness.
Community
Introducing GARDO: Reinforcing Diffusion Models without Reward Hacking
paper: https://arxiv.org/abs/2512.24138
code: https://github.com/tinnerhrhe/gardo
project: https://tinnerhrhe.github.io/gardo_project/
arXiv lens breakdown of this paper ๐ https://arxivlens.com/PaperView/Details/gardo-reinforcing-diffusion-models-without-reward-hacking-8457-5d8540b5
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Data-regularized Reinforcement Learning for Diffusion Models at Scale (2025)
- The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation (2025)
- DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO (2025)
- Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning (2025)
- Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function (2025)
- Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment (2025)
- Beyond Reward Margin: Rethinking and Resolving Likelihood Displacement in Diffusion Models via Video Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper