Evaluating Robustness of Reward Models for Mathematical Reasoning

Note that this project page is fully anonymized. Some links might not be available due to anonymization.

📝

🤗

Abstract

Reward models are key in reinforcement learning from human feedback (RLHF) systems, aligning the model behavior with human preferences. Particularly in the math domain, there have been plenty of studies using reward models to align policies for improving reasoning capabilities. Recently, as the importance of reward models has been emphasized, RewardBench is proposed to understand their behavior. However, we figure out that the math subset of RewardBench has different representations between chosen and rejected completions, and relies on a single comparison, which may lead to unreliable results as it only see an isolated case. Therefore, it fails to accurately present the robustness of reward models, leading to a misunderstanding of its performance and potentially resulting in reward hacking.

In this work, we introduce a new design for reliable evaluation of reward models, and to validate this, we construct RewardMATH, a benchmark that effectively represents the robustness of reward models in mathematical reasoning tasks. We demonstrate that the scores on RewardMATH strongly correlate with the results of optimized policy and effectively estimate reward overoptimization, whereas the existing benchmark shows almost no correlation. The results underscore the potential of our design to enhance the reliability of evaluation, and represent the robustness of reward model.

Preliminaries

Robustness of reward model

Reward hacking represents a significant challenge in the development and implementation of reward models for RLHF. This phenomenon occurs when policies exploit loopholes in reward models to achieve higher scores, stemming from discrepancies between human preferences (the true reward function) and proxy reward models. Such issues underscore the importance of evaluating reward models themselves, not just policy models (post-RLHF models). The reward hacking can lead to reward overoptimization, where employing a proxy reward model for optimization may initially improve the true reward but gradually leads to degradation, ultimately resulting in optimization failure.

In this work, we argue that the robustness of a reward model should be evaluated based on how effectively it provides signals from which a policy can learn.

Designing a Reliable Benchmark

On the road to the Evaluation of Robustness of Reward Model

A motivation example from math subset of RewardBench and drawbacks of the existing evaluation method.

RewardBench, a widely-used benchmark for reward models, does not fully address the robustness of models in the math domain, with recent findings showing about 20% of the annotations in its underlying PRM800K dataset are incorrect. The evaluation process in RewardBench, which compares rewards between chosen and rejected solutions annotated by unaligned GPT-4, is flawed due to humans often skipping steps in solutions, leading to discrepancies with machine-generated solutions. These discrepancies challenge the evaluation’s reliability, as comparing with a single incorrect solution does not sufficiently assess the robustness of reward models.

RewardMATH

A histogram showing the distribution of samples by the number of steps on RewardBench and RewardMATH, and the contribution of each model to the rejected solutions.

The design philosophy of RewardMATH is to caution against a hasty generalization, which occurs when conclusions are drawn from a sample that is too small or consists of too few cases. To design a reliable benchmark, we aim to mitigate the risk of reward hacking and employs comparisons with a variety of incorrect (i.e., rejected) solutions. Therefore, we introduce RewardMATH, a reliable benchmark crafted for evaluating the robustness of reward models in mathematical reasoning.

Evaluation metric

For each problem, we infer 10 solutions in total—1 correct solution and 9 incorrect solutions—and then assign a true classification label when a reward of chosen solution is higher than all rewards of rejected solutions. Furthermore, considering only whether the reward of chosen solution is the highest can be fairly strict, we also utilize Mean Reciprocal Rank (MRR), where higher ranks for the chosen solution lead to higher scores.

Evaluating Reward Models

The results of generative reward models on RewardBench and RewardMATH.

The results from RewardBench suggest that LLMs, such as GPT-4 or Prometheus-2-7B, could potentially serve as effective reward models. However, more thorough evaluations on RewardMATH indicate that LLMs generally do not perform well as reward models, with most achieving scores close to zero, except for those in the GPT-4 family. Through direct assessment that considers the presence of ties, we find that most LLMs fail to distinguish between correct and incorrect solutions, simply assigning the same scores to all.

The results of classifier-based RMs and PRMs on RewardBench and RewardMATH.

Rankings on RewardBench do not consistently predict performance on RewardMATH. Specifically, Oasst-rm-2.1-pythia-1.4b, which is one of the top-ranked models in RewardBench, faces challenges in RewardMATH, scoring lower than Beavor-7b-v2.0-reward, the lowest-ranked model in RewardBench. However, Internlm2-7b-reward exhibits the highest performance in RewardMATH, suggesting that it is genuinely a robust reward model for mathematical reasoning.

Validating Our Design for a Reliable Benchmark

Reliability of Benchmark

The relationship between the difference in accuracy on math test sets and the performance based on the benchmark design.

RewardMATH shows a strong positive correlation between the benchmark scores and the results of optimized policy, indicating its reliability, whereas RewardBench shows only a weak correlation. Additionally, the analysis explores the design of evaluation sets that prevent reward hacking by comparing chosen and rejected solutions from the two benchmarks. The results of heatmap highlight that the importance of minimizing the representation differences between chosen and rejected solutions to mitigate vulnerability to reward hacking, as well as employing one-to-many comparisons for more reliable evaluations.

Through the Lens of Reward Overoptimization

Gold rewards and oracle rewards (pass@1) in BoN and PPO experiments with proxy reward models across different amounts of data in a synthetic setup.

Typically, a robust proxy reward model trained to capture human preferences should exhibit increasing gold rewards as KL divergence increases. Conversely, a collapse in gold rewards at certain point during an increase in KL divergence indicates a lack of robustness in the proxy reward model. Figure illustrates how dataset size impacts the behavior of reward model within a synthetic setup. We find that proxy reward models trained on smaller datasets reach peak rewards at lower KL divergences, indicating faster overoptimization. This finding suggests that larger datasets can help mitigate reward overoptimization. Furthermore, we confirm that reward overoptimization can also be observed through oracle rewards (i.e., pass@1) in tasks with well-defined human preferences, such as mathematics.

Gold and oracle rewards (pass@1) for BoN experiments with MetaMATH-Mistral-7B.

Figure shows gold and oracle rewards change with increasing KL divergence and reveals varying effects of overoptimization across different models. Notably, high-performing models on RewardBench, like Oasst-rm-2.1-pythia-1.4b, often exhibit rapid overoptimization without a consistent correlation between benchmark performance and the extent of overoptimization. In contrast, RewardMATH demonstrates a clear trend where higher performance correlates with less reward collapse, highlighting its reliability in providing accurate rewards and effectively mitigating overoptimization.

Discussion

Developing effective RLHF systems

Benchmarks serve as critical milestones in advancing artificial intelligence. In this work, we argue that a benchmark for reward models should reliably assess their robustness, where a robust RM indicates a model that provide useful signals to enable effective policy learning. Through extensive experiments, we confirm that our reliable benchmark design, which mitigates the risk of reward hacking and employs one-to-many comparisons, accurately reflects the robustness of reward models. While this work marks a significant step forward, there is still room for improvement. We validate our design in mathematical reasoning tasks, where human preferences can be clearly defined by correctness, making it easier to gather multiple rejected completions. Since the reward models can be applied to a wide range of tasks, a crucial next step is to extend our design to cover all of them. We hope that advancing this line of research will provide a promising path toward developing more trustworthy and effective RLHF systems.

Conclusion

In this work, we suggest a new design for reliable evaluation of reward models: (1) mitigating the risk of reward hacking and (2) employing a one-to-many comparison. To validate our design, we propose RewardMATH, a benchmark that effectively represents the robustness of reward models in mathematical reasoning tasks. Our extensive experiments demonstrate that the performance on RewardMATH has a strong correlation with the performance of the optimized policy, whereas the existing benchmark shows no correlation. Furthermore, we also confirm that RewardMATH can effectively estimate the reward overoptimization, a critical concern in RLHF systems.

BibTeX


        @article{Anonymized,
        title={Evaluating Robustness of Reward Models for Mathematical Reasoning},
        author={Anonymized},
        journal={Anonymized},
        year={2024}
        }