YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling
Overview
PaTaRM is a Generative Reward Model (GRM) designed for RLHF alignment. Existing GRMs face a fundamental trade-off: pairwise methods suffer from training-inference mismatch, while pointwise methods require expensive absolute rating annotations. PaTaRM resolves this by introducing two key components:
- Preference-Aware Reward (PAR): enables robust pointwise training directly from readily available pairwise preference data, eliminating the need for explicit rating labels.
- Task-Adaptive Rubric: dynamically generates instance-specific evaluation criteria for precise, context-aware scoring.
PaTaRM achieves an 8.7% average improvement on RewardBench and RMBench across Qwen3-8B/14B models, and boosts downstream RLHF performance by an average relative improvement of 13.6% on IFEval and InFoBench.
Models
| Model | Base | Link |
|---|---|---|
| PaTaRM-8B | Qwen3-8B | AIJian/PaTaRM-8B |
| PaTaRM-14B | Qwen3-14B | AIJian/PaTaRM-14B |
Training Data
| Dataset | Type | Size | Description |
|---|---|---|---|
sft_train_35.6k.jsonl |
SFT | 35.6k | Supervised fine-tuning data |
rl_train_mix_41.7k.jsonl |
RL | 41.7k | Pointwise RL training data (pairwise data converted via PAR) |
Usage
PaTaRM supports two evaluation modes: pointwise (scoring a single response) and pairwise (comparing two responses). Both modes use task-adaptive rubrics for math, code, chat, and safety tasks.
Pointwise Evaluation
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "AIJian/PaTaRM-8B" # or AIJian/PaTaRM-14B
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
# Construct prompt using the PaTaRM pointwise template
prompt = """<task>chat</task>
<rubrics>
Usefulness:
- 8-10: Fully addresses the question with accurate, comprehensive information.
- 6-7: Addresses the question clearly but may lack some detail.
- 3-5: Relevant but missing key details or context.
- 0-2: Off-topic, incomplete, or poorly structured.
</rubrics>
<prompt>
What is the capital of France?
</prompt>
<response>
The capital of France is Paris.
</response>"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
result = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
# Parse <answer>score</answer> from result
print(result)
Pairwise Evaluation
# Construct prompt using the PaTaRM pairwise template
prompt = """<task>chat</task>
<rubrics>
...
</rubrics>
<prompt>
What is the capital of France?
</prompt>
<responseA>
The capital of France is Paris.
</responseA>
<responseB>
France's capital city is Paris, which is also its largest city.
</responseB>"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
result = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
# Parse <answer>A</answer> or <answer>B</answer> from result
print(result)
Citation
If you find PaTaRM useful in your research, please cite our paper:
@misc{jian2026patarmbridgingpairwisepointwise,
title={PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling},
author={Ai Jian and Jingqing Ruan and Xing Ma and Dailin Li and Weipeng Zhang and Ke Zeng and Xunliang Cai},
year={2026},
eprint={2510.24235},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.24235},
}
License
This project is licensed under the Apache 2.0 License.
Contact
For questions or feedback, feel free to reach out:
- 📧 jianai@bupt.edu.cn
- 📧 jianai0530@gmail.com
