PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling

Overview

PaTaRM is a Generative Reward Model (GRM) designed for RLHF alignment. Existing GRMs face a fundamental trade-off: pairwise methods suffer from training-inference mismatch, while pointwise methods require expensive absolute rating annotations. PaTaRM resolves this by introducing two key components:

Preference-Aware Reward (PAR): enables robust pointwise training directly from readily available pairwise preference data, eliminating the need for explicit rating labels.
Task-Adaptive Rubric: dynamically generates instance-specific evaluation criteria for precise, context-aware scoring.

PaTaRM achieves an 8.7% average improvement on RewardBench and RMBench across Qwen3-8B/14B models, and boosts downstream RLHF performance by an average relative improvement of 13.6% on IFEval and InFoBench.

Models

Model	Base	Link
PaTaRM-8B	Qwen3-8B	AIJian/PaTaRM-8B
PaTaRM-14B	Qwen3-14B	AIJian/PaTaRM-14B

Training Data

Dataset	Type	Size	Description
`sft_train_35.6k.jsonl`	SFT	35.6k	Supervised fine-tuning data
`rl_train_mix_41.7k.jsonl`	RL	41.7k	Pointwise RL training data (pairwise data converted via PAR)

Usage

PaTaRM supports two evaluation modes: pointwise (scoring a single response) and pairwise (comparing two responses). Both modes use task-adaptive rubrics for math, code, chat, and safety tasks.

Pointwise Evaluation

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "AIJian/PaTaRM-8B"  # or AIJian/PaTaRM-14B
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")

# Construct prompt using the PaTaRM pointwise template
prompt = """<task>chat</task>
<rubrics>
Usefulness:
  - 8-10: Fully addresses the question with accurate, comprehensive information.
  - 6-7: Addresses the question clearly but may lack some detail.
  - 3-5: Relevant but missing key details or context.
  - 0-2: Off-topic, incomplete, or poorly structured.
</rubrics>
<prompt>
What is the capital of France?
</prompt>
<response>
The capital of France is Paris.
</response>"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
result = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
# Parse <answer>score</answer> from result
print(result)

Pairwise Evaluation

# Construct prompt using the PaTaRM pairwise template
prompt = """<task>chat</task>
<rubrics>
...
</rubrics>
<prompt>
What is the capital of France?
</prompt>
<responseA>
The capital of France is Paris.
</responseA>
<responseB>
France's capital city is Paris, which is also its largest city.
</responseB>"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
result = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
# Parse <answer>A</answer> or <answer>B</answer> from result
print(result)

Citation

If you find PaTaRM useful in your research, please cite our paper:

@misc{jian2026patarmbridgingpairwisepointwise,
      title={PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling}, 
      author={Ai Jian and Jingqing Ruan and Xing Ma and Dailin Li and Weipeng Zhang and Ke Zeng and Xunliang Cai},
      year={2026},
      eprint={2510.24235},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.24235}, 
}

License

This project is licensed under the Apache 2.0 License.

Contact

For questions or feedback, feel free to reach out:

📧 jianai@bupt.edu.cn
📧 jianai0530@gmail.com

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including AIJian/PaTaRM

PaTaRM

Collection

PaTaRM is a Generative Reward Model (GRM) for RLHF alignment. • 4 items • Updated 9 days ago • 1

Paper for AIJian/PaTaRM

PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling

Paper • 2510.24235 • Published Oct 28, 2025