qwen3.5-hard-only-r4
Summary
- Base model:
Qwen/Qwen3.5-4B - Dataset:
tytodd/qwen3.5-4b-v1 - Checkpoint:
tytodd/qwen3.5-hard-only-r4
OOD Evaluation
| benchmark | n | auroc | accuracy |
|---|---|---|---|
| arc_challenge | 1000 | 0.8875 | 0.8890 |
| judge_bench | 278 | 0.7065 | 0.6583 |
| mmlu | 1000 | 0.7550 | 0.7680 |
| mmlu_pro | 1000 | 0.6889 | 0.7070 |
| rod101_essay_scoring | 81 | 0.7115 | 0.7407 |



