RewardHacking

AI & ML interests

None defined yet.

tongliuphysics

authored 2 papers 3 months ago

Temperature-scaling surprisal estimates improve fit to human reading times -- but does it do so for the "right reasons"?

Paper • 2311.09325 • Published Nov 15, 2023

FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings

Paper • 2501.06645 • Published Jan 11, 2025

tongliuphysics

authored a paper about 1 year ago

Multimodal Pragmatic Jailbreak on Text-to-image Models

Paper • 2409.19149 • Published Sep 27, 2024