Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization Paper • 2605.26457 • Published May 26 • 7
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts Paper • 2606.02404 • Published Jun 1 • 59
A 2-step Framework for Automated Literary Translation Evaluation: Its Promises and Pitfalls Paper • 2412.01340 • Published Dec 2, 2024
On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists Paper • 2605.20668 • Published May 20 • 13
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts Paper • 2606.02404 • Published Jun 1 • 59
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts Paper • 2606.02404 • Published Jun 1 • 59
CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation Paper • 2505.24456 • Published May 30, 2025
PRISM: Fine-Grained Paper-to-Paper Retrieval with Multi-Aspect-Aware Query Optimization Paper • 2507.10057 • Published Jul 14, 2025
Rethinking Reward Models for Multi-Domain Test-Time Scaling Paper • 2510.00492 • Published Oct 1, 2025 • 28
Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs Paper • 2510.09201 • Published Oct 10, 2025 • 50
Efficient Long Context Language Model Retrieval with Compression Paper • 2412.18232 • Published Dec 24, 2024 • 1
On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists Paper • 2605.20668 • Published May 20 • 13
It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs Paper • 2605.20258 • Published May 18 • 30
OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources Paper • 2605.29250 • Published May 28 • 79
OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources Paper • 2605.29250 • Published May 28 • 79
On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists Paper • 2605.20668 • Published May 20 • 13
On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists Paper • 2605.20668 • Published May 20 • 13
Reasoning over mathematical objects: on-policy reward modeling and test time aggregation Paper • 2603.18886 • Published Mar 19 • 6