Efficient Agent Evaluation via Diversity-Guided User Simulation Paper โข 2604.21480 โข Published 9 days ago โข 14
Alignment Makes Language Models Normative, Not Descriptive Paper โข 2603.17218 โข Published Mar 17 โข 46
Runtime error Agents 1 ST-WebAgentBench Leaderboard ๐ก 1 Safety & Trustworthiness Leaderboard for Web Agents
Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs Paper โข 2603.09906 โข Published Mar 10 โข 75
Runtime error Agents 1 ST-WebAgentBench Leaderboard ๐ก 1 Safety & Trustworthiness Leaderboard for Web Agents
Runtime error Agents 1 ST-WebAgentBench Leaderboard ๐ก 1 Safety & Trustworthiness Leaderboard for Web Agents
STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts Paper โข 2602.14265 โข Published Feb 15 โข 21
Enterprise Agents and Benchmarks Collection Enterprise agent ecosystem featuring AssetOpsBench (industrial) and ITBench (SRE, FinOps, CISO), CUGA to accelerate AI Automation โข 16 items โข Updated 23 days ago โข 15
Effective Red-Teaming of Policy-Adherent Agents Paper โข 2506.09600 โข Published Jun 11, 2025 โข 39
TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations Paper โข 2505.18125 โข Published May 23, 2025 โข 112
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents Paper โข 2410.06703 โข Published Oct 9, 2024 โข 3
AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation Paper โข 2503.19693 โข Published Mar 25, 2025 โข 76
GLEE: A Unified Framework and Benchmark for Language-based Economic Environments Paper โข 2410.05254 โข Published Oct 7, 2024 โข 85