arxiv:2512.05033

Arbitrage: Efficient Reasoning via Advantage-Aware Speculation

Published on Dec 4

· Submitted by

Yuezhou Hu on Dec 10

University of California, Berkeley

Upvote

Authors:

Monishwaran Maheswaran ,

Yuezhou Hu ,

Abstract

Arbitrage is a dynamic routing framework for speculative decoding that improves efficiency in large language model inference by predicting when the target model will provide a better step, outperforming traditional step-level methods.

AI-generated summary

Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to sim2times at matched accuracy.

View arXiv page View PDF Project page GitHub 2 Add to collection

Community

yuezhouhu

Paper author Paper submitter about 16 hours ago

Modern large language models achieve impressive reasoning capabilities with long chains of thought, but they incur substantial computational cost at inference time. Speculative decoding improves efficiency by using a fast, less accurate draft model to propose tokens that are then verified in parallel by a stronger target model. However, on reasoning tasks, traditional token-level speculative decoding often rejects many semantically valid steps due to superficial token mismatches. Recent step-level semantic verification methods mitigate this by accepting or rejecting entire reasoning steps, but they still waste target compute by regenerating many rejected steps that yield little quality gain.

We propose ARBITRAGE, a step-level speculative generation framework that dynamically routes generation based on the relative advantage of the target model over the draft model. Instead of relying on a fixed acceptance threshold, ARBITRAGE uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal “arbitrage oracle” that always selects the higher-quality step, achieving near-optimal efficiency–accuracy trade-offs. Across multiple mathematical reasoning benchmarks, ARBITRAGE consistently outperforms prior step-level speculative decoding baselines, reducing inference latency by up to approximately 2× at matched accuracy.