Papers
arxiv:2512.05033

Arbitrage: Efficient Reasoning via Advantage-Aware Speculation

Published on Dec 4
· Submitted by Yuezhou Hu on Dec 10
Authors:
,
,
,
,
,
,
,
,

Abstract

Arbitrage is a dynamic routing framework for speculative decoding that improves efficiency in large language model inference by predicting when the target model will provide a better step, outperforming traditional step-level methods.

AI-generated summary

Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to sim2times at matched accuracy.

Community

Paper author Paper submitter

Modern large language models achieve impressive reasoning capabilities with long chains of thought, but they incur substantial computational cost at inference time. Speculative decoding improves efficiency by using a fast, less accurate draft model to propose tokens that are then verified in parallel by a stronger target model. However, on reasoning tasks, traditional token-level speculative decoding often rejects many semantically valid steps due to superficial token mismatches. Recent step-level semantic verification methods mitigate this by accepting or rejecting entire reasoning steps, but they still waste target compute by regenerating many rejected steps that yield little quality gain.

We propose ARBITRAGE, a step-level speculative generation framework that dynamically routes generation based on the relative advantage of the target model over the draft model. Instead of relying on a fixed acceptance threshold, ARBITRAGE uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal “arbitrage oracle” that always selects the higher-quality step, achieving near-optimal efficiency–accuracy trade-offs. Across multiple mathematical reasoning benchmarks, ARBITRAGE consistently outperforms prior step-level speculative decoding baselines, reducing inference latency by up to approximately 2× at matched accuracy.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.05033 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.05033 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.05033 in a Space README.md to link it from this page.

Collections including this paper 1