Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space
Abstract
Large Language Models (LLMs) apply uniform computation to all tokens, despite language exhibiting highly non-uniform information density. This token-uniform regime wastes capacity on locally predictable spans while under-allocating computation to semantically critical transitions. We propose Dynamic Large Concept Models (DLCM), a hierarchical language modeling framework that learns semantic boundaries from latent representations and shifts computation from tokens to a compressed concept space where reasoning is more efficient. DLCM discovers variable-length concepts end-to-end without relying on predefined linguistic units. Hierarchical compression fundamentally changes scaling behavior. We introduce the first compression-aware scaling law, which disentangles token-level capacity, concept-level reasoning capacity, and compression ratio, enabling principled compute allocation under fixed FLOPs. To stably train this heterogeneous architecture, we further develop a decoupled μP parametrization that supports zero-shot hyperparameter transfer across widths and compression regimes. At a practical setting (R=4, corresponding to an average of four tokens per concept), DLCM reallocates roughly one-third of inference compute into a higher-capacity reasoning backbone, achieving a +2.69\% average improvement across 12 zero-shot benchmarks under matched inference FLOPs.
Community
Dynamic Large Concept Models (DLCM) introduce an end-to-end trained concept-level language modeling architecture that breaks the token-uniform computation paradigm in modern LLMs. Inspired by hierarchical models such as H-Net, DLCM learns semantic boundaries directly from latent representations, dynamically compresses token sequences into variable-length concepts, performs deep reasoning in the concept space, and projects the results back to tokens via causal cross-attention.
Compared to standard dense Transformers trained with next-token prediction, DLCM achieves ~34% inference FLOPs reduction under apple-to-apple settings, while consistently improving performance on reasoning-dominant benchmarks. Notably, the relative FLOPs savings increase with model scale, indicating favorable scaling behavior beyond parameter efficiency alone. At similar loss levels, DLCM reallocates computation toward boundary and planning tokens, yielding stronger downstream accuracy despite reduced redundant token processing.
Technically, the paper contributes:
(1) a FlashAttention-VarLen–based implementation for efficient concept-token cross-attention;
(2) a decoupled μP formulation tailored to heterogeneous token- and concept-width modules, enabling zero-shot hyperparameter transfer across scales;
(3) a Global Parser that enforces stable, content-adaptive compression at the batch level and delivers solid empirical gains.
Overall, DLCM can be viewed as a principled special case of layer-wise local compression combined with sparse attention, offering a scalable path toward more compute-efficient and reasoning-centric language models.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper