Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers Paper • 2601.17367 • Published Jan 24 • 34
Small-scale proxies for large-scale Transformer training instabilities Paper • 2309.14322 • Published Sep 25, 2023 • 22
Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training Paper • 2602.00747 • Published Jan 31 • 9
HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing Paper • 2602.03560 • Published Feb 3 • 49
FineRMoE: Dimension Expansion for Finer-Grained Expert with Its Upcycling Approach Paper • 2603.13364 • Published Mar 9 • 9
When Does Sparsity Mitigate the Curse of Depth in LLMs Paper • 2603.15389 • Published 25 days ago • 5
Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks Paper • 2603.11487 • Published 29 days ago • 2
Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings Paper • 2512.12167 • Published Dec 13, 2025 • 5
CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs Paper • 2602.05258 • Published Feb 5 • 7
Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data Paper • 2510.25804 • Published Oct 29, 2025 • 1
Value Residual Learning For Alleviating Attention Concentration In Transformers Paper • 2410.17897 • Published Oct 23, 2024 • 9