arxiv:2604.21691

There Will Be a Scientific Theory of Deep Learning

Published on Apr 23

Authors:

Abstract

The paper argues for the emergence of a scientific theory of deep learning, termed "learning mechanics," which focuses on characterizing training dynamics, hidden representations, and performance through five key research areas including solvable idealized settings and universal behaviors.

AI-generated summary

In this paper, we make the case that a scientific theory of deep learning is emerging. By this we mean a theory which characterizes important properties and statistics of the training process, hidden representations, final weights, and performance of neural networks. We pull together major strands of ongoing research in deep learning theory and identify five growing bodies of work that point toward such a theory: (a) solvable idealized settings that provide intuition for learning dynamics in realistic systems; (b) tractable limits that reveal insights into fundamental learning phenomena; (c) simple mathematical laws that capture important macroscopic observables; (d) theories of hyperparameters that disentangle them from the rest of the training process, leaving simpler systems behind; and (e) universal behaviors shared across systems and settings which clarify which phenomena call for explanation. Taken together, these bodies of work share certain broad traits: they are concerned with the dynamics of the training process; they primarily seek to describe coarse aggregate statistics; and they emphasize falsifiable quantitative predictions. We argue that the emerging theory is best thought of as a mechanics of the learning process, and suggest the name learning mechanics. We discuss the relationship between this mechanics perspective and other approaches for building a theory of deep learning, including the statistical and information-theoretic perspectives. In particular, we anticipate a symbiotic relationship between learning mechanics and mechanistic interpretability. We also review and address common arguments that fundamental theory will not be possible or is not important. We conclude with a portrait of important open directions in learning mechanics and advice for beginners. We host further introductory materials, perspectives, and open questions at learningmechanics.pub.

View arXiv page View PDF Add to collection

Community

mishig

about 22 hours ago

The paper "There Will Be a Scientific Theory of Deep Learning" (arXiv:2604.21691) argues that a fundamental scientific theory of deep learning—termed **"learning mechanics"**—is emerging. This theory aims to be the physics of learning: a mathematical framework that predicts coarse, aggregate statistics of training dynamics, representations, and model performance from first principles.

The authors present five lines of evidence supporting this claim, each drawing analogies to physics (mechanics, statistical physics, and thermodynamics):

1. Analytically Solvable Settings Exist

Just as physics uses the harmonic oscillator or hydrogen atom as solvable cornerstones, deep learning admits tractable models that provide intuition for broader phenomena.

Linearization in data: Deep linear networks exhibit exactly solvable gradient flow dynamics (decoupled Bernoulli ODEs) showing "greedy low-rank learning"—larger singular modes are learned first.
Linearization in parameters: The Neural Tangent Kernel (NTK) limit describes wide networks as kernel ridge regression, enabling exact predictions of test performance from data statistics.
Beyond linearization: Recent progress solves nonlinear settings like multi-index models, quadratic activations, and shallow networks in high dimensions.

Figure 1: (Left) Exact solutions for deep linear networks show sequential learning of singular modes. (Right) NTK-based predictions match test performance of wide neural networks.

2. Insightful Limits Reveal Fundamental Behavior

Complex systems simplify in infinite-size limits. Neural networks exhibit distinct "phases" depending on scaling:

The Lazy/Rich Dichotomy:
- Lazy regime: With standard initialization ($\sim \text{width}^{-1/2}$), networks behave like frozen kernels (NTK regime).
- Rich regime: With smaller initialization ($\sim \text{width}^{-1}$), networks exhibit feature learning—hidden representations adapt to data structure.
Infinite Depth: Residual networks behave like Neural ODEs or stochastic differential equations depending on depth-scaling.
The Discretization Hypothesis: Finite networks may be understood as discretized approximations to infinite continuous systems (analogous to PDE discretization).

Figure 2: Same network architecture exhibits lazy (weights frozen) vs. rich (weights cluster toward teacher features) dynamics depending on initialization scale.

3. Simple Empirical Laws Capture Macroscopic Statistics

Aggregate observables follow predictable mathematical laws:

Neural Scaling Laws: Test loss follows power laws in compute, dataset size, and parameter count ($L \propto C^{-\alpha}$). These are predictable but their exponents remain theoretically unexplained.
Edge of Stability: With gradient descent, the Hessian sharpness (largest eigenvalue) progressively increases then stabilizes near $2/\eta$ (the stability threshold), explaining why training oscillates yet converges.
Neural Collapse: At convergence, class representations form a regular simplex (tightly clustered class means with maximal angular separation).
Conservation Laws: Gradient flow exhibits Noetherian conservation laws (e.g., layerwise covariance differences are conserved).

Figure 3 (Top): Power-law scaling of loss with compute/data/parameters. Figure 4 (Bottom): Sharpness stabilizes at the edge of stability ($2/\eta$) across architectures.

4. Hyperparameters Can Be Disentangled and Understood

Optimization and architecture hyperparameters can be mathematically decoupled:

μP (Maximal Update Parameterization): By scaling learning rates and initializations appropriately with width ($\eta \sim \text{width}^{-1}$), training dynamics remain stable across model sizes, enabling zero-shot hyperparameter transfer from small to large models.
Implicit Regularization: SGD implicitly penalizes loss curvature (Hessian sharpness), explaining why large learning rates and small batch sizes often generalize better.

Figure 5: Under μP, optimal learning rates remain constant across widths, enabling transfer from small proxy models to large production models.

5. Universal Phenomena Appear Across Settings and Tasks

Similar to critical phenomena in physics, deep learning exhibits universality:

Universal Representations: Different architectures (CNNs, Transformers) and modalities (vision, language) converge to similar internal representations as scale increases ("Platonic Representation Hypothesis").
Universal Inductive Biases: Architectures with different designs (e.g., U-Net vs. Transformer diffusion models) produce nearly identical outputs given the same random seeds, suggesting shared fundamental biases.
Universal Data Structure: Natural data (images, text) shares power-law spectra and hierarchical structure, explaining why the same algorithms work across domains.

Figure 6: (Left) Different diffusion architectures produce identical images from the same noise seeds. (Right) Representation similarity between language and vision models increases with scale.

Synthesis: Learning Mechanics ↔ Mechanistic Interpretability

The authors propose a symbiotic relationship between learning mechanics (the "physics" of aggregate dynamics) and mechanistic interpretability (the "biology" of circuits and features):

Mechanics → Interpretability: Provides rigorous foundations for assumptions (linearity, locality, sparsity) and explains how mechanisms form during training.
Interpretability → Mechanics: Empirical discoveries (induction heads, grokking, Fourier features) provide concrete phenomena for theory to explain.

Open Directions

The paper identifies 10 critical open problems, including:

Solvable models of genuinely deep, nonlinear learning
Theoretical capture of natural data structure
Predicting scaling law exponents a priori
Formal definitions of "features" and "circuits"
Understanding whether finite networks are approximations to infinite limits

The authors conclude that deep learning theory has transitioned from pure mathematics to an empirical science, with the transparency and measurability of neural networks making a comprehensive "learning mechanics" achievable.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.21691

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.21691 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.21691 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.21691 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.