File size: 5,280 Bytes
99b4b39
 
 
 
 
 
53a4665
99b4b39
 
 
 
53a4665
99b4b39
53a4665
 
99b4b39
 
 
 
 
 
53a4665
99b4b39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5f44337
99b4b39
 
 
 
 
 
 
 
16cf376
99b4b39
 
e049485
99b4b39
f106e23
99b4b39
f106e23
99b4b39
e049485
99b4b39
53a4665
99b4b39
 
 
 
 
e049485
99b4b39
 
 
 
 
f106e23
99b4b39
a1ef966
13b6afa
944e277
99b4b39
e049485
99b4b39
 
 
 
 
 
 
 
587c5cc
 
 
99b4b39
587c5cc
99b4b39
587c5cc
99b4b39
 
 
587c5cc
 
99b4b39
f18061b
 
587c5cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99b4b39
587c5cc
99b4b39
587c5cc
 
 
 
 
 
 
 
 
 
 
 
 
e85d709
 
e049485
e85d709
e62be4c
e85d709
 
 
53a4665
8e3a0f0
 
afe6afd
8e3a0f0
 
 
 
 
 
 
afe6afd
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
language: en
license: apache-2.0
library_name: transformers
tags:
- bert
- fast
- monarch-matrices
- mnli
- efficiency
- triton
- hardware-efficient
- sub-quadratic
- fast-inference
- h100-optimized
datasets:
- glue
- wikipedia
metrics:
- accuracy
- throughput
- latency
pipeline_tag: text-classification
model-index:
- name: Monarch-BERT-Base-MNLI-Full
  results:
  - task:
      type: text-classification
      name: Natural Language Inference
    dataset:
      name: GLUE MNLI
      type: glue
      config: mnli
      split: validation_matched
    metrics:
    - name: Accuracy
      type: accuracy
      value: 78.34%
      description: "Approx. 5% accuracy trade-off for maximum speed compared to dense BERT."
    - name: Throughput (TPS on H100)
      type: throughput
      value: 9029.4
      description: "Measured with torch.compile(mode='max-autotune') on NVIDIA H100. Represents a 24.3% increase over the optimized Triton baseline."
    - name: Latency (ms)
      type: latency
      value: 3.54
      description: "Batch size 32, sequence length 128. Achieves a ~24,6% faster inference time."
---

# Monarch-BERT-MNLI (Full)

**Breaking the Efficiency Barrier: -66.2% Parameters, +24% Speed.**

> **tl;dr:** Achieving **extreme resource efficiency** on **MNLI**. We replaced *every* dense FFN layer in BERT-Base with structured Monarch Matrices. Distilled in **just 3 hours on one H100** using only **500k Wiki tokens** and **MNLI** data, this model slashes parameters by 66.2% and boosts throughput by **+24%** (vs optimized Baseline).

## High Performance, Low Cost

Training models from scratch typically requires billions of tokens. We took a different path to shock the efficiency curve:

* **Training Time:** A few hours on **1x NVIDIA H100**.
* **Data:** Only **MNLI** + **500k Wikipedia Samples**.
* **Trade-off:** This extreme compression comes with a moderate accuracy drop (~5%). *Need higher accuracy? Check out our [Hybrid Version](https://huggingface.co/ykae/monarch-bert-base-mnli-hybrid) (<1% loss).*

## Key Benchmarks

Measured on a single NVIDIA H100 using `torch.compile(mode="max-autotune")`.

| Metric | BERT-Base (Baseline) | **Monarch-Full (This)** | Delta |
| :--- | :--- | :--- | :--- |
| **Parameters** | 85.65M | **28.98M** | πŸ“‰ **-66.2%** |
| **Compute (GFLOPs)** | 696.5 | **232.6** | πŸ“‰ **-66.6%** |
| **Throughput (TPS)** | 7261 | **9029** | πŸš€ **+24.3%** |
| **Latency (Batch 32)** | 4.41 ms | **3.54 ms** | ⚑ **+24.6% Faster** |
| **Accuracy (MNLI)** | 83.62% | **78.34%** | πŸ“‰ **-5.28%** |

## Usage

This model uses a **custom architecture**. You must enable `trust_remote_code=True` to load the Monarch layers (`MonarchUp`, `MonarchDown`, `MonarchFFN`).

To see the real speedup, **compilation is mandatory** (otherwise PyTorch Python overhead masks the hardware gains).

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm import tqdm

device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "ykae/monarch-bert-base-mnli"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(
    model_id, 
    trust_remote_code=True
).to(device)

# torch.set_float32_matmul_precision('high')
# model = torch.compile(model, mode="max-autotune")
model.eval()

print("πŸ“Š Loading MNLI Validation set...")
dataset = load_dataset("glue", "mnli", split="validation_matched")

def tokenize_fn(ex):
    return tokenizer(ex['premise'], ex['hypothesis'], 
                     padding="max_length", truncation=True, max_length=128)

tokenized_ds = dataset.map(tokenize_fn, batched=True)
tokenized_ds.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
loader = DataLoader(tokenized_ds, batch_size=32)

correct = 0
total = 0

print(f"πŸš€ Starting evaluation on {len(tokenized_ds)} samples...")
with torch.no_grad():
    for batch in tqdm(loader):
        ids = batch['input_ids'].to(device)
        mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        
        outputs = model(ids, attention_mask=mask)
        preds = torch.argmax(outputs.logits, dim=1)
        
        correct += (preds == labels).sum().item()
        total += labels.size(0)

print(f"\nβœ… Evaluation Finished!")
print(f"πŸ“ˆ Accuracy: {100 * correct / total:.2f}%")
```

## The "Memory Paradox" (Read this!)

You might notice that while the parameter count is lower, the peak VRAM usage during inference can be slightly higher than the baseline.

**Why?**
This is a **software artifact**, not a hardware limitation.
* **Solution:** A custom **Fused Triton Kernel** (planned) would fuse the steps of our Monarch, keeping intermediate activations in the GPU's SRAM. This would drop dynamic VRAM usage significantly below the baseline, matching the FLOPs reduction.

## Citation
```bibtex
@misc{ykae-monarch-bert-mnli-2026,
  author = {Yusuf Kalyoncuoglu, YKAE-Vision},
  title = {Monarch-BERT-MNLI: Extreme Compression via Monarch FFNs},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/ykae/monarch-bert-base-mnli}}
}
```