File size: 5,280 Bytes
99b4b39 53a4665 99b4b39 53a4665 99b4b39 53a4665 99b4b39 53a4665 99b4b39 5f44337 99b4b39 16cf376 99b4b39 e049485 99b4b39 f106e23 99b4b39 f106e23 99b4b39 e049485 99b4b39 53a4665 99b4b39 e049485 99b4b39 f106e23 99b4b39 a1ef966 13b6afa 944e277 99b4b39 e049485 99b4b39 587c5cc 99b4b39 587c5cc 99b4b39 587c5cc 99b4b39 587c5cc 99b4b39 f18061b 587c5cc 99b4b39 587c5cc 99b4b39 587c5cc e85d709 e049485 e85d709 e62be4c e85d709 53a4665 8e3a0f0 afe6afd 8e3a0f0 afe6afd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
---
language: en
license: apache-2.0
library_name: transformers
tags:
- bert
- fast
- monarch-matrices
- mnli
- efficiency
- triton
- hardware-efficient
- sub-quadratic
- fast-inference
- h100-optimized
datasets:
- glue
- wikipedia
metrics:
- accuracy
- throughput
- latency
pipeline_tag: text-classification
model-index:
- name: Monarch-BERT-Base-MNLI-Full
results:
- task:
type: text-classification
name: Natural Language Inference
dataset:
name: GLUE MNLI
type: glue
config: mnli
split: validation_matched
metrics:
- name: Accuracy
type: accuracy
value: 78.34%
description: "Approx. 5% accuracy trade-off for maximum speed compared to dense BERT."
- name: Throughput (TPS on H100)
type: throughput
value: 9029.4
description: "Measured with torch.compile(mode='max-autotune') on NVIDIA H100. Represents a 24.3% increase over the optimized Triton baseline."
- name: Latency (ms)
type: latency
value: 3.54
description: "Batch size 32, sequence length 128. Achieves a ~24,6% faster inference time."
---
# Monarch-BERT-MNLI (Full)
**Breaking the Efficiency Barrier: -66.2% Parameters, +24% Speed.**
> **tl;dr:** Achieving **extreme resource efficiency** on **MNLI**. We replaced *every* dense FFN layer in BERT-Base with structured Monarch Matrices. Distilled in **just 3 hours on one H100** using only **500k Wiki tokens** and **MNLI** data, this model slashes parameters by 66.2% and boosts throughput by **+24%** (vs optimized Baseline).
## High Performance, Low Cost
Training models from scratch typically requires billions of tokens. We took a different path to shock the efficiency curve:
* **Training Time:** A few hours on **1x NVIDIA H100**.
* **Data:** Only **MNLI** + **500k Wikipedia Samples**.
* **Trade-off:** This extreme compression comes with a moderate accuracy drop (~5%). *Need higher accuracy? Check out our [Hybrid Version](https://huggingface.co/ykae/monarch-bert-base-mnli-hybrid) (<1% loss).*
## Key Benchmarks
Measured on a single NVIDIA H100 using `torch.compile(mode="max-autotune")`.
| Metric | BERT-Base (Baseline) | **Monarch-Full (This)** | Delta |
| :--- | :--- | :--- | :--- |
| **Parameters** | 85.65M | **28.98M** | π **-66.2%** |
| **Compute (GFLOPs)** | 696.5 | **232.6** | π **-66.6%** |
| **Throughput (TPS)** | 7261 | **9029** | π **+24.3%** |
| **Latency (Batch 32)** | 4.41 ms | **3.54 ms** | β‘ **+24.6% Faster** |
| **Accuracy (MNLI)** | 83.62% | **78.34%** | π **-5.28%** |
## Usage
This model uses a **custom architecture**. You must enable `trust_remote_code=True` to load the Monarch layers (`MonarchUp`, `MonarchDown`, `MonarchFFN`).
To see the real speedup, **compilation is mandatory** (otherwise PyTorch Python overhead masks the hardware gains).
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm import tqdm
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "ykae/monarch-bert-base-mnli"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(
model_id,
trust_remote_code=True
).to(device)
# torch.set_float32_matmul_precision('high')
# model = torch.compile(model, mode="max-autotune")
model.eval()
print("π Loading MNLI Validation set...")
dataset = load_dataset("glue", "mnli", split="validation_matched")
def tokenize_fn(ex):
return tokenizer(ex['premise'], ex['hypothesis'],
padding="max_length", truncation=True, max_length=128)
tokenized_ds = dataset.map(tokenize_fn, batched=True)
tokenized_ds.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
loader = DataLoader(tokenized_ds, batch_size=32)
correct = 0
total = 0
print(f"π Starting evaluation on {len(tokenized_ds)} samples...")
with torch.no_grad():
for batch in tqdm(loader):
ids = batch['input_ids'].to(device)
mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device)
outputs = model(ids, attention_mask=mask)
preds = torch.argmax(outputs.logits, dim=1)
correct += (preds == labels).sum().item()
total += labels.size(0)
print(f"\nβ
Evaluation Finished!")
print(f"π Accuracy: {100 * correct / total:.2f}%")
```
## The "Memory Paradox" (Read this!)
You might notice that while the parameter count is lower, the peak VRAM usage during inference can be slightly higher than the baseline.
**Why?**
This is a **software artifact**, not a hardware limitation.
* **Solution:** A custom **Fused Triton Kernel** (planned) would fuse the steps of our Monarch, keeping intermediate activations in the GPU's SRAM. This would drop dynamic VRAM usage significantly below the baseline, matching the FLOPs reduction.
## Citation
```bibtex
@misc{ykae-monarch-bert-mnli-2026,
author = {Yusuf Kalyoncuoglu, YKAE-Vision},
title = {Monarch-BERT-MNLI: Extreme Compression via Monarch FFNs},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/ykae/monarch-bert-base-mnli}}
}
``` |