Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model
Paper
•
2104.09617
•
Published
•
1
This model is NbAiLab/nb-bert-base fine-tuned on the thivy/norwegian-ner-combined dataset for Named Entity Recognition in Norwegian (Bokmål and Nynorsk).
| Metric | Score |
|---|---|
| F1 | 0.9329 |
| Precision | 0.93 |
| Recall | 0.93 |
Best Epoch: 5 out of 20 (early stopped at epoch 11)
| Label | Description | Examples |
|---|---|---|
| PER | Person names | Erna Solberg, Ibsen |
| ORG | Organizations | Stortinget, NATO |
| LOC | Locations | Oslo, Norge, Europa |
| MISC | Miscellaneous | Nobels fredspris |
Dataset: thivy/norwegian-ner-combined
The dataset combines:
Quality improvements:
{
"learning_rate": 3.5e-5,
"num_train_epochs": 20,
"per_device_train_batch_size": 8,
"per_device_eval_batch_size": 8,
"weight_decay": 0.15,
"warmup_ratio": 0.05,
"lr_scheduler_type": "cosine_with_restarts",
"num_cycles": 4,
"early_stopping_patience": 6,
"metric_for_best_model": "f1",
}
Phase 5: Gentle LR Restarts
The model was trained using a cosine learning rate schedule with gentle restarts:
Why this worked:
| Phase | F1 | Strategy | Result |
|---|---|---|---|
| Phase 1 | - | Initial baseline | Established pipeline |
| Phase 2 | - | Data filtering | Improved quality |
| Phase 3 | 0.9298 | OneCycleLR | Good, but plateaued at epoch 6 |
| Phase 4 | 0.9142 | Aggressive restarts (1e-4) | ❌ Catastrophic forgetting |
| Phase 5 | 0.9329 | Gentle restarts (3.5e-5) | ✅ Best model |
from transformers import pipeline
# Load pipeline
ner = pipeline(
"ner",
model="thivy/nb-bert-norwegian-ner",
aggregation_strategy="simple"
)
# Predict
text = "Erna Solberg er statsminister i Norge."
entities = ner(text)
print(entities)
Output:
[
{'entity_group': 'PER', 'score': 0.99, 'word': 'Erna Solberg', 'start': 0, 'end': 12},
{'entity_group': 'LOC', 'score': 0.99, 'word': 'Norge', 'start': 33, 'end': 38}
]
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model and tokenizer
model_name = "thivy/nb-bert-norwegian-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Tokenize input
text = "Oslo er hovedstaden i Norge."
inputs = tokenizer(text, return_tensors="pt")
# Predict
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
# Decode predictions
labels = model.config.id2label
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for token, pred in zip(tokens, predictions[0]):
if token not in ["[CLS]", "[SEP]", "[PAD]"]:
print(f"{token}: {labels[pred.item()]}")
id2label = {
0: "O",
1: "B-LOC",
2: "B-MISC",
3: "B-ORG",
4: "B-PER",
5: "I-LOC",
6: "I-MISC",
7: "I-ORG",
8: "I-PER",
}
F1: 0.9329
Precision: 0.9300
Recall: 0.9358
| Entity | Precision | Recall | F1 |
|---|---|---|---|
| PER | 0.95 | 0.96 | 0.95 |
| ORG | 0.91 | 0.90 | 0.90 |
| LOC | 0.94 | 0.95 | 0.94 |
| MISC | 0.88 | 0.86 | 0.87 |
@misc{norwegian-ner-2024,
author = {Thivyesh Ahilathasan},
title = {Norwegian NER Model (nb-bert-base fine-tuned)},
year = {2024},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/thivy/nb-bert-norwegian-ner}},
}
@misc{kummervold2021operationalizing,
title={Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model},
author={Per E Kummervold and Javier de la Rosa and Freddy Wetjen and Svein Arne Brygfjeld},
year={2021},
eprint={2104.09617},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
See thivy/norwegian-ner-combined for NorNE and WikiANN citations.
CC-BY 4.0 (same as base model and dataset)
Base model
NbAiLab/nb-bert-base