LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
Paper
•
2404.05961
•
Published
•
66
Experimental German text embedding model based on SmolLM3-3B, trained using the LLM2Vec approach to transform a decoder-only LLM into a powerful text encoder.
This model represents German text as dense vectors suitable for semantic search, clustering, and similarity tasks. It was created by adapting SmolLM3-3B through a two-stage training process that enables bidirectional attention and teaches the model to generate meaningful text representations.
Model Transformation: Modified SmolLM3-3B architecture to enable bidirectional attention by:
MNTP Training:
from llm2vec import LLM2Vec
import torch
# Load model
model = LLM2Vec.from_pretrained(
"mayflowergmbh/smollm3-3b-embed-de",
device_map="auto",
torch_dtype=torch.bfloat16,
)
# Encode German texts
texts = [
"Berlin ist die Hauptstadt von Deutschland.",
"Die deutsche Hauptstadt ist Berlin.",
"München ist eine Stadt in Bayern."
]
embeddings = model.encode(texts)
# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)
from sentence_transformers import SentenceTransformer
# Note: Requires adapter for sentence-transformers compatibility
model = SentenceTransformer('path/to/smollm3-3b-embed-de')
embeddings = model.encode(texts)
# Create document embeddings
documents = [
"Die Katze sitzt auf dem Sofa.",
"Der Hund spielt im Garten.",
"Python ist eine Programmiersprache.",
"Machine Learning revolutioniert die Technologie."
]
doc_embeddings = model.encode(documents)
# Search with a query
query = "Haustiere und ihre Aktivitäten"
query_embedding = model.encode([query])
# Find most similar documents
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
top_indices = similarities.argsort()[-3:][::-1]
for idx in top_indices:
print(f"Score: {similarities[idx]:.3f} - {documents[idx]}")
Base Model: SmolLM3-3B
- Hidden Size: 2048
- Intermediate Size: 11008
- Number of Layers: 36
- Number of Attention Heads: 16
- Vocabulary Size: 128256
- Position Embeddings: 65536 (RoPE)
MNTP Stage:
Supervised Stage:
If you use this model, please cite:
@misc{smollm3-embed-de,
title={SmolLM3-3B German Embeddings},
author={Johann-Peter Hartmann},
year={2025},
publisher={Mayflower GmbH},
url={https://huggingface.co/mayflowergmbh/smollm3-3b-embed-de}
}
@article{llm2vec,
title={LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders},
author={Behnamghader, Parishad and others},
journal={arXiv preprint arXiv:2404.05961},
year={2024}
}
For questions or issues, please open an issue on the GitHub repository.