WeDLM-8B

WeDLM-8B is a diffusion language model that performs parallel decoding under standard causal attention, initialized from Qwen3-8B.

This is the base (pretrained) version. For the instruction-tuned version, see WeDLM-8B-Instruct.

📄 Paper (Coming Soon) | 🌐 Project Page | 💻 GitHub

Model Details

Attribute	Value
Initialized From	Qwen3-8B
Parameters	8B
Context Length	32,768

Quick Start (Recommended)

For fast inference, use the wedlm engine:

pip install git+https://github.com/tencent/WeDLM.git

from wedlm import LLM, SamplingParams

llm = LLM(model="tencent/WeDLM-8B")

prompt = "The theory of relativity states that"
outputs = llm.generate([prompt], SamplingParams(max_tokens=256))

print(outputs[0]["text"])

HuggingFace Transformers

For training or simple forward passes, you can load via Transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "tencent/WeDLM-8B", 
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto"
)

inputs = tokenizer("The theory of relativity", return_tensors="pt").to(model.device)
outputs = model(**inputs)

⚠️ Note: The HuggingFace interface is for training/forward pass convenience. For optimized inference throughput, use the wedlm engine above.

Performance

Benchmark	Qwen3-8B	WeDLM-8B
ARC-C (0-shot)	92.66	92.92
GSM8K (3-shot)	85.97	90.20
MATH (4-shot)	50.80	53.60
HumanEval (4-shot)	68.90	75.00
MMLU (5-shot)	74.03	75.46
Average	72.61	74.72