Qwen3-8B-LaCo-Pruned

This model is a layer-pruned version of Qwen3-8B-Base using the LaCo (Layer Collapse) structured pruning method.

Model Summary

Attribute	Value
Base Model	Qwen/Qwen3-8B-Base
Pruning Method	LaCo (Layer Collapse)
Original Layers	36
Pruned Layers	26
Layers Removed	10
Compression	27.8%
Parameters	~5.8B (reduced from ~8B)

Intended Use

Research on model compression and efficiency
Fine-tuning base for domain-specific applications
Inference optimization where speed/memory matters more than factual accuracy
Edge deployment scenarios with limited computational resources

⚠️ Important Limitations

This pruned model has significantly reduced factual knowledge capabilities. It performs at near-random levels on knowledge-intensive benchmarks like MMLU.

Use Case	Status
Physical reasoning tasks	✅ Good (82.6% retained)
Reading comprehension	⚠️ Acceptable (74.3% retained)
Common sense reasoning	⚠️ Degraded (61.8% retained)
Factual question answering	❌ Not recommended
Knowledge-intensive tasks	❌ Not recommended

Recommendation: Fine-tune this model on your target domain before deployment.

Pruning Details

LaCo Hyperparameters

Parameter	Value	Description
MERGE_LAYERS (C)	3	Layers merged per operation
LOWEST_LAY (L)	4	Minimum layer index for merging
HIGHEST_LAY (H)	28	Maximum layer index for merging
INTERVAL (I)	2	Minimum gap between merge points
THRESHOLD (T)	0.85	Cosine similarity threshold
MAX_COMPRESSION	30%	Maximum allowed compression

Pruning Statistics

Metric	Value
Successful Merges	5
Rejected Merges	0
Total Iterations	6
Final Compression	27.8%

Hidden State Similarity (Calibration Set)

Metric	Value
Average	0.9680
Min	0.9492
Max	0.9766

Individual similarities: [0.9492, 0.9727, 0.9609, 0.9766, 0.9688, 0.9648, 0.9648, 0.9766, 0.9727, 0.9727]

Perplexity Results

Model	Perplexity	Ratio
Original (Qwen3-8B-Base)	26.19	1.00×
Pruned (this model)	71.48	2.73×

Benchmark Results

Comparison with Original Qwen3-8B-Base

Benchmark	Original	Pruned	Retention	Status
PIQA	79.54%	65.67%	82.6%	✅ Good
BoolQ	83.09%	61.77%	74.3%	⚠️ Acceptable
HellaSwag	78.55%	48.52%	61.8%	⚠️ Degraded
MMLU (5-shot)	76.89%	25.12%	32.7%	❌ Near random

Original scores from Qwen3 Technical Report

Key Findings

Physical reasoning preserved: PIQA retained 82.6% of original performance
Factual knowledge destroyed: MMLU collapsed to random-chance (25% for 4-way MCQ)
Perplexity underestimates damage: 2.73× PPL ratio doesn't predict the benchmark collapse
Layer-specific knowledge: Factual knowledge appears encoded in specific removed layers

Usage

Basic Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Mercity/Qwen3-8B-LaCo-Pruned"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

# Text generation
prompt = "The process of photosynthesis"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With 4-bit Quantization (Further Compression)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    "Mercity/Qwen3-8B-LaCo-Pruned",
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True
)

Recovery Recommendations

To restore performance after pruning:

Option 1: LoRA Fine-tuning (Recommended)

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", 
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)
# Fine-tune on OpenOrca, Alpaca, or domain-specific data

Option 2: Knowledge Distillation

Use original Qwen3-8B-Base as teacher to transfer knowledge back.

Expected Recovery

With fine-tuning: +15-25% on MMLU
With knowledge distillation: +25-35% on MMLU

Technical Specifications

Attribute	Value
Architecture	Transformer decoder-only
Parameters	~5.8B
Layers	26
Hidden Size	4096
Attention Heads (Q)	32
Attention Heads (KV)	8 (GQA)
Intermediate Size	12288
Vocabulary Size	151,669
Max Context Length	32,768 tokens
Precision	bfloat16

Citation

If you use this model, please cite the original LaCo paper and Qwen3:

@article{yang2024laco,
  title={LaCo: Large Language Model Pruning via Layer Collapse},
  author={Yang, Yifei and Cao, Zouying and Zhao, Hai},
  journal={arXiv preprint arXiv:2402.11187},
  year={2024}
}

@misc{qwen3technicalreport,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  year={2025},
  eprint={2505.09388},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2505.09388}
}

References

License

Apache 2.0 (same as base Qwen3 model)

Acknowledgments

Qwen Team for the excellent Qwen3-8B-Base model
LaCo authors for the pruning methodology
Hugging Face for model hosting

Downloads last month: 2

Safetensors

Model size

6B params

Tensor type

BF16

Model tree for Mercity/Qwen3-8B-LaCo-Pruned

Base model

Qwen/Qwen3-8B-Base

Finetuned

(333)

this model

Dataset used to train Mercity/Qwen3-8B-LaCo-Pruned

Papers for Mercity/Qwen3-8B-LaCo-Pruned

Qwen3 Technical Report

Paper • 2505.09388 • Published May 14, 2025 • 333

LaCo: Large Language Model Pruning via Layer Collapse

Paper • 2402.11187 • Published Feb 17, 2024 • 2

Evaluation results

Accuracy (Normalized) on HellaSwag
self-reported

48.520
Accuracy (Normalized) on PIQA
self-reported

65.670
Accuracy on BoolQ
self-reported

61.770
Accuracy (5-shot) on MMLU
self-reported

25.120