Qwen3-8B-LaCo-Pruned

This model is a layer-pruned version of Qwen3-8B-Base using the LaCo (Layer Collapse) structured pruning method.

Model Summary

Attribute Value
Base Model Qwen/Qwen3-8B-Base
Pruning Method LaCo (Layer Collapse)
Original Layers 36
Pruned Layers 26
Layers Removed 10
Compression 27.8%
Parameters ~5.8B (reduced from ~8B)

Intended Use

  • Research on model compression and efficiency
  • Fine-tuning base for domain-specific applications
  • Inference optimization where speed/memory matters more than factual accuracy
  • Edge deployment scenarios with limited computational resources

⚠️ Important Limitations

This pruned model has significantly reduced factual knowledge capabilities. It performs at near-random levels on knowledge-intensive benchmarks like MMLU.

Use Case Status
Physical reasoning tasks ✅ Good (82.6% retained)
Reading comprehension ⚠️ Acceptable (74.3% retained)
Common sense reasoning ⚠️ Degraded (61.8% retained)
Factual question answering ❌ Not recommended
Knowledge-intensive tasks ❌ Not recommended

Recommendation: Fine-tune this model on your target domain before deployment.


Pruning Details

LaCo Hyperparameters

Parameter Value Description
MERGE_LAYERS (C) 3 Layers merged per operation
LOWEST_LAY (L) 4 Minimum layer index for merging
HIGHEST_LAY (H) 28 Maximum layer index for merging
INTERVAL (I) 2 Minimum gap between merge points
THRESHOLD (T) 0.85 Cosine similarity threshold
MAX_COMPRESSION 30% Maximum allowed compression

Pruning Statistics

Metric Value
Successful Merges 5
Rejected Merges 0
Total Iterations 6
Final Compression 27.8%

Hidden State Similarity (Calibration Set)

Metric Value
Average 0.9680
Min 0.9492
Max 0.9766

Individual similarities: [0.9492, 0.9727, 0.9609, 0.9766, 0.9688, 0.9648, 0.9648, 0.9766, 0.9727, 0.9727]

Perplexity Results

Model Perplexity Ratio
Original (Qwen3-8B-Base) 26.19 1.00×
Pruned (this model) 71.48 2.73×

Benchmark Results

Comparison with Original Qwen3-8B-Base

Benchmark Original Pruned Retention Status
PIQA 79.54% 65.67% 82.6% ✅ Good
BoolQ 83.09% 61.77% 74.3% ⚠️ Acceptable
HellaSwag 78.55% 48.52% 61.8% ⚠️ Degraded
MMLU (5-shot) 76.89% 25.12% 32.7% ❌ Near random

Original scores from Qwen3 Technical Report

Key Findings

  1. Physical reasoning preserved: PIQA retained 82.6% of original performance
  2. Factual knowledge destroyed: MMLU collapsed to random-chance (25% for 4-way MCQ)
  3. Perplexity underestimates damage: 2.73× PPL ratio doesn't predict the benchmark collapse
  4. Layer-specific knowledge: Factual knowledge appears encoded in specific removed layers

Usage

Basic Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Mercity/Qwen3-8B-LaCo-Pruned"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

# Text generation
prompt = "The process of photosynthesis"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With 4-bit Quantization (Further Compression)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    "Mercity/Qwen3-8B-LaCo-Pruned",
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True
)

Recovery Recommendations

To restore performance after pruning:

Option 1: LoRA Fine-tuning (Recommended)

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", 
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)
# Fine-tune on OpenOrca, Alpaca, or domain-specific data

Option 2: Knowledge Distillation

Use original Qwen3-8B-Base as teacher to transfer knowledge back.

Expected Recovery

  • With fine-tuning: +15-25% on MMLU
  • With knowledge distillation: +25-35% on MMLU

Technical Specifications

Attribute Value
Architecture Transformer decoder-only
Parameters ~5.8B
Layers 26
Hidden Size 4096
Attention Heads (Q) 32
Attention Heads (KV) 8 (GQA)
Intermediate Size 12288
Vocabulary Size 151,669
Max Context Length 32,768 tokens
Precision bfloat16

Citation

If you use this model, please cite the original LaCo paper and Qwen3:

@article{yang2024laco,
  title={LaCo: Large Language Model Pruning via Layer Collapse},
  author={Yang, Yifei and Cao, Zouying and Zhao, Hai},
  journal={arXiv preprint arXiv:2402.11187},
  year={2024}
}

@misc{qwen3technicalreport,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  year={2025},
  eprint={2505.09388},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2505.09388}
}

References

License

Apache 2.0 (same as base Qwen3 model)

Acknowledgments

  • Qwen Team for the excellent Qwen3-8B-Base model
  • LaCo authors for the pruning methodology
  • Hugging Face for model hosting
Downloads last month
2
Safetensors
Model size
6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Mercity/Qwen3-8B-LaCo-Pruned

Base model

Qwen/Qwen3-8B-Base
Finetuned
(333)
this model

Dataset used to train Mercity/Qwen3-8B-LaCo-Pruned

Papers for Mercity/Qwen3-8B-LaCo-Pruned

Evaluation results