Qwen2.5-1.5B-Instruct โ€” Quantized GGUF Models

This repository provides GGUF-quantized variants of Qwen2.5-1.5B-Instruct, optimized for efficient inference across a wide range of hardware โ€” from modern GPUs to low-memory CPUs and edge devices.

The goal of these quantizations is to significantly reduce memory and compute requirements while retaining the strong instruction-following and reasoning behavior of the base model.


Model Summary

  • Base Model: Qwen2.5-1.5B-Instruct
  • Architecture: Decoder-only Transformer
  • Parameter Count: ~1.5B
  • Modalities: Text
  • Context Length: Up to 32K tokens (backend dependent)
  • Developer: Qwen Team (Alibaba Cloud)
  • License: Apache-2.0
  • Languages: Multilingual (English, Chinese, others)

Available Quantizations

Multiple GGUF quantization levels are provided to support different performance, memory, and accuracy requirements.

Q2_K (2-bit)

  • Extremely small memory footprint
  • Enables inference on very constrained devices
  • Suitable for experimentation or ultra-low-resource environments
  • Significant quality degradation compared to higher bit-rates

Q3_K_M (3-bit)

  • Slightly improved quality over Q2_K
  • Still very lightweight and fast
  • Reasoning and instruction accuracy noticeably reduced
  • Best for basic conversational or lightweight tasks

Q4_K_M (4-bit)

  • Strong efficiency-to-quality ratio
  • Works well on CPUs and low-VRAM GPUs
  • Suitable for general chat and instruction tasks
  • Moderate quality loss in complex reasoning

Q5_K_M (5-bit)

  • Good balance between size and output quality
  • Retains most instruction-following capabilities
  • Recommended default for local usage

Q6_K (6-bit)

  • Higher fidelity responses
  • Increased memory usage compared to 5-bit
  • Better suited for reasoning-heavy prompts

Q8_0 (8-bit)

  • Near FP16-level quality
  • Largest quantized variant
  • Best choice when memory allows and accuracy is critical

Actual performance depends on inference backend, context length, sampling parameters, and prompt complexity.


Why Use Quantized Qwen2.5?

  • Efficient instruction-following with low latency
  • Capable reasoning even at reduced precision
  • Runs entirely offline
  • Scales from laptops to edge devices
  • Flexible deployment via GGUF-compatible runtimes

These models are ideal for local assistants, offline chat applications, research, and resource-constrained environments.


Usage Example

llama.cpp (GGUF)

./llama-cli \
  -m qwen2.5-1.5b-instruct-q5_k_m.gguf \
  -p "Explain the difference between supervised and unsupervised learning." \
  -n 256 \
  -c 8192

Recommended Settings

  • Prefer Q5_K_M or higher for reasoning tasks
  • Use lower bit-rates (Q2_K, Q3_K_M) only when memory is extremely limited
  • Temperature range: 0.6 โ€“ 0.8 for balanced outputs

Training Data (Base Model)

The original Qwen2.5-1.5B-Instruct model was trained and fine-tuned on a diverse mixture of:

  • Instruction-following datasets
  • Multilingual general-knowledge corpora
  • Reasoning-focused synthetic data
  • Conversational and task-oriented examples

Quantization applies numerical compression only and does not alter training data or model behavior intentionally.


Recommended Applications

  • Offline AI assistants
  • Local chat and analysis tools
  • Educational experimentation
  • CPU-only or low-VRAM environments
  • Embedded and edge deployments

Known Limitations

  • Lower bit-rate models may hallucinate more frequently
  • Q2_K and Q3_K_M are not suitable for complex reasoning
  • Not intended for safety-critical or high-risk decision making

Always validate performance on your specific workload.


Acknowledgements

  • Qwen Team for releasing the Qwen2.5 model family
  • The llama.cpp community for GGUF tooling and inference support
  • Open-source contributors enabling efficient local LLM deployment

Contact

For issues related to quantization files or deployment guidance, please open an issue in this repository.

Downloads last month
554
GGUF
Model size
2B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for goasty/Qwen2.5-1.5B-Instruct_GGUF

Base model

Qwen/Qwen2.5-1.5B
Quantized
(154)
this model