Spaces:
Running
on
Zero
A newer version of the Gradio SDK is available:
6.1.0
Pre-Quantized Model Implementation
This project uses a pre-quantized int4 model for efficient deployment. The model is already quantized and stored in the int4 subfolder, so we don't need to apply additional quantization during loading.
Key Changes Made
1. Loading Pre-Quantized Model
The app now correctly loads the pre-quantized model without trying to re-quantize it:
def load_model():
"""Load the pre-quantized model and tokenizer"""
global model, tokenizer
try:
# Load tokenizer from int4 subfolder
tokenizer = AutoTokenizer.from_pretrained(MAIN_MODEL_ID, subfolder="int4")
# Load pre-quantized model without additional quantization config
model_kwargs = {
"device_map": "auto" if DEVICE == "cuda" else "cpu",
"torch_dtype": torch.float32, # Use float32 for compatibility
"trust_remote_code": True,
"low_cpu_mem_usage": True,
}
model = AutoModelForCausalLM.from_pretrained(MAIN_MODEL_ID, subfolder="int4", **model_kwargs)
return True
except Exception as e:
logger.error(f"Error loading model: {e}")
return False
2. Proper Inference with Cache Implementation
The most important fix is using cache_implementation="static" for generation:
output_ids = model.generate(
inputs['input_ids'],
max_new_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
do_sample=do_sample,
attention_mask=inputs['attention_mask'],
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
cache_implementation="static" # CRITICAL for quantized models
)
Why This Approach Works
Avoiding Quantization Conflicts
The warning you saw:
You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.
This happens because:
- Your model in the
int4subfolder is already quantized - When you try to apply TorchAO quantization to an already quantized model, it conflicts
- The solution is to load the pre-quantized model directly without additional quantization
Benefits of Pre-Quantized Models
- No Quantization Overhead: The model is already optimized
- Consistent Performance: No runtime quantization variations
- Memory Efficient: Already compressed for deployment
- Faster Loading: No quantization step during loading
Testing the Implementation
Run the test script to verify the pre-quantized model works:
python test_pre_quantized_model.py
This will test:
- Loading the pre-quantized model without conflicts
- Text generation with proper cache implementation
- Verification of quantization status
Performance Benefits
- Memory Reduction: Pre-quantized models use ~50% less memory
- Faster Loading: No quantization step during model loading
- Consistent Performance: No quantization variations between runs
- Optimized Kernels: Pre-quantized models use optimized inference kernels
Common Issues and Solutions
Issue: Quantization config warning
Solution: Don't apply additional quantization to pre-quantized models
Issue: Model outputs incorrect or garbled text
Solution: Ensure cache_implementation="static" is used in generation
Issue: Memory errors during loading
Solution: Use low_cpu_mem_usage=True and appropriate device mapping
Issue: Slow inference
Solution:
- Use
cache_implementation="static" - Consider using
torch.compilefor additional speedup - Monitor memory usage
Model Structure
Your model repository should have this structure:
Tonic/petite-elle-L-aime-3-sft/
βββ int4/
β βββ config.json
β βββ pytorch_model.bin
β βββ tokenizer.json
β βββ ...
βββ README.md
βββ ...
Deployment Notes
- No Additional Quantization: The model is already quantized
- Cache Implementation: Always use
cache_implementation="static" - Memory Monitoring: Pre-quantized models use less memory
- Performance: Optimized for deployment without quantization overhead
Troubleshooting
Check Model Quantization
# Check if model is quantized
for name, module in model.named_modules():
if hasattr(module, 'weight') and module.weight.dtype != torch.float32:
print(f"{name}: {module.weight.dtype}")
Memory Usage
import torch
print(f"GPU Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
Verify Model Loading
# Check model config
print(f"Model dtype: {model.dtype}")
print(f"Model device: {model.device}")
Alternative: TorchAO Quantization
If you want to use TorchAO quantization instead of pre-quantized models:
- Load the base model (not from int4 subfolder)
- Apply TorchAO quantization during loading
- Use appropriate quantization configs for your device
from transformers import TorchAoConfig
from torchao.quantization import Int4WeightOnlyConfig
quant_config = Int4WeightOnlyConfig(group_size=128)
quantization_config = TorchAoConfig(quant_type=quant_config)
model = AutoModelForCausalLM.from_pretrained(
model_id, # Not subfolder="int4"
quantization_config=quantization_config,
device_map="auto",
torch_dtype=torch.float32,
)
This implementation ensures proper handling of pre-quantized models without quantization conflicts, with the critical cache_implementation="static" parameter for correct generation.