Petite-LLM-3 / docsandtests /README_TORCHAO.md
Tonic's picture
small mods on the title and description
5975026

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Pre-Quantized Model Implementation

This project uses a pre-quantized int4 model for efficient deployment. The model is already quantized and stored in the int4 subfolder, so we don't need to apply additional quantization during loading.

Key Changes Made

1. Loading Pre-Quantized Model

The app now correctly loads the pre-quantized model without trying to re-quantize it:

def load_model():
    """Load the pre-quantized model and tokenizer"""
    global model, tokenizer
    
    try:
        # Load tokenizer from int4 subfolder
        tokenizer = AutoTokenizer.from_pretrained(MAIN_MODEL_ID, subfolder="int4")
        
        # Load pre-quantized model without additional quantization config
        model_kwargs = {
            "device_map": "auto" if DEVICE == "cuda" else "cpu",
            "torch_dtype": torch.float32,  # Use float32 for compatibility
            "trust_remote_code": True,
            "low_cpu_mem_usage": True,
        }
        
        model = AutoModelForCausalLM.from_pretrained(MAIN_MODEL_ID, subfolder="int4", **model_kwargs)
        
        return True
    except Exception as e:
        logger.error(f"Error loading model: {e}")
        return False

2. Proper Inference with Cache Implementation

The most important fix is using cache_implementation="static" for generation:

output_ids = model.generate(
    inputs['input_ids'],
    max_new_tokens=max_tokens,
    temperature=temperature,
    top_p=top_p,
    do_sample=do_sample,
    attention_mask=inputs['attention_mask'],
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    cache_implementation="static"  # CRITICAL for quantized models
)

Why This Approach Works

Avoiding Quantization Conflicts

The warning you saw:

You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.

This happens because:

  1. Your model in the int4 subfolder is already quantized
  2. When you try to apply TorchAO quantization to an already quantized model, it conflicts
  3. The solution is to load the pre-quantized model directly without additional quantization

Benefits of Pre-Quantized Models

  1. No Quantization Overhead: The model is already optimized
  2. Consistent Performance: No runtime quantization variations
  3. Memory Efficient: Already compressed for deployment
  4. Faster Loading: No quantization step during loading

Testing the Implementation

Run the test script to verify the pre-quantized model works:

python test_pre_quantized_model.py

This will test:

  • Loading the pre-quantized model without conflicts
  • Text generation with proper cache implementation
  • Verification of quantization status

Performance Benefits

  1. Memory Reduction: Pre-quantized models use ~50% less memory
  2. Faster Loading: No quantization step during model loading
  3. Consistent Performance: No quantization variations between runs
  4. Optimized Kernels: Pre-quantized models use optimized inference kernels

Common Issues and Solutions

Issue: Quantization config warning

Solution: Don't apply additional quantization to pre-quantized models

Issue: Model outputs incorrect or garbled text

Solution: Ensure cache_implementation="static" is used in generation

Issue: Memory errors during loading

Solution: Use low_cpu_mem_usage=True and appropriate device mapping

Issue: Slow inference

Solution:

  1. Use cache_implementation="static"
  2. Consider using torch.compile for additional speedup
  3. Monitor memory usage

Model Structure

Your model repository should have this structure:

Tonic/petite-elle-L-aime-3-sft/
β”œβ”€β”€ int4/
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ pytorch_model.bin
β”‚   β”œβ”€β”€ tokenizer.json
β”‚   └── ...
β”œβ”€β”€ README.md
└── ...

Deployment Notes

  1. No Additional Quantization: The model is already quantized
  2. Cache Implementation: Always use cache_implementation="static"
  3. Memory Monitoring: Pre-quantized models use less memory
  4. Performance: Optimized for deployment without quantization overhead

Troubleshooting

Check Model Quantization

# Check if model is quantized
for name, module in model.named_modules():
    if hasattr(module, 'weight') and module.weight.dtype != torch.float32:
        print(f"{name}: {module.weight.dtype}")

Memory Usage

import torch
print(f"GPU Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

Verify Model Loading

# Check model config
print(f"Model dtype: {model.dtype}")
print(f"Model device: {model.device}")

Alternative: TorchAO Quantization

If you want to use TorchAO quantization instead of pre-quantized models:

  1. Load the base model (not from int4 subfolder)
  2. Apply TorchAO quantization during loading
  3. Use appropriate quantization configs for your device
from transformers import TorchAoConfig
from torchao.quantization import Int4WeightOnlyConfig

quant_config = Int4WeightOnlyConfig(group_size=128)
quantization_config = TorchAoConfig(quant_type=quant_config)

model = AutoModelForCausalLM.from_pretrained(
    model_id,  # Not subfolder="int4"
    quantization_config=quantization_config,
    device_map="auto",
    torch_dtype=torch.float32,
)

This implementation ensures proper handling of pre-quantized models without quantization conflicts, with the critical cache_implementation="static" parameter for correct generation.