Spaces:

Tonic
/

Petite-LLM-3

Running on Zero

App Files Files Community

Petite-LLM-3 / docsandtests /README_TORCHAO.md

Tonic

small mods on the title and description

5975026 5 months ago

preview code

raw

history blame contribute delete

5.69 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Pre-Quantized Model Implementation

This project uses a pre-quantized int4 model for efficient deployment. The model is already quantized and stored in the int4 subfolder, so we don't need to apply additional quantization during loading.

Key Changes Made

1. Loading Pre-Quantized Model

The app now correctly loads the pre-quantized model without trying to re-quantize it:

def load_model():
    """Load the pre-quantized model and tokenizer"""
    global model, tokenizer
    
    try:
        # Load tokenizer from int4 subfolder
        tokenizer = AutoTokenizer.from_pretrained(MAIN_MODEL_ID, subfolder="int4")
        
        # Load pre-quantized model without additional quantization config
        model_kwargs = {
            "device_map": "auto" if DEVICE == "cuda" else "cpu",
            "torch_dtype": torch.float32,  # Use float32 for compatibility
            "trust_remote_code": True,
            "low_cpu_mem_usage": True,
        }
        
        model = AutoModelForCausalLM.from_pretrained(MAIN_MODEL_ID, subfolder="int4", **model_kwargs)
        
        return True
    except Exception as e:
        logger.error(f"Error loading model: {e}")
        return False

2. Proper Inference with Cache Implementation

The most important fix is using cache_implementation="static" for generation:

output_ids = model.generate(
    inputs['input_ids'],
    max_new_tokens=max_tokens,
    temperature=temperature,
    top_p=top_p,
    do_sample=do_sample,
    attention_mask=inputs['attention_mask'],
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    cache_implementation="static"  # CRITICAL for quantized models
)

Why This Approach Works

Avoiding Quantization Conflicts

The warning you saw:

You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.

This happens because:

Your model in the int4 subfolder is already quantized
When you try to apply TorchAO quantization to an already quantized model, it conflicts
The solution is to load the pre-quantized model directly without additional quantization

Benefits of Pre-Quantized Models

No Quantization Overhead: The model is already optimized
Consistent Performance: No runtime quantization variations
Memory Efficient: Already compressed for deployment
Faster Loading: No quantization step during loading

Testing the Implementation

Run the test script to verify the pre-quantized model works:

python test_pre_quantized_model.py

This will test:

Loading the pre-quantized model without conflicts
Text generation with proper cache implementation
Verification of quantization status

Performance Benefits

Memory Reduction: Pre-quantized models use ~50% less memory
Faster Loading: No quantization step during model loading
Consistent Performance: No quantization variations between runs
Optimized Kernels: Pre-quantized models use optimized inference kernels

Common Issues and Solutions

Issue: Quantization config warning

Solution: Don't apply additional quantization to pre-quantized models

Issue: Model outputs incorrect or garbled text

Solution: Ensure cache_implementation="static" is used in generation

Issue: Memory errors during loading

Solution: Use low_cpu_mem_usage=True and appropriate device mapping

Issue: Slow inference

Solution:

Use cache_implementation="static"
Consider using torch.compile for additional speedup
Monitor memory usage

Model Structure

Your model repository should have this structure:

Tonic/petite-elle-L-aime-3-sft/
├── int4/
│   ├── config.json
│   ├── pytorch_model.bin
│   ├── tokenizer.json
│   └── ...
├── README.md
└── ...

Deployment Notes

No Additional Quantization: The model is already quantized
Cache Implementation: Always use cache_implementation="static"
Memory Monitoring: Pre-quantized models use less memory
Performance: Optimized for deployment without quantization overhead

Troubleshooting

Check Model Quantization

# Check if model is quantized
for name, module in model.named_modules():
    if hasattr(module, 'weight') and module.weight.dtype != torch.float32:
        print(f"{name}: {module.weight.dtype}")

Memory Usage

import torch
print(f"GPU Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

Verify Model Loading

# Check model config
print(f"Model dtype: {model.dtype}")
print(f"Model device: {model.device}")

Alternative: TorchAO Quantization

If you want to use TorchAO quantization instead of pre-quantized models:

Load the base model (not from int4 subfolder)
Apply TorchAO quantization during loading
Use appropriate quantization configs for your device

from transformers import TorchAoConfig
from torchao.quantization import Int4WeightOnlyConfig

quant_config = Int4WeightOnlyConfig(group_size=128)
quantization_config = TorchAoConfig(quant_type=quant_config)

model = AutoModelForCausalLM.from_pretrained(
    model_id,  # Not subfolder="int4"
    quantization_config=quantization_config,
    device_map="auto",
    torch_dtype=torch.float32,
)

This implementation ensures proper handling of pre-quantized models without quantization conflicts, with the critical cache_implementation="static" parameter for correct generation.