--- language: en license: llama3.1 library_name: transformers tags: - text-classification - energy - document-classification - llama-3.1 - lora - peft - binary-classification - energy-documents pipeline_tag: text-classification widget: - text: "Solar energy has become increasingly cost-competitive with fossil fuels in recent years. The price of photovoltaic panels has dropped significantly, making renewable energy more accessible." example_title: "Energy Document" - text: "The committee discussed the implementation of new operational guidelines. Training sessions will be conducted for all staff members next month." example_title: "Non-Energy Document" datasets: - custom metrics: - accuracy - f1 - precision - recall base_model: meta-llama/Llama-3.1-8B model-index: - name: Llama-3.1-8B-Energy-Classifier results: - task: type: text-classification name: Energy Document Classification dataset: name: Custom Energy Documents Dataset type: custom metrics: - type: accuracy value: 0.9839 name: Test Accuracy verified: true - type: f1 value: 0.9841 name: Test F1 Score verified: true - type: precision value: 0.9717 name: Test Precision verified: true - type: recall value: 0.9969 name: Test Recall verified: true - type: roc_auc value: 0.9976 name: ROC-AUC verified: true --- # 🔋 Llama-3.1-8B Energy Document Classifier A fine-tuned **Llama-3.1-8B** model for binary classification of energy-related documents, achieving **98.39% accuracy** on test data. This model uses **LoRA (Low-Rank Adaptation)** for parameter-efficient fine-tuning, trained on **95,602 documents** (perfectly balanced: 47,801 energy + 47,801 non-energy). ## 📊 Model Performance ### Test Set Results (9,562 documents) | Metric | Score | |--------|-------| | **Test Accuracy** | 98.39% | | **F1 Score** | 98.41% | | **Precision** | 97.17% | | **Recall** | 99.69% | | **ROC-AUC** | 99.76% | ### Validation Set Results (9,560 documents) | Metric | Score | |--------|-------| | **Val Accuracy** | 98.55% | | **Val F1 Score** | 98.56% | | **Val Precision** | 97.54% | | **Val Recall** | 99.60% | | **Val ROC-AUC** | 99.76% | ### Confusion Matrix (Test Set - 9,562 documents) | | Predicted Non-Energy | Predicted Energy | |--|---------------------|------------------| | **Actual Non-Energy** | 4,642 (97.09%) | 139 (2.91%) | | **Actual Energy** | 15 (0.31%) | 4,766 (99.69%) | **Only 154 misclassifications out of 9,562 documents (1.61% error rate)!** ### Training Details - **Base Model**: meta-llama/Llama-3.1-8B - **Training Method**: LoRA (r=16, alpha=32, dropout=0.05) - **Target Modules**: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj - **Trainable Parameters**: 45M out of 8B (0.56%) - **Total Dataset**: 95,602 documents (perfectly balanced) - Train: 76,480 (38,240 energy + 38,240 non-energy) - Val: 9,560 (4,780 energy + 4,780 non-energy) - Test: 9,562 (4,781 energy + 4,781 non-energy) - **Energy Data Sources**: - EnergyAI/finepdfs_energy (40,989 docs) - EnergyAI/wikipedia_energy (5,459 docs) - EnergyAI/eartharxiv_engrxiv_energy (27 docs) - EnergyAI/scored_chunks_from_SPE_pipeline (1,326 docs) - **Training Time**: ~2 hours on 4× A100 80GB GPUs - **Convergence**: Early stopping at step 1,100 (< 1 epoch!) - **Effective Batch Size**: 64 (per_device=4, gradient_accum=4, 4 GPUs) - **Learning Rate**: 2e-5 with cosine schedule and 10% warmup - **Precision**: bfloat16 mixed precision ### Data Curation Energy-labeled documents were sourced from four HuggingFace datasets (see above). Classification labels for the training data were created with Mistral 3 Large model and this classifier was distilled from this data. Non-energy documents were sampled from a base document pipeline, with deduplication to ensure no overlap with energy documents (validated by both document ID and MD5 hash matching). ## 🎯 Use Cases This model can classify documents as **energy-related** or **non-energy**. Perfect for: - 📚 Research paper categorization - 📰 News article filtering - 📄 Document management systems - 🔍 Content discovery and recommendation - 🗂️ Dataset curation for energy research **Energy Topics Covered:** - Oil & Gas - Renewable Energy (Solar, Wind, Hydro, Geothermal) - Electricity & Power Systems - Nuclear Energy - Energy Policy & Economics - Carbon & Climate (energy-related aspects) - Energy Storage & Batteries ## 🚀 Quick Start ### Installation ```bash pip install torch>=2.0.0 transformers>=4.44.0 peft>=0.12.0 accelerate>=0.28.0 ``` **⚠️ Version Requirements:** | Package | Recommended Version | Notes | |---------|-------------------|-------| | `torch` | ≥2.0.0 | CUDA support recommended | | `transformers` | 4.44.0 - 4.57.x | Tested range | | `peft` | 0.12.0 - 0.18.x | LoRA adapter loading | | `accelerate` | ≥0.28.0 | For device_map="auto" | **Note:** The tokenizer is loaded from the base model (`meta-llama/Llama-3.1-8B`). You need access to Llama-3.1 on Hugging Face (requires accepting the license agreement). ### Basic Usage ```python from peft import AutoPeftModelForSequenceClassification from transformers import AutoTokenizer import torch # Load model and tokenizer model_name = "omar-elmansouri/Llama-3.1-8B-Energy-Classifier" # Load tokenizer from base model (recommended for compatibility) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B") if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token # Load the PEFT adapter model model = AutoPeftModelForSequenceClassification.from_pretrained( model_name, device_map="auto", torch_dtype=torch.bfloat16, ) model.config.pad_token_id = tokenizer.pad_token_id model.eval() # Classify a document text = """ Solar energy capacity has grown exponentially over the past decade. The International Energy Agency reports that solar now represents the fastest-growing renewable energy source globally. """ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) inputs = {k: v.to(model.device) for k, v in inputs.items()} with torch.no_grad(): outputs = model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_class = torch.argmax(predictions, dim=-1).item() label_map = {0: "non_energy", 1: "energy"} print(f"Prediction: {label_map[predicted_class]}") print(f"Confidence: {predictions[0][predicted_class].item():.4f}") print(f"Probabilities: Energy={predictions[0][1].item():.4f}, Non-Energy={predictions[0][0].item():.4f}") ``` **Output:** ``` Prediction: energy Confidence: 0.9987 Probabilities: Energy=0.9987, Non-Energy=0.0013 ``` ### Batch Processing ```python from transformers import pipeline # Create classification pipeline classifier = pipeline( "text-classification", model="EnergyAI/Llama-3.1-8B-Energy-Classifier", device_map="auto", torch_dtype=torch.bfloat16, ) # Classify multiple documents texts = [ "Wind turbines are becoming more efficient with larger blade designs.", "The software development team completed the sprint planning meeting.", "Natural gas prices fluctuated amid geopolitical tensions in Europe.", ] results = classifier(texts, truncation=True, max_length=512) for text, result in zip(texts, results): print(f"Text: {text[:50]}...") print(f"Label: {result['label']}, Score: {result['score']:.4f}\n") ``` ### Using PEFT for Efficient Loading ```python from peft import AutoPeftModelForSequenceClassification from transformers import AutoTokenizer import torch # Load with PEFT (more memory efficient) model = AutoPeftModelForSequenceClassification.from_pretrained( "EnergyAI/Llama-3.1-8B-Energy-Classifier", device_map="auto", torch_dtype=torch.bfloat16, ) tokenizer = AutoTokenizer.from_pretrained("EnergyAI/Llama-3.1-8B-Energy-Classifier") # Classify text = "Offshore wind farms are expanding along the Atlantic coast." inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) inputs = {k: v.to(model.device) for k, v in inputs.items()} with torch.no_grad(): outputs = model(**inputs) probs = torch.nn.functional.softmax(outputs.logits, dim=-1) print(f"Energy probability: {probs[0][1].item():.4f}") ``` ## 🏗️ Model Architecture ### Base Model - **Name**: meta-llama/Llama-3.1-8B - **Parameters**: 8 Billion - **Architecture**: Transformer-based causal language model - **Context Length**: 128K tokens (using 1024 for classification) ### Fine-tuning Details - **Method**: LoRA (Low-Rank Adaptation) - **LoRA Rank (r)**: 16 - **LoRA Alpha**: 32 - **LoRA Dropout**: 0.05 - **Target Modules**: - `q_proj`, `k_proj`, `v_proj`, `o_proj` (Attention) - `gate_proj`, `up_proj`, `down_proj` (MLP) - **Trainable Parameters**: 41,951,232 (~0.52% of base model) - **Task Type**: Sequence Classification (Binary) ## 📚 Training Details ### Training Data - **Training Samples**: 30,534 documents - **Validation Samples**: 3,816 documents - **Test Samples**: 3,818 documents - **Total**: 38,168 labeled documents - **Class Distribution**: Balanced (50/50 energy/non-energy) ### Training Configuration - **Epochs**: 3 (early stopped at 2.3) - **Batch Size**: 4 per device × 4 GPUs × 4 gradient accumulation = 64 effective - **Learning Rate**: 2e-5 (cosine schedule with 10% warmup) - **Optimizer**: AdamW (fused) - **Weight Decay**: 0.01 - **Max Gradient Norm**: 1.0 - **Mixed Precision**: bfloat16 - **Training Time**: ~93 minutes on 4× GPUs ### Hardware - **GPUs**: 4× NVIDIA (high-memory GPUs) - **Memory**: 200GB total - **Framework**: PyTorch with HuggingFace Transformers & PEFT ### Training Metrics Evolution | Step | Train Loss | Val Loss | Val Accuracy | Val F1 | |------|-----------|----------|--------------|--------| | 100 | 4.31 | 1.00 | 65.2% | 66.8% | | 300 | 1.78 | 0.83 | 80.0% | 80.4% | | 500 | 0.73 | 0.70 | 89.6% | 89.7% | | 700 | 0.42 | 0.65 | 93.9% | 94.0% | | 900 | 0.27 | 0.64 | 96.1% | 96.1% | | **1100** | **0.21** | **0.63** | **98.2%** | **98.2%** | ## 💡 How It Works ### Label Definitions - **Label 0 (non_energy)**: Documents that are **NOT** primarily about energy topics - Examples: General news, politics (non-energy), sports, culture, software, education - **Label 1 (energy)**: Documents primarily discussing energy-related topics - Examples: - "Solar panel efficiency reached new record highs..." - "OPEC announced production cuts affecting oil prices..." - "Nuclear reactor designs promise safer, cleaner energy..." - "Wind energy capacity doubled in the last five years..." ### Classification Process 1. **Input**: Document text (up to 1024 tokens) 2. **Tokenization**: Llama-3.1 tokenizer with left padding 3. **Model Forward Pass**: Through LoRA-adapted Llama-3.1-8B 4. **Output**: Binary logits → softmax probabilities 5. **Prediction**: Class with highest probability + confidence score ## 📦 Model Files This repository contains: - `adapter_config.json` - LoRA adapter configuration - `adapter_model.safetensors` - Trained LoRA weights (161 MB) - `tokenizer.json` - Tokenizer vocabulary - `tokenizer_config.json` - Tokenizer configuration - `special_tokens_map.json` - Special tokens mapping **Note**: The base Llama-3.1-8B model will be downloaded automatically from HuggingFace. ## 🔧 Advanced Usage ### Custom Inference Script ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification from typing import List, Dict class EnergyClassifier: def __init__(self, model_name: str = "EnergyAI/Llama-3.1-8B-Energy-Classifier"): self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModelForSequenceClassification.from_pretrained( model_name, device_map="auto", torch_dtype=torch.bfloat16, ) self.model.eval() self.label_map = {0: "non_energy", 1: "energy"} @torch.no_grad() def predict(self, text: str, return_probs: bool = True) -> Dict: inputs = self.tokenizer( text, return_tensors="pt", truncation=True, max_length=1024, padding=True, ) inputs = {k: v.to(self.model.device) for k, v in inputs.items()} outputs = self.model(**inputs) probs = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_class = torch.argmax(probs, dim=-1).item() result = { "label": self.label_map[predicted_class], "confidence": probs[0][predicted_class].item(), } if return_probs: result["probabilities"] = { "non_energy": probs[0][0].item(), "energy": probs[0][1].item(), } return result @torch.no_grad() def predict_batch(self, texts: List[str], batch_size: int = 8) -> List[Dict]: results = [] for i in range(0, len(texts), batch_size): batch = texts[i:i + batch_size] inputs = self.tokenizer( batch, return_tensors="pt", truncation=True, max_length=1024, padding=True, ) inputs = {k: v.to(self.model.device) for k, v in inputs.items()} outputs = self.model(**inputs) probs = torch.nn.functional.softmax(outputs.logits, dim=-1) for j in range(len(batch)): pred_class = torch.argmax(probs[j]).item() results.append({ "label": self.label_map[pred_class], "confidence": probs[j][pred_class].item(), "probabilities": { "non_energy": probs[j][0].item(), "energy": probs[j][1].item(), } }) return results # Usage classifier = EnergyClassifier() result = classifier.predict("Wind energy is the fastest growing renewable source.") print(result) ``` ### Processing Large Files ```python import json from tqdm import tqdm def classify_jsonl_file(input_file: str, output_file: str): classifier = EnergyClassifier() # Read all texts texts = [] with open(input_file, 'r') as f: for line in f: data = json.loads(line) texts.append(data['text']) # Classify in batches results = classifier.predict_batch(texts, batch_size=16) # Write results with open(input_file, 'r') as fin, open(output_file, 'w') as fout: for line, result in tqdm(zip(fin, results), total=len(texts)): data = json.loads(line) data['predicted_label'] = result['label'] data['confidence'] = result['confidence'] data['energy_prob'] = result['probabilities']['energy'] fout.write(json.dumps(data) + '\n') # Process your dataset classify_jsonl_file('documents.jsonl', 'documents_classified.jsonl') ``` ## 🎓 Training Code The training code is available at: [GitHub Repository](https://github.com/EnergyAI/energy-classifier) To reproduce the training: ```bash # Clone repository git clone https://github.com/EnergyAI/energy-classifier.git cd energy-classifier # Install dependencies pip install -r requirements.txt # Prepare your data (train.jsonl, val.jsonl, test.jsonl) # Format: {"text": "document text", "label": 0 or 1} # Train python train.py --config configs/training_config.yaml ``` ## 📋 Requirements ```txt torch>=2.0.0 transformers>=4.40.0 peft>=0.10.0 accelerate>=0.28.0 safetensors>=0.4.0 ``` ## ⚡ Performance Tips ### For Maximum Speed: ```python # Use fp16 instead of bfloat16 if your GPU supports it model = AutoModelForSequenceClassification.from_pretrained( "EnergyAI/Llama-3.1-8B-Energy-Classifier", device_map="auto", torch_dtype=torch.float16, # Faster on some GPUs ) # Enable torch.compile (PyTorch 2.0+) model = torch.compile(model, mode="reduce-overhead") ``` ### For Lower Memory: ```python # Use 8-bit quantization from transformers import BitsAndBytesConfig quantization_config = BitsAndBytesConfig(load_in_8bit=True) model = AutoModelForSequenceClassification.from_pretrained( "EnergyAI/Llama-3.1-8B-Energy-Classifier", quantization_config=quantization_config, device_map="auto", ) ``` ## 📊 Benchmark Results ### Inference Speed (on NVIDIA A100 GPU) | Batch Size | Throughput (docs/sec) | Latency (ms/doc) | |------------|----------------------|------------------| | 1 | 12.3 | 81.3 | | 8 | 78.4 | 10.2 | | 16 | 134.7 | 5.9 | | 32 | 198.5 | 3.2 | ### Memory Usage - **Model Size**: 161 MB (LoRA adapters only) - **Peak GPU Memory** (bf16): ~18 GB (includes base model) - **Peak GPU Memory** (8-bit): ~10 GB ## 🤝 Citation If you use this model in your research, please cite: ```bibtex @misc{llama31-energy-classifier, author = {EnergyAI Team}, title = {Llama-3.1-8B Energy Document Classifier}, year = {2025}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/EnergyAI/Llama-3.1-8B-Energy-Classifier}}, } ``` ## 📄 License This model is released under the **Llama 3.1 Community License Agreement**. See the [license](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE) for details. **Base Model**: [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) ## 🔗 Links - **Model Card**: https://huggingface.co/EnergyAI/Llama-3.1-8B-Energy-Classifier - **GitHub**: https://github.com/EnergyAI/energy-classifier - **Demo**: [HuggingFace Spaces](https://huggingface.co/spaces/EnergyAI/energy-classifier-demo) - **Dataset**: Available upon request ## 👥 Contact For questions or issues: - Open an issue on [GitHub](https://github.com/EnergyAI/energy-classifier/issues) - Contact: energyai@example.com ## 🙏 Acknowledgments - **Meta AI** for the Llama 3.1 base model - **HuggingFace** for the transformers and PEFT libraries - **Research team** for dataset curation and annotation --- **Happy Classifying! 🔋⚡**