---
language: en
license: llama3.1
library_name: transformers
tags:
- text-classification
- energy
- document-classification
- llama-3.1
- lora
- peft
- binary-classification
- energy-documents
pipeline_tag: text-classification
widget:
- text: "Solar energy has become increasingly cost-competitive with fossil fuels in recent years. The price of photovoltaic panels has dropped significantly, making renewable energy more accessible."
  example_title: "Energy Document"
- text: "The committee discussed the implementation of new operational guidelines. Training sessions will be conducted for all staff members next month."
  example_title: "Non-Energy Document"
datasets:
- custom
metrics:
- accuracy
- f1
- precision
- recall
base_model: meta-llama/Llama-3.1-8B
model-index:
- name: Llama-3.1-8B-Energy-Classifier
  results:
  - task:
      type: text-classification
      name: Energy Document Classification
    dataset:
      name: Custom Energy Documents Dataset
      type: custom
    metrics:
    - type: accuracy
      value: 0.9839
      name: Test Accuracy
      verified: true
    - type: f1
      value: 0.9841
      name: Test F1 Score
      verified: true
    - type: precision
      value: 0.9717
      name: Test Precision
      verified: true
    - type: recall
      value: 0.9969
      name: Test Recall
      verified: true
    - type: roc_auc
      value: 0.9976
      name: ROC-AUC
      verified: true
---

# 🔋 Llama-3.1-8B Energy Document Classifier

A fine-tuned **Llama-3.1-8B** model for binary classification of energy-related documents, achieving **98.39% accuracy** on test data.

This model uses **LoRA (Low-Rank Adaptation)** for parameter-efficient fine-tuning, trained on **95,602 documents** (perfectly balanced: 47,801 energy + 47,801 non-energy).

## 📊 Model Performance

### Test Set Results (9,562 documents)

| Metric | Score |
|--------|-------|
| **Test Accuracy** | 98.39% |
| **F1 Score** | 98.41% |
| **Precision** | 97.17% |
| **Recall** | 99.69% |
| **ROC-AUC** | 99.76% |

### Validation Set Results (9,560 documents)

| Metric | Score |
|--------|-------|
| **Val Accuracy** | 98.55% |
| **Val F1 Score** | 98.56% |
| **Val Precision** | 97.54% |
| **Val Recall** | 99.60% |
| **Val ROC-AUC** | 99.76% |

### Confusion Matrix (Test Set - 9,562 documents)

|  | Predicted Non-Energy | Predicted Energy |
|--|---------------------|------------------|
| **Actual Non-Energy** | 4,642 (97.09%) | 139 (2.91%) |
| **Actual Energy** | 15 (0.31%) | 4,766 (99.69%) |

**Only 154 misclassifications out of 9,562 documents (1.61% error rate)!**

### Training Details

- **Base Model**: meta-llama/Llama-3.1-8B
- **Training Method**: LoRA (r=16, alpha=32, dropout=0.05)
- **Target Modules**: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- **Trainable Parameters**: 45M out of 8B (0.56%)
- **Total Dataset**: 95,602 documents (perfectly balanced)
  - Train: 76,480 (38,240 energy + 38,240 non-energy)
  - Val: 9,560 (4,780 energy + 4,780 non-energy)
  - Test: 9,562 (4,781 energy + 4,781 non-energy)
- **Energy Data Sources**:
  - EnergyAI/finepdfs_energy (40,989 docs)
  - EnergyAI/wikipedia_energy (5,459 docs)
  - EnergyAI/eartharxiv_engrxiv_energy (27 docs)
  - EnergyAI/scored_chunks_from_SPE_pipeline (1,326 docs)
- **Training Time**: ~2 hours on 4× A100 80GB GPUs
- **Convergence**: Early stopping at step 1,100 (< 1 epoch!)
- **Effective Batch Size**: 64 (per_device=4, gradient_accum=4, 4 GPUs)
- **Learning Rate**: 2e-5 with cosine schedule and 10% warmup
- **Precision**: bfloat16 mixed precision

### Data Curation

Energy-labeled documents were sourced from four HuggingFace datasets (see above). Classification labels for the training data were created with Mistral 3 Large model and this classifier was distilled from this data. Non-energy documents were sampled from a base document pipeline, with deduplication to ensure no overlap with energy documents (validated by both document ID and MD5 hash matching).

## 🎯 Use Cases

This model can classify documents as **energy-related** or **non-energy**. Perfect for:
- 📚 Research paper categorization
- 📰 News article filtering
- 📄 Document management systems
- 🔍 Content discovery and recommendation
- 🗂️ Dataset curation for energy research

**Energy Topics Covered:**
- Oil & Gas
- Renewable Energy (Solar, Wind, Hydro, Geothermal)
- Electricity & Power Systems
- Nuclear Energy
- Energy Policy & Economics
- Carbon & Climate (energy-related aspects)
- Energy Storage & Batteries

## 🚀 Quick Start

### Installation

```bash
pip install torch>=2.0.0 transformers>=4.44.0 peft>=0.12.0 accelerate>=0.28.0
```

**⚠️ Version Requirements:**
| Package | Recommended Version | Notes |
|---------|-------------------|-------|
| `torch` | ≥2.0.0 | CUDA support recommended |
| `transformers` | 4.44.0 - 4.57.x | Tested range |
| `peft` | 0.12.0 - 0.18.x | LoRA adapter loading |
| `accelerate` | ≥0.28.0 | For device_map="auto" |

**Note:** The tokenizer is loaded from the base model (`meta-llama/Llama-3.1-8B`). You need access to Llama-3.1 on Hugging Face (requires accepting the license agreement).

### Basic Usage

```python
from peft import AutoPeftModelForSequenceClassification
from transformers import AutoTokenizer
import torch

# Load model and tokenizer
model_name = "omar-elmansouri/Llama-3.1-8B-Energy-Classifier"

# Load tokenizer from base model (recommended for compatibility)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load the PEFT adapter model
model = AutoPeftModelForSequenceClassification.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
model.config.pad_token_id = tokenizer.pad_token_id
model.eval()

# Classify a document
text = """
Solar energy capacity has grown exponentially over the past decade. 
The International Energy Agency reports that solar now represents 
the fastest-growing renewable energy source globally.
"""


inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=-1).item()

label_map = {0: "non_energy", 1: "energy"}
print(f"Prediction: {label_map[predicted_class]}")
print(f"Confidence: {predictions[0][predicted_class].item():.4f}")
print(f"Probabilities: Energy={predictions[0][1].item():.4f}, Non-Energy={predictions[0][0].item():.4f}")
```

**Output:**
```
Prediction: energy
Confidence: 0.9987
Probabilities: Energy=0.9987, Non-Energy=0.0013
```

### Batch Processing

```python
from transformers import pipeline

# Create classification pipeline
classifier = pipeline(
    "text-classification",
    model="EnergyAI/Llama-3.1-8B-Energy-Classifier",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Classify multiple documents
texts = [
    "Wind turbines are becoming more efficient with larger blade designs.",
    "The software development team completed the sprint planning meeting.",
    "Natural gas prices fluctuated amid geopolitical tensions in Europe.",
]

results = classifier(texts, truncation=True, max_length=512)
for text, result in zip(texts, results):
    print(f"Text: {text[:50]}...")
    print(f"Label: {result['label']}, Score: {result['score']:.4f}\n")
```

### Using PEFT for Efficient Loading

```python
from peft import AutoPeftModelForSequenceClassification
from transformers import AutoTokenizer
import torch

# Load with PEFT (more memory efficient)
model = AutoPeftModelForSequenceClassification.from_pretrained(
    "EnergyAI/Llama-3.1-8B-Energy-Classifier",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained("EnergyAI/Llama-3.1-8B-Energy-Classifier")

# Classify
text = "Offshore wind farms are expanding along the Atlantic coast."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
print(f"Energy probability: {probs[0][1].item():.4f}")
```

## 🏗️ Model Architecture

### Base Model
- **Name**: meta-llama/Llama-3.1-8B
- **Parameters**: 8 Billion
- **Architecture**: Transformer-based causal language model
- **Context Length**: 128K tokens (using 1024 for classification)

### Fine-tuning Details
- **Method**: LoRA (Low-Rank Adaptation)
- **LoRA Rank (r)**: 16
- **LoRA Alpha**: 32
- **LoRA Dropout**: 0.05
- **Target Modules**: 
  - `q_proj`, `k_proj`, `v_proj`, `o_proj` (Attention)
  - `gate_proj`, `up_proj`, `down_proj` (MLP)
- **Trainable Parameters**: 41,951,232 (~0.52% of base model)
- **Task Type**: Sequence Classification (Binary)

## 📚 Training Details

### Training Data
- **Training Samples**: 30,534 documents
- **Validation Samples**: 3,816 documents
- **Test Samples**: 3,818 documents
- **Total**: 38,168 labeled documents
- **Class Distribution**: Balanced (50/50 energy/non-energy)

### Training Configuration
- **Epochs**: 3 (early stopped at 2.3)
- **Batch Size**: 4 per device × 4 GPUs × 4 gradient accumulation = 64 effective
- **Learning Rate**: 2e-5 (cosine schedule with 10% warmup)
- **Optimizer**: AdamW (fused)
- **Weight Decay**: 0.01
- **Max Gradient Norm**: 1.0
- **Mixed Precision**: bfloat16
- **Training Time**: ~93 minutes on 4× GPUs

### Hardware
- **GPUs**: 4× NVIDIA (high-memory GPUs)
- **Memory**: 200GB total
- **Framework**: PyTorch with HuggingFace Transformers & PEFT

### Training Metrics Evolution

| Step | Train Loss | Val Loss | Val Accuracy | Val F1 |
|------|-----------|----------|--------------|--------|
| 100 | 4.31 | 1.00 | 65.2% | 66.8% |
| 300 | 1.78 | 0.83 | 80.0% | 80.4% |
| 500 | 0.73 | 0.70 | 89.6% | 89.7% |
| 700 | 0.42 | 0.65 | 93.9% | 94.0% |
| 900 | 0.27 | 0.64 | 96.1% | 96.1% |
| **1100** | **0.21** | **0.63** | **98.2%** | **98.2%** |

## 💡 How It Works

### Label Definitions

- **Label 0 (non_energy)**: Documents that are **NOT** primarily about energy topics
  - Examples: General news, politics (non-energy), sports, culture, software, education
  
- **Label 1 (energy)**: Documents primarily discussing energy-related topics
  - Examples: 
    - "Solar panel efficiency reached new record highs..."
    - "OPEC announced production cuts affecting oil prices..."
    - "Nuclear reactor designs promise safer, cleaner energy..."
    - "Wind energy capacity doubled in the last five years..."

### Classification Process

1. **Input**: Document text (up to 1024 tokens)
2. **Tokenization**: Llama-3.1 tokenizer with left padding
3. **Model Forward Pass**: Through LoRA-adapted Llama-3.1-8B
4. **Output**: Binary logits → softmax probabilities
5. **Prediction**: Class with highest probability + confidence score

## 📦 Model Files

This repository contains:
- `adapter_config.json` - LoRA adapter configuration
- `adapter_model.safetensors` - Trained LoRA weights (161 MB)
- `tokenizer.json` - Tokenizer vocabulary
- `tokenizer_config.json` - Tokenizer configuration
- `special_tokens_map.json` - Special tokens mapping

**Note**: The base Llama-3.1-8B model will be downloaded automatically from HuggingFace.

## 🔧 Advanced Usage

### Custom Inference Script

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from typing import List, Dict

class EnergyClassifier:
    def __init__(self, model_name: str = "EnergyAI/Llama-3.1-8B-Energy-Classifier"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name,
            device_map="auto",
            torch_dtype=torch.bfloat16,
        )
        self.model.eval()
        self.label_map = {0: "non_energy", 1: "energy"}
    
    @torch.no_grad()
    def predict(self, text: str, return_probs: bool = True) -> Dict:
        inputs = self.tokenizer(
            text,
            return_tensors="pt",
            truncation=True,
            max_length=1024,
            padding=True,
        )
        inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
        
        outputs = self.model(**inputs)
        probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(probs, dim=-1).item()
        
        result = {
            "label": self.label_map[predicted_class],
            "confidence": probs[0][predicted_class].item(),
        }
        
        if return_probs:
            result["probabilities"] = {
                "non_energy": probs[0][0].item(),
                "energy": probs[0][1].item(),
            }
        
        return result
    
    @torch.no_grad()
    def predict_batch(self, texts: List[str], batch_size: int = 8) -> List[Dict]:
        results = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            inputs = self.tokenizer(
                batch,
                return_tensors="pt",
                truncation=True,
                max_length=1024,
                padding=True,
            )
            inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
            
            outputs = self.model(**inputs)
            probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
            
            for j in range(len(batch)):
                pred_class = torch.argmax(probs[j]).item()
                results.append({
                    "label": self.label_map[pred_class],
                    "confidence": probs[j][pred_class].item(),
                    "probabilities": {
                        "non_energy": probs[j][0].item(),
                        "energy": probs[j][1].item(),
                    }
                })
        
        return results

# Usage
classifier = EnergyClassifier()
result = classifier.predict("Wind energy is the fastest growing renewable source.")
print(result)
```

### Processing Large Files

```python
import json
from tqdm import tqdm

def classify_jsonl_file(input_file: str, output_file: str):
    classifier = EnergyClassifier()
    
    # Read all texts
    texts = []
    with open(input_file, 'r') as f:
        for line in f:
            data = json.loads(line)
            texts.append(data['text'])
    
    # Classify in batches
    results = classifier.predict_batch(texts, batch_size=16)
    
    # Write results
    with open(input_file, 'r') as fin, open(output_file, 'w') as fout:
        for line, result in tqdm(zip(fin, results), total=len(texts)):
            data = json.loads(line)
            data['predicted_label'] = result['label']
            data['confidence'] = result['confidence']
            data['energy_prob'] = result['probabilities']['energy']
            fout.write(json.dumps(data) + '\n')

# Process your dataset
classify_jsonl_file('documents.jsonl', 'documents_classified.jsonl')
```

## 🎓 Training Code

The training code is available at: [GitHub Repository](https://github.com/EnergyAI/energy-classifier)

To reproduce the training:

```bash
# Clone repository
git clone https://github.com/EnergyAI/energy-classifier.git
cd energy-classifier

# Install dependencies
pip install -r requirements.txt

# Prepare your data (train.jsonl, val.jsonl, test.jsonl)
# Format: {"text": "document text", "label": 0 or 1}

# Train
python train.py --config configs/training_config.yaml
```

## 📋 Requirements

```txt
torch>=2.0.0
transformers>=4.40.0
peft>=0.10.0
accelerate>=0.28.0
safetensors>=0.4.0
```

## ⚡ Performance Tips

### For Maximum Speed:
```python
# Use fp16 instead of bfloat16 if your GPU supports it
model = AutoModelForSequenceClassification.from_pretrained(
    "EnergyAI/Llama-3.1-8B-Energy-Classifier",
    device_map="auto",
    torch_dtype=torch.float16,  # Faster on some GPUs
)

# Enable torch.compile (PyTorch 2.0+)
model = torch.compile(model, mode="reduce-overhead")
```

### For Lower Memory:
```python
# Use 8-bit quantization
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForSequenceClassification.from_pretrained(
    "EnergyAI/Llama-3.1-8B-Energy-Classifier",
    quantization_config=quantization_config,
    device_map="auto",
)
```

## 📊 Benchmark Results

### Inference Speed (on NVIDIA A100 GPU)

| Batch Size | Throughput (docs/sec) | Latency (ms/doc) |
|------------|----------------------|------------------|
| 1 | 12.3 | 81.3 |
| 8 | 78.4 | 10.2 |
| 16 | 134.7 | 5.9 |
| 32 | 198.5 | 3.2 |

### Memory Usage

- **Model Size**: 161 MB (LoRA adapters only)
- **Peak GPU Memory** (bf16): ~18 GB (includes base model)
- **Peak GPU Memory** (8-bit): ~10 GB

## 🤝 Citation

If you use this model in your research, please cite:

```bibtex
@misc{llama31-energy-classifier,
  author = {EnergyAI Team},
  title = {Llama-3.1-8B Energy Document Classifier},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/EnergyAI/Llama-3.1-8B-Energy-Classifier}},
}
```

## 📄 License

This model is released under the **Llama 3.1 Community License Agreement**. See the [license](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE) for details.

**Base Model**: [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)

## 🔗 Links

- **Model Card**: https://huggingface.co/EnergyAI/Llama-3.1-8B-Energy-Classifier
- **GitHub**: https://github.com/EnergyAI/energy-classifier
- **Demo**: [HuggingFace Spaces](https://huggingface.co/spaces/EnergyAI/energy-classifier-demo)
- **Dataset**: Available upon request

## 👥 Contact

For questions or issues:
- Open an issue on [GitHub](https://github.com/EnergyAI/energy-classifier/issues)
- Contact: energyai@example.com

## 🙏 Acknowledgments

- **Meta AI** for the Llama 3.1 base model
- **HuggingFace** for the transformers and PEFT libraries
- **Research team** for dataset curation and annotation

---

**Happy Classifying! 🔋⚡**