# 🔧 Troubleshooting Guide

Common issues and solutions for ULTRATHINK training.

## Table of Contents
- [Installation Issues](#installation-issues)
- [Training Errors](#training-errors)
- [Memory Issues](#memory-issues)
- [Performance Problems](#performance-problems)
- [Data Loading Issues](#data-loading-issues)
- [Distributed Training](#distributed-training)
- [Monitoring & Logging](#monitoring--logging)
- [Docker Issues](#docker-issues)

---

## Installation Issues

### ❌ `ImportError: No module named 'flash_attn'`

**Cause**: Flash Attention 2 not installed or incompatible with your CUDA version.

**Solution**:
```bash
# Check CUDA version
nvidia-smi

# Install Flash Attention 2 (requires CUDA 11.6+)
pip install flash-attn --no-build-isolation

# If build fails, disable Flash Attention
python train_ultrathink.py --no_flash_attention
```

**Alternative**: Use PyTorch's native SDPA:
```python
# In your config
model:
  use_flash_attention: false
  use_sdpa: true  # PyTorch 2.0+ scaled dot product attention
```

---

### ❌ `CUDA out of memory` during installation

**Cause**: Trying to build packages that require GPU memory.

**Solution**:
```bash
# Set environment variable to reduce memory usage
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

# Install with no cache
pip install --no-cache-dir -r requirements.txt

# Or install CPU-only first, then GPU packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txt
```

---

### ❌ `ModuleNotFoundError: No module named 'src'`

**Cause**: Python can't find the `src` module.

**Solution**:
```bash
# Make sure you're in the correct directory
cd UltraThinking-LLM-Training/deep

# Install in development mode
pip install -e .

# Or add to PYTHONPATH
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
```

---

## Training Errors

### ❌ `RuntimeError: Expected all tensors to be on the same device`

**Cause**: Model and data are on different devices (CPU vs GPU).

**Solution**:
```python
# Ensure device consistency
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}
```

**In training script**:
```bash
# Force CPU training
python train_ultrathink.py --device cpu

# Force GPU training
python train_ultrathink.py --device cuda
```

---

### ❌ `NaN loss` or `loss becomes inf`

**Cause**: Gradient explosion, learning rate too high, or numerical instability.

**Solutions**:

1. **Reduce learning rate**:
```bash
python train_ultrathink.py --learning_rate 1e-4  # Instead of 3e-4
```

2. **Enable gradient clipping**:
```bash
python train_ultrathink.py --gradient_clip_norm 1.0
```

3. **Use mixed precision carefully**:
```bash
# Try FP32 first
python train_ultrathink.py --no_amp

# Or use BF16 if supported (A100, H100)
python train_ultrathink.py --use_bf16
```

4. **Check for bad data**:
```python
# Add validation in data loading
def validate_batch(batch):
    for k, v in batch.items():
        if torch.isnan(v).any() or torch.isinf(v).any():
            raise ValueError(f"Invalid values in {k}")
    return batch
```

5. **Reduce batch size**:
```bash
python train_ultrathink.py --batch_size 1 --gradient_accumulation_steps 8
```

---

### ❌ `ValueError: Tokenizer not found`

**Cause**: Tokenizer not downloaded or path incorrect.

**Solution**:
```bash
# Download tokenizer manually
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('gpt2')"

# Or specify tokenizer path
python train_ultrathink.py --tokenizer_name gpt2

# Use local tokenizer
python train_ultrathink.py --tokenizer_path ./my_tokenizer
```

---

### ❌ `AssertionError: hidden_size must be divisible by num_heads`

**Cause**: Invalid model configuration.

**Solution**:
```bash
# Ensure hidden_size is divisible by num_heads
# Example: hidden_size=768, num_heads=12 ✅
# Example: hidden_size=768, num_heads=10 ❌

python train_ultrathink.py --hidden_size 768 --num_heads 12

# Common valid combinations:
# 256 / 4 = 64
# 512 / 8 = 64
# 768 / 12 = 64
# 1024 / 16 = 64
```

---

## Memory Issues

### ❌ `CUDA out of memory` during training

**Immediate fixes** (try in order):

1. **Reduce batch size**:
```bash
python train_ultrathink.py --batch_size 1 --gradient_accumulation_steps 16
```

2. **Enable gradient checkpointing**:
```bash
python train_ultrathink.py --gradient_checkpointing
```

3. **Use mixed precision**:
```bash
python train_ultrathink.py --use_amp
```

4. **Reduce sequence length**:
```bash
python train_ultrathink.py --max_seq_length 512  # Instead of 2048
```

5. **Use DeepSpeed ZeRO**:
```bash
python train_ultrathink.py --use_deepspeed --deepspeed_config deepspeed_config_zero2.json
```

**Memory optimization checklist**:
```python
# In your training script
import torch

# Clear cache before training
torch.cuda.empty_cache()

# Enable memory efficient attention
model.config.use_flash_attention = True

# Reduce optimizer memory (use 8-bit Adam)
from bitsandbytes.optim import AdamW8bit
optimizer = AdamW8bit(model.parameters())

# Enable CPU offloading (slower but uses less GPU memory)
model.enable_cpu_offload()
```

---

### ❌ Memory leak (memory keeps increasing)

**Cause**: Accumulating gradients, keeping references to tensors, or logging too much.

**Solutions**:

1. **Clear gradients properly**:
```python
optimizer.zero_grad(set_to_none=True)  # More memory efficient than zero_grad()
```

2. **Detach metrics**:
```python
loss_value = loss.detach().item()  # Don't keep computation graph
```

3. **Limit logging**:
```python
if step % 100 == 0:  # Log every 100 steps, not every step
    logger.log_metrics({"loss": loss.item()})
```

4. **Clear cache periodically**:
```python
if step % 1000 == 0:
    torch.cuda.empty_cache()
```

---

### ❌ `RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED`

**Cause**: CUDA/cuDNN version mismatch or insufficient GPU memory.

**Solution**:
```bash
# Check CUDA version
nvidia-smi

# Reinstall PyTorch with correct CUDA version
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Or disable cuDNN
export CUDA_VISIBLE_DEVICES=0
export CUDNN_ENABLED=0
python train_ultrathink.py
```

---

## Performance Problems

### 🐌 Training is very slow

**Diagnosis**:
```python
# Add profiling to find bottleneck
python scripts/profile_model.py --size small

# Check GPU utilization
nvidia-smi -l 1  # Update every second
```

**Common causes and fixes**:

1. **CPU bottleneck (data loading)**:
```bash
# Increase data loading workers
python train_ultrathink.py --num_workers 4 --prefetch_factor 2
```

2. **Small batch size**:
```bash
# Increase effective batch size
python train_ultrathink.py --batch_size 8 --gradient_accumulation_steps 4
```

3. **Not using Flash Attention**:
```bash
pip install flash-attn --no-build-isolation
python train_ultrathink.py --use_flash_attention
```

4. **Logging too frequently**:
```bash
python train_ultrathink.py --log_interval 100  # Instead of 10
```

5. **Slow storage**:
```bash
# Use local SSD instead of network storage
# Copy dataset to local disk first
cp -r /network/dataset /local/ssd/dataset
python train_ultrathink.py --dataset_path /local/ssd/dataset
```

---

### 🐌 Low GPU utilization (<50%)

**Causes**:
- Data loading bottleneck
- Small batch size
- CPU preprocessing too slow

**Solutions**:
```bash
# Increase workers and prefetch
python train_ultrathink.py --num_workers 8 --prefetch_factor 4

# Use streaming datasets (no preprocessing)
python train_ultrathink.py --dataset c4 --streaming

# Increase batch size
python train_ultrathink.py --batch_size 16

# Pin memory for faster transfer
python train_ultrathink.py --pin_memory
```

---

## Data Loading Issues

### ❌ `ConnectionError: Couldn't reach the Hugging Face Hub`

**Cause**: Network issues or HF Hub down.

**Solution**:
```bash
# Use cached dataset
export HF_DATASETS_OFFLINE=1
python train_ultrathink.py --dataset wikitext

# Or download dataset manually
python -c "from datasets import load_dataset; load_dataset('wikitext', 'wikitext-103-v1')"

# Use local dataset
python train_ultrathink.py --dataset_path ./my_local_dataset
```

---

### ❌ `Dataset is too large for disk`

**Solution**: Use streaming mode
```bash
python train_ultrathink.py --dataset c4 --streaming
```

**Or**: Use a smaller subset
```bash
python train_ultrathink.py --dataset c4 --max_samples 100000
```

---

### ❌ `KeyError: 'text'` or wrong dataset format

**Cause**: Dataset doesn't have expected column names.

**Solution**:
```python
# Check dataset structure
from datasets import load_dataset
dataset = load_dataset("your_dataset")
print(dataset.column_names)

# Map to correct format
def preprocess(examples):
    return {"text": examples["content"]}  # Rename column

dataset = dataset.map(preprocess)
```

**In training script**:
```bash
python train_ultrathink.py --text_column content  # Instead of default "text"
```

---

## Distributed Training

### ❌ `RuntimeError: Address already in use`

**Cause**: Previous training process still running or port conflict.

**Solution**:
```bash
# Kill previous processes
pkill -f train_ultrathink.py

# Use different port
python train_ultrathink.py --master_port 29501

# Or let system choose port
python train_ultrathink.py --master_port 0
```

---

### ❌ Multi-GPU training hangs or crashes

**Diagnosis**:
```bash
# Test NCCL communication
python -m torch.distributed.run --nproc_per_node=2 scripts/test_distributed.py

# Check NCCL debug info
export NCCL_DEBUG=INFO
python train_ultrathink.py
```

**Common fixes**:

1. **Set correct backend**:
```bash
export NCCL_SOCKET_IFNAME=eth0  # Or your network interface
export NCCL_IB_DISABLE=1  # Disable InfiniBand if not available
```

2. **Use correct launcher**:
```bash
# Use torchrun (recommended)
torchrun --nproc_per_node=2 train_ultrathink.py

# Or accelerate
accelerate launch --num_processes=2 train_ultrathink.py
```

3. **Increase timeout**:
```bash
export NCCL_TIMEOUT=1800  # 30 minutes
```

---

### ❌ `RuntimeError: NCCL error: unhandled system error`

**Solution**:
```bash
# Disable peer-to-peer access
export NCCL_P2P_DISABLE=1

# Use different NCCL backend
export NCCL_SOCKET_IFNAME=lo  # Loopback for single node

# Check GPU topology
nvidia-smi topo -m
```

---

## Monitoring & Logging

### ❌ MLflow UI not starting

**Solution**:
```bash
# Check if port is in use
lsof -i :5000

# Use different port
mlflow ui --port 5001

# Or specify host
mlflow ui --host 0.0.0.0 --port 5000
```

---

### ❌ Weights & Biases not logging

**Solution**:
```bash
# Login to W&B
wandb login

# Check API key
echo $WANDB_API_KEY

# Disable W&B if not needed
export WANDB_MODE=disabled
python train_ultrathink.py
```

---

### ❌ TensorBoard shows no data

**Solution**:
```bash
# Check log directory
ls -la ./runs

# Start TensorBoard with correct path
tensorboard --logdir ./runs --port 6006

# Force reload
tensorboard --logdir ./runs --reload_interval 5
```

---

## Docker Issues

### ❌ `docker: permission denied`

**Solution**:
```bash
# Add user to docker group
sudo usermod -aG docker $USER
newgrp docker

# Or use sudo
sudo docker compose up
```

---

### ❌ Container runs out of memory

**Solution**:
```bash
# Increase Docker memory limit
# Docker Desktop: Settings → Resources → Memory

# Or use docker run with memory limit
docker run --memory=16g --gpus all ultrathink:latest
```

---

### ❌ GPU not available in Docker

**Solution**:
```bash
# Install nvidia-docker2
sudo apt-get install nvidia-docker2
sudo systemctl restart docker

# Test GPU access
docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

# Run with GPU
docker run --gpus all ultrathink:latest
```

---

## Getting Help

If you can't find a solution here:

1. **Check existing issues**: [GitHub Issues](https://github.com/vediyappanm/UltraThinking-LLM-Training/issues)
2. **Search discussions**: [GitHub Discussions](https://github.com/vediyappanm/UltraThinking-LLM-Training/discussions)
3. **Enable debug logging**:
```bash
export LOG_LEVEL=DEBUG
python train_ultrathink.py --verbose
```
4. **Create a minimal reproduction**:
```bash
python train_ultrathink.py \
  --hidden_size 256 --num_layers 2 \
  --batch_size 1 --max_steps 10 \
  --dataset wikitext --max_samples 100
```
5. **Open an issue** with:
   - Error message and full traceback
   - Your configuration (model size, hardware, etc.)
   - Steps to reproduce
   - Output of `python --version`, `torch.__version__`, `nvidia-smi`

---

## Debugging Checklist

Before opening an issue, try:

- [ ] Update to latest version: `git pull && pip install -r requirements.txt`
- [ ] Clear cache: `rm -rf ~/.cache/huggingface`
- [ ] Test with minimal config: `--hidden_size 256 --num_layers 2 --batch_size 1`
- [ ] Check GPU: `nvidia-smi`
- [ ] Check disk space: `df -h`
- [ ] Check CUDA version: `nvcc --version`
- [ ] Run tests: `pytest tests/`
- [ ] Enable verbose logging: `--verbose`

---

**Last Updated**: January 2025  
**Version**: 1.0.0

Found a solution not listed here? [Contribute to this guide!](CONTRIBUTING.md)