UltraThinking-LLM-Training / docs /TROUBLESHOOTING.md

Vedisasi

Upload folder using huggingface_hub

54c5666 verified 3 months ago

preview code

raw

history blame contribute delete

13.2 kB

🔧 Troubleshooting Guide

Common issues and solutions for ULTRATHINK training.

Installation Issues
Training Errors
Memory Issues
Performance Problems
Data Loading Issues
Distributed Training
Monitoring & Logging
Docker Issues

Installation Issues

❌ `ImportError: No module named 'flash_attn'`

Cause: Flash Attention 2 not installed or incompatible with your CUDA version.

Solution:

# Check CUDA version
nvidia-smi

# Install Flash Attention 2 (requires CUDA 11.6+)
pip install flash-attn --no-build-isolation

# If build fails, disable Flash Attention
python train_ultrathink.py --no_flash_attention

Alternative: Use PyTorch's native SDPA:

# In your config
model:
  use_flash_attention: false
  use_sdpa: true  # PyTorch 2.0+ scaled dot product attention

❌ `CUDA out of memory` during installation

Cause: Trying to build packages that require GPU memory.

Solution:

# Set environment variable to reduce memory usage
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

# Install with no cache
pip install --no-cache-dir -r requirements.txt

# Or install CPU-only first, then GPU packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txt

❌ `ModuleNotFoundError: No module named 'src'`

Cause: Python can't find the src module.

Solution:

# Make sure you're in the correct directory
cd UltraThinking-LLM-Training/deep

# Install in development mode
pip install -e .

# Or add to PYTHONPATH
export PYTHONPATH="${PYTHONPATH}:$(pwd)"

Training Errors

❌ `RuntimeError: Expected all tensors to be on the same device`

Cause: Model and data are on different devices (CPU vs GPU).

Solution:

# Ensure device consistency
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}

In training script:

# Force CPU training
python train_ultrathink.py --device cpu

# Force GPU training
python train_ultrathink.py --device cuda

❌ `NaN loss` or `loss becomes inf`

Cause: Gradient explosion, learning rate too high, or numerical instability.

Solutions:

Reduce learning rate:

python train_ultrathink.py --learning_rate 1e-4  # Instead of 3e-4

Enable gradient clipping:

python train_ultrathink.py --gradient_clip_norm 1.0

Use mixed precision carefully:

# Try FP32 first
python train_ultrathink.py --no_amp

# Or use BF16 if supported (A100, H100)
python train_ultrathink.py --use_bf16

Check for bad data:

# Add validation in data loading
def validate_batch(batch):
    for k, v in batch.items():
        if torch.isnan(v).any() or torch.isinf(v).any():
            raise ValueError(f"Invalid values in {k}")
    return batch

Reduce batch size:

python train_ultrathink.py --batch_size 1 --gradient_accumulation_steps 8

❌ `ValueError: Tokenizer not found`

Cause: Tokenizer not downloaded or path incorrect.

Solution:

# Download tokenizer manually
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('gpt2')"

# Or specify tokenizer path
python train_ultrathink.py --tokenizer_name gpt2

# Use local tokenizer
python train_ultrathink.py --tokenizer_path ./my_tokenizer

❌ `AssertionError: hidden_size must be divisible by num_heads`

Cause: Invalid model configuration.

Solution:

# Ensure hidden_size is divisible by num_heads
# Example: hidden_size=768, num_heads=12 ✅
# Example: hidden_size=768, num_heads=10 ❌

python train_ultrathink.py --hidden_size 768 --num_heads 12

# Common valid combinations:
# 256 / 4 = 64
# 512 / 8 = 64
# 768 / 12 = 64
# 1024 / 16 = 64

Memory Issues

❌ `CUDA out of memory` during training

Immediate fixes (try in order):

Reduce batch size:

python train_ultrathink.py --batch_size 1 --gradient_accumulation_steps 16

Enable gradient checkpointing:

python train_ultrathink.py --gradient_checkpointing

Use mixed precision:

python train_ultrathink.py --use_amp

Reduce sequence length:

python train_ultrathink.py --max_seq_length 512  # Instead of 2048

Use DeepSpeed ZeRO:

python train_ultrathink.py --use_deepspeed --deepspeed_config deepspeed_config_zero2.json

Memory optimization checklist:

# In your training script
import torch

# Clear cache before training
torch.cuda.empty_cache()

# Enable memory efficient attention
model.config.use_flash_attention = True

# Reduce optimizer memory (use 8-bit Adam)
from bitsandbytes.optim import AdamW8bit
optimizer = AdamW8bit(model.parameters())

# Enable CPU offloading (slower but uses less GPU memory)
model.enable_cpu_offload()

❌ Memory leak (memory keeps increasing)

Cause: Accumulating gradients, keeping references to tensors, or logging too much.

Solutions:

Clear gradients properly:

optimizer.zero_grad(set_to_none=True)  # More memory efficient than zero_grad()

Detach metrics:

loss_value = loss.detach().item()  # Don't keep computation graph

Limit logging:

if step % 100 == 0:  # Log every 100 steps, not every step
    logger.log_metrics({"loss": loss.item()})

Clear cache periodically:

if step % 1000 == 0:
    torch.cuda.empty_cache()

❌ `RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED`

Cause: CUDA/cuDNN version mismatch or insufficient GPU memory.

Solution:

# Check CUDA version
nvidia-smi

# Reinstall PyTorch with correct CUDA version
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Or disable cuDNN
export CUDA_VISIBLE_DEVICES=0
export CUDNN_ENABLED=0
python train_ultrathink.py

Performance Problems

🐌 Training is very slow

Diagnosis:

# Add profiling to find bottleneck
python scripts/profile_model.py --size small

# Check GPU utilization
nvidia-smi -l 1  # Update every second

Common causes and fixes:

CPU bottleneck (data loading):

# Increase data loading workers
python train_ultrathink.py --num_workers 4 --prefetch_factor 2

Small batch size:

# Increase effective batch size
python train_ultrathink.py --batch_size 8 --gradient_accumulation_steps 4

Not using Flash Attention:

pip install flash-attn --no-build-isolation
python train_ultrathink.py --use_flash_attention

Logging too frequently:

python train_ultrathink.py --log_interval 100  # Instead of 10

Slow storage:

# Use local SSD instead of network storage
# Copy dataset to local disk first
cp -r /network/dataset /local/ssd/dataset
python train_ultrathink.py --dataset_path /local/ssd/dataset

🐌 Low GPU utilization (<50%)

Causes:

Data loading bottleneck
Small batch size
CPU preprocessing too slow

Solutions:

# Increase workers and prefetch
python train_ultrathink.py --num_workers 8 --prefetch_factor 4

# Use streaming datasets (no preprocessing)
python train_ultrathink.py --dataset c4 --streaming

# Increase batch size
python train_ultrathink.py --batch_size 16

# Pin memory for faster transfer
python train_ultrathink.py --pin_memory

Data Loading Issues

❌ `ConnectionError: Couldn't reach the Hugging Face Hub`

Cause: Network issues or HF Hub down.

Solution:

# Use cached dataset
export HF_DATASETS_OFFLINE=1
python train_ultrathink.py --dataset wikitext

# Or download dataset manually
python -c "from datasets import load_dataset; load_dataset('wikitext', 'wikitext-103-v1')"

# Use local dataset
python train_ultrathink.py --dataset_path ./my_local_dataset

❌ `Dataset is too large for disk`

Solution: Use streaming mode

python train_ultrathink.py --dataset c4 --streaming

Or: Use a smaller subset

python train_ultrathink.py --dataset c4 --max_samples 100000

❌ `KeyError: 'text'` or wrong dataset format

Cause: Dataset doesn't have expected column names.

Solution:

# Check dataset structure
from datasets import load_dataset
dataset = load_dataset("your_dataset")
print(dataset.column_names)

# Map to correct format
def preprocess(examples):
    return {"text": examples["content"]}  # Rename column

dataset = dataset.map(preprocess)

In training script:

python train_ultrathink.py --text_column content  # Instead of default "text"

Distributed Training

❌ `RuntimeError: Address already in use`

Cause: Previous training process still running or port conflict.

Solution:

# Kill previous processes
pkill -f train_ultrathink.py

# Use different port
python train_ultrathink.py --master_port 29501

# Or let system choose port
python train_ultrathink.py --master_port 0

❌ Multi-GPU training hangs or crashes

Diagnosis:

# Test NCCL communication
python -m torch.distributed.run --nproc_per_node=2 scripts/test_distributed.py

# Check NCCL debug info
export NCCL_DEBUG=INFO
python train_ultrathink.py

Common fixes:

Set correct backend:

export NCCL_SOCKET_IFNAME=eth0  # Or your network interface
export NCCL_IB_DISABLE=1  # Disable InfiniBand if not available

Use correct launcher:

# Use torchrun (recommended)
torchrun --nproc_per_node=2 train_ultrathink.py

# Or accelerate
accelerate launch --num_processes=2 train_ultrathink.py

Increase timeout:

export NCCL_TIMEOUT=1800  # 30 minutes

❌ `RuntimeError: NCCL error: unhandled system error`

Solution:

# Disable peer-to-peer access
export NCCL_P2P_DISABLE=1

# Use different NCCL backend
export NCCL_SOCKET_IFNAME=lo  # Loopback for single node

# Check GPU topology
nvidia-smi topo -m

Monitoring & Logging

❌ MLflow UI not starting

Solution:

# Check if port is in use
lsof -i :5000

# Use different port
mlflow ui --port 5001

# Or specify host
mlflow ui --host 0.0.0.0 --port 5000

❌ Weights & Biases not logging

Solution:

# Login to W&B
wandb login

# Check API key
echo $WANDB_API_KEY

# Disable W&B if not needed
export WANDB_MODE=disabled
python train_ultrathink.py

❌ TensorBoard shows no data

Solution:

# Check log directory
ls -la ./runs

# Start TensorBoard with correct path
tensorboard --logdir ./runs --port 6006

# Force reload
tensorboard --logdir ./runs --reload_interval 5

Docker Issues

❌ `docker: permission denied`

Solution:

# Add user to docker group
sudo usermod -aG docker $USER
newgrp docker

# Or use sudo
sudo docker compose up

❌ Container runs out of memory

Solution:

# Increase Docker memory limit
# Docker Desktop: Settings → Resources → Memory

# Or use docker run with memory limit
docker run --memory=16g --gpus all ultrathink:latest

❌ GPU not available in Docker

Solution:

# Install nvidia-docker2
sudo apt-get install nvidia-docker2
sudo systemctl restart docker

# Test GPU access
docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

# Run with GPU
docker run --gpus all ultrathink:latest

Getting Help

If you can't find a solution here:

Check existing issues: GitHub Issues
Search discussions: GitHub Discussions
Enable debug logging:

export LOG_LEVEL=DEBUG
python train_ultrathink.py --verbose

Create a minimal reproduction:

python train_ultrathink.py \
  --hidden_size 256 --num_layers 2 \
  --batch_size 1 --max_steps 10 \
  --dataset wikitext --max_samples 100

Open an issue with:
- Error message and full traceback
- Your configuration (model size, hardware, etc.)
- Steps to reproduce
- Output of python --version, torch.__version__, nvidia-smi

Debugging Checklist

Before opening an issue, try:

Update to latest version: git pull && pip install -r requirements.txt
Clear cache: rm -rf ~/.cache/huggingface
Test with minimal config: --hidden_size 256 --num_layers 2 --batch_size 1
Check GPU: nvidia-smi
Check disk space: df -h
Check CUDA version: nvcc --version
Run tests: pytest tests/
Enable verbose logging: --verbose

Last Updated: January 2025
Version: 1.0.0

Found a solution not listed here? Contribute to this guide!

Vedisasi
/

UltraThinking-LLM-Training

🔧 Troubleshooting Guide

Table of Contents

Installation Issues

❌ `ImportError: No module named 'flash_attn'`

❌ `CUDA out of memory` during installation

❌ `ModuleNotFoundError: No module named 'src'`

Training Errors

❌ `RuntimeError: Expected all tensors to be on the same device`

❌ `NaN loss` or `loss becomes inf`

❌ `ValueError: Tokenizer not found`

❌ `AssertionError: hidden_size must be divisible by num_heads`

Memory Issues

❌ `CUDA out of memory` during training

❌ Memory leak (memory keeps increasing)

❌ `RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED`

Performance Problems

🐌 Training is very slow

🐌 Low GPU utilization (<50%)

Data Loading Issues

❌ `ConnectionError: Couldn't reach the Hugging Face Hub`

❌ `Dataset is too large for disk`

❌ `KeyError: 'text'` or wrong dataset format

Distributed Training

❌ `RuntimeError: Address already in use`

❌ Multi-GPU training hangs or crashes

❌ `RuntimeError: NCCL error: unhandled system error`

Monitoring & Logging

❌ MLflow UI not starting

❌ Weights & Biases not logging

❌ TensorBoard shows no data

Docker Issues

❌ `docker: permission denied`

❌ Container runs out of memory

❌ GPU not available in Docker

Getting Help

Debugging Checklist

🔧 Troubleshooting Guide

Table of Contents

Installation Issues

❌ ImportError: No module named 'flash_attn'

❌ CUDA out of memory during installation

❌ ModuleNotFoundError: No module named 'src'

Training Errors

❌ RuntimeError: Expected all tensors to be on the same device

❌ NaN loss or loss becomes inf

❌ ValueError: Tokenizer not found

❌ AssertionError: hidden_size must be divisible by num_heads

Memory Issues

❌ CUDA out of memory during training

❌ Memory leak (memory keeps increasing)

❌ RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Performance Problems

🐌 Training is very slow

🐌 Low GPU utilization (<50%)

Data Loading Issues

❌ ConnectionError: Couldn't reach the Hugging Face Hub

❌ Dataset is too large for disk

❌ KeyError: 'text' or wrong dataset format

Distributed Training

❌ RuntimeError: Address already in use

❌ Multi-GPU training hangs or crashes

❌ RuntimeError: NCCL error: unhandled system error

Monitoring & Logging

❌ MLflow UI not starting

❌ Weights & Biases not logging

❌ TensorBoard shows no data

Docker Issues

❌ docker: permission denied

❌ Container runs out of memory

❌ GPU not available in Docker

Getting Help

Debugging Checklist

❌ `ImportError: No module named 'flash_attn'`

❌ `CUDA out of memory` during installation

❌ `ModuleNotFoundError: No module named 'src'`

❌ `RuntimeError: Expected all tensors to be on the same device`

❌ `NaN loss` or `loss becomes inf`

❌ `ValueError: Tokenizer not found`

❌ `AssertionError: hidden_size must be divisible by num_heads`

❌ `CUDA out of memory` during training

❌ `RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED`

❌ `ConnectionError: Couldn't reach the Hugging Face Hub`

❌ `Dataset is too large for disk`

❌ `KeyError: 'text'` or wrong dataset format

❌ `RuntimeError: Address already in use`

❌ `RuntimeError: NCCL error: unhandled system error`

❌ `docker: permission denied`