UltraThinking-LLM-Training / docs /TROUBLESHOOTING.md
Vedisasi's picture
Upload folder using huggingface_hub
54c5666 verified

πŸ”§ Troubleshooting Guide

Common issues and solutions for ULTRATHINK training.

Table of Contents


Installation Issues

❌ ImportError: No module named 'flash_attn'

Cause: Flash Attention 2 not installed or incompatible with your CUDA version.

Solution:

# Check CUDA version
nvidia-smi

# Install Flash Attention 2 (requires CUDA 11.6+)
pip install flash-attn --no-build-isolation

# If build fails, disable Flash Attention
python train_ultrathink.py --no_flash_attention

Alternative: Use PyTorch's native SDPA:

# In your config
model:
  use_flash_attention: false
  use_sdpa: true  # PyTorch 2.0+ scaled dot product attention

❌ CUDA out of memory during installation

Cause: Trying to build packages that require GPU memory.

Solution:

# Set environment variable to reduce memory usage
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

# Install with no cache
pip install --no-cache-dir -r requirements.txt

# Or install CPU-only first, then GPU packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txt

❌ ModuleNotFoundError: No module named 'src'

Cause: Python can't find the src module.

Solution:

# Make sure you're in the correct directory
cd UltraThinking-LLM-Training/deep

# Install in development mode
pip install -e .

# Or add to PYTHONPATH
export PYTHONPATH="${PYTHONPATH}:$(pwd)"

Training Errors

❌ RuntimeError: Expected all tensors to be on the same device

Cause: Model and data are on different devices (CPU vs GPU).

Solution:

# Ensure device consistency
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}

In training script:

# Force CPU training
python train_ultrathink.py --device cpu

# Force GPU training
python train_ultrathink.py --device cuda

❌ NaN loss or loss becomes inf

Cause: Gradient explosion, learning rate too high, or numerical instability.

Solutions:

  1. Reduce learning rate:
python train_ultrathink.py --learning_rate 1e-4  # Instead of 3e-4
  1. Enable gradient clipping:
python train_ultrathink.py --gradient_clip_norm 1.0
  1. Use mixed precision carefully:
# Try FP32 first
python train_ultrathink.py --no_amp

# Or use BF16 if supported (A100, H100)
python train_ultrathink.py --use_bf16
  1. Check for bad data:
# Add validation in data loading
def validate_batch(batch):
    for k, v in batch.items():
        if torch.isnan(v).any() or torch.isinf(v).any():
            raise ValueError(f"Invalid values in {k}")
    return batch
  1. Reduce batch size:
python train_ultrathink.py --batch_size 1 --gradient_accumulation_steps 8

❌ ValueError: Tokenizer not found

Cause: Tokenizer not downloaded or path incorrect.

Solution:

# Download tokenizer manually
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('gpt2')"

# Or specify tokenizer path
python train_ultrathink.py --tokenizer_name gpt2

# Use local tokenizer
python train_ultrathink.py --tokenizer_path ./my_tokenizer

❌ AssertionError: hidden_size must be divisible by num_heads

Cause: Invalid model configuration.

Solution:

# Ensure hidden_size is divisible by num_heads
# Example: hidden_size=768, num_heads=12 βœ…
# Example: hidden_size=768, num_heads=10 ❌

python train_ultrathink.py --hidden_size 768 --num_heads 12

# Common valid combinations:
# 256 / 4 = 64
# 512 / 8 = 64
# 768 / 12 = 64
# 1024 / 16 = 64

Memory Issues

❌ CUDA out of memory during training

Immediate fixes (try in order):

  1. Reduce batch size:
python train_ultrathink.py --batch_size 1 --gradient_accumulation_steps 16
  1. Enable gradient checkpointing:
python train_ultrathink.py --gradient_checkpointing
  1. Use mixed precision:
python train_ultrathink.py --use_amp
  1. Reduce sequence length:
python train_ultrathink.py --max_seq_length 512  # Instead of 2048
  1. Use DeepSpeed ZeRO:
python train_ultrathink.py --use_deepspeed --deepspeed_config deepspeed_config_zero2.json

Memory optimization checklist:

# In your training script
import torch

# Clear cache before training
torch.cuda.empty_cache()

# Enable memory efficient attention
model.config.use_flash_attention = True

# Reduce optimizer memory (use 8-bit Adam)
from bitsandbytes.optim import AdamW8bit
optimizer = AdamW8bit(model.parameters())

# Enable CPU offloading (slower but uses less GPU memory)
model.enable_cpu_offload()

❌ Memory leak (memory keeps increasing)

Cause: Accumulating gradients, keeping references to tensors, or logging too much.

Solutions:

  1. Clear gradients properly:
optimizer.zero_grad(set_to_none=True)  # More memory efficient than zero_grad()
  1. Detach metrics:
loss_value = loss.detach().item()  # Don't keep computation graph
  1. Limit logging:
if step % 100 == 0:  # Log every 100 steps, not every step
    logger.log_metrics({"loss": loss.item()})
  1. Clear cache periodically:
if step % 1000 == 0:
    torch.cuda.empty_cache()

❌ RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Cause: CUDA/cuDNN version mismatch or insufficient GPU memory.

Solution:

# Check CUDA version
nvidia-smi

# Reinstall PyTorch with correct CUDA version
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Or disable cuDNN
export CUDA_VISIBLE_DEVICES=0
export CUDNN_ENABLED=0
python train_ultrathink.py

Performance Problems

🐌 Training is very slow

Diagnosis:

# Add profiling to find bottleneck
python scripts/profile_model.py --size small

# Check GPU utilization
nvidia-smi -l 1  # Update every second

Common causes and fixes:

  1. CPU bottleneck (data loading):
# Increase data loading workers
python train_ultrathink.py --num_workers 4 --prefetch_factor 2
  1. Small batch size:
# Increase effective batch size
python train_ultrathink.py --batch_size 8 --gradient_accumulation_steps 4
  1. Not using Flash Attention:
pip install flash-attn --no-build-isolation
python train_ultrathink.py --use_flash_attention
  1. Logging too frequently:
python train_ultrathink.py --log_interval 100  # Instead of 10
  1. Slow storage:
# Use local SSD instead of network storage
# Copy dataset to local disk first
cp -r /network/dataset /local/ssd/dataset
python train_ultrathink.py --dataset_path /local/ssd/dataset

🐌 Low GPU utilization (<50%)

Causes:

  • Data loading bottleneck
  • Small batch size
  • CPU preprocessing too slow

Solutions:

# Increase workers and prefetch
python train_ultrathink.py --num_workers 8 --prefetch_factor 4

# Use streaming datasets (no preprocessing)
python train_ultrathink.py --dataset c4 --streaming

# Increase batch size
python train_ultrathink.py --batch_size 16

# Pin memory for faster transfer
python train_ultrathink.py --pin_memory

Data Loading Issues

❌ ConnectionError: Couldn't reach the Hugging Face Hub

Cause: Network issues or HF Hub down.

Solution:

# Use cached dataset
export HF_DATASETS_OFFLINE=1
python train_ultrathink.py --dataset wikitext

# Or download dataset manually
python -c "from datasets import load_dataset; load_dataset('wikitext', 'wikitext-103-v1')"

# Use local dataset
python train_ultrathink.py --dataset_path ./my_local_dataset

❌ Dataset is too large for disk

Solution: Use streaming mode

python train_ultrathink.py --dataset c4 --streaming

Or: Use a smaller subset

python train_ultrathink.py --dataset c4 --max_samples 100000

❌ KeyError: 'text' or wrong dataset format

Cause: Dataset doesn't have expected column names.

Solution:

# Check dataset structure
from datasets import load_dataset
dataset = load_dataset("your_dataset")
print(dataset.column_names)

# Map to correct format
def preprocess(examples):
    return {"text": examples["content"]}  # Rename column

dataset = dataset.map(preprocess)

In training script:

python train_ultrathink.py --text_column content  # Instead of default "text"

Distributed Training

❌ RuntimeError: Address already in use

Cause: Previous training process still running or port conflict.

Solution:

# Kill previous processes
pkill -f train_ultrathink.py

# Use different port
python train_ultrathink.py --master_port 29501

# Or let system choose port
python train_ultrathink.py --master_port 0

❌ Multi-GPU training hangs or crashes

Diagnosis:

# Test NCCL communication
python -m torch.distributed.run --nproc_per_node=2 scripts/test_distributed.py

# Check NCCL debug info
export NCCL_DEBUG=INFO
python train_ultrathink.py

Common fixes:

  1. Set correct backend:
export NCCL_SOCKET_IFNAME=eth0  # Or your network interface
export NCCL_IB_DISABLE=1  # Disable InfiniBand if not available
  1. Use correct launcher:
# Use torchrun (recommended)
torchrun --nproc_per_node=2 train_ultrathink.py

# Or accelerate
accelerate launch --num_processes=2 train_ultrathink.py
  1. Increase timeout:
export NCCL_TIMEOUT=1800  # 30 minutes

❌ RuntimeError: NCCL error: unhandled system error

Solution:

# Disable peer-to-peer access
export NCCL_P2P_DISABLE=1

# Use different NCCL backend
export NCCL_SOCKET_IFNAME=lo  # Loopback for single node

# Check GPU topology
nvidia-smi topo -m

Monitoring & Logging

❌ MLflow UI not starting

Solution:

# Check if port is in use
lsof -i :5000

# Use different port
mlflow ui --port 5001

# Or specify host
mlflow ui --host 0.0.0.0 --port 5000

❌ Weights & Biases not logging

Solution:

# Login to W&B
wandb login

# Check API key
echo $WANDB_API_KEY

# Disable W&B if not needed
export WANDB_MODE=disabled
python train_ultrathink.py

❌ TensorBoard shows no data

Solution:

# Check log directory
ls -la ./runs

# Start TensorBoard with correct path
tensorboard --logdir ./runs --port 6006

# Force reload
tensorboard --logdir ./runs --reload_interval 5

Docker Issues

❌ docker: permission denied

Solution:

# Add user to docker group
sudo usermod -aG docker $USER
newgrp docker

# Or use sudo
sudo docker compose up

❌ Container runs out of memory

Solution:

# Increase Docker memory limit
# Docker Desktop: Settings β†’ Resources β†’ Memory

# Or use docker run with memory limit
docker run --memory=16g --gpus all ultrathink:latest

❌ GPU not available in Docker

Solution:

# Install nvidia-docker2
sudo apt-get install nvidia-docker2
sudo systemctl restart docker

# Test GPU access
docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

# Run with GPU
docker run --gpus all ultrathink:latest

Getting Help

If you can't find a solution here:

  1. Check existing issues: GitHub Issues
  2. Search discussions: GitHub Discussions
  3. Enable debug logging:
export LOG_LEVEL=DEBUG
python train_ultrathink.py --verbose
  1. Create a minimal reproduction:
python train_ultrathink.py \
  --hidden_size 256 --num_layers 2 \
  --batch_size 1 --max_steps 10 \
  --dataset wikitext --max_samples 100
  1. Open an issue with:
    • Error message and full traceback
    • Your configuration (model size, hardware, etc.)
    • Steps to reproduce
    • Output of python --version, torch.__version__, nvidia-smi

Debugging Checklist

Before opening an issue, try:

  • Update to latest version: git pull && pip install -r requirements.txt
  • Clear cache: rm -rf ~/.cache/huggingface
  • Test with minimal config: --hidden_size 256 --num_layers 2 --batch_size 1
  • Check GPU: nvidia-smi
  • Check disk space: df -h
  • Check CUDA version: nvcc --version
  • Run tests: pytest tests/
  • Enable verbose logging: --verbose

Last Updated: January 2025
Version: 1.0.0

Found a solution not listed here? Contribute to this guide!