π§ Troubleshooting Guide
Common issues and solutions for ULTRATHINK training.
Table of Contents
- Installation Issues
- Training Errors
- Memory Issues
- Performance Problems
- Data Loading Issues
- Distributed Training
- Monitoring & Logging
- Docker Issues
Installation Issues
β ImportError: No module named 'flash_attn'
Cause: Flash Attention 2 not installed or incompatible with your CUDA version.
Solution:
# Check CUDA version
nvidia-smi
# Install Flash Attention 2 (requires CUDA 11.6+)
pip install flash-attn --no-build-isolation
# If build fails, disable Flash Attention
python train_ultrathink.py --no_flash_attention
Alternative: Use PyTorch's native SDPA:
# In your config
model:
use_flash_attention: false
use_sdpa: true # PyTorch 2.0+ scaled dot product attention
β CUDA out of memory during installation
Cause: Trying to build packages that require GPU memory.
Solution:
# Set environment variable to reduce memory usage
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
# Install with no cache
pip install --no-cache-dir -r requirements.txt
# Or install CPU-only first, then GPU packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txt
β ModuleNotFoundError: No module named 'src'
Cause: Python can't find the src module.
Solution:
# Make sure you're in the correct directory
cd UltraThinking-LLM-Training/deep
# Install in development mode
pip install -e .
# Or add to PYTHONPATH
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
Training Errors
β RuntimeError: Expected all tensors to be on the same device
Cause: Model and data are on different devices (CPU vs GPU).
Solution:
# Ensure device consistency
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}
In training script:
# Force CPU training
python train_ultrathink.py --device cpu
# Force GPU training
python train_ultrathink.py --device cuda
β NaN loss or loss becomes inf
Cause: Gradient explosion, learning rate too high, or numerical instability.
Solutions:
- Reduce learning rate:
python train_ultrathink.py --learning_rate 1e-4 # Instead of 3e-4
- Enable gradient clipping:
python train_ultrathink.py --gradient_clip_norm 1.0
- Use mixed precision carefully:
# Try FP32 first
python train_ultrathink.py --no_amp
# Or use BF16 if supported (A100, H100)
python train_ultrathink.py --use_bf16
- Check for bad data:
# Add validation in data loading
def validate_batch(batch):
for k, v in batch.items():
if torch.isnan(v).any() or torch.isinf(v).any():
raise ValueError(f"Invalid values in {k}")
return batch
- Reduce batch size:
python train_ultrathink.py --batch_size 1 --gradient_accumulation_steps 8
β ValueError: Tokenizer not found
Cause: Tokenizer not downloaded or path incorrect.
Solution:
# Download tokenizer manually
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('gpt2')"
# Or specify tokenizer path
python train_ultrathink.py --tokenizer_name gpt2
# Use local tokenizer
python train_ultrathink.py --tokenizer_path ./my_tokenizer
β AssertionError: hidden_size must be divisible by num_heads
Cause: Invalid model configuration.
Solution:
# Ensure hidden_size is divisible by num_heads
# Example: hidden_size=768, num_heads=12 β
# Example: hidden_size=768, num_heads=10 β
python train_ultrathink.py --hidden_size 768 --num_heads 12
# Common valid combinations:
# 256 / 4 = 64
# 512 / 8 = 64
# 768 / 12 = 64
# 1024 / 16 = 64
Memory Issues
β CUDA out of memory during training
Immediate fixes (try in order):
- Reduce batch size:
python train_ultrathink.py --batch_size 1 --gradient_accumulation_steps 16
- Enable gradient checkpointing:
python train_ultrathink.py --gradient_checkpointing
- Use mixed precision:
python train_ultrathink.py --use_amp
- Reduce sequence length:
python train_ultrathink.py --max_seq_length 512 # Instead of 2048
- Use DeepSpeed ZeRO:
python train_ultrathink.py --use_deepspeed --deepspeed_config deepspeed_config_zero2.json
Memory optimization checklist:
# In your training script
import torch
# Clear cache before training
torch.cuda.empty_cache()
# Enable memory efficient attention
model.config.use_flash_attention = True
# Reduce optimizer memory (use 8-bit Adam)
from bitsandbytes.optim import AdamW8bit
optimizer = AdamW8bit(model.parameters())
# Enable CPU offloading (slower but uses less GPU memory)
model.enable_cpu_offload()
β Memory leak (memory keeps increasing)
Cause: Accumulating gradients, keeping references to tensors, or logging too much.
Solutions:
- Clear gradients properly:
optimizer.zero_grad(set_to_none=True) # More memory efficient than zero_grad()
- Detach metrics:
loss_value = loss.detach().item() # Don't keep computation graph
- Limit logging:
if step % 100 == 0: # Log every 100 steps, not every step
logger.log_metrics({"loss": loss.item()})
- Clear cache periodically:
if step % 1000 == 0:
torch.cuda.empty_cache()
β RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
Cause: CUDA/cuDNN version mismatch or insufficient GPU memory.
Solution:
# Check CUDA version
nvidia-smi
# Reinstall PyTorch with correct CUDA version
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Or disable cuDNN
export CUDA_VISIBLE_DEVICES=0
export CUDNN_ENABLED=0
python train_ultrathink.py
Performance Problems
π Training is very slow
Diagnosis:
# Add profiling to find bottleneck
python scripts/profile_model.py --size small
# Check GPU utilization
nvidia-smi -l 1 # Update every second
Common causes and fixes:
- CPU bottleneck (data loading):
# Increase data loading workers
python train_ultrathink.py --num_workers 4 --prefetch_factor 2
- Small batch size:
# Increase effective batch size
python train_ultrathink.py --batch_size 8 --gradient_accumulation_steps 4
- Not using Flash Attention:
pip install flash-attn --no-build-isolation
python train_ultrathink.py --use_flash_attention
- Logging too frequently:
python train_ultrathink.py --log_interval 100 # Instead of 10
- Slow storage:
# Use local SSD instead of network storage
# Copy dataset to local disk first
cp -r /network/dataset /local/ssd/dataset
python train_ultrathink.py --dataset_path /local/ssd/dataset
π Low GPU utilization (<50%)
Causes:
- Data loading bottleneck
- Small batch size
- CPU preprocessing too slow
Solutions:
# Increase workers and prefetch
python train_ultrathink.py --num_workers 8 --prefetch_factor 4
# Use streaming datasets (no preprocessing)
python train_ultrathink.py --dataset c4 --streaming
# Increase batch size
python train_ultrathink.py --batch_size 16
# Pin memory for faster transfer
python train_ultrathink.py --pin_memory
Data Loading Issues
β ConnectionError: Couldn't reach the Hugging Face Hub
Cause: Network issues or HF Hub down.
Solution:
# Use cached dataset
export HF_DATASETS_OFFLINE=1
python train_ultrathink.py --dataset wikitext
# Or download dataset manually
python -c "from datasets import load_dataset; load_dataset('wikitext', 'wikitext-103-v1')"
# Use local dataset
python train_ultrathink.py --dataset_path ./my_local_dataset
β Dataset is too large for disk
Solution: Use streaming mode
python train_ultrathink.py --dataset c4 --streaming
Or: Use a smaller subset
python train_ultrathink.py --dataset c4 --max_samples 100000
β KeyError: 'text' or wrong dataset format
Cause: Dataset doesn't have expected column names.
Solution:
# Check dataset structure
from datasets import load_dataset
dataset = load_dataset("your_dataset")
print(dataset.column_names)
# Map to correct format
def preprocess(examples):
return {"text": examples["content"]} # Rename column
dataset = dataset.map(preprocess)
In training script:
python train_ultrathink.py --text_column content # Instead of default "text"
Distributed Training
β RuntimeError: Address already in use
Cause: Previous training process still running or port conflict.
Solution:
# Kill previous processes
pkill -f train_ultrathink.py
# Use different port
python train_ultrathink.py --master_port 29501
# Or let system choose port
python train_ultrathink.py --master_port 0
β Multi-GPU training hangs or crashes
Diagnosis:
# Test NCCL communication
python -m torch.distributed.run --nproc_per_node=2 scripts/test_distributed.py
# Check NCCL debug info
export NCCL_DEBUG=INFO
python train_ultrathink.py
Common fixes:
- Set correct backend:
export NCCL_SOCKET_IFNAME=eth0 # Or your network interface
export NCCL_IB_DISABLE=1 # Disable InfiniBand if not available
- Use correct launcher:
# Use torchrun (recommended)
torchrun --nproc_per_node=2 train_ultrathink.py
# Or accelerate
accelerate launch --num_processes=2 train_ultrathink.py
- Increase timeout:
export NCCL_TIMEOUT=1800 # 30 minutes
β RuntimeError: NCCL error: unhandled system error
Solution:
# Disable peer-to-peer access
export NCCL_P2P_DISABLE=1
# Use different NCCL backend
export NCCL_SOCKET_IFNAME=lo # Loopback for single node
# Check GPU topology
nvidia-smi topo -m
Monitoring & Logging
β MLflow UI not starting
Solution:
# Check if port is in use
lsof -i :5000
# Use different port
mlflow ui --port 5001
# Or specify host
mlflow ui --host 0.0.0.0 --port 5000
β Weights & Biases not logging
Solution:
# Login to W&B
wandb login
# Check API key
echo $WANDB_API_KEY
# Disable W&B if not needed
export WANDB_MODE=disabled
python train_ultrathink.py
β TensorBoard shows no data
Solution:
# Check log directory
ls -la ./runs
# Start TensorBoard with correct path
tensorboard --logdir ./runs --port 6006
# Force reload
tensorboard --logdir ./runs --reload_interval 5
Docker Issues
β docker: permission denied
Solution:
# Add user to docker group
sudo usermod -aG docker $USER
newgrp docker
# Or use sudo
sudo docker compose up
β Container runs out of memory
Solution:
# Increase Docker memory limit
# Docker Desktop: Settings β Resources β Memory
# Or use docker run with memory limit
docker run --memory=16g --gpus all ultrathink:latest
β GPU not available in Docker
Solution:
# Install nvidia-docker2
sudo apt-get install nvidia-docker2
sudo systemctl restart docker
# Test GPU access
docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
# Run with GPU
docker run --gpus all ultrathink:latest
Getting Help
If you can't find a solution here:
- Check existing issues: GitHub Issues
- Search discussions: GitHub Discussions
- Enable debug logging:
export LOG_LEVEL=DEBUG
python train_ultrathink.py --verbose
- Create a minimal reproduction:
python train_ultrathink.py \
--hidden_size 256 --num_layers 2 \
--batch_size 1 --max_steps 10 \
--dataset wikitext --max_samples 100
- Open an issue with:
- Error message and full traceback
- Your configuration (model size, hardware, etc.)
- Steps to reproduce
- Output of
python --version,torch.__version__,nvidia-smi
Debugging Checklist
Before opening an issue, try:
- Update to latest version:
git pull && pip install -r requirements.txt - Clear cache:
rm -rf ~/.cache/huggingface - Test with minimal config:
--hidden_size 256 --num_layers 2 --batch_size 1 - Check GPU:
nvidia-smi - Check disk space:
df -h - Check CUDA version:
nvcc --version - Run tests:
pytest tests/ - Enable verbose logging:
--verbose
Last Updated: January 2025
Version: 1.0.0
Found a solution not listed here? Contribute to this guide!