# Multi-GPU Training with 2× RTX 5090s ## 🚀 **Quick Start** ### **Multi-GPU Training (Recommended)** ```bash torchrun --nproc_per_node=2 train.py --config runs/humigence/config.snapshot.json ``` ### **Single GPU Training (Fallback)** ```bash python train.py --config runs/humigence/config.snapshot.json --fallback_single_gpu ``` ## 🔧 **Features** ### **Multi-GPU Support** - ✅ **NCCL Backend**: Stable distributed training - ✅ **2× RTX 5090s**: Full utilization of both GPUs - ✅ **Automatic Detection**: Detects available GPUs - ✅ **Process Synchronization**: Proper rank management ### **Environment Hardening** - ✅ **NCCL Debug**: `NCCL_DEBUG=INFO` for troubleshooting - ✅ **IB Disabled**: `NCCL_IB_DISABLE=1` prevents InfiniBand issues - ✅ **P2P Disabled**: `NCCL_P2P_DISABLE=1` prevents peer-to-peer issues - ✅ **Async Error Handling**: `NCCL_ASYNC_ERROR_HANDLING=1` for better error handling - ✅ **Tokenizer Safety**: `TOKENIZERS_PARALLELISM=false` prevents fork warnings ### **Graceful Fallback** - ✅ **Automatic Fallback**: Falls back to single GPU if multi-GPU fails - ✅ **Clear Warnings**: Shows when fallback is triggered - ✅ **No Data Loss**: Training continues seamlessly - ✅ **Error Recovery**: Handles NCCL errors gracefully ### **Device Consistency** - ✅ **Training**: Each process uses its local rank device (`cuda:local_rank`) - ✅ **Evaluation**: Always uses `cuda:0` for fresh model reload - ✅ **No Mixing**: No tensors mixed between `cuda:0` and `cuda:1` - ✅ **Synchronization**: Proper process synchronization ## 📊 **Training Modes** ### **Multi-GPU Mode (Default)** ``` 🚀 Starting multi-GPU training on 2 GPUs ✅ Distributed training initialized Rank: 0/1 Local Rank: 0 World Size: 2 Device: cuda:0 🚀 Starting multi-GPU training on 2 GPUs ✅ Distributed training initialized Rank: 1/1 Local Rank: 1 World Size: 2 Device: cuda:1 ``` ### **Single GPU Fallback** ``` ⚠️ NCCL multi-GPU failed, falling back to single GPU training on cuda:0. 🚀 Starting single GPU training on cuda:0 ✅ Single GPU Training: cuda:0 Device: cuda:0 ``` ## 🛠️ **Configuration** ### **Multi-GPU Configuration** The launcher automatically sets: ```python config = { "distributed": True, "rank": 0, # or 1 "world_size": 2, "local_rank": 0, # or 1 "device": "cuda:0", # or "cuda:1" "per_device_train_batch_size": 4, # Per GPU "per_device_eval_batch_size": 8, # Per GPU } ``` ### **Single GPU Configuration** ```python config = { "distributed": False, "rank": 0, "world_size": 1, "local_rank": 0, "device": "cuda:0", "per_device_train_batch_size": 8, # Doubled for single GPU "per_device_eval_batch_size": 16, # Doubled for single GPU } ``` ## 🔍 **Troubleshooting** ### **Common Issues** 1. **NCCL Initialization Failed** ``` ❌ Distributed training initialization failed: NCCL Error ⚠️ NCCL multi-GPU failed, falling back to single GPU training on cuda:0. ``` **Solution**: This is expected behavior. The launcher will automatically fall back to single GPU. 2. **CUDA Out of Memory** ``` ❌ CUDA out of memory ``` **Solution**: Reduce `per_device_train_batch_size` in your config. 3. **Device Mismatch** ``` ❌ Expected all tensors to be on the same device ``` **Solution**: This should not happen with the new launcher. If it does, check that evaluation is using fresh model reload. ### **Debug Mode** Set environment variables for debugging: ```bash export NCCL_DEBUG=INFO export CUDA_LAUNCH_BLOCKING=1 torchrun --nproc_per_node=2 train.py --config runs/humigence/config.snapshot.json ``` ## 📈 **Performance** ### **Multi-GPU Benefits** - **2× Training Speed**: Approximately 2x faster training - **Larger Batch Sizes**: Can use larger effective batch sizes - **Better Convergence**: Often better model performance - **Memory Efficiency**: Distributes memory across GPUs ### **Single GPU Fallback** - **Reliable**: Always works if multi-GPU fails - **Simpler**: Easier to debug issues - **Compatible**: Works with any setup ## 🎯 **Best Practices** 1. **Always use the launcher**: Don't run training directly 2. **Check GPU availability**: Ensure both GPUs are visible 3. **Monitor memory usage**: Watch for OOM errors 4. **Use appropriate batch sizes**: Start small and increase 5. **Check logs**: Look for NCCL warnings or errors ## 🚨 **Important Notes** - **Evaluation always uses cuda:0**: Fresh model reload ensures device consistency - **Training uses local rank devices**: Each process uses its assigned GPU - **No tensor mixing**: Tensors never cross between cuda:0 and cuda:1 - **Automatic fallback**: If multi-GPU fails, single GPU training continues - **Process synchronization**: All processes are properly synchronized ## 🎉 **Summary** The new training launcher provides: - **Robust multi-GPU training** with NCCL - **Graceful fallback** to single GPU - **Device consistency** throughout training and evaluation - **Professional logging** and error handling - **Fool-proof operation** with automatic error recovery No more `cuda:0 vs cuda:1` mismatches, no deadlocks, no NCCL crashes without fallback! 🚀