Multi-GPU Training with 2Γ RTX 5090s
π Quick Start
Multi-GPU Training (Recommended)
torchrun --nproc_per_node=2 train.py --config runs/humigence/config.snapshot.json
Single GPU Training (Fallback)
python train.py --config runs/humigence/config.snapshot.json --fallback_single_gpu
π§ Features
Multi-GPU Support
- β NCCL Backend: Stable distributed training
- β 2Γ RTX 5090s: Full utilization of both GPUs
- β Automatic Detection: Detects available GPUs
- β Process Synchronization: Proper rank management
Environment Hardening
- β
NCCL Debug:
NCCL_DEBUG=INFOfor troubleshooting - β
IB Disabled:
NCCL_IB_DISABLE=1prevents InfiniBand issues - β
P2P Disabled:
NCCL_P2P_DISABLE=1prevents peer-to-peer issues - β
Async Error Handling:
NCCL_ASYNC_ERROR_HANDLING=1for better error handling - β
Tokenizer Safety:
TOKENIZERS_PARALLELISM=falseprevents fork warnings
Graceful Fallback
- β Automatic Fallback: Falls back to single GPU if multi-GPU fails
- β Clear Warnings: Shows when fallback is triggered
- β No Data Loss: Training continues seamlessly
- β Error Recovery: Handles NCCL errors gracefully
Device Consistency
- β
Training: Each process uses its local rank device (
cuda:local_rank) - β
Evaluation: Always uses
cuda:0for fresh model reload - β
No Mixing: No tensors mixed between
cuda:0andcuda:1 - β Synchronization: Proper process synchronization
π Training Modes
Multi-GPU Mode (Default)
π Starting multi-GPU training on 2 GPUs
β
Distributed training initialized
Rank: 0/1
Local Rank: 0
World Size: 2
Device: cuda:0
π Starting multi-GPU training on 2 GPUs
β
Distributed training initialized
Rank: 1/1
Local Rank: 1
World Size: 2
Device: cuda:1
Single GPU Fallback
β οΈ NCCL multi-GPU failed, falling back to single GPU training on cuda:0.
π Starting single GPU training on cuda:0
β
Single GPU Training: cuda:0
Device: cuda:0
π οΈ Configuration
Multi-GPU Configuration
The launcher automatically sets:
config = {
"distributed": True,
"rank": 0, # or 1
"world_size": 2,
"local_rank": 0, # or 1
"device": "cuda:0", # or "cuda:1"
"per_device_train_batch_size": 4, # Per GPU
"per_device_eval_batch_size": 8, # Per GPU
}
Single GPU Configuration
config = {
"distributed": False,
"rank": 0,
"world_size": 1,
"local_rank": 0,
"device": "cuda:0",
"per_device_train_batch_size": 8, # Doubled for single GPU
"per_device_eval_batch_size": 16, # Doubled for single GPU
}
π Troubleshooting
Common Issues
NCCL Initialization Failed
β Distributed training initialization failed: NCCL Error β οΈ NCCL multi-GPU failed, falling back to single GPU training on cuda:0.Solution: This is expected behavior. The launcher will automatically fall back to single GPU.
CUDA Out of Memory
β CUDA out of memorySolution: Reduce
per_device_train_batch_sizein your config.Device Mismatch
β Expected all tensors to be on the same deviceSolution: This should not happen with the new launcher. If it does, check that evaluation is using fresh model reload.
Debug Mode
Set environment variables for debugging:
export NCCL_DEBUG=INFO
export CUDA_LAUNCH_BLOCKING=1
torchrun --nproc_per_node=2 train.py --config runs/humigence/config.snapshot.json
π Performance
Multi-GPU Benefits
- 2Γ Training Speed: Approximately 2x faster training
- Larger Batch Sizes: Can use larger effective batch sizes
- Better Convergence: Often better model performance
- Memory Efficiency: Distributes memory across GPUs
Single GPU Fallback
- Reliable: Always works if multi-GPU fails
- Simpler: Easier to debug issues
- Compatible: Works with any setup
π― Best Practices
- Always use the launcher: Don't run training directly
- Check GPU availability: Ensure both GPUs are visible
- Monitor memory usage: Watch for OOM errors
- Use appropriate batch sizes: Start small and increase
- Check logs: Look for NCCL warnings or errors
π¨ Important Notes
- Evaluation always uses cuda:0: Fresh model reload ensures device consistency
- Training uses local rank devices: Each process uses its assigned GPU
- No tensor mixing: Tensors never cross between cuda:0 and cuda:1
- Automatic fallback: If multi-GPU fails, single GPU training continues
- Process synchronization: All processes are properly synchronized
π Summary
The new training launcher provides:
- Robust multi-GPU training with NCCL
- Graceful fallback to single GPU
- Device consistency throughout training and evaluation
- Professional logging and error handling
- Fool-proof operation with automatic error recovery
No more cuda:0 vs cuda:1 mismatches, no deadlocks, no NCCL crashes without fallback! π