humigencev2 / MULTI_GPU_TRAINING_README.md
lilbablo's picture
chore: initial public release of Humigence with dual-GPU & CLI wizard
36ac84e

Multi-GPU Training with 2Γ— RTX 5090s

πŸš€ Quick Start

Multi-GPU Training (Recommended)

torchrun --nproc_per_node=2 train.py --config runs/humigence/config.snapshot.json

Single GPU Training (Fallback)

python train.py --config runs/humigence/config.snapshot.json --fallback_single_gpu

πŸ”§ Features

Multi-GPU Support

  • βœ… NCCL Backend: Stable distributed training
  • βœ… 2Γ— RTX 5090s: Full utilization of both GPUs
  • βœ… Automatic Detection: Detects available GPUs
  • βœ… Process Synchronization: Proper rank management

Environment Hardening

  • βœ… NCCL Debug: NCCL_DEBUG=INFO for troubleshooting
  • βœ… IB Disabled: NCCL_IB_DISABLE=1 prevents InfiniBand issues
  • βœ… P2P Disabled: NCCL_P2P_DISABLE=1 prevents peer-to-peer issues
  • βœ… Async Error Handling: NCCL_ASYNC_ERROR_HANDLING=1 for better error handling
  • βœ… Tokenizer Safety: TOKENIZERS_PARALLELISM=false prevents fork warnings

Graceful Fallback

  • βœ… Automatic Fallback: Falls back to single GPU if multi-GPU fails
  • βœ… Clear Warnings: Shows when fallback is triggered
  • βœ… No Data Loss: Training continues seamlessly
  • βœ… Error Recovery: Handles NCCL errors gracefully

Device Consistency

  • βœ… Training: Each process uses its local rank device (cuda:local_rank)
  • βœ… Evaluation: Always uses cuda:0 for fresh model reload
  • βœ… No Mixing: No tensors mixed between cuda:0 and cuda:1
  • βœ… Synchronization: Proper process synchronization

πŸ“Š Training Modes

Multi-GPU Mode (Default)

πŸš€ Starting multi-GPU training on 2 GPUs
βœ… Distributed training initialized
   Rank: 0/1
   Local Rank: 0
   World Size: 2
   Device: cuda:0

πŸš€ Starting multi-GPU training on 2 GPUs
βœ… Distributed training initialized
   Rank: 1/1
   Local Rank: 1
   World Size: 2
   Device: cuda:1

Single GPU Fallback

⚠️ NCCL multi-GPU failed, falling back to single GPU training on cuda:0.
πŸš€ Starting single GPU training on cuda:0
βœ… Single GPU Training: cuda:0
   Device: cuda:0

πŸ› οΈ Configuration

Multi-GPU Configuration

The launcher automatically sets:

config = {
    "distributed": True,
    "rank": 0,  # or 1
    "world_size": 2,
    "local_rank": 0,  # or 1
    "device": "cuda:0",  # or "cuda:1"
    "per_device_train_batch_size": 4,  # Per GPU
    "per_device_eval_batch_size": 8,   # Per GPU
}

Single GPU Configuration

config = {
    "distributed": False,
    "rank": 0,
    "world_size": 1,
    "local_rank": 0,
    "device": "cuda:0",
    "per_device_train_batch_size": 8,  # Doubled for single GPU
    "per_device_eval_batch_size": 16,  # Doubled for single GPU
}

πŸ” Troubleshooting

Common Issues

  1. NCCL Initialization Failed

    ❌ Distributed training initialization failed: NCCL Error
    ⚠️ NCCL multi-GPU failed, falling back to single GPU training on cuda:0.
    

    Solution: This is expected behavior. The launcher will automatically fall back to single GPU.

  2. CUDA Out of Memory

    ❌ CUDA out of memory
    

    Solution: Reduce per_device_train_batch_size in your config.

  3. Device Mismatch

    ❌ Expected all tensors to be on the same device
    

    Solution: This should not happen with the new launcher. If it does, check that evaluation is using fresh model reload.

Debug Mode

Set environment variables for debugging:

export NCCL_DEBUG=INFO
export CUDA_LAUNCH_BLOCKING=1
torchrun --nproc_per_node=2 train.py --config runs/humigence/config.snapshot.json

πŸ“ˆ Performance

Multi-GPU Benefits

  • 2Γ— Training Speed: Approximately 2x faster training
  • Larger Batch Sizes: Can use larger effective batch sizes
  • Better Convergence: Often better model performance
  • Memory Efficiency: Distributes memory across GPUs

Single GPU Fallback

  • Reliable: Always works if multi-GPU fails
  • Simpler: Easier to debug issues
  • Compatible: Works with any setup

🎯 Best Practices

  1. Always use the launcher: Don't run training directly
  2. Check GPU availability: Ensure both GPUs are visible
  3. Monitor memory usage: Watch for OOM errors
  4. Use appropriate batch sizes: Start small and increase
  5. Check logs: Look for NCCL warnings or errors

🚨 Important Notes

  • Evaluation always uses cuda:0: Fresh model reload ensures device consistency
  • Training uses local rank devices: Each process uses its assigned GPU
  • No tensor mixing: Tensors never cross between cuda:0 and cuda:1
  • Automatic fallback: If multi-GPU fails, single GPU training continues
  • Process synchronization: All processes are properly synchronized

πŸŽ‰ Summary

The new training launcher provides:

  • Robust multi-GPU training with NCCL
  • Graceful fallback to single GPU
  • Device consistency throughout training and evaluation
  • Professional logging and error handling
  • Fool-proof operation with automatic error recovery

No more cuda:0 vs cuda:1 mismatches, no deadlocks, no NCCL crashes without fallback! πŸš€