# ULTRATHINK Advanced Training Guide Complete guide for training the ULTRATHINK model with all advanced features. ## Table of Contents - [Quick Start](#quick-start) - [Training Profiles](#training-profiles) - [Advanced Features](#advanced-features) - [Configuration System](#configuration-system) - [Training Environments](#training-environments) - [Monitoring & Debugging](#monitoring--debugging) - [Best Practices](#best-practices) - [Troubleshooting](#troubleshooting) --- ## Quick Start ### Local Training #### Option 1: Using Training Profiles ```bash # Small model (for testing) python scripts/run_training.bat --profile small # Medium model (production-ready) python scripts/run_training.bat --profile medium # Large model (full-scale) python scripts/run_training.bat --profile large ``` #### Option 2: Using Configuration Files ```bash python train_advanced.py --config configs/train_small.yaml ``` #### Option 3: Direct Command Line ```bash python train_ultrathink.py \ --use_mlflow \ --dataset wikitext \ --hidden_size 512 \ --num_layers 6 \ --num_heads 8 \ --batch_size 4 \ --enable_moe \ --enable_dre ``` ### Google Colab Training 1. Open `colab_training.ipynb` in Google Colab 2. Choose your GPU runtime (T4/V100/A100) 3. Select a configuration cell and run it 4. Monitor progress in MLflow UI or logs --- ## Training Profiles ### Small Profile **Hardware**: Local machines, 4-8GB VRAM **Use Case**: Testing, development, prototyping ```yaml Model Size: 512 hidden, 6 layers Features: Basic transformer Training Time: ~2-4 hours (10K samples) Memory: ~6GB VRAM ``` **When to use**: - Testing new features - Debugging training pipeline - Quick experiments - Limited hardware resources ### Medium Profile **Hardware**: Single GPU (16-32GB), cloud instances **Use Case**: Production models, research ```yaml Model Size: 2048 hidden, 24 layers Features: MoE, DRE, Constitutional AI Training Time: ~1-2 days (100K samples) Memory: ~20GB VRAM ``` **When to use**: - Production deployments - Research experiments - Competitive benchmarks - Full feature validation ### Large Profile **Hardware**: Multi-GPU, A100/H100 clusters **Use Case**: State-of-the-art models ```yaml Model Size: 4096 hidden, 32 layers Features: All advanced features + Multimodal Training Time: ~1-2 weeks (1M+ samples) Memory: ~40-80GB VRAM ``` **When to use**: - Frontier model development - Large-scale pretraining - Multimodal applications - Maximum performance --- ## Advanced Features ### 1. Mixture of Experts (MoE) **What it does**: Routes inputs to specialized expert networks for efficient scaling. **Enable**: ```yaml advanced: enable_moe: true moe: num_knowledge_experts: 32 num_skill_experts: 16 num_meta_experts: 8 num_safety_experts: 4 moe_top_k: 2 ``` **Benefits**: - ✅ 5-10x parameter scaling with minimal compute increase - ✅ Specialized knowledge domains - ✅ Better performance on diverse tasks **Considerations**: - Requires more memory for expert parameters - Best with batch_size >= 4 - Works with expert parallelism for large scale ### 2. Dynamic Reasoning Engine (DRE) **What it does**: Adaptive computational paths based on input complexity. **Enable**: ```yaml advanced: enable_dre: true dre_warmup_steps: 5000 ``` **Benefits**: - ✅ Adaptive reasoning depth - ✅ Better on complex problems - ✅ Improved efficiency **Considerations**: - Use warmup to stabilize training - May increase training time initially - Best for reasoning-heavy tasks ### 3. Constitutional AI **What it does**: Built-in safety and alignment mechanisms. **Enable**: ```yaml advanced: enable_constitutional: true ``` **Benefits**: - ✅ Safer outputs - ✅ Better alignment - ✅ Reduced harmful content **Considerations**: - Adds overhead (~5-10%) - Best combined with RLHF - Requires safety datasets ### 4. RLHF (Reinforcement Learning from Human Feedback) **What it does**: Fine-tunes model based on human preferences. **Enable**: ```yaml advanced: enable_rlhf: true rlhf: rlhf_frequency: 5 rlhf_iterations: 100 ppo_epochs: 4 ``` **When to use**: - After pretraining - For instruction following - For alignment fine-tuning **Process**: 1. Pretrain model without RLHF 2. Save checkpoint 3. Resume with RLHF enabled 4. Fine-tune with preference data ### 5. Multimodal Capabilities **What it does**: Process images, audio, and text together. **Enable**: ```yaml advanced: enable_multimodal: true multimodal: image_size: 224 patch_size: 14 audio_sample_rate: 16000 ``` **Requirements**: - Multimodal datasets - Larger memory (images/audio) - Vision/audio encoders --- ## Configuration System ### YAML Configuration Structure ```yaml # Model architecture model: vocab_size: 100352 hidden_size: 2048 num_layers: 24 ... # Advanced features advanced: enable_moe: true enable_dre: true ... # Training hyperparameters training: batch_size: 8 learning_rate: 1e-4 ... # Data configuration data: dataset: wikitext ... # Logging and monitoring logging: use_mlflow: true ... ``` ### Override Configuration Values ```bash python train_advanced.py \ --config configs/train_medium.yaml \ --override \ training.batch_size=4 \ model.hidden_size=1024 \ advanced.enable_moe=false ``` ### Create Custom Configurations 1. Copy existing config: ```bash cp configs/train_medium.yaml configs/my_config.yaml ``` 2. Edit values in `my_config.yaml` 3. Run training: ```bash python train_advanced.py --config configs/my_config.yaml ``` --- ## Training Environments ### Local Training #### Setup ```bash # Install dependencies pip install -r requirements.txt # Start MLflow UI mlflow ui # Run training python train_advanced.py --config configs/train_small.yaml ``` #### Monitor - MLflow UI: http://localhost:5000 - Logs: `./outputs/[model_name]/training.log` - Checkpoints: `./outputs/[model_name]/checkpoint_*.pt` ### Google Colab Training #### Setup 1. Open `colab_training.ipynb` 2. Select GPU runtime 3. Mount Google Drive 4. Install dependencies #### Benefits - Free GPU access (T4) - Paid options (V100/A100) - Persistent storage via Drive - Easy sharing #### Limitations - Session timeouts (~12 hours) - GPU availability varies - Slower than dedicated hardware ### Cloud Training (AWS/GCP/Azure) #### AWS SageMaker ```python # Use train_advanced.py with SageMaker estimator from sagemaker.pytorch import PyTorch estimator = PyTorch( entry_point='train_advanced.py', instance_type='ml.p3.8xlarge', instance_count=1, hyperparameters={ 'config': 'configs/train_large.yaml' } ) estimator.fit() ``` #### GCP AI Platform ```bash gcloud ai-platform jobs submit training ultrathink_job \ --region=us-central1 \ --master-machine-type=n1-highmem-16 \ --master-accelerator=type=nvidia-tesla-v100,count=4 \ --package-path=./src \ --module-name=train_advanced \ -- \ --config=configs/train_large.yaml ``` ### Distributed Training #### Multi-GPU (Single Node) ```bash # Using torchrun torchrun --nproc_per_node=4 train_advanced.py \ --config configs/train_large.yaml \ --override distributed.enabled=true ``` #### Multi-Node with DeepSpeed ```bash # Create hostfile echo "node1 slots=8" > hostfile echo "node2 slots=8" >> hostfile # Run with DeepSpeed deepspeed --hostfile=hostfile train_advanced.py \ --config configs/train_large.yaml \ --override \ distributed.enabled=true \ distributed.launcher=deepspeed \ distributed.deepspeed_config=config/deepspeed_z3.json ``` --- ## Monitoring & Debugging ### MLflow Tracking **View Experiments**: ```bash mlflow ui # Open http://localhost:5000 ``` **Key Metrics**: - `train/loss`: Training loss - `val/loss`: Validation loss - `train/learning_rate`: Current LR - `train/grad_norm`: Gradient norms - `eval/*`: Evaluation metrics **Artifacts**: - Config files - Checkpoints - Evaluation results - Model exports ### Logging **Log Levels**: ```python # In config or code logging.basicConfig(level=logging.INFO) # INFO, DEBUG, WARNING, ERROR ``` **Log Files**: - Training log: `./outputs/[model_name]/training.log` - MLflow logs: `./mlruns/` ### Debugging Tips **Memory Issues**: ```bash # Check GPU memory nvidia-smi # Enable memory profiling python train_advanced.py \ --config configs/train_small.yaml \ --override training.use_amp=true model.gradient_checkpointing=true ``` **Slow Training**: ```bash # Enable profiling python -m torch.utils.bottleneck train_advanced.py --config configs/train_small.yaml ``` **NaN/Inf Loss**: ```yaml # Add gradient clipping training: gradient_clipping: 1.0 # Reduce learning rate training: learning_rate: 1e-5 # Enable AMP carefully training: use_amp: true ``` --- ## Best Practices ### 1. Start Small, Scale Up - ✅ Test on small config first - ✅ Verify all features work - ✅ Then scale to production ### 2. Use Mixed Precision (AMP) ```yaml training: use_amp: true ``` - 2x faster training - 50% less memory - Minimal accuracy loss ### 3. Gradient Checkpointing for Large Models ```yaml model: gradient_checkpointing: true ``` - Trades compute for memory - Enables larger models - ~20% slower, 40% less memory ### 4. Optimize Data Loading ```yaml data: num_workers: 8 # Set to CPU cores streaming: true # For large datasets ``` ### 5. Save Checkpoints Regularly ```yaml evaluation: eval_frequency: 1 # Save every epoch ``` ### 6. Monitor Gradient Norms - Healthy range: 0.1 - 10 - Too high (>100): Reduce LR or increase clipping - Too low (<0.01): Increase LR or check optimizer ### 7. Use Learning Rate Warmup ```yaml training: warmup_steps: 2000 # Gradual LR increase ``` ### 8. Enable Advanced Features Gradually 1. Train baseline model 2. Add MoE 3. Add DRE 4. Add Constitutional AI 5. Fine-tune with RLHF --- ## Troubleshooting ### Out of Memory (OOM) **Symptoms**: CUDA out of memory error **Solutions**: ```yaml # 1. Reduce batch size training: batch_size: 2 gradient_accumulation_steps: 16 # Maintain effective batch size # 2. Enable gradient checkpointing model: gradient_checkpointing: true # 3. Reduce sequence length model: max_seq_length: 512 # 4. Use smaller model model: hidden_size: 512 num_layers: 6 ``` ### Slow Training **Symptoms**: Low tokens/second **Solutions**: ```yaml # 1. Enable flash attention model: use_flash_attention: true # 2. Use AMP training: use_amp: true # 3. Increase batch size training: batch_size: 16 # 4. Optimize data loading data: num_workers: 8 streaming: true ``` ### Training Instability **Symptoms**: NaN loss, exploding gradients **Solutions**: ```yaml # 1. Enable gradient clipping training: gradient_clipping: 1.0 # 2. Reduce learning rate training: learning_rate: 1e-5 # 3. Increase warmup training: warmup_steps: 10000 # 4. Use DRE warmup advanced: dre_warmup_steps: 5000 ``` ### Poor Performance **Symptoms**: High validation loss, poor benchmarks **Solutions**: 1. Train longer (more epochs/steps) 2. Increase model size 3. Use better datasets 4. Enable advanced features (MoE, DRE) 5. Fine-tune with RLHF 6. Check data quality 7. Verify tokenization 8. Monitor overfitting ### Dataset Issues **Symptoms**: Dataset loading errors **Solutions**: ```bash # 1. Use dummy dataset for testing python train_advanced.py \ --config configs/train_small.yaml \ --override data.dataset=dummy # 2. Check dataset availability python -c "from datasets import load_dataset; load_dataset('wikitext', 'wikitext-103-v1')" # 3. Use streaming for large datasets --override data.streaming=true ``` --- ## Example Training Workflows ### Workflow 1: Quick Prototype ```bash # 1. Test with dummy data python train_advanced.py --config configs/train_small.yaml \ --override data.dataset=dummy data.train_samples=1000 # 2. Run on real data python train_advanced.py --config configs/train_small.yaml \ --override data.dataset=wikitext training.num_epochs=1 ``` ### Workflow 2: Production Training ```bash # 1. Pretrain baseline python train_advanced.py --config configs/train_medium.yaml \ --run-name "baseline_v1" # 2. Resume with advanced features python train_advanced.py --config configs/train_medium.yaml \ --init-from ./outputs/medium_model/final_model \ --override advanced.enable_moe=true advanced.enable_dre=true \ --run-name "advanced_v1" # 3. Fine-tune with RLHF python train_advanced.py --config configs/train_medium.yaml \ --init-from ./outputs/medium_model/final_model \ --override advanced.enable_rlhf=true training.learning_rate=1e-5 \ --run-name "rlhf_v1" ``` ### Workflow 3: Multi-GPU Training ```bash # Use torchrun for data parallelism torchrun --nproc_per_node=4 train_advanced.py \ --config configs/train_large.yaml \ --override distributed.enabled=true ``` --- ## Performance Benchmarks ### Expected Throughput | Profile | Hardware | Tokens/sec | Time (100K samples) | |---------|----------|------------|---------------------| | Small | GTX 1080 Ti | ~5000 | 4 hours | | Medium | RTX 3090 | ~2000 | 1 day | | Large | A100 40GB | ~800 | 1 week | | Large | 4x A100 | ~3000 | 2 days | ### Memory Requirements | Profile | Parameters | Memory (FP32) | Memory (FP16) | |---------|------------|---------------|---------------| | Small | 100M | 6 GB | 4 GB | | Medium | 2B | 24 GB | 16 GB | | Large | 10B | 80 GB | 48 GB | --- ## Additional Resources - **Documentation**: [README.md](README.md) - **Model Card**: [MODEL_CARD.md](MODEL_CARD.md) - **API Reference**: [docs/README.md](docs/README.md) - **GitHub Issues**: Report bugs and request features - **Discord/Slack**: Community support (if available) --- ## Getting Help 1. **Check logs**: `./outputs/[model_name]/training.log` 2. **Search issues**: GitHub issues page 3. **Community**: Discord/Slack channels 4. **Documentation**: This guide and other docs Happy training! 🚀