ULTRATHINK Advanced Training Guide
Complete guide for training the ULTRATHINK model with all advanced features.
Table of Contents
- Quick Start
- Training Profiles
- Advanced Features
- Configuration System
- Training Environments
- Monitoring & Debugging
- Best Practices
- Troubleshooting
Quick Start
Local Training
Option 1: Using Training Profiles
# Small model (for testing)
python scripts/run_training.bat --profile small
# Medium model (production-ready)
python scripts/run_training.bat --profile medium
# Large model (full-scale)
python scripts/run_training.bat --profile large
Option 2: Using Configuration Files
python train_advanced.py --config configs/train_small.yaml
Option 3: Direct Command Line
python train_ultrathink.py \
--use_mlflow \
--dataset wikitext \
--hidden_size 512 \
--num_layers 6 \
--num_heads 8 \
--batch_size 4 \
--enable_moe \
--enable_dre
Google Colab Training
- Open
colab_training.ipynbin Google Colab - Choose your GPU runtime (T4/V100/A100)
- Select a configuration cell and run it
- Monitor progress in MLflow UI or logs
Training Profiles
Small Profile
Hardware: Local machines, 4-8GB VRAM
Use Case: Testing, development, prototyping
Model Size: 512 hidden, 6 layers
Features: Basic transformer
Training Time: ~2-4 hours (10K samples)
Memory: ~6GB VRAM
When to use:
- Testing new features
- Debugging training pipeline
- Quick experiments
- Limited hardware resources
Medium Profile
Hardware: Single GPU (16-32GB), cloud instances
Use Case: Production models, research
Model Size: 2048 hidden, 24 layers
Features: MoE, DRE, Constitutional AI
Training Time: ~1-2 days (100K samples)
Memory: ~20GB VRAM
When to use:
- Production deployments
- Research experiments
- Competitive benchmarks
- Full feature validation
Large Profile
Hardware: Multi-GPU, A100/H100 clusters
Use Case: State-of-the-art models
Model Size: 4096 hidden, 32 layers
Features: All advanced features + Multimodal
Training Time: ~1-2 weeks (1M+ samples)
Memory: ~40-80GB VRAM
When to use:
- Frontier model development
- Large-scale pretraining
- Multimodal applications
- Maximum performance
Advanced Features
1. Mixture of Experts (MoE)
What it does: Routes inputs to specialized expert networks for efficient scaling.
Enable:
advanced:
enable_moe: true
moe:
num_knowledge_experts: 32
num_skill_experts: 16
num_meta_experts: 8
num_safety_experts: 4
moe_top_k: 2
Benefits:
- โ 5-10x parameter scaling with minimal compute increase
- โ Specialized knowledge domains
- โ Better performance on diverse tasks
Considerations:
- Requires more memory for expert parameters
- Best with batch_size >= 4
- Works with expert parallelism for large scale
2. Dynamic Reasoning Engine (DRE)
What it does: Adaptive computational paths based on input complexity.
Enable:
advanced:
enable_dre: true
dre_warmup_steps: 5000
Benefits:
- โ Adaptive reasoning depth
- โ Better on complex problems
- โ Improved efficiency
Considerations:
- Use warmup to stabilize training
- May increase training time initially
- Best for reasoning-heavy tasks
3. Constitutional AI
What it does: Built-in safety and alignment mechanisms.
Enable:
advanced:
enable_constitutional: true
Benefits:
- โ Safer outputs
- โ Better alignment
- โ Reduced harmful content
Considerations:
- Adds overhead (~5-10%)
- Best combined with RLHF
- Requires safety datasets
4. RLHF (Reinforcement Learning from Human Feedback)
What it does: Fine-tunes model based on human preferences.
Enable:
advanced:
enable_rlhf: true
rlhf:
rlhf_frequency: 5
rlhf_iterations: 100
ppo_epochs: 4
When to use:
- After pretraining
- For instruction following
- For alignment fine-tuning
Process:
- Pretrain model without RLHF
- Save checkpoint
- Resume with RLHF enabled
- Fine-tune with preference data
5. Multimodal Capabilities
What it does: Process images, audio, and text together.
Enable:
advanced:
enable_multimodal: true
multimodal:
image_size: 224
patch_size: 14
audio_sample_rate: 16000
Requirements:
- Multimodal datasets
- Larger memory (images/audio)
- Vision/audio encoders
Configuration System
YAML Configuration Structure
# Model architecture
model:
vocab_size: 100352
hidden_size: 2048
num_layers: 24
...
# Advanced features
advanced:
enable_moe: true
enable_dre: true
...
# Training hyperparameters
training:
batch_size: 8
learning_rate: 1e-4
...
# Data configuration
data:
dataset: wikitext
...
# Logging and monitoring
logging:
use_mlflow: true
...
Override Configuration Values
python train_advanced.py \
--config configs/train_medium.yaml \
--override \
training.batch_size=4 \
model.hidden_size=1024 \
advanced.enable_moe=false
Create Custom Configurations
- Copy existing config:
cp configs/train_medium.yaml configs/my_config.yaml
Edit values in
my_config.yamlRun training:
python train_advanced.py --config configs/my_config.yaml
Training Environments
Local Training
Setup
# Install dependencies
pip install -r requirements.txt
# Start MLflow UI
mlflow ui
# Run training
python train_advanced.py --config configs/train_small.yaml
Monitor
- MLflow UI: http://localhost:5000
- Logs:
./outputs/[model_name]/training.log - Checkpoints:
./outputs/[model_name]/checkpoint_*.pt
Google Colab Training
Setup
- Open
colab_training.ipynb - Select GPU runtime
- Mount Google Drive
- Install dependencies
Benefits
- Free GPU access (T4)
- Paid options (V100/A100)
- Persistent storage via Drive
- Easy sharing
Limitations
- Session timeouts (~12 hours)
- GPU availability varies
- Slower than dedicated hardware
Cloud Training (AWS/GCP/Azure)
AWS SageMaker
# Use train_advanced.py with SageMaker estimator
from sagemaker.pytorch import PyTorch
estimator = PyTorch(
entry_point='train_advanced.py',
instance_type='ml.p3.8xlarge',
instance_count=1,
hyperparameters={
'config': 'configs/train_large.yaml'
}
)
estimator.fit()
GCP AI Platform
gcloud ai-platform jobs submit training ultrathink_job \
--region=us-central1 \
--master-machine-type=n1-highmem-16 \
--master-accelerator=type=nvidia-tesla-v100,count=4 \
--package-path=./src \
--module-name=train_advanced \
-- \
--config=configs/train_large.yaml
Distributed Training
Multi-GPU (Single Node)
# Using torchrun
torchrun --nproc_per_node=4 train_advanced.py \
--config configs/train_large.yaml \
--override distributed.enabled=true
Multi-Node with DeepSpeed
# Create hostfile
echo "node1 slots=8" > hostfile
echo "node2 slots=8" >> hostfile
# Run with DeepSpeed
deepspeed --hostfile=hostfile train_advanced.py \
--config configs/train_large.yaml \
--override \
distributed.enabled=true \
distributed.launcher=deepspeed \
distributed.deepspeed_config=config/deepspeed_z3.json
Monitoring & Debugging
MLflow Tracking
View Experiments:
mlflow ui
# Open http://localhost:5000
Key Metrics:
train/loss: Training lossval/loss: Validation losstrain/learning_rate: Current LRtrain/grad_norm: Gradient normseval/*: Evaluation metrics
Artifacts:
- Config files
- Checkpoints
- Evaluation results
- Model exports
Logging
Log Levels:
# In config or code
logging.basicConfig(level=logging.INFO) # INFO, DEBUG, WARNING, ERROR
Log Files:
- Training log:
./outputs/[model_name]/training.log - MLflow logs:
./mlruns/
Debugging Tips
Memory Issues:
# Check GPU memory
nvidia-smi
# Enable memory profiling
python train_advanced.py \
--config configs/train_small.yaml \
--override training.use_amp=true model.gradient_checkpointing=true
Slow Training:
# Enable profiling
python -m torch.utils.bottleneck train_advanced.py --config configs/train_small.yaml
NaN/Inf Loss:
# Add gradient clipping
training:
gradient_clipping: 1.0
# Reduce learning rate
training:
learning_rate: 1e-5
# Enable AMP carefully
training:
use_amp: true
Best Practices
1. Start Small, Scale Up
- โ Test on small config first
- โ Verify all features work
- โ Then scale to production
2. Use Mixed Precision (AMP)
training:
use_amp: true
- 2x faster training
- 50% less memory
- Minimal accuracy loss
3. Gradient Checkpointing for Large Models
model:
gradient_checkpointing: true
- Trades compute for memory
- Enables larger models
- ~20% slower, 40% less memory
4. Optimize Data Loading
data:
num_workers: 8 # Set to CPU cores
streaming: true # For large datasets
5. Save Checkpoints Regularly
evaluation:
eval_frequency: 1 # Save every epoch
6. Monitor Gradient Norms
- Healthy range: 0.1 - 10
- Too high (>100): Reduce LR or increase clipping
- Too low (<0.01): Increase LR or check optimizer
7. Use Learning Rate Warmup
training:
warmup_steps: 2000 # Gradual LR increase
8. Enable Advanced Features Gradually
- Train baseline model
- Add MoE
- Add DRE
- Add Constitutional AI
- Fine-tune with RLHF
Troubleshooting
Out of Memory (OOM)
Symptoms: CUDA out of memory error
Solutions:
# 1. Reduce batch size
training:
batch_size: 2
gradient_accumulation_steps: 16 # Maintain effective batch size
# 2. Enable gradient checkpointing
model:
gradient_checkpointing: true
# 3. Reduce sequence length
model:
max_seq_length: 512
# 4. Use smaller model
model:
hidden_size: 512
num_layers: 6
Slow Training
Symptoms: Low tokens/second
Solutions:
# 1. Enable flash attention
model:
use_flash_attention: true
# 2. Use AMP
training:
use_amp: true
# 3. Increase batch size
training:
batch_size: 16
# 4. Optimize data loading
data:
num_workers: 8
streaming: true
Training Instability
Symptoms: NaN loss, exploding gradients
Solutions:
# 1. Enable gradient clipping
training:
gradient_clipping: 1.0
# 2. Reduce learning rate
training:
learning_rate: 1e-5
# 3. Increase warmup
training:
warmup_steps: 10000
# 4. Use DRE warmup
advanced:
dre_warmup_steps: 5000
Poor Performance
Symptoms: High validation loss, poor benchmarks
Solutions:
- Train longer (more epochs/steps)
- Increase model size
- Use better datasets
- Enable advanced features (MoE, DRE)
- Fine-tune with RLHF
- Check data quality
- Verify tokenization
- Monitor overfitting
Dataset Issues
Symptoms: Dataset loading errors
Solutions:
# 1. Use dummy dataset for testing
python train_advanced.py \
--config configs/train_small.yaml \
--override data.dataset=dummy
# 2. Check dataset availability
python -c "from datasets import load_dataset; load_dataset('wikitext', 'wikitext-103-v1')"
# 3. Use streaming for large datasets
--override data.streaming=true
Example Training Workflows
Workflow 1: Quick Prototype
# 1. Test with dummy data
python train_advanced.py --config configs/train_small.yaml \
--override data.dataset=dummy data.train_samples=1000
# 2. Run on real data
python train_advanced.py --config configs/train_small.yaml \
--override data.dataset=wikitext training.num_epochs=1
Workflow 2: Production Training
# 1. Pretrain baseline
python train_advanced.py --config configs/train_medium.yaml \
--run-name "baseline_v1"
# 2. Resume with advanced features
python train_advanced.py --config configs/train_medium.yaml \
--init-from ./outputs/medium_model/final_model \
--override advanced.enable_moe=true advanced.enable_dre=true \
--run-name "advanced_v1"
# 3. Fine-tune with RLHF
python train_advanced.py --config configs/train_medium.yaml \
--init-from ./outputs/medium_model/final_model \
--override advanced.enable_rlhf=true training.learning_rate=1e-5 \
--run-name "rlhf_v1"
Workflow 3: Multi-GPU Training
# Use torchrun for data parallelism
torchrun --nproc_per_node=4 train_advanced.py \
--config configs/train_large.yaml \
--override distributed.enabled=true
Performance Benchmarks
Expected Throughput
| Profile | Hardware | Tokens/sec | Time (100K samples) |
|---|---|---|---|
| Small | GTX 1080 Ti | ~5000 | 4 hours |
| Medium | RTX 3090 | ~2000 | 1 day |
| Large | A100 40GB | ~800 | 1 week |
| Large | 4x A100 | ~3000 | 2 days |
Memory Requirements
| Profile | Parameters | Memory (FP32) | Memory (FP16) |
|---|---|---|---|
| Small | 100M | 6 GB | 4 GB |
| Medium | 2B | 24 GB | 16 GB |
| Large | 10B | 80 GB | 48 GB |
Additional Resources
- Documentation: README.md
- Model Card: MODEL_CARD.md
- API Reference: docs/README.md
- GitHub Issues: Report bugs and request features
- Discord/Slack: Community support (if available)
Getting Help
- Check logs:
./outputs/[model_name]/training.log - Search issues: GitHub issues page
- Community: Discord/Slack channels
- Documentation: This guide and other docs
Happy training! ๐