UltraThinking-LLM-Training / ADVANCED_TRAINING_GUIDE.md
Vedisasi's picture
Upload folder using huggingface_hub
54c5666 verified

ULTRATHINK Advanced Training Guide

Complete guide for training the ULTRATHINK model with all advanced features.

Table of Contents


Quick Start

Local Training

Option 1: Using Training Profiles

# Small model (for testing)
python scripts/run_training.bat --profile small

# Medium model (production-ready)
python scripts/run_training.bat --profile medium

# Large model (full-scale)
python scripts/run_training.bat --profile large

Option 2: Using Configuration Files

python train_advanced.py --config configs/train_small.yaml

Option 3: Direct Command Line

python train_ultrathink.py \
  --use_mlflow \
  --dataset wikitext \
  --hidden_size 512 \
  --num_layers 6 \
  --num_heads 8 \
  --batch_size 4 \
  --enable_moe \
  --enable_dre

Google Colab Training

  1. Open colab_training.ipynb in Google Colab
  2. Choose your GPU runtime (T4/V100/A100)
  3. Select a configuration cell and run it
  4. Monitor progress in MLflow UI or logs

Training Profiles

Small Profile

Hardware: Local machines, 4-8GB VRAM
Use Case: Testing, development, prototyping

Model Size: 512 hidden, 6 layers
Features: Basic transformer
Training Time: ~2-4 hours (10K samples)
Memory: ~6GB VRAM

When to use:

  • Testing new features
  • Debugging training pipeline
  • Quick experiments
  • Limited hardware resources

Medium Profile

Hardware: Single GPU (16-32GB), cloud instances
Use Case: Production models, research

Model Size: 2048 hidden, 24 layers
Features: MoE, DRE, Constitutional AI
Training Time: ~1-2 days (100K samples)
Memory: ~20GB VRAM

When to use:

  • Production deployments
  • Research experiments
  • Competitive benchmarks
  • Full feature validation

Large Profile

Hardware: Multi-GPU, A100/H100 clusters
Use Case: State-of-the-art models

Model Size: 4096 hidden, 32 layers
Features: All advanced features + Multimodal
Training Time: ~1-2 weeks (1M+ samples)
Memory: ~40-80GB VRAM

When to use:

  • Frontier model development
  • Large-scale pretraining
  • Multimodal applications
  • Maximum performance

Advanced Features

1. Mixture of Experts (MoE)

What it does: Routes inputs to specialized expert networks for efficient scaling.

Enable:

advanced:
  enable_moe: true

moe:
  num_knowledge_experts: 32
  num_skill_experts: 16
  num_meta_experts: 8
  num_safety_experts: 4
  moe_top_k: 2

Benefits:

  • โœ… 5-10x parameter scaling with minimal compute increase
  • โœ… Specialized knowledge domains
  • โœ… Better performance on diverse tasks

Considerations:

  • Requires more memory for expert parameters
  • Best with batch_size >= 4
  • Works with expert parallelism for large scale

2. Dynamic Reasoning Engine (DRE)

What it does: Adaptive computational paths based on input complexity.

Enable:

advanced:
  enable_dre: true
  dre_warmup_steps: 5000

Benefits:

  • โœ… Adaptive reasoning depth
  • โœ… Better on complex problems
  • โœ… Improved efficiency

Considerations:

  • Use warmup to stabilize training
  • May increase training time initially
  • Best for reasoning-heavy tasks

3. Constitutional AI

What it does: Built-in safety and alignment mechanisms.

Enable:

advanced:
  enable_constitutional: true

Benefits:

  • โœ… Safer outputs
  • โœ… Better alignment
  • โœ… Reduced harmful content

Considerations:

  • Adds overhead (~5-10%)
  • Best combined with RLHF
  • Requires safety datasets

4. RLHF (Reinforcement Learning from Human Feedback)

What it does: Fine-tunes model based on human preferences.

Enable:

advanced:
  enable_rlhf: true

rlhf:
  rlhf_frequency: 5
  rlhf_iterations: 100
  ppo_epochs: 4

When to use:

  • After pretraining
  • For instruction following
  • For alignment fine-tuning

Process:

  1. Pretrain model without RLHF
  2. Save checkpoint
  3. Resume with RLHF enabled
  4. Fine-tune with preference data

5. Multimodal Capabilities

What it does: Process images, audio, and text together.

Enable:

advanced:
  enable_multimodal: true

multimodal:
  image_size: 224
  patch_size: 14
  audio_sample_rate: 16000

Requirements:

  • Multimodal datasets
  • Larger memory (images/audio)
  • Vision/audio encoders

Configuration System

YAML Configuration Structure

# Model architecture
model:
  vocab_size: 100352
  hidden_size: 2048
  num_layers: 24
  ...

# Advanced features
advanced:
  enable_moe: true
  enable_dre: true
  ...

# Training hyperparameters
training:
  batch_size: 8
  learning_rate: 1e-4
  ...

# Data configuration
data:
  dataset: wikitext
  ...

# Logging and monitoring
logging:
  use_mlflow: true
  ...

Override Configuration Values

python train_advanced.py \
  --config configs/train_medium.yaml \
  --override \
    training.batch_size=4 \
    model.hidden_size=1024 \
    advanced.enable_moe=false

Create Custom Configurations

  1. Copy existing config:
cp configs/train_medium.yaml configs/my_config.yaml
  1. Edit values in my_config.yaml

  2. Run training:

python train_advanced.py --config configs/my_config.yaml

Training Environments

Local Training

Setup

# Install dependencies
pip install -r requirements.txt

# Start MLflow UI
mlflow ui

# Run training
python train_advanced.py --config configs/train_small.yaml

Monitor

  • MLflow UI: http://localhost:5000
  • Logs: ./outputs/[model_name]/training.log
  • Checkpoints: ./outputs/[model_name]/checkpoint_*.pt

Google Colab Training

Setup

  1. Open colab_training.ipynb
  2. Select GPU runtime
  3. Mount Google Drive
  4. Install dependencies

Benefits

  • Free GPU access (T4)
  • Paid options (V100/A100)
  • Persistent storage via Drive
  • Easy sharing

Limitations

  • Session timeouts (~12 hours)
  • GPU availability varies
  • Slower than dedicated hardware

Cloud Training (AWS/GCP/Azure)

AWS SageMaker

# Use train_advanced.py with SageMaker estimator
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point='train_advanced.py',
    instance_type='ml.p3.8xlarge',
    instance_count=1,
    hyperparameters={
        'config': 'configs/train_large.yaml'
    }
)
estimator.fit()

GCP AI Platform

gcloud ai-platform jobs submit training ultrathink_job \
  --region=us-central1 \
  --master-machine-type=n1-highmem-16 \
  --master-accelerator=type=nvidia-tesla-v100,count=4 \
  --package-path=./src \
  --module-name=train_advanced \
  -- \
  --config=configs/train_large.yaml

Distributed Training

Multi-GPU (Single Node)

# Using torchrun
torchrun --nproc_per_node=4 train_advanced.py \
  --config configs/train_large.yaml \
  --override distributed.enabled=true

Multi-Node with DeepSpeed

# Create hostfile
echo "node1 slots=8" > hostfile
echo "node2 slots=8" >> hostfile

# Run with DeepSpeed
deepspeed --hostfile=hostfile train_advanced.py \
  --config configs/train_large.yaml \
  --override \
    distributed.enabled=true \
    distributed.launcher=deepspeed \
    distributed.deepspeed_config=config/deepspeed_z3.json

Monitoring & Debugging

MLflow Tracking

View Experiments:

mlflow ui
# Open http://localhost:5000

Key Metrics:

  • train/loss: Training loss
  • val/loss: Validation loss
  • train/learning_rate: Current LR
  • train/grad_norm: Gradient norms
  • eval/*: Evaluation metrics

Artifacts:

  • Config files
  • Checkpoints
  • Evaluation results
  • Model exports

Logging

Log Levels:

# In config or code
logging.basicConfig(level=logging.INFO)  # INFO, DEBUG, WARNING, ERROR

Log Files:

  • Training log: ./outputs/[model_name]/training.log
  • MLflow logs: ./mlruns/

Debugging Tips

Memory Issues:

# Check GPU memory
nvidia-smi

# Enable memory profiling
python train_advanced.py \
  --config configs/train_small.yaml \
  --override training.use_amp=true model.gradient_checkpointing=true

Slow Training:

# Enable profiling
python -m torch.utils.bottleneck train_advanced.py --config configs/train_small.yaml

NaN/Inf Loss:

# Add gradient clipping
training:
  gradient_clipping: 1.0
  
# Reduce learning rate
training:
  learning_rate: 1e-5
  
# Enable AMP carefully
training:
  use_amp: true

Best Practices

1. Start Small, Scale Up

  • โœ… Test on small config first
  • โœ… Verify all features work
  • โœ… Then scale to production

2. Use Mixed Precision (AMP)

training:
  use_amp: true
  • 2x faster training
  • 50% less memory
  • Minimal accuracy loss

3. Gradient Checkpointing for Large Models

model:
  gradient_checkpointing: true
  • Trades compute for memory
  • Enables larger models
  • ~20% slower, 40% less memory

4. Optimize Data Loading

data:
  num_workers: 8  # Set to CPU cores
  streaming: true  # For large datasets

5. Save Checkpoints Regularly

evaluation:
  eval_frequency: 1  # Save every epoch

6. Monitor Gradient Norms

  • Healthy range: 0.1 - 10
  • Too high (>100): Reduce LR or increase clipping
  • Too low (<0.01): Increase LR or check optimizer

7. Use Learning Rate Warmup

training:
  warmup_steps: 2000  # Gradual LR increase

8. Enable Advanced Features Gradually

  1. Train baseline model
  2. Add MoE
  3. Add DRE
  4. Add Constitutional AI
  5. Fine-tune with RLHF

Troubleshooting

Out of Memory (OOM)

Symptoms: CUDA out of memory error

Solutions:

# 1. Reduce batch size
training:
  batch_size: 2
  gradient_accumulation_steps: 16  # Maintain effective batch size

# 2. Enable gradient checkpointing
model:
  gradient_checkpointing: true

# 3. Reduce sequence length
model:
  max_seq_length: 512

# 4. Use smaller model
model:
  hidden_size: 512
  num_layers: 6

Slow Training

Symptoms: Low tokens/second

Solutions:

# 1. Enable flash attention
model:
  use_flash_attention: true

# 2. Use AMP
training:
  use_amp: true

# 3. Increase batch size
training:
  batch_size: 16

# 4. Optimize data loading
data:
  num_workers: 8
  streaming: true

Training Instability

Symptoms: NaN loss, exploding gradients

Solutions:

# 1. Enable gradient clipping
training:
  gradient_clipping: 1.0

# 2. Reduce learning rate
training:
  learning_rate: 1e-5

# 3. Increase warmup
training:
  warmup_steps: 10000

# 4. Use DRE warmup
advanced:
  dre_warmup_steps: 5000

Poor Performance

Symptoms: High validation loss, poor benchmarks

Solutions:

  1. Train longer (more epochs/steps)
  2. Increase model size
  3. Use better datasets
  4. Enable advanced features (MoE, DRE)
  5. Fine-tune with RLHF
  6. Check data quality
  7. Verify tokenization
  8. Monitor overfitting

Dataset Issues

Symptoms: Dataset loading errors

Solutions:

# 1. Use dummy dataset for testing
python train_advanced.py \
  --config configs/train_small.yaml \
  --override data.dataset=dummy

# 2. Check dataset availability
python -c "from datasets import load_dataset; load_dataset('wikitext', 'wikitext-103-v1')"

# 3. Use streaming for large datasets
--override data.streaming=true

Example Training Workflows

Workflow 1: Quick Prototype

# 1. Test with dummy data
python train_advanced.py --config configs/train_small.yaml \
  --override data.dataset=dummy data.train_samples=1000

# 2. Run on real data
python train_advanced.py --config configs/train_small.yaml \
  --override data.dataset=wikitext training.num_epochs=1

Workflow 2: Production Training

# 1. Pretrain baseline
python train_advanced.py --config configs/train_medium.yaml \
  --run-name "baseline_v1"

# 2. Resume with advanced features
python train_advanced.py --config configs/train_medium.yaml \
  --init-from ./outputs/medium_model/final_model \
  --override advanced.enable_moe=true advanced.enable_dre=true \
  --run-name "advanced_v1"

# 3. Fine-tune with RLHF
python train_advanced.py --config configs/train_medium.yaml \
  --init-from ./outputs/medium_model/final_model \
  --override advanced.enable_rlhf=true training.learning_rate=1e-5 \
  --run-name "rlhf_v1"

Workflow 3: Multi-GPU Training

# Use torchrun for data parallelism
torchrun --nproc_per_node=4 train_advanced.py \
  --config configs/train_large.yaml \
  --override distributed.enabled=true

Performance Benchmarks

Expected Throughput

Profile Hardware Tokens/sec Time (100K samples)
Small GTX 1080 Ti ~5000 4 hours
Medium RTX 3090 ~2000 1 day
Large A100 40GB ~800 1 week
Large 4x A100 ~3000 2 days

Memory Requirements

Profile Parameters Memory (FP32) Memory (FP16)
Small 100M 6 GB 4 GB
Medium 2B 24 GB 16 GB
Large 10B 80 GB 48 GB

Additional Resources


Getting Help

  1. Check logs: ./outputs/[model_name]/training.log
  2. Search issues: GitHub issues page
  3. Community: Discord/Slack channels
  4. Documentation: This guide and other docs

Happy training! ๐Ÿš€