UltraThinking-LLM-Training / ADVANCED_TRAINING_GUIDE.md

Vedisasi

Upload folder using huggingface_hub

54c5666 verified 5 months ago

preview code

raw

history blame contribute delete

14.6 kB

ULTRATHINK Advanced Training Guide

Complete guide for training the ULTRATHINK model with all advanced features.

Quick Start
Training Profiles
Advanced Features
Configuration System
Training Environments
Monitoring & Debugging
Best Practices
Troubleshooting

Quick Start

Local Training

Option 1: Using Training Profiles

# Small model (for testing)
python scripts/run_training.bat --profile small

# Medium model (production-ready)
python scripts/run_training.bat --profile medium

# Large model (full-scale)
python scripts/run_training.bat --profile large

Option 2: Using Configuration Files

python train_advanced.py --config configs/train_small.yaml

Option 3: Direct Command Line

python train_ultrathink.py \
  --use_mlflow \
  --dataset wikitext \
  --hidden_size 512 \
  --num_layers 6 \
  --num_heads 8 \
  --batch_size 4 \
  --enable_moe \
  --enable_dre

Google Colab Training

Open colab_training.ipynb in Google Colab
Choose your GPU runtime (T4/V100/A100)
Select a configuration cell and run it
Monitor progress in MLflow UI or logs

Training Profiles

Small Profile

Hardware: Local machines, 4-8GB VRAM
Use Case: Testing, development, prototyping

Model Size: 512 hidden, 6 layers
Features: Basic transformer
Training Time: ~2-4 hours (10K samples)
Memory: ~6GB VRAM

When to use:

Testing new features
Debugging training pipeline
Quick experiments
Limited hardware resources

Medium Profile

Hardware: Single GPU (16-32GB), cloud instances
Use Case: Production models, research

Model Size: 2048 hidden, 24 layers
Features: MoE, DRE, Constitutional AI
Training Time: ~1-2 days (100K samples)
Memory: ~20GB VRAM

When to use:

Production deployments
Research experiments
Competitive benchmarks
Full feature validation

Large Profile

Hardware: Multi-GPU, A100/H100 clusters
Use Case: State-of-the-art models

Model Size: 4096 hidden, 32 layers
Features: All advanced features + Multimodal
Training Time: ~1-2 weeks (1M+ samples)
Memory: ~40-80GB VRAM

When to use:

Frontier model development
Large-scale pretraining
Multimodal applications
Maximum performance

Advanced Features

1. Mixture of Experts (MoE)

What it does: Routes inputs to specialized expert networks for efficient scaling.

Enable:

advanced:
  enable_moe: true

moe:
  num_knowledge_experts: 32
  num_skill_experts: 16
  num_meta_experts: 8
  num_safety_experts: 4
  moe_top_k: 2

Benefits:

✅ 5-10x parameter scaling with minimal compute increase
✅ Specialized knowledge domains
✅ Better performance on diverse tasks

Considerations:

Requires more memory for expert parameters
Best with batch_size >= 4
Works with expert parallelism for large scale

2. Dynamic Reasoning Engine (DRE)

What it does: Adaptive computational paths based on input complexity.

Enable:

advanced:
  enable_dre: true
  dre_warmup_steps: 5000

Benefits:

✅ Adaptive reasoning depth
✅ Better on complex problems
✅ Improved efficiency

Considerations:

Use warmup to stabilize training
May increase training time initially
Best for reasoning-heavy tasks

3. Constitutional AI

What it does: Built-in safety and alignment mechanisms.

Enable:

advanced:
  enable_constitutional: true

Benefits:

✅ Safer outputs
✅ Better alignment
✅ Reduced harmful content

Considerations:

Adds overhead (~5-10%)
Best combined with RLHF
Requires safety datasets

4. RLHF (Reinforcement Learning from Human Feedback)

What it does: Fine-tunes model based on human preferences.

Enable:

advanced:
  enable_rlhf: true

rlhf:
  rlhf_frequency: 5
  rlhf_iterations: 100
  ppo_epochs: 4

When to use:

After pretraining
For instruction following
For alignment fine-tuning

Process:

Pretrain model without RLHF
Save checkpoint
Resume with RLHF enabled
Fine-tune with preference data

5. Multimodal Capabilities

What it does: Process images, audio, and text together.

Enable:

advanced:
  enable_multimodal: true

multimodal:
  image_size: 224
  patch_size: 14
  audio_sample_rate: 16000

Requirements:

Multimodal datasets
Larger memory (images/audio)
Vision/audio encoders

Configuration System

YAML Configuration Structure

# Model architecture
model:
  vocab_size: 100352
  hidden_size: 2048
  num_layers: 24
  ...

# Advanced features
advanced:
  enable_moe: true
  enable_dre: true
  ...

# Training hyperparameters
training:
  batch_size: 8
  learning_rate: 1e-4
  ...

# Data configuration
data:
  dataset: wikitext
  ...

# Logging and monitoring
logging:
  use_mlflow: true
  ...

Override Configuration Values

python train_advanced.py \
  --config configs/train_medium.yaml \
  --override \
    training.batch_size=4 \
    model.hidden_size=1024 \
    advanced.enable_moe=false

Create Custom Configurations

Copy existing config:

cp configs/train_medium.yaml configs/my_config.yaml

Edit values in my_config.yaml
Run training:

python train_advanced.py --config configs/my_config.yaml

Training Environments

Local Training

Setup

# Install dependencies
pip install -r requirements.txt

# Start MLflow UI
mlflow ui

# Run training
python train_advanced.py --config configs/train_small.yaml

Monitor

MLflow UI: http://localhost:5000
Logs: ./outputs/[model_name]/training.log
Checkpoints: ./outputs/[model_name]/checkpoint_*.pt

Google Colab Training

Setup

Open colab_training.ipynb
Select GPU runtime
Mount Google Drive
Install dependencies

Benefits

Free GPU access (T4)
Paid options (V100/A100)
Persistent storage via Drive
Easy sharing

Limitations

Session timeouts (~12 hours)
GPU availability varies
Slower than dedicated hardware

Cloud Training (AWS/GCP/Azure)

AWS SageMaker

# Use train_advanced.py with SageMaker estimator
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point='train_advanced.py',
    instance_type='ml.p3.8xlarge',
    instance_count=1,
    hyperparameters={
        'config': 'configs/train_large.yaml'
    }
)
estimator.fit()

GCP AI Platform

gcloud ai-platform jobs submit training ultrathink_job \
  --region=us-central1 \
  --master-machine-type=n1-highmem-16 \
  --master-accelerator=type=nvidia-tesla-v100,count=4 \
  --package-path=./src \
  --module-name=train_advanced \
  -- \
  --config=configs/train_large.yaml

Distributed Training

Multi-GPU (Single Node)

# Using torchrun
torchrun --nproc_per_node=4 train_advanced.py \
  --config configs/train_large.yaml \
  --override distributed.enabled=true

Multi-Node with DeepSpeed

# Create hostfile
echo "node1 slots=8" > hostfile
echo "node2 slots=8" >> hostfile

# Run with DeepSpeed
deepspeed --hostfile=hostfile train_advanced.py \
  --config configs/train_large.yaml \
  --override \
    distributed.enabled=true \
    distributed.launcher=deepspeed \
    distributed.deepspeed_config=config/deepspeed_z3.json

Monitoring & Debugging

MLflow Tracking

View Experiments:

mlflow ui
# Open http://localhost:5000

Key Metrics:

train/loss: Training loss
val/loss: Validation loss
train/learning_rate: Current LR
train/grad_norm: Gradient norms
eval/*: Evaluation metrics

Artifacts:

Config files
Checkpoints
Evaluation results
Model exports

Logging

Log Levels:

# In config or code
logging.basicConfig(level=logging.INFO)  # INFO, DEBUG, WARNING, ERROR

Log Files:

Training log: ./outputs/[model_name]/training.log
MLflow logs: ./mlruns/

Debugging Tips

Memory Issues:

# Check GPU memory
nvidia-smi

# Enable memory profiling
python train_advanced.py \
  --config configs/train_small.yaml \
  --override training.use_amp=true model.gradient_checkpointing=true

Slow Training:

# Enable profiling
python -m torch.utils.bottleneck train_advanced.py --config configs/train_small.yaml

NaN/Inf Loss:

# Add gradient clipping
training:
  gradient_clipping: 1.0
  
# Reduce learning rate
training:
  learning_rate: 1e-5
  
# Enable AMP carefully
training:
  use_amp: true

Best Practices

1. Start Small, Scale Up

✅ Test on small config first
✅ Verify all features work
✅ Then scale to production

2. Use Mixed Precision (AMP)

training:
  use_amp: true

2x faster training
50% less memory
Minimal accuracy loss

3. Gradient Checkpointing for Large Models

model:
  gradient_checkpointing: true

Trades compute for memory
Enables larger models
~20% slower, 40% less memory

4. Optimize Data Loading

data:
  num_workers: 8  # Set to CPU cores
  streaming: true  # For large datasets

5. Save Checkpoints Regularly

evaluation:
  eval_frequency: 1  # Save every epoch

6. Monitor Gradient Norms

Healthy range: 0.1 - 10
Too high (>100): Reduce LR or increase clipping
Too low (<0.01): Increase LR or check optimizer

7. Use Learning Rate Warmup

training:
  warmup_steps: 2000  # Gradual LR increase

8. Enable Advanced Features Gradually

Train baseline model
Add MoE
Add DRE
Add Constitutional AI
Fine-tune with RLHF

Troubleshooting

Out of Memory (OOM)

Symptoms: CUDA out of memory error

Solutions:

# 1. Reduce batch size
training:
  batch_size: 2
  gradient_accumulation_steps: 16  # Maintain effective batch size

# 2. Enable gradient checkpointing
model:
  gradient_checkpointing: true

# 3. Reduce sequence length
model:
  max_seq_length: 512

# 4. Use smaller model
model:
  hidden_size: 512
  num_layers: 6

Slow Training

Symptoms: Low tokens/second

Solutions:

# 1. Enable flash attention
model:
  use_flash_attention: true

# 2. Use AMP
training:
  use_amp: true

# 3. Increase batch size
training:
  batch_size: 16

# 4. Optimize data loading
data:
  num_workers: 8
  streaming: true

Training Instability

Symptoms: NaN loss, exploding gradients

Solutions:

# 1. Enable gradient clipping
training:
  gradient_clipping: 1.0

# 2. Reduce learning rate
training:
  learning_rate: 1e-5

# 3. Increase warmup
training:
  warmup_steps: 10000

# 4. Use DRE warmup
advanced:
  dre_warmup_steps: 5000

Poor Performance

Symptoms: High validation loss, poor benchmarks

Solutions:

Train longer (more epochs/steps)
Increase model size
Use better datasets
Enable advanced features (MoE, DRE)
Fine-tune with RLHF
Check data quality
Verify tokenization
Monitor overfitting

Dataset Issues

Symptoms: Dataset loading errors

Solutions:

# 1. Use dummy dataset for testing
python train_advanced.py \
  --config configs/train_small.yaml \
  --override data.dataset=dummy

# 2. Check dataset availability
python -c "from datasets import load_dataset; load_dataset('wikitext', 'wikitext-103-v1')"

# 3. Use streaming for large datasets
--override data.streaming=true

Example Training Workflows

Workflow 1: Quick Prototype

# 1. Test with dummy data
python train_advanced.py --config configs/train_small.yaml \
  --override data.dataset=dummy data.train_samples=1000

# 2. Run on real data
python train_advanced.py --config configs/train_small.yaml \
  --override data.dataset=wikitext training.num_epochs=1

Workflow 2: Production Training

# 1. Pretrain baseline
python train_advanced.py --config configs/train_medium.yaml \
  --run-name "baseline_v1"

# 2. Resume with advanced features
python train_advanced.py --config configs/train_medium.yaml \
  --init-from ./outputs/medium_model/final_model \
  --override advanced.enable_moe=true advanced.enable_dre=true \
  --run-name "advanced_v1"

# 3. Fine-tune with RLHF
python train_advanced.py --config configs/train_medium.yaml \
  --init-from ./outputs/medium_model/final_model \
  --override advanced.enable_rlhf=true training.learning_rate=1e-5 \
  --run-name "rlhf_v1"

Workflow 3: Multi-GPU Training

# Use torchrun for data parallelism
torchrun --nproc_per_node=4 train_advanced.py \
  --config configs/train_large.yaml \
  --override distributed.enabled=true

Performance Benchmarks

Expected Throughput

Profile	Hardware	Tokens/sec	Time (100K samples)
Small	GTX 1080 Ti	~5000	4 hours
Medium	RTX 3090	~2000	1 day
Large	A100 40GB	~800	1 week
Large	4x A100	~3000	2 days

Memory Requirements

Profile	Parameters	Memory (FP32)	Memory (FP16)
Small	100M	6 GB	4 GB
Medium	2B	24 GB	16 GB
Large	10B	80 GB	48 GB

Additional Resources

Documentation: README.md
Model Card: MODEL_CARD.md
API Reference: docs/README.md
GitHub Issues: Report bugs and request features
Discord/Slack: Community support (if available)

Getting Help

Check logs: ./outputs/[model_name]/training.log
Search issues: GitHub issues page
Community: Discord/Slack channels
Documentation: This guide and other docs

Happy training! 🚀

ULTRATHINK Advanced Training Guide

Table of Contents

Quick Start

Local Training

Option 1: Using Training Profiles

Option 2: Using Configuration Files

Option 3: Direct Command Line

Google Colab Training

Training Profiles

Small Profile

Medium Profile

Large Profile

Advanced Features

1. Mixture of Experts (MoE)

2. Dynamic Reasoning Engine (DRE)

3. Constitutional AI

4. RLHF (Reinforcement Learning from Human Feedback)

5. Multimodal Capabilities

Configuration System

YAML Configuration Structure

Override Configuration Values

Create Custom Configurations

Training Environments

Local Training

Setup

Monitor

Google Colab Training

Setup

Benefits

Limitations

Cloud Training (AWS/GCP/Azure)

AWS SageMaker

GCP AI Platform

Distributed Training

Multi-GPU (Single Node)

Multi-Node with DeepSpeed

Monitoring & Debugging

MLflow Tracking

Logging

Debugging Tips

Best Practices

1. Start Small, Scale Up

2. Use Mixed Precision (AMP)

3. Gradient Checkpointing for Large Models

4. Optimize Data Loading

5. Save Checkpoints Regularly

6. Monitor Gradient Norms

7. Use Learning Rate Warmup

8. Enable Advanced Features Gradually

Troubleshooting

Out of Memory (OOM)

Slow Training

Training Instability

Poor Performance

Dataset Issues

Example Training Workflows

Workflow 1: Quick Prototype

Workflow 2: Production Training

Workflow 3: Multi-GPU Training

Performance Benchmarks

Expected Throughput

Memory Requirements

Additional Resources

Getting Help