File size: 3,322 Bytes
e6dee89 ccda2ec e6dee89 ccda2ec e6dee89 ccda2ec e6dee89 ccda2ec e6dee89 ccda2ec e6dee89 ccda2ec e6dee89 ccda2ec |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
---
license: apache-2.0
language:
- en
tags:
- moe
- olmo
- olmoe
co2_eq_emissions: 1
datasets:
- allenai/OLMoE-mix-0924
library_name: transformers
---
# Model Summary
# OLMoE with Adapters
This repository contains an extension of the OLMo model with adapter layers for parameter-efficient fine-tuning. By adding small adapter modules to the model, we can fine-tune it on downstream tasks while freezing most of the original parameters, resulting in much more efficient training.
## Model Architecture
The `OlmoEWithAdaptersForCausalLM` model extends the original OLMo architecture by:
1. Adding small adapter layers (bottleneck layers) to each MLP block
2. Allowing selective freezing of the base model's parameters
3. Training only the adapter parameters (~0.1-1% of total parameters)
Key components:
- `OlmoEWithAdaptersMLP`: MLP layer with additional adapter modules
- `OlmoEWithAdaptersDecoderLayer`: Decoder layer incorporating adapter MLPs
- `OlmoEWithAdaptersModel`: Full model with adapter-based decoder layers
- `OlmoEWithAdaptersForCausalLM`: Causal language model with adapters
## Training Script
The `train_olmoe_adapters.py` script provides a complete workflow for fine-tuning the model:
### Features:
- Parameter-efficient fine-tuning using adapters
- Support for various datasets through Hugging Face datasets library
- Customizable adapter size
- Option to freeze/unfreeze different components
- Training with AdamW optimizer and learning rate scheduling
- Evaluation with perplexity metrics
- Model checkpointing and saving
### Usage:
```bash
python train.py \
--model_name_or_path allenai/OLMo-7B \
--adapter_size 64 \
--freeze_base_model True \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--output_dir ./olmoe-adapter-finetuned \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--learning_rate 5e-5 \
--warmup_steps 100 \
--logging_steps 100 \
--save_steps 1000 \
--seed 42
```
## Benefits of Adapter-Based Fine-Tuning
1. **Efficiency**: Train only ~0.1-1% of the parameters, dramatically reducing GPU memory requirements
2. **Storage**: Store only adapter weights rather than full fine-tuned models
3. **Composability**: Multiple adapters can be trained for different tasks and swapped at inference time
4. **Reduced Overfitting**: Lower parameter count helps prevent overfitting on small datasets
## How to Use the Fine-Tuned Model
```python
from transformers import OlmoTokenizer
from modeling_olmoe import OlmoEWithAdaptersForCausalLM
# Load the fine-tuned model
model = OlmoEWithAdaptersForCausalLM.from_pretrained("./olmoe-adapter-finetuned")
tokenizer = OlmoTokenizer.from_pretrained("./olmoe-adapter-finetuned")
# Generate text
inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Adapter Size Recommendations
The adapter size determines the parameter efficiency vs. performance trade-off:
- **Small datasets**: 16-32 dimensions
- **Medium datasets**: 64-128 dimensions
- **Large datasets**: 128-256 dimensions
For most fine-tuning scenarios, an adapter size of 64 provides a good balance between efficiency and performance. |