---
library_name: transformers
license: cc-by-4.0
language:
- en
datasets:
- babylm-anon/stratified_10m_curriculum
---

# Model Card for TICL 
A RoBERTa model pre-trained on a dataset of 10M words using (**T**raining Data) **I**nfluence-driven **C**urriculum **L**earning.


## Model Details
See our paper at REDACTED for details on our method. 
### Model Description

This is a model submitted to the strict-small track of the 2025 BabyLM challenge.

- **Developed by:** REDACTED
- **Funded by [optional]:** REDACTED
- **Model type:**  Language model (Masked)
- **Language(s) (NLP):** eng
- **License:** CC-By-4.0


### Model Sources 

<!-- Provide the basic links for the model. -->

- **Repository:** [https://anonymous.4open.science/r/cl-4B5C](https://anonymous.4open.science/r/cl-4B5C)


## Uses

This model was trained to demonstrate the effectiveness of a novel curriculum learning method over training in random order.


## Training Details

### Training Data

We utilize [this](https://huggingface.co/datasets/babylm-anon/stratified_10m_curriculum) dataset built from the following existing ones:

- C1: Child Directed Speech
  - CHILDES [(MacWhinney 2000)](https://doi.org/10.4324/9781315805641)
- C2: Children's Books
  - [Children Stories Text Corpus](https://www.kaggle.com/datasets/edenbd/children-stories-text-corpus) [(Bensaid et al. 2021)](
https://doi.org/10.48550/arXiv.2108.04324
)
  - Children's Book Test [(Hill et al. 2016)](
https://doi.org/10.48550/arXiv.1511.02301)
- C3: Dialogue
  - OpenSubtitles [(Lison and Tiedemann 2016)](https://aclanthology.org/L16-1147)
  - Switchboard Dialog Act Corpus [(Stolcke et al. 2000)](https://aclanthology.org/J00-3003)
  - [British NationalCorpus (BNC), dialogue portion](http://hdl.handle.net/20.500.14106/2554)
- C4: Educational
  - Simple Wiki [(Warstadt et al. 2023)](https://doi.org/10.48550/arXiv.2301.11796)
  - [QED](https://opus.nlpl.eu/download.php?f=QED/v2.0a/xml/en.zip) [(Abdelali et al. 2014)](https://aclanthology.org/L14-1675/)
- C5: Written English
  - Standardized Project Gutenberg Corpus [(Gerlach and Font-Clos 2018)](https://arxiv.org/abs/1812.08092)
  - Wikipedia [(Warstadt et al. 2023)](https://arxiv.org/abs/2301.11796)


### Data mix
|                           |   Words |        |   Documents |        |
|:--------------------------|--------:|:-------|------------:|:-------|
| C1: Child Directed Speech | 1999999 | 20.00% |      360533 | 33.68% |
| C2: Children's Books      | 1999995 | 20.00% |       77384 | 7.23%  |
| C3: Dialogue              | 1999987 | 20.00% |      349650 | 32.67% |
| C4: Educational           | 1999999 | 20.00% |      161554 | 15.09% |
| C5: Written English       | 1999945 | 20.00% |      121200 | 11.32% |

### Training Procedure
We extract training data influence estimates from models trained in random order, and sort the training data based on that information with various strategies detailed in the paper.
This is the overall best performing model in our experiments, trained in order of increasing influence and re-weighted with lognormal filter, see the paper for details.


#### Training Hyperparameters
We employ a novel curriculum learning strategy in which the model is trained in non-random order with a total of 100M words.

| Parameter                    |             |
|------------------------------|-------------|
| **Shared Hyperparameters**   |             |
| Vocabulary size              | 52k         |
| Hidden size                  | 768         |
| Number of layers             | 12          |
| Number of attention heads    | 12          |
| Initializer range            | 0.02        |
| Tie word embeddings          | True        |
| **Model-Specific Settings**  |             |
| Max position embeddings      | 514         |
| Intermediate (FFN) size      | 3072        |
| Norm epsilon                 | 1e-5        |
| Attention dropout            | 0.1         |
| Activation function          | gelu        |
| Hidden dropout               | 0.1         |
| **Training Setup**           |             |
| FP16                         | False       |
| Per Device Batch Size        | 32          |
| Gradient Accumulation Steps  | 16          |
| GPUs                         | 4           |
| Adam β₁                      | 0.9         |
| Adam β₂                      | 0.98        |
| Adam ε                       | 1e-6        |
| Weight Decay ε               | 0.01        |
| Learning rate                | 5e-4        |
| Scheduler                    | polynomial  |


## Evaluation

We use [this](https://github.com/babylm/evaluation-pipeline-2025) evaluation pipeline of the 2025 BabyLM challange


### Results


| Task                            | Metric       |
|---------------------------------|--------------|
| (Super) GLUE                    | 0.579        |
| blimp_filtered                  | 0.688        |
| supplement_filtered             | 0.559        |
| entity_tracking                 | 0.302        |
| ewok_filtered                   | 0.509        |
| wug_adj_nominalization          | 0.570        |
| **Macro acc **                  | **0.584**    |


## Model Card Contact

REDACTED