UniRec-0.1B: Unified Text and Formula Recognition with 0.1B Parameters
[Paper] [Code] [ModelScope Demo] [Hugging Face Demo] [Local Demo] [UniRec40M Dataset]
Introduction
UniRec-0.1B is a unified recognition model with only 0.1B parameters, designed for high-accuracy and efficient recognition of plain text (words, lines, paragraphs), mathematical formulas (single-line, multi-line), and mixed content in both Chinese and English.
It addresses structural variability and semantic entanglement by using a hierarchical supervision training strategy and a semantic-decoupled tokenizer. Despite its small size, it achieves performance comparable to or better than much larger vision-language models.
Get Started with the UniRec
Dependencies:
- PyTorch version >= 1.13.0
- Python version >= 3.7
conda create -n openocr python==3.10
conda activate openocr
# install gpu version torch >=1.13.0
conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=11.8 -c pytorch -c nvidia
# or cpu version
conda install pytorch torchvision torchaudio cpuonly -c pytorch
git clone https://github.com/Topdu/OpenOCR.git
Downloding the UniRec Model from ModelScope or Hugging Face
cd OpenOCR
pip install -r requirements.txt
# download model from modelscope
modelscope download topdktu/unirec-0.1b --local_dir ./unirec-0.1b
# or download model from huggingface
huggingface-cli download topdu/unirec-0.1b --local-dir ./unirec-0.1b
Inference
python tools/infer_rec.py --c ./configs/rec/unirec/focalsvtr_ardecoder_unirec.yml --o Global.infer_img=/path/img_fold or /path/img_file
Local Demo
pip install gradio==4.20.0
python demo_unirec.py
Training
Additional dependencies:
pip install PyMuPDF
pip install pdf2image
pip install numpy==1.26.4
pip install albumentations==1.4.24
pip install transformers==4.49.0
pip install -U flash-attn --no-build-isolation
It is recommended to organize your working directory as follows:
|-UniRec40M # Main directory for UniRec40M dataset
|-OpenOCR # Directory for OpenOCR-related files
|-evaluation # Directory for evaluation dataset
Download the UniRec40M dataset from Hugging Face
# downloading small data for quickly training
huggingface-cli download topdu/UniRec40M --include "hiertext_lmdb/**" --repo-type dataset --local-dir ./UniRec40M/
huggingface-cli download topdu/OpenOCR-Data --include "evaluation/**" --repo-type dataset --local-dir ./
Run the following command to train the model quickly:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --master_port=23333 --nproc_per_node=8 tools/train_rec.py --c configs/rec/unirec/focalsvtr_ardecoder_unirec.yml
Downloading the full dataset requires 3.5 TB of available storage space. Then, you need to merge the split files named data.mdb.part_* (located in HWDB2Train, ch_pdf_lmdb, and en_pdf_lmdb) into a single data.mdb file. Execute the commands below step by step:
# downloading full data
huggingface-cli download topdu/UniRec40M --repo-type dataset --local-dir ./UniRec40M/
cd UniRec40M/HWDB2Train/image_lmdb & cat data.mdb.part_* > data.mdb
cd UniRec40M/ch_pdf_lmdb & cat data.mdb.part_* > data.mdb
cd UniRec40M/en_pdf_lmdb & cat data.mdb.part_* > data.mdb
And modify the configs/rec/unirec/focalsvtr_ardecoder_unirec.yml file as follows:
...
Train:
dataset:
name: NaSizeDataSet
divided_factor: ÷d_factor [64, 64] # w, h
max_side: &max_side [960, 1408] # [64*30, 64*44] # w, h [960, 1408] #
root_path: path/to/UniRec40M
add_return: True
zoom_min_factor: 4
use_zoom: True
all_data: True
test_data: False
use_aug: True
use_linedata: True
transforms:
- UniRecLabelEncode: # Class handling label
max_text_length: *max_text_length
vlmocr: True
tokenizer_path: *vlm_ocr_config # path to tokenizer, e.g. 'vocab.json', 'merges.txt'
- KeepKeys:
keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order
sampler:
name: NaSizeSampler
# divide_factor: to ensure the width and height dimensions can be devided by downsampling multiple
min_bs: 1
max_bs: 24
loader:
shuffle: True
batch_size_per_card: 64
drop_last: True
num_workers: 8
...
Citation
If you find our method useful for your research, please cite:
@article{du2025unirec,
title={UniRec-0.1B: Unified Text and Formula Recognition with 0.1B Parameters},
author={Yongkun Du and Zhineng Chen and Yazhen Xie and Weikang Bai and Hao Feng and Wei Shi and Yuchen Su and Can Huang and Yu-Gang Jiang},
journal={arXiv preprint arXiv:2512.21095},
year={2025}
}
- Downloads last month
- 61