The corpipe25-corefud1.3-base-251101 Model
The corpipe25-corefud1.3-base-251101 is a umT5-base-based multilingual model for
coreference resolution usable in CorPipe 25 https://github.com/ufal/crac2025-corpipe.
It is released on LINDAT/CLARIAH-CZ and on
HuggingFace under the CC BY-NC-SA 4.0 license.
The model is downloaded automatically from HuggingFace when running prediction with
the --load ufal/corpipe25-corefud1.3-base-251101 argument.
The model is language agnostic, so it can be in theory used to predict
coreference in any umT5 language; for zero-shot cross-lingual evaluation,
please refer to the CRAC 2025 paper.
The model expects empty nodes to be already present on input, predicted by https://github.com/ufal/crac2025_empty_nodes_baseline.
The model was trained using the following command (see the CorPipe 25 repository for more information):
tbs="ca_ancora cs_pcedt cs_pdt cu_proiel de_potsdamcc en_gum en_litbank es_ancora fr_ancor fr_democrat grc_proiel hbo_ptnk hi_hdtb hu_korkor hu_szegedkoref ko_ecmt lt_lcc no_bokmaalnarc no_nynorsknarc pl_pcc ru_rucor tr_itcc"
python3 corpipe25.py --train --dev --treebanks $(for c in $tbs; do echo data/$c/$c-corefud-train.conllu; done) --batch_size=8 --learning_rate=6e-4 --learning_rate_decay --adafactor --encoder=google/umt5-base --exp=corpipe25-corefud1.3-base --compile
CorefUD 1.3 Test Sets Results
The model achieves the following CorefUD 1.3 test set results (as reported in
the paper); segment size 2560 was used, with the exception for cu_proiel and
grc_proiel where it was 512:
| avg | ca | cs_pce | cs_pdt | cu | de_pot | en_gum | en_lit | es | fr_anc | fr_dem | grc | hbo_pt | hi | hu_kor | hu_sze | ko_emc | lt | no_bok | no_nyn | pl | ru | tr |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 69.27 | 77.4 | 73.5 | 75.1 | 53.5 | 62.0 | 71.0 | 72.8 | 78.6 | 71.2 | 66.7 | 64.9 | 59.0 | 72.7 | 61.5 | 63.7 | 67.8 | 72.9 | 73.2 | 70.4 | 74.5 | 77.8 | 63.9 |
Running the Model on Plain Text
To run the model on plain text, first the plain text needs to be tokenized and converted to CoNLL-U (and optionally parsed if you also want mention heads), by using for example UDPipe 2:
curl -F data="Eve came home and Peter greeted her there. Then Peter and Paul set out to a trip and Eve waved them off." \
-F model=english -F tokenizer= -F tagger= -F parser= https://lindat.mff.cuni.cz/services/udpipe/api/process \
| python -X utf8 -c "import sys,json; sys.stdout.write(json.load(sys.stdin)['result'])" >input.conllu
Then the CoNLL-U file can be processed by CorPipe 25, by using for example
python3 corpipe25.py --load ufal/corpipe25-corefud1.3-base-251101 --exp . --epoch 0 --test input.conllu
which would generate the following predictions in input.00.conllu:
# generator = UDPipe 2, https://lindat.mff.cuni.cz/services/udpipe
# udpipe_model = english-ewt-ud-2.17-251125
# udpipe_model_licence = CC BY-NC-SA
# newdoc
# global.Entity = eid-etype-head-other
# newpar
# sent_id = 1
# text = Eve came home and Peter greeted her there.
1 Eve Eve PROPN NNP Number=Sing 2 nsubj _ Entity=(c1--1)
2 came come VERB VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 0 root _ _
3 home home ADV RB _ 2 advmod _ Entity=(c2--1)
4 and and CCONJ CC _ 6 cc _ _
5 Peter Peter PROPN NNP Number=Sing 6 nsubj _ Entity=(c3--1)
6 greeted greet VERB VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 2 conj _ _
7 her she PRON PRP Case=Acc|Gender=Fem|Number=Sing|Person=3|PronType=Prs 6 obj _ Entity=(c1--1)
8 there there ADV RB PronType=Dem 6 advmod _ Entity=(c2--1)|SpaceAfter=No
9 . . PUNCT . _ 2 punct _ _
# sent_id = 2
# text = Then Peter and Paul set out to a trip and Eve waved them off.
1 Then then ADV RB PronType=Dem 5 advmod _ _
2 Peter Peter PROPN NNP Number=Sing 5 nsubj _ Entity=(c4--1(c3--1)
3 and and CCONJ CC _ 4 cc _ _
4 Paul Paul PROPN NNP Number=Sing 2 conj _ Entity=(c5--1)c4)
5 set set VERB VBD Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin 0 root _ _
6 out out ADP RP _ 5 compound:prt _ _
7 to to ADP IN _ 9 case _ _
8 a a DET DT Definite=Ind|PronType=Art 9 det _ Entity=(c6--2
9 trip trip NOUN NN Number=Sing 5 obl _ Entity=c6)
10 and and CCONJ CC _ 12 cc _ _
11 Eve Eve PROPN NNP Number=Sing 12 nsubj _ Entity=(c1--1)
12 waved wave VERB VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 5 conj _ _
13 them they PRON PRP Case=Acc|Number=Plur|Person=3|PronType=Prs 12 obj _ Entity=(c4--1)
14 off off ADP RP _ 12 compound:prt _ SpaceAfter=No
15 . . PUNCT . _ 5 punct _ SpaceAfter=No
How to Cite
@inproceedings{straka-2025-corpipe,
title = "{C}or{P}ipe at {CRAC} 2025: Evaluating Multilingual Encoders for Multilingual Coreference Resolution",
author = "Straka, Milan",
editor = "Ogrodniczuk, Maciej and Novak, Michal and Poesio, Massimo and Pradhan, Sameer and Ng, Vincent",
booktitle = "Proceedings of the Eighth Workshop on Computational Models of Reference, Anaphora and Coreference",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.crac-1.11/",
doi = "10.18653/v1/2025.crac-1.11",
pages = "130--139",
}
Model tree for ufal/corpipe25-corefud1.3-base-251101
Base model
google/umt5-base