Why is merges.txt empty in DeepChem/ChemBERTa-77M-MTR?

by Mafuton - opened Feb 3, 2025

Feb 3, 2025

Hi,

I downloaded the tokenizer for DeepChem/ChemBERTa-77M-MTR( or ChemBERTa-77M-MLM) and found that the merges.txt file is empty. As this tokenizer is supposed to use Byte Pair Encoding (BPE), I expected merges.txt to contain merge rules. However, since it is empty, tokenization does not work as expected, splitting "Cl" into "C" and "l" instead of keeping "Cl" as a single token.

Could you clarify why merges.txt is empty? Should there be a proper merges.txt, or is this the intended behavior?

Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment