--- license: mit ---

MedTok: Multimodal Medical Code Tokenizer

## Overview of MedTok MEDTOK is a multimodal tokenizer of medical codes that combines text descriptions of codes with graph-based representations of dependencies between codes derived from clinical ontologies and standard medical terminologies. MEDTOK is a general-purpose tokenizer that can be integrated into any transformer-based model or system that requires tokenization. ## How to use MedTok? ```bash from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("mims-harvard/MedTok", trust_remote_code=True) tokens = tokenizer("E11.9") embed = tokenizer.embed("E11.9") ``` - embed means the quantized embedding for this input medical code. If you want to use the tokenized embedding for each medical code, please download it from [mims-harvard/MedTok](https://huggingface.co/mims-harvard/MedTok) or [code2embeddings.json.zip](https://doi.org/10.7910/DVN/7XNT3M) directly. And the downloaded embedding file could be put into 'MedTok/embedding.npy' to run EHR or QA tasks based on MedTok. ### 🏥MedTok for EHR & MedicalQA Please reference our github repo [MedTok](https://github.com/mims-harvard/MedTok) ### Note MedTok tokenizer V1.0 now only supports those medical codes adopted in our paper. For those unseen codes, the output will be '' token. We will also continue to update our MedTok to make it apply to more coding system and tokenize medical code dynamically. ## Citation ```bash @article{su2025multimodal, title={Multimodal Medical Code Tokenizer}, author={Su, Xiaorui and Messica, Shvat and Huang, Yepeng and Johnson, Ruth and Fesser, Lukas and Gao, Shanghua and Sahneh, Faryad and Zitnik, Marinka}, journal={International Conference on Machine Learning, ICML}, year={2025} } ``` ## Contact Thank you for your support! If you have any questions or suggestions, please email [Xiaorui Su](xiaorui_su@hms.harvard.edu) and [Marinka Zitnik](marinka@hms.harvard.edu).