DPLM-2 Bit 650M
DPLM-2 Bit is a 650M-parameter multimodal diffusion protein language model for joint protein sequence and structure modeling. It is a bitwise structure-token modeling variant of DPLM-2, introduced in DPLM-2.1, for improving structure modeling over index-based discrete structure token prediction.
For the official implementation, installation instructions, generation scripts, training configuration, and evaluation utilities, see the bytedance/dplm repository.
Model Details
- Model type: Multimodal discrete diffusion protein language model with bitwise structure-token prediction
- Checkpoint:
airkingbd/dplm2_bit_650m - Architecture: ESM-style transformer for DPLM-2 Bit (
EsmForDPLM2Bit) - Scale: 650M parameters, 33 transformer layers, hidden size 1280, 20 attention heads
- Amino-acid vocabulary size: 33
- Structure codebook: 8,192 structure codes represented by 13-bit latent structure features
- Base initialization: DPLM-2 Bit training is initialized from the pretrained
DPLM sequence model
airkingbd/dplm_650m - Structure tokenizer: Uses
airkingbd/struct_tokenizer - License: Apache-2.0
- Papers: DPLM-2 and DPLM-2.1
Bitwise Modeling
The original DPLM-2 models protein structures with discrete structure token indices produced by a structure tokenizer. In the DPLM-2.1 analysis, the authors identify index-based structure token prediction as a bottleneck: small changes in the underlying quantized bits can produce a very different token index, making the index classification target hard for the language model to learn.
DPLM-2 Bit uses the LFQ structure tokenizer's bit-level representation directly. Instead of predicting one 8,192-way structure-token index per residue, it predicts each of the 13 bits of the quantized structure feature as a binary target. This turns structure prediction into 13 binary classifications per residue, provides finer-grained supervision, and reduces the difficulty of learning structural patterns from tokenized 3D structures.
Quick Start
Install the official DPLM codebase and dependencies:
git clone --recursive https://github.com/bytedance/dplm.git
cd dplm
conda create -n dplm python=3.9 pip
conda activate dplm
bash scripts/install.sh
Load the pretrained DPLM-2 Bit checkpoint:
from byprot.models.dplm2 import DPLM2Bit
dplm2_bit = DPLM2Bit.from_pretrained("airkingbd/dplm2_bit_650m").cuda()
dplm2_bit = dplm2_bit.eval()
Sequence-Structure Co-Generation
Use generate_dplm2.py with --bit_model. The official repository uses
annealing@1.1:0.1 for the released DPLM-2 Bit co-generation example:
model_name=dplm2_bit_650m
sampling_strategy=annealing@1.1:0.1
output_dir=generation-results/${model_name}
python generate_dplm2.py \
--model_name airkingbd/${model_name} \
--task co_generation \
--bit_model \
--sampling_strategy ${sampling_strategy} \
--num_seqs 50 \
--max_iter 500 \
--seq_lens 100 200 300 400 500 \
--saveto ${output_dir}
Forward Folding
DPLM-2 Bit can generate structures conditioned on amino-acid sequences:
model_name=dplm2_bit_650m
output_dir=generation-results/${model_name}
python generate_dplm2.py \
--model_name airkingbd/${model_name} \
--task folding \
--bit_model \
--input_fasta_path data-bin/cameo2022/aatype.fasta \
--max_iter 100 \
--unmasking_strategy deterministic \
--sampling_strategy argmax \
--saveto ${output_dir}
Inverse Folding
DPLM-2 Bit can predict amino-acid sequences conditioned on tokenized protein structures:
model_name=dplm2_bit_650m
output_dir=generation-results/${model_name}
python generate_dplm2.py \
--model_name airkingbd/${model_name} \
--task inverse_folding \
--bit_model \
--input_fasta_path data-bin/cameo2022/struct.fasta \
--max_iter 100 \
--unmasking_strategy deterministic \
--sampling_strategy argmax \
--saveto ${output_dir}
For custom structures, first tokenize PDB files with the released structure tokenizer:
python src/byprot/utils/protein/tokenize_pdb.py \
--input_pdb_folder /path/to/input/pdbs \
--output_dir /path/to/output/tokenized_protein
Then pass the generated structure-token FASTA file to generate_dplm2.py.
Training Data and Training Procedure
DPLM-2 Bit uses the same PDB and SwissProt-derived structure data as DPLM-2. The authors provide the preprocessed training dataset on Hugging Face as airkingbd/pdb_swissprot.
The official DPLM repository provides the DPLM-2 Bit experiment configuration at
configs/experiment/dplm2/dplm2_bit_650m.yaml. The configuration initializes
from airkingbd/dplm_650m, uses airkingbd/dplm2_650m as the tokenizer
vocabulary source, and uses airkingbd/struct_tokenizer for structure
tokenization.
Experimental Results
The tables below summarize selected results reported in the DPLM-2.1 paper. Lower RMSD is better and higher TM-score, AAR, accuracy, and diversity are better.
Forward Folding
| Model | CAMEO 2022 RMSD | CAMEO 2022 TM-score | PDB Date RMSD | PDB Date TM-score |
|---|---|---|---|---|
| DPLM-2 650M | 7.7025 | 0.7936 | 5.3071 | 0.8306 |
| DPLM-2 Bit 650M | 6.4028 | 0.8380 | 3.2213 | 0.9043 |
Structure-Token Prediction Accuracy
| Model | Test Set | Index Acc. | Bit Acc. | RMSD | TM-score |
|---|---|---|---|---|---|
| DPLM-2 650M | CAMEO 2022 | 0.0864 | 0.7720 | 7.7025 | 0.7936 |
| DPLM-2 650M | PDB Date | 0.1188 | 0.7932 | 5.3071 | 0.8306 |
| DPLM-2 Bit 650M | CAMEO 2022 | 0.1258 | 0.7958 | 6.4028 | 0.8380 |
| DPLM-2 Bit 650M | PDB Date | 0.2641 | 0.8648 | 3.2213 | 0.9043 |
Inverse Folding
| Model | CAMEO 2022 AAR | CAMEO 2022 TM-score |
|---|---|---|
| DPLM-2 650M | 0.4962 | 0.8816 |
| DPLM-2 3B | 0.5236 | 0.8900 |
| DPLM-2 Bit 650M | 0.5586 | 0.8907 |
Representation Learning
| Model | Human PPI Accuracy (%) | DeepLoc Subcellular Accuracy (%) |
|---|---|---|
| SaProt | 86.41 | 85.57 |
| DPLM-2 650M | 84.44 | 82.98 |
| DPLM-2 Bit 650M | 88.89 | 83.39 |
Unconditional Generation Diversity
| Model | Diversity |
|---|---|
| DPLM-2 650M | 0.700 |
| DPLM-2 Bit 650M | 0.825 |
For full experimental settings, additional variants such as FM, ResDiff, Geo, REPA, and SFT, and complete ablations, see the DPLM-2.1 paper.
Citation
If you use this checkpoint, please cite the DPLM and DPLM-2 papers:
@inproceedings{wang2024dplm,
title={Diffusion Language Models Are Versatile Protein Learners},
author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
booktitle={International Conference on Machine Learning},
year={2024}
}
@inproceedings{wang2025dplm2,
title={DPLM-2: A Multimodal Diffusion Protein Language Model},
author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
booktitle={International Conference on Learning Representations},
year={2025}
}
@inproceedings{hsieh2025dplm2_1,
title={Elucidating the Design Space of Multimodal Protein Language Models},
author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan},
booktitle={International Conference on Machine Learning},
year={2025}
}
Acknowledgements
DPLM builds on and acknowledges prior work and resources including ByProt, EvoDiff, SaProt, ESM, LM-Design, EigenFold, MultiFlow, FrameFlow, and OpenFold-related structure modeling utilities. See the official repository for the complete acknowledgements and implementation details.
- Downloads last month
- 184