DPLM-2 Bit 650M

DPLM-2 Bit is a 650M-parameter multimodal diffusion protein language model for joint protein sequence and structure modeling. It is a bitwise structure-token modeling variant of DPLM-2, introduced in DPLM-2.1, for improving structure modeling over index-based discrete structure token prediction.

For the official implementation, installation instructions, generation scripts, training configuration, and evaluation utilities, see the bytedance/dplm repository.

Model Details

Model type: Multimodal discrete diffusion protein language model with bitwise structure-token prediction
Checkpoint: airkingbd/dplm2_bit_650m
Architecture: ESM-style transformer for DPLM-2 Bit (EsmForDPLM2Bit)
Scale: 650M parameters, 33 transformer layers, hidden size 1280, 20 attention heads
Amino-acid vocabulary size: 33
Structure codebook: 8,192 structure codes represented by 13-bit latent structure features
Base initialization: DPLM-2 Bit training is initialized from the pretrained DPLM sequence model airkingbd/dplm_650m
Structure tokenizer: Uses airkingbd/struct_tokenizer
License: Apache-2.0
Papers: DPLM-2 and DPLM-2.1

Bitwise Modeling

The original DPLM-2 models protein structures with discrete structure token indices produced by a structure tokenizer. In the DPLM-2.1 analysis, the authors identify index-based structure token prediction as a bottleneck: small changes in the underlying quantized bits can produce a very different token index, making the index classification target hard for the language model to learn.

DPLM-2 Bit uses the LFQ structure tokenizer's bit-level representation directly. Instead of predicting one 8,192-way structure-token index per residue, it predicts each of the 13 bits of the quantized structure feature as a binary target. This turns structure prediction into 13 binary classifications per residue, provides finer-grained supervision, and reduces the difficulty of learning structural patterns from tokenized 3D structures.

Quick Start

Install the official DPLM codebase and dependencies:

git clone --recursive https://github.com/bytedance/dplm.git
cd dplm

conda create -n dplm python=3.9 pip
conda activate dplm
bash scripts/install.sh

Load the pretrained DPLM-2 Bit checkpoint:

from byprot.models.dplm2 import DPLM2Bit

dplm2_bit = DPLM2Bit.from_pretrained("airkingbd/dplm2_bit_650m").cuda()
dplm2_bit = dplm2_bit.eval()

Sequence-Structure Co-Generation

Use generate_dplm2.py with --bit_model. The official repository uses annealing@1.1:0.1 for the released DPLM-2 Bit co-generation example:

model_name=dplm2_bit_650m
sampling_strategy=annealing@1.1:0.1
output_dir=generation-results/${model_name}

python generate_dplm2.py \
    --model_name airkingbd/${model_name} \
    --task co_generation \
    --bit_model \
    --sampling_strategy ${sampling_strategy} \
    --num_seqs 50 \
    --max_iter 500 \
    --seq_lens 100 200 300 400 500 \
    --saveto ${output_dir}

Forward Folding

DPLM-2 Bit can generate structures conditioned on amino-acid sequences:

model_name=dplm2_bit_650m
output_dir=generation-results/${model_name}

python generate_dplm2.py \
    --model_name airkingbd/${model_name} \
    --task folding \
    --bit_model \
    --input_fasta_path data-bin/cameo2022/aatype.fasta \
    --max_iter 100 \
    --unmasking_strategy deterministic \
    --sampling_strategy argmax \
    --saveto ${output_dir}

Inverse Folding

DPLM-2 Bit can predict amino-acid sequences conditioned on tokenized protein structures:

model_name=dplm2_bit_650m
output_dir=generation-results/${model_name}

python generate_dplm2.py \
    --model_name airkingbd/${model_name} \
    --task inverse_folding \
    --bit_model \
    --input_fasta_path data-bin/cameo2022/struct.fasta \
    --max_iter 100 \
    --unmasking_strategy deterministic \
    --sampling_strategy argmax \
    --saveto ${output_dir}

For custom structures, first tokenize PDB files with the released structure tokenizer:

python src/byprot/utils/protein/tokenize_pdb.py \
    --input_pdb_folder /path/to/input/pdbs \
    --output_dir /path/to/output/tokenized_protein

Then pass the generated structure-token FASTA file to generate_dplm2.py.

Training Data and Training Procedure

DPLM-2 Bit uses the same PDB and SwissProt-derived structure data as DPLM-2. The authors provide the preprocessed training dataset on Hugging Face as airkingbd/pdb_swissprot.

The official DPLM repository provides the DPLM-2 Bit experiment configuration at configs/experiment/dplm2/dplm2_bit_650m.yaml. The configuration initializes from airkingbd/dplm_650m, uses airkingbd/dplm2_650m as the tokenizer vocabulary source, and uses airkingbd/struct_tokenizer for structure tokenization.

Experimental Results

The tables below summarize selected results reported in the DPLM-2.1 paper. Lower RMSD is better and higher TM-score, AAR, accuracy, and diversity are better.

Forward Folding

Model	CAMEO 2022 RMSD	CAMEO 2022 TM-score	PDB Date RMSD	PDB Date TM-score
DPLM-2 650M	7.7025	0.7936	5.3071	0.8306
DPLM-2 Bit 650M	6.4028	0.8380	3.2213	0.9043

Structure-Token Prediction Accuracy

Model	Test Set	Index Acc.	Bit Acc.	RMSD	TM-score
DPLM-2 650M	CAMEO 2022	0.0864	0.7720	7.7025	0.7936
DPLM-2 650M	PDB Date	0.1188	0.7932	5.3071	0.8306
DPLM-2 Bit 650M	CAMEO 2022	0.1258	0.7958	6.4028	0.8380
DPLM-2 Bit 650M	PDB Date	0.2641	0.8648	3.2213	0.9043

Inverse Folding

Model	CAMEO 2022 AAR	CAMEO 2022 TM-score
DPLM-2 650M	0.4962	0.8816
DPLM-2 3B	0.5236	0.8900
DPLM-2 Bit 650M	0.5586	0.8907

Representation Learning

Model	Human PPI Accuracy (%)	DeepLoc Subcellular Accuracy (%)
SaProt	86.41	85.57
DPLM-2 650M	84.44	82.98
DPLM-2 Bit 650M	88.89	83.39

Unconditional Generation Diversity

Model	Diversity
DPLM-2 650M	0.700
DPLM-2 Bit 650M	0.825

For full experimental settings, additional variants such as FM, ResDiff, Geo, REPA, and SFT, and complete ablations, see the DPLM-2.1 paper.

Citation

If you use this checkpoint, please cite the DPLM and DPLM-2 papers:

@inproceedings{wang2024dplm,
  title={Diffusion Language Models Are Versatile Protein Learners},
  author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
  booktitle={International Conference on Machine Learning},
  year={2024}
}

@inproceedings{wang2025dplm2,
  title={DPLM-2: A Multimodal Diffusion Protein Language Model},
  author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
  booktitle={International Conference on Learning Representations},
  year={2025}
}

@inproceedings{hsieh2025dplm2_1,
  title={Elucidating the Design Space of Multimodal Protein Language Models},
  author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan},
  booktitle={International Conference on Machine Learning},
  year={2025}
}

Acknowledgements

DPLM builds on and acknowledges prior work and resources including ByProt, EvoDiff, SaProt, ESM, LM-Design, EigenFold, MultiFlow, FrameFlow, and OpenFold-related structure modeling utilities. See the official repository for the complete acknowledgements and implementation details.

Downloads last month: 184

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train airkingbd/dplm2_bit_650m

Collection including airkingbd/dplm2_bit_650m

DPLM-2

Collection

DPLM-2: A Multimodal Diffusion Protein Language Model (https://arxiv.org/abs/2410.13782) • 5 items • Updated 5 days ago

Papers for airkingbd/dplm2_bit_650m

Elucidating the Design Space of Multimodal Protein Language Models

Paper • 2504.11454 • Published Apr 15, 2025 • 1

DPLM-2: A Multimodal Diffusion Protein Language Model

Paper • 2410.13782 • Published Oct 17, 2024 • 22