DPLM-2 Bit 650M

DPLM-2 Bit is a 650M-parameter multimodal diffusion protein language model for joint protein sequence and structure modeling. It is a bitwise structure-token modeling variant of DPLM-2, introduced in DPLM-2.1, for improving structure modeling over index-based discrete structure token prediction.

For the official implementation, installation instructions, generation scripts, training configuration, and evaluation utilities, see the bytedance/dplm repository.

Model Details

  • Model type: Multimodal discrete diffusion protein language model with bitwise structure-token prediction
  • Checkpoint: airkingbd/dplm2_bit_650m
  • Architecture: ESM-style transformer for DPLM-2 Bit (EsmForDPLM2Bit)
  • Scale: 650M parameters, 33 transformer layers, hidden size 1280, 20 attention heads
  • Amino-acid vocabulary size: 33
  • Structure codebook: 8,192 structure codes represented by 13-bit latent structure features
  • Base initialization: DPLM-2 Bit training is initialized from the pretrained DPLM sequence model airkingbd/dplm_650m
  • Structure tokenizer: Uses airkingbd/struct_tokenizer
  • License: Apache-2.0
  • Papers: DPLM-2 and DPLM-2.1

Bitwise Modeling

The original DPLM-2 models protein structures with discrete structure token indices produced by a structure tokenizer. In the DPLM-2.1 analysis, the authors identify index-based structure token prediction as a bottleneck: small changes in the underlying quantized bits can produce a very different token index, making the index classification target hard for the language model to learn.

DPLM-2 Bit uses the LFQ structure tokenizer's bit-level representation directly. Instead of predicting one 8,192-way structure-token index per residue, it predicts each of the 13 bits of the quantized structure feature as a binary target. This turns structure prediction into 13 binary classifications per residue, provides finer-grained supervision, and reduces the difficulty of learning structural patterns from tokenized 3D structures.

Quick Start

Install the official DPLM codebase and dependencies:

git clone --recursive https://github.com/bytedance/dplm.git
cd dplm

conda create -n dplm python=3.9 pip
conda activate dplm
bash scripts/install.sh

Load the pretrained DPLM-2 Bit checkpoint:

from byprot.models.dplm2 import DPLM2Bit

dplm2_bit = DPLM2Bit.from_pretrained("airkingbd/dplm2_bit_650m").cuda()
dplm2_bit = dplm2_bit.eval()

Sequence-Structure Co-Generation

Use generate_dplm2.py with --bit_model. The official repository uses annealing@1.1:0.1 for the released DPLM-2 Bit co-generation example:

model_name=dplm2_bit_650m
sampling_strategy=annealing@1.1:0.1
output_dir=generation-results/${model_name}

python generate_dplm2.py \
    --model_name airkingbd/${model_name} \
    --task co_generation \
    --bit_model \
    --sampling_strategy ${sampling_strategy} \
    --num_seqs 50 \
    --max_iter 500 \
    --seq_lens 100 200 300 400 500 \
    --saveto ${output_dir}

Forward Folding

DPLM-2 Bit can generate structures conditioned on amino-acid sequences:

model_name=dplm2_bit_650m
output_dir=generation-results/${model_name}

python generate_dplm2.py \
    --model_name airkingbd/${model_name} \
    --task folding \
    --bit_model \
    --input_fasta_path data-bin/cameo2022/aatype.fasta \
    --max_iter 100 \
    --unmasking_strategy deterministic \
    --sampling_strategy argmax \
    --saveto ${output_dir}

Inverse Folding

DPLM-2 Bit can predict amino-acid sequences conditioned on tokenized protein structures:

model_name=dplm2_bit_650m
output_dir=generation-results/${model_name}

python generate_dplm2.py \
    --model_name airkingbd/${model_name} \
    --task inverse_folding \
    --bit_model \
    --input_fasta_path data-bin/cameo2022/struct.fasta \
    --max_iter 100 \
    --unmasking_strategy deterministic \
    --sampling_strategy argmax \
    --saveto ${output_dir}

For custom structures, first tokenize PDB files with the released structure tokenizer:

python src/byprot/utils/protein/tokenize_pdb.py \
    --input_pdb_folder /path/to/input/pdbs \
    --output_dir /path/to/output/tokenized_protein

Then pass the generated structure-token FASTA file to generate_dplm2.py.

Training Data and Training Procedure

DPLM-2 Bit uses the same PDB and SwissProt-derived structure data as DPLM-2. The authors provide the preprocessed training dataset on Hugging Face as airkingbd/pdb_swissprot.

The official DPLM repository provides the DPLM-2 Bit experiment configuration at configs/experiment/dplm2/dplm2_bit_650m.yaml. The configuration initializes from airkingbd/dplm_650m, uses airkingbd/dplm2_650m as the tokenizer vocabulary source, and uses airkingbd/struct_tokenizer for structure tokenization.

Experimental Results

The tables below summarize selected results reported in the DPLM-2.1 paper. Lower RMSD is better and higher TM-score, AAR, accuracy, and diversity are better.

Forward Folding

Model CAMEO 2022 RMSD CAMEO 2022 TM-score PDB Date RMSD PDB Date TM-score
DPLM-2 650M 7.7025 0.7936 5.3071 0.8306
DPLM-2 Bit 650M 6.4028 0.8380 3.2213 0.9043

Structure-Token Prediction Accuracy

Model Test Set Index Acc. Bit Acc. RMSD TM-score
DPLM-2 650M CAMEO 2022 0.0864 0.7720 7.7025 0.7936
DPLM-2 650M PDB Date 0.1188 0.7932 5.3071 0.8306
DPLM-2 Bit 650M CAMEO 2022 0.1258 0.7958 6.4028 0.8380
DPLM-2 Bit 650M PDB Date 0.2641 0.8648 3.2213 0.9043

Inverse Folding

Model CAMEO 2022 AAR CAMEO 2022 TM-score
DPLM-2 650M 0.4962 0.8816
DPLM-2 3B 0.5236 0.8900
DPLM-2 Bit 650M 0.5586 0.8907

Representation Learning

Model Human PPI Accuracy (%) DeepLoc Subcellular Accuracy (%)
SaProt 86.41 85.57
DPLM-2 650M 84.44 82.98
DPLM-2 Bit 650M 88.89 83.39

Unconditional Generation Diversity

Model Diversity
DPLM-2 650M 0.700
DPLM-2 Bit 650M 0.825

For full experimental settings, additional variants such as FM, ResDiff, Geo, REPA, and SFT, and complete ablations, see the DPLM-2.1 paper.

Citation

If you use this checkpoint, please cite the DPLM and DPLM-2 papers:

@inproceedings{wang2024dplm,
  title={Diffusion Language Models Are Versatile Protein Learners},
  author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
  booktitle={International Conference on Machine Learning},
  year={2024}
}

@inproceedings{wang2025dplm2,
  title={DPLM-2: A Multimodal Diffusion Protein Language Model},
  author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
  booktitle={International Conference on Learning Representations},
  year={2025}
}

@inproceedings{hsieh2025dplm2_1,
  title={Elucidating the Design Space of Multimodal Protein Language Models},
  author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan},
  booktitle={International Conference on Machine Learning},
  year={2025}
}

Acknowledgements

DPLM builds on and acknowledges prior work and resources including ByProt, EvoDiff, SaProt, ESM, LM-Design, EigenFold, MultiFlow, FrameFlow, and OpenFold-related structure modeling utilities. See the official repository for the complete acknowledgements and implementation details.

Downloads last month
184
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train airkingbd/dplm2_bit_650m

Collection including airkingbd/dplm2_bit_650m

Papers for airkingbd/dplm2_bit_650m