DPLM-2 3B

DPLM-2 is a multimodal diffusion protein language model for jointly modeling, understanding, and generating protein sequences and structures. It extends the discrete diffusion protein language model family from sequence-only protein language modeling to sequence-structure modeling, enabling protein sequence-structure co-generation and conditional generation tasks such as folding, inverse folding, and motif scaffolding.

This repository contains the 3B-parameter DPLM-2 checkpoint. For the official implementation, installation instructions, generation scripts, training configuration, and evaluation utilities, see the bytedance/dplm repository.

Model Details

  • Model type: Multimodal discrete diffusion protein language model
  • Checkpoint: airkingbd/dplm2_3b
  • Architecture: ESM-style transformer for DPLM-2 (EsmForDPLM2)
  • Scale: 3B parameters, 36 transformer layers, hidden size 2560, 40 attention heads
  • Vocabulary: 8,229 tokens, covering amino-acid tokens, structure tokens, and special tokens
  • Base initialization: DPLM-2 training is initialized from the pretrained DPLM sequence model airkingbd/dplm_3b
  • Structure tokenizer: Uses the DPLM structure tokenizer (airkingbd/struct_tokenizer) for structure-token based modeling and PDB reconstruction
  • License: Apache-2.0
  • Paper: DPLM-2: A Multimodal Diffusion Protein Language Model

Quick Start

Install the official DPLM codebase and dependencies:

git clone --recursive https://github.com/bytedance/dplm.git
cd dplm

conda create -n dplm python=3.9 pip
conda activate dplm
bash scripts/install.sh

Load the pretrained DPLM-2 checkpoint:

from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2

dplm2 = DPLM2.from_pretrained("airkingbd/dplm2_3b").cuda()
dplm2 = dplm2.eval()

Sequence-Structure Co-Generation

The official repository provides generate_dplm2.py for co-generation. The default DPLM-2 sampling strategy is annealing@2.0:0.1, which starts with high sampling temperature for diversity and anneals to a lower temperature for designability.

model_name=dplm2_3b
sampling_strategy=annealing@2.0:0.1
output_dir=generation-results/${model_name}

python generate_dplm2.py \
    --model_name airkingbd/${model_name} \
    --task co_generation \
    --sampling_strategy ${sampling_strategy} \
    --num_seqs 50 \
    --max_iter 500 \
    --seq_lens 100 200 300 400 500 \
    --saveto ${output_dir}

Generated sequences and structures are saved under generation-results/dplm2_3b/co_generation. The official repository also includes evaluation utilities for TM-score, RMSD, diversity, and related structure metrics.

Forward Folding

DPLM-2 can generate structures conditioned on input amino-acid sequences. The official scripts use deterministic argmax decoding for 100 diffusion iterations:

model_name=dplm2_3b
output_dir=generation-results/${model_name}

python generate_dplm2.py \
    --model_name airkingbd/${model_name} \
    --task folding \
    --input_fasta_path data-bin/cameo2022/aatype.fasta \
    --max_iter 100 \
    --unmasking_strategy deterministic \
    --sampling_strategy argmax \
    --saveto ${output_dir}

For custom sequences, provide a FASTA file via --input_fasta_path.

Inverse Folding

DPLM-2 can predict amino-acid sequences conditioned on tokenized protein structures:

model_name=dplm2_3b
output_dir=generation-results/${model_name}

python generate_dplm2.py \
    --model_name airkingbd/${model_name} \
    --task inverse_folding \
    --input_fasta_path data-bin/cameo2022/struct.fasta \
    --max_iter 100 \
    --unmasking_strategy deterministic \
    --sampling_strategy argmax \
    --saveto ${output_dir}

To use a custom structure, first tokenize PDB files with the structure tokenizer:

python src/byprot/utils/protein/tokenize_pdb.py \
    --input_pdb_folder /path/to/your/input/structure \
    --output_dir /path/to/your/input/structure/tokenized_protein

Then pass the generated struct.fasta to generate_dplm2.py.

Motif Scaffolding

DPLM-2 supports multimodal motif scaffolding by conditioning on both the sequence and structure tokens of the motif and co-generating the scaffold sequence and structure:

model_name=dplm2_3b
output_dir=./generation-results/${model_name}/motif_scaffold

python run/scaffold_generate_dplm2.py \
    --model_name airkingbd/${model_name} \
    --num_seqs 100 \
    --saveto ${output_dir}

See the official repository for required motif data preparation and evaluation steps.

Training Data and Training Procedure

DPLM-2 is trained on experimental structures from PDB and AF2-predicted structures from SwissProt. The authors provide the preprocessed training dataset on Hugging Face as airkingbd/pdb_swissprot.

The official DPLM repository describes the following training setup for dplm2_3b:

  • Initialize from the pretrained DPLM checkpoint airkingbd/dplm_3b
  • Use a warm-up training strategy for structure data scarcity
  • Use LoRA to limit large parameter shifts during multimodal training
  • Use airkingbd/struct_tokenizer for structure tokenization

The experiment configuration is available in the official repository at configs/experiment/dplm2/dplm2_3b.yaml.

Evaluation Summary

The DPLM repository reports DPLM-2 results on multiple protein generation and understanding tasks, including sequence-structure co-generation, forward folding, inverse folding, motif scaffolding, and representation learning. For full tables, baselines, metrics, and evaluation details, refer to the DPLM-2 paper, the DPLM-2.1 paper, and the official bytedance/dplm repository.

Citation

If you use this checkpoint, please cite the DPLM and DPLM-2 papers:

@inproceedings{wang2024dplm,
  title={Diffusion Language Models Are Versatile Protein Learners},
  author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
  booktitle={International Conference on Machine Learning},
  year={2024}
}

@inproceedings{wang2025dplm2,
  title={DPLM-2: A Multimodal Diffusion Protein Language Model},
  author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
  booktitle={International Conference on Learning Representations},
  year={2025}
}

@inproceedings{hsieh2025dplm2_1,
  title={Elucidating the Design Space of Multimodal Protein Language Models},
  author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan},
  booktitle={International Conference on Machine Learning},
  year={2025}
}

Acknowledgements

DPLM builds on and acknowledges prior work and resources including ByProt, EvoDiff, SaProt, ESM, LM-Design, EigenFold, MultiFlow, FrameFlow, and OpenFold-related structure modeling utilities. See the official repository for the complete acknowledgements and implementation details.

Downloads last month
745
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train airkingbd/dplm2_3b

Collection including airkingbd/dplm2_3b

Papers for airkingbd/dplm2_3b