Ying Liu
- Jan 15
- 4 min read

A summary of research paper about Protein generation using EvoDiff

The original paper can be found here. All figures are from this paper as well.

In this paper, the authors introduced a new protein generation method based on discrete diffusion model called EvoDiff. EvoDiff is a discrete diffusion model which generate proteins inaccessible to structure-based models, such as those with disordered regions, while maintaining the ability to design scaffolds for functional structural motifs, demonstrating the universality of the sequence-based formulation.

It generates protein sequence unconditionally with a fully random masked sequence in the first place. Then the protein sequence is folded with OmegaFold. The resulted structure gives a good structure foldability and reasonable biological properties. It also enables different forms of conditioning, including evolution-guided generation as well as inpainting and scaffolding.

Previous protein generative models trained on global sequence space have been either left-to-right autoregressive (LRAR) models or masked language models (MLMs). This paper explores several methods based the LRAR and MLMs.

Method

There are a few model modification methods which were explored in this paper.First, the input protein data are:

Single protein sequence;
Multiple Sequence Alignments (MSA).

Second, the way amino acid was corrupted at each step in the forward process:

Order-agnostic autoregressive diffusion model (OADM);
Discrete denoising diffusion probabilistic (D3PM).

Third, the transition matrix in D3PM were check in two ways:

Uniform transition matrix;
Biologically-informed transition matrix.

Order-agnostic autoregressive diffusion model (OADM)

It masks the sequence on each step of the encoding part in the diffusion model until the full sequence is masked. Then it allows order-agnostic decoding process, unlike LRAR which learns to generate sequences in a pre-specified left-to-right decoding order.

At each step t, the loss

L is the sequence length, 𝒰(SL) is the uniformly sampled decoding order. 𝑥θ(L-t) is masked.

Discrete denoising diffusion probabilistic (D3PM)

This method does amino-acid mutation in each step of encoding part by applying a transition matrix Q to the sequence. Discrete inputs are iteratively corrupted via a controlled Markov process until they constitute samples from a uniform stationary distribution. The forward corruption process is described by:

If the transition matrix is derived from BLOSUM matrices of amino acid substitution frequencies, the model is called by EvoDiff-D3PM-BLOSUM. If the transition matrix is uniform, it is then called EvoDiff-D3PM-Uniform.

EvoDiff-D3PM models are trained via a hybrid loss function:

Baseline and Metric

The EvoDiff model are compared with the Test set (UniRef50, Protein Data Bank, penFold), LRAR from UCSF, Transformer protein language model ESM from Meta/NYU/Standford/MIT, RFdiffusion from University of Washington and Potts.

In the baseline models, RFdiffusion generates a structure that fulfils desired constrains and then design a sequence that fold to the structure. EvoDiff, however, designs the sequence first, then folded using OmegaFold.

The performance of generated protein are evaluated in the following metrics:

Structural plausibility

foldability - average predicted local distance difference test (pLDDT, higher is better)
self-consistency - self-consistency perplexity (scPerplexity, lower is better)

Biological properties

Coverage over the distribution of sequence and functional property
Distribution of structural property
memorisation of data - Hamming distance

Sample diversity

aligned residue-wise sequence similarity between the generated query sequence and the most similar sequence in the original MSA Cation Diffusion Facilitator (CDF) family.

Scaffold success

motif coordinates have less than 1 Å Root-mean-square deviation (RMSD) from the desired motif coordinates

Disorder score

Disorder-Region Bidirectional Encoder Representations from Transformers (DR-BERT). It predicts disorder scores for each residue in the generated and natural sequences.

Structural similarity

TM score measures the structural similarity via the template modelling score.

Dataset

Sequence-only EvoDiff models were trained on UniRef50 which contains approximately 45 million protein sequences. The UniRef50 release and train/validation/testing splits from CARP were used to facilitate comparisons between models. Sequences longer than 1024 residues were randomly subsampled to 1024 residues. Multiple sequence alignment (MSA) EvoDiff models were trained on OpenFold, which contains 401,381 MSAs for 140,000 unique Protein Data Bank (PDB) chains and 16,000,000 UniClust30 clusters.

Result

First, all Evo-Diff models using sequence only unconditional protein generation are compared with LRAR and ESM-2 to evaluate their structural plausibility.

Then they were compared with test set, RFdiffusion and ESM-2 to evaluate biological property.

As a result, the EvoDiff-OADM outperforms EvoDiff-D3PM. So the OADM branch was extended with input using MSA, EvoDiff-MSA.

The EvoDiff-MSA generates more foldable and self-consistent sequences than sampling from ESM-MSA or using Potts models trained on individual MSAs. It also yields sequences that exhibit strikingly low similarity to those in the original MSA while still retaining structural integrity relative to the original query sequences.

The EvoDiff-MSA enabled protein sequence generation with conditions including intrinsic disorder region generation and scaffolds generation.

IDR regions inpainted by EvoDiff-Seq and EvoDiff-MSA result in distributions of disorder scores similar to those for natural sequences, across both the IDR and the surrounding structured regions. The putative IDRs generated by EvoDiff-Seq are less similar to their original IDR than those from EvoDiff-MSA.

Both models generated disordered regions that preserve disorder scores over the entire protein sequence and still exhibit low sequence similarity to the original IDR.

Regarding the scaffolds generation, no correlation between the problem-specific success rates of EvoDiff and RFdiffusion, suggests EvoDiff may have orthogonal strengths to RFdiffusion.

EvoDiff-MSA generates scaffolds that are more structurally similar to the native scaffold than EvoDiff-Seq.

Summary

EvoDiff is the first demonstrations of programmable generation capabilities from deep generative models of protein sequence alone. EvoDiff-Seq covers better protein functional and structural space than MLMs (ESM-2 and RFdiffusion). LRAR fit evolutionary sequence distribution better, but it cannot perform conditional generation, EvoDiff allows evolution-guided generation, inpainting and scaffolding

Future work

1. Enable conditioning via guidance.

2. intuitively fit into EvoDiff-D3PM from OADM throught editing at every decoding step.

3. Conditioned on text or chemical information .