microsoft / foldingdiff Goto Github PK

View Code? Open in Web Editor NEW

460.0 17.0 46.0 233.86 MB

Diffusion models of protein structure; trigonometry and attention are all you need!

Home Page: https://arxiv.org/abs/2209.15611

License: MIT License

Python 2.68% Shell 0.01% Jupyter Notebook 97.31%

diffusion diffusion-models protein proteins transformer protein-structure-generation

foldingdiff's Introduction

foldingdiff - Diffusion model for protein backbone generation

We present a diffusion model for generating novel protein backbone structures. For more details, see our preprint on arXiv. We also host a trained version of our model on HuggingFace spaces and at SuperBio so you can get started with generating protein structures with just your browser!

Installation

To install, clone this using git clone. This software is written in Python, notably using PyTorch, PyTorch Lightning, and the HuggingFace transformers library. The required conda environment is defined within the environment.yml file. To set this up, make sure you have conda (or mamba) installed, clone this repository, and run:

conda env create -f environment.yml
conda activate foldingdiff
pip install -e ./  # make sure ./ is the dir including setup.py

Downloading data

We require some data files not packaged on Git due to their large size. These are not required for sampling (as long as you are not using the --testcomparison option, see below); this is required for training your own model. We provide a script in the data dir to download requisite CATH data.

# Download the CATH dataset
cd data  # Ensure that you are in the data subdirectory within the codebase
chmod +x download_cath.sh
./download_cath.sh

If the download link in the .sh file is not working, the tarball is also mirrored at the following Dropbox link.

Training models

To train your own model on the CATH dataset, use the script at bin/train.py in combination with one of the json config files under config_jsons (or write your own). An example usage of this is as follows:

python bin/train.py config_jsons/cath_full_angles_cosine.json --dryrun

By default, the training script will calculate the KL divergence at each timestep before starting training, which can be quite computationally expensive with more timesteps. To skip this, append the --dryrun flag. The output of the model will be in the results folder with the following major files present:

results/
    - config.json           # Contains the config file for the huggingface BERT model itself
    - logs/                 # Contains the logs from training
    - models/               # Contains model checkpoints. By default we store the best 5 models by validation loss and the best 5 by training loss
    - training_args.json    # Full set of arguments, can be used to reproduce run

Pre-trained models

We provide weights for a model trained on the CATH dataset. These weights are stored on HuggingFace model hub at wukevin/foldingdiff_cath. The following code snippet shows how to load this model, load data (assuming it's been downloaded), and perform a forward pass:

from huggingface_hub import snapshot_download
from torch.utils.data.dataloader import DataLoader
from foldingdiff import modelling
from foldingdiff import datasets as dsets

# Load the model (files will be cached for future calls)
m = modelling.BertForDiffusion.from_dir(snapshot_download("wukevin/foldingdiff_cath"))

# Load dataset
# As part of loading, we try to compute internal angles in parallel. This may
# throw warnings like the following; this is normal.
# WARNING:root:Illegal values for omega in /home/*/projects/foldingdiff-main/data/cath/dompdb/2ebqA00 -- skipping
# After computing these once, the results are saved in a .pkl file under the
# foldingdiff source directory for faster loading in future calls.
clean_dset = dsets.CathCanonicalAnglesOnlyDataset(pad=128, trim_strategy='randomcrop')
noised_dset = dsets.NoisedAnglesDataset(clean_dset, timesteps=1000, beta_schedule='cosine')
dl = DataLoader(noised_dset, batch_size=32, shuffle=False)
x = iter(dl).next()

# Forward pass
predicted_noise = m(x['corrupted'], x['t'], x['attn_mask'])

Sampling protein backbones

To sample protein backbones, use the script bin/sample.py. Example commands to do this using the pretrained weights described above are as follows.

# To sample 10 backbones per length ranging from [50, 128) with a batch size of 512 - reproduces results in our manuscript
python ~/projects/foldingdiff/bin/sample.py -l 50 128 -n 10 -b 512 --device cuda:0

This will run the trained model hosted at wukevin/foldingdiff_cath and generate sequences of varying lengths. If you wish to load the test dataset and include test chains in the generated plots, use the option --testcomparison; note that this requires downloading the CATH dataset, see above. Running sample.py will create the following directory structure in the diretory where it is run:

some_dir/
    - plots/            # Contains plots comparing the distribution of training/generated angles
    - sampled_angles/   # Contains .csv.gz files with the sampled angles
    - sampled_pdb/      # Contains .pdb files from converting the sampled angles to cartesian coordinates
    - model_snapshot/   # Contains a copy of the model used to produce results

Not specifying a --device will default to the first device cuda:0; use --device cpu to run on CPU (though this will be very slow). See the following table for runtimes from our machines.

Device	Runtime estimates sampling 512 structures
Nvidia RTX 2080Ti	7 minutes
i9-9960X (16 physical cores)	2 hours

Maximum training similarity TM scores

After generating sequences, we can calculate TM-scores to evaluate the simliarity of the generated sequences and the original sequences. This is done using the script under bin/tmscore_training.py and requires data to have been downloaded prior (see above).

Visualizing diffusion "folding" process

The above sampling code can also be run with the --fullhistory flag to write an additional subdirectory sample_history under each of the sampled_angles and sampled_pdb folders that contain pdb/csv files coresponding to each timestep in the sampling process. The pdb files, for example, can then be passed into the script under foldingdiff/pymol_vis.py to generate a gif of the folding process (as shown above). An example command to do this is:

python ~/projects/foldingdiff/foldingdiff/pymol_vis.py pdb2gif -i sampled_pdb/sample_history/generated_0/*.pdb -o generated_0.gif

Note this script lives separately from other plotting code because it depends on PyMOL; feel free to install/activate your own installation of PyMOL for this, or set up an environment using PyMOL open source.

Evaluating designability of generated backbones

One way to evaluate the quality of generated backbones is via their "designability". This refers to whether or not we can design an amino acid chain that will fold into the designed backbone. To evaluate this, we use an inverse folding model to generate amino acid sequences that are predicted to fold into our generated backbone, and check whether those generated sequences actually fold into a structure comparable to our backbone.

Inverse folding

Inverse folding is the task of predicting a sequence of amino acids that will produce a given protein backbone structure. We evaluated two different methods for this step, ProteinMPNN and ESM-IF1; we find ProteinMPNN to be significantly more performant. In our analyses, we generate 8 different amino caid sequences for each of FoldingDiff's generated structures.

ESM-IF1

We use a different conda environment for ESM-IF1; see this Jupyter notebook for setup details. We found that the following series of commands works on our machines:

mamba create -n inverse python=3.9 pytorch cudatoolkit pyg -c pytorch -c conda-forge -c pyg
conda activate inverse
mamba install -c conda-forge biotite
pip install git+https://github.com/facebookresearch/esm.git

After this, we cd into the folder that contains the sampled_pdb directory created by the prior step, and run:

python ~/projects/foldingdiff/bin/pdb_to_residues_esm.py sampled_pdb -o esm_residues

This creates a new folder, esm_residues that contains 10 potential residues for each of the pdb files contained in sampled_pdb.

ProteinMPNN

To set up ProteinMPNN, see the authors guide on their GitHub.

After this, we follow a similar procedure as for ESM-IF1 (above) where we cd into the directory containing the sampled_pdb folder and run:

python ~/projects/foldingdiff/bin/pdb_to_residue_proteinmpnn.py sampled_pdb

This will create a new directory called proteinmpnn_residues containing 8 amino acid chains per sampled PDB structure.

Structural prediction

After generating amino acid sequences, we check that these recapitulate our original sampled structures by passing them through either OmegaFold or AlphaFold. After running one of these folders, we use the following command to asses self-consistency TM scores:

python ~/projects/foldingdiff/bin/sctm.py -f alphafold_predictions_proteinmpnn

Where alphafold_predictions_proteinmpnn is a folder containing the folded structures corresponding to inverse folded amino acid sequences. This produces a json file of all scTM scores, as well as various pdf files containing plots and correlations of the scTM score distribution.

OmegaFold

We primarily use OmegaFold to fold the amino acid sequences produced by either ESM-IF1 or ProteinMPNN. This is due to OmegaFold's relatively fast runtime compared to AlphaFold2, and due to the fact that OmegaFold is natively designed to be run without MSA information - making it more suitable for our protein design task.

After creating and activating a separate conda environment and following the authors' instructions for installing OmegaFold, we use the following script to split our input amino acid fasta files across GPUs for inference, and subsequently calculate the self-consistency TM (scTM) scores.

# Fold each fasta, spreading the work over GPUs 0 and 1, outputs to omegafold_predictions folder
python ~/projects/foldingdiff/bin/omegafold_across_gpus.py esm_residues/*.fasta -g 0 1

AlphaFold2

We run AlphaFold2 via the localcolabfold installation method (see GitHub). Due to AlphaFold's runtime requirements, we provide scripts to split the set of fasta files into subdirectories that can then be separately folded; see SLURM script under scripts/slurm/alphafold.sbatch for an example.

Tests

Tests are implemented through a mixture of doctests and unittests. To run unittests, run:

python -m unittest -v

You may see warnings like the following; these are expected.

WARNING:root:Illegal values for omega in protdiff-main/data/cath/dompdb/5a2qw00 -- skipping

foldingdiff's People

Contributors

Stargazers

Watchers

foldingdiff's Issues

How to do inpainting?

I'm wondering how inpainting can be done with this pretrained model. By in-painting I mean feeding a structure to the model and letting it only change a few angles in the middle of the structure.

what is the task formulating?

what is the task formulating? i didn't found it in the paper. thank you very much

Error when I run Sampling protein backbones

I have an AssertionError: Expected /home/myname/projects to be empty!
My File Structure is:/home/myname/projects/foldingdiff-main or /home/myname/projects/folddiff
Sadly，None of them works. The error message is always AssertionError
Can you show me a solution?

Error when using script partial_noise_reconstruct.py

Hello,

It is a good idea to incorporate a script to produce variants from starting structures (aka. inpaiting). I was trying to run the script partial_noise_reconstruct.py but I got the following error:

$ python ../bin/partial_noise_reconstruct.py protein1.pdb protein2.pdb output.json

INFO:root:Detected huggingface repo ID wukevin/foldingdiff_cath
Fetching 6 files: 100%|█████████████████████| 6/6 [00:00<00:00, 23410.07it/s]
INFO:root:Using downloaded model at /home/tn/.cache/huggingface/hub/models--wukevin--foldingdiff_cath/snapshots/98d77b1e68468db5ca03cdba1c0a90f2a2a33edc
INFO:root:Loading dataset from 2 pdb files
INFO:root:Clean dataset class: <class 'foldingdiff.datasets.CathCanonicalAnglesOnlyDataset'>
Traceback (most recent call last):
  File "../bin/partial_noise_reconstruct.py", line 132, in <module>
    main()
  File "../bin/partial_noise_reconstruct.py", line 116, in main
    scores = get_reconstruction_error(
  File "../bin/partial_noise_reconstruct.py", line 94, in get_reconstruction_error
    dset = load_dataset(pdb_files, model)
  File "../bin/partial_noise_reconstruct.py", line 34, in load_dataset
    dset = clean_dset_class(
  File "/home/tn/foldingdiff/foldingdiff/datasets.py", line 499, in __init__
    super().__init__(*args, **kwargs)
  File "/home/tn/foldingdiff/foldingdiff/datasets.py", line 123, in __init__
    fnames = self.__get_pdb_fnames(pdbs)
  File "/home/tn/foldingdiff/foldingdiff/datasets.py", line 238, in __get_pdb_fnames
    if Path(pdbs).is_dir():
  File "/home/tn/.conda/envs/foldingdiff/lib/python3.8/pathlib.py", line 1042, in __new__
    self = cls._from_parts(args, init=False)
  File "/home/tn/.conda/envs/foldingdiff/lib/python3.8/pathlib.py", line 683, in _from_parts
    drv, root, parts = self._parse_args(args)
  File "/home/tn/.conda/envs/foldingdiff/lib/python3.8/pathlib.py", line 667, in _parse_args
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not list

Could you please help on this issue?

Where is wrapped gaussian sampling done during noising?

The only code where I see a loop over k is in ScoreMatchingNoisedAnglesDataset class.

I am not sure that I was able to find the use of the wrapped gaussian during generation of the noised samples. Can you point me to the relevant code where this is being done?

AttributeError in CathCanonicalAnglesOnlyDataset

When I try to run the train script from the readme, I get this error:

Traceback (most recent call last):
  File "bin/train.py", line 576, in <module>
    main()
  File "bin/train.py", line 563, in main
    train(**config_args)
  File "bin/train.py", line 342, in train
    dsets = get_train_valid_test_sets(
  File "bin/train.py", line 145, in get_train_valid_test_sets
    clean_dsets = [
  File "bin/train.py", line 146, in <listcomp>
    clean_dset_class(
  File "/home/kevyan/src/foldingdiff/foldingdiff/datasets.py", line 482, in __init__
    super().__init__(*args, **kwargs)
  File "/home/kevyan/src/foldingdiff/foldingdiff/datasets.py", line 142, in __init__
    elif use_cache and os.path.exists(self.cache_fname):
  File "/home/kevyan/src/foldingdiff/foldingdiff/datasets.py", line 264, in cache_fname
    for fname in self.fnames:
AttributeError: 'CathCanonicalAnglesOnlyDataset' object has no attribute 'fnames'

download data problem

Hello, when I download data with the script download_alphafold.sh, files are 404 NOT FOUND
and the script download_cath.sh, md5sum: cath/cath-dataset-nonredundant-S40.pdb.tgz: no properly formatted MD5 checksum lines found
Coud you please fix it? Thanks.

Submit training code error

I'm sorry to bother you, but I ran pythonpython bin/train.py config_jsons/cath_full_angles_cosine.json --dryrun in foldingdiff_main The error is as follows:
from foldingdiff import datasets
ModuleNotFoundError: No module named 'foldingdiff'
Looking forward to your reply

The input of the model do not include amino acid residues sequence infomation?

so what can the model be used for?

ERROR 404: Not Found when I download the AlphaFold files

Great Work. I meet some small questions, when I use your script to download the AlphaFold data, it tells me that ERROR 404: Not Found. May be there is something wrong with the AlphaFold database path.

Minimum number of structures to train model

Hello,

I was performing some tests and it seems that there is a minimum number of protein structures to train a model. I have tested datasets with 2 through 10 structures (similar domains) and the pipeline runs starting at 10 structures.

Is it correct? or is there something I am not considering?

Thanks

BERT or Transformer?

Hi Kevin,

Thank you for the impressive work!

In section 3.3, it says "a vanilla bidirectional transformer architecture" is adopted with a citation to the origin transformer paper. Also in Appendix C2, "an auto-regressive transformer" is used as a baseline.

I am quite confused since it looks like the implementation uses a BERT architecture (for both the main model and the autoregressive baseline). I am wondering whether the implementation or the preprint has been updated.

Best,

Missing CATH dataset

Hi,
I tried to run the data downloading script download_cath.sh as suggested in README, but the error below showed up, saying "cath-dataset-nonredundant-S40.pdb.tgz" file is missing.

And I checked the remote file directory /cath/releases/latest-release/non-redundant-data-sets via FTP. There seems to be no files there. Wondering if you could help to fix it. Thanks a lot:)

Conditional generation

Thanks to the author for sharing, this is the wonderful work.
I'd like to ask you a question. I want to achieve the conditional generation of the three-dimensional structure of the protein. For example, input the target protein amino acid sequence, generate the corresponding amino acid sequence of the protein three-dimensional structure. Can condition generation be realized by modifying Foldingdiff? Can you briefly explain how to modify it? I would appreciate it if you could give me some help.

CUDA requirement

Hello,

I am trying to train the model and my system does not have cuda capabilities. If I run the bin/train.py I got the message "assert torch.cuda.is_available(), "Requires CUDA to train" "
Is there a way to bypass this issue?

EDIT: I have commented the line throwing the message and realized that an option for CPU-use only is available.

Thanks

Training my own dataset

Hi, thank you for sharing your very interesting folding model!
I want to train a model using my own dataset. Do I just need to put the dataset in the data\cath directory? Also, I would appreciate it if you could explain how to create a compatible Dataset. It seems like PDB files are not compatible.

backbone with clashes

Hi,

Very interesting work.

I tried the sampling command 'python ~/projects/foldingdiff/bin/sample.py -l 50 128 -n 10 -b 512 --device cuda:0' and I observed a lot of steric clashes within the generated PDB structures. Any insights about where these clashes might come from? Based on the paper, it seems rebuilding the backbone from angles does not really cause problems. So it could be the training/mode/data?

Best,
Xiaotong

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.