GithubHelp home page GithubHelp logo

rfdiffusion's Introduction

RFdiffusion

alt text

Image: Ian C. Haydon / UW Institute for Protein Design

Description

RFdiffusion is an open source method for structure generation, with or without conditional information (a motif, target etc). It can perform a whole range of protein design challenges as we have outlined in the RFdiffusion paper.

Things Diffusion can do

  • Motif Scaffolding
  • Unconditional protein generation
  • Symmetric unconditional generation (cyclic, dihedral and tetrahedral symmetries currently implemented, more coming!)
  • Symmetric motif scaffolding
  • Binder design
  • Design diversification ("partial diffusion", sampling around a design)

Table of contents

Getting started / installation

Thanks to Sergey Ovchinnikov, RFdiffusion is available as a Google Colab Notebook if you would like to run it there!

We strongly recommend reading this README carefully before getting started with RFdiffusion, and working through some of the examples in the Colab Notebook.

If you want to set up RFdiffusion locally, follow the steps below:

To get started using RFdiffusion, clone the repo:

git clone https://github.com/RosettaCommons/RFdiffusion.git

You'll then need to download the model weights into the RFDiffusion directory.

cd RFdiffusion
mkdir models && cd models
wget http://files.ipd.uw.edu/pub/RFdiffusion/6f5902ac237024bdd0c176cb93063dc4/Base_ckpt.pt
wget http://files.ipd.uw.edu/pub/RFdiffusion/e29311f6f1bf1af907f9ef9f44b8328b/Complex_base_ckpt.pt
wget http://files.ipd.uw.edu/pub/RFdiffusion/60f09a193fb5e5ccdc4980417708dbab/Complex_Fold_base_ckpt.pt
wget http://files.ipd.uw.edu/pub/RFdiffusion/74f51cfb8b440f50d70878e05361d8f0/InpaintSeq_ckpt.pt
wget http://files.ipd.uw.edu/pub/RFdiffusion/76d00716416567174cdb7ca96e208296/InpaintSeq_Fold_ckpt.pt
wget http://files.ipd.uw.edu/pub/RFdiffusion/5532d2e1f3a4738decd58b19d633b3c3/ActiveSite_ckpt.pt
wget http://files.ipd.uw.edu/pub/RFdiffusion/12fc204edeae5b57713c5ad7dcb97d39/Base_epoch8_ckpt.pt

Optional:
wget http://files.ipd.uw.edu/pub/RFdiffusion/f572d396fae9206628714fb2ce00f72e/Complex_beta_ckpt.pt

# original structure prediction weights
wget http://files.ipd.uw.edu/pub/RFdiffusion/1befcb9b28e2f778f53d47f18b7597fa/RF_structure_prediction_weights.pt

Conda Install SE3-Transformer

Ensure that you have either Anaconda or Miniconda installed.

You also need to install NVIDIA's implementation of SE(3)-Transformers Here is how to install the NVIDIA SE(3)-Transformer code:

conda env create -f env/SE3nv.yml

conda activate SE3nv
cd env/SE3Transformer
pip install --no-cache-dir -r requirements.txt
python setup.py install
cd ../.. # change into the root directory of the repository
pip install -e . # install the rfdiffusion module from the root of the repository

Anytime you run diffusion you should be sure to activate this conda environment by running the following command:

conda activate SE3nv

Total setup should take less than 30 minutes on a standard desktop computer. Note: Due to the variation in GPU types and drivers that users have access to, we are not able to make one environment that will run on all setups. As such, we are only providing a yml file with support for CUDA 11.1 and leaving it to each user to customize it to work on their setups. This customization will involve changing the cudatoolkit and (possibly) the PyTorch version specified in the yml file.


Get PPI Scaffold Examples

To run the scaffolded protein binder design (PPI) examples, we have provided some example scaffold files (examples/ppi_scaffolds_subset.tar.gz). You'll need to untar this:

tar -xvf examples/ppi_scaffolds_subset.tar.gz -C examples/

We will explain what these files are and how to use them in the Fold Conditioning section.


Usage

In this section we will demonstrate how to run diffusion.

alt text

Running the diffusion script

The actual script you will execute is called scripts/run_inference.py. There are many ways to run it, governed by hydra configs. Hydra configs are a nice way of being able to specify many different options, with sensible defaults drawn directly from the model checkpoint, so inference should always, by default, match training. What this means is that the default values in config/inference/base.yml might not match the actual values used during inference, with a specific checkpoint. This is all handled under the hood.


Basic execution - an unconditional monomer

alt text

Let's first look at how you would do unconditional design of a protein of length 150aa. For this, we just need to specify three things:

  1. The length of the protein
  2. The location where we want to write files to
  3. The number of designs we want
./scripts/run_inference.py 'contigmap.contigs=[150-150]' inference.output_prefix=test_outputs/test inference.num_designs=10

Let's look at this in detail. Firstly, what is contigmap.contigs? Hydra configs tell the inference script how it should be run. To keep things organised, the config has different sub-configs, one of them being contigmap, which pertains to everything related to the contig string (that defines the protein being built). Take a look at the config file if this isn't clear: configs/inference/base.yml Anything in the config can be overwritten manually from the command line. You could, for example, change how the diffuser works:

diffuser.crd_scale=0.5

... but don't do this unless you really know what you're doing!!

Now, what does 'contigmap.contigs=[150-150]' mean? To those who have used RFjoint inpainting, this might look familiar, but a little bit different. Diffusion, in fact, uses the identical 'contig mapper' as inpainting, except that, because we're using hydra, we have to give this to the model in a different way. The contig string has to be passed as a single-item in a list, rather than as a string, for hydra reasons and the entire argument MUST be enclosed in '' so that the commandline does not attempt to parse any of the special characters.

The contig string allows you to specify a length range, but here, we just want a protein of 150aa in length, so you just specify [150-150] This will then run 10 diffusion trajectories, saving the outputs to your specified output folder.

NB the first time you run RFdiffusion, it will take a while 'Calculating IGSO3'. Once it has done this, it'll be cached for future reference though! For an additional example of unconditional monomer generation, take a look at ./examples/design_unconditional.sh in the repo!


Motif Scaffolding

RFdiffusion can be used to scaffold motifs, in a manner akin to Constrained Hallucination and RFjoint Inpainting. In general, RFdiffusion significantly outperforms both Constrained Hallucination and RFjoint Inpainting.

alt text

When scaffolding protein motifs, we need a way of specifying that we want to scaffold some particular protein input (one or more segments from a .pdb file), and to be able to specify how we want these connected, and by how many residues, in the new scaffolded protein. What's more, we want to be able to sample different lengths of connecting protein, as we generally don't know a priori precisely how many residues we'll need to best scaffold a motif. This job of specifying inputs is handled by contigs, governed by the contigmap config in the hydra config. For those familiar with Constrained Hallucination or RFjoint Inpainting, the logic is very similar. Briefly:

  • Anything prefixed by a letter indicates that this is a motif, with the letter corresponding to the chain letter in the input pdb files. E.g. A10-25 pertains to residues ('A',10),('A',11)...('A',25) in the corresponding input pdb
  • Anything not prefixed by a letter indicates protein to be built. This can be input as a length range. These length ranges are randomly sampled each iteration of RFdiffusion inference.
  • To specify chain breaks, we use /0 .

In more detail, if we want to scaffold a motif, the input is just like RFjoint Inpainting, except needing to navigate the hydra config input. If we want to scaffold residues 10-25 on chain A a pdb, this would be done with 'contigmap.contigs=[5-15/A10-25/30-40]'. This asks RFdiffusion to build 5-15 residues (randomly sampled at each inference cycle) N-terminally of A10-25 from the input pdb, followed by 30-40 residues (again, randomly sampled) to its C-terminus. If we wanted to ensure the length was always e.g. 55 residues, this can be specified with contigmap.length=55-55. You need to obviously also provide a path to your pdb file: inference.input_pdb=path/to/file.pdb. It doesn't matter if your input pdb has residues you don't want to scaffold - the contig map defines which residues in the pdb are actually used as the "motif". In other words, even if your pdb files has a B chain, and other residues on the A chain, only A10-25 will be provided to RFdiffusion.

To specify that we want to inpaint in the presence of a separate chain, this can be done as follows:

'contigmap.contigs=[5-15/A10-25/30-40/0 B1-100]'

Look at this carefully. /0 is the indicator that we want a chain break. NOTE, the space is important here. This tells the diffusion model to add a big residue jump (200aa) to the input, so that the model sees the first chain as being on a separate chain to the second.

An example of motif scaffolding can be found in ./examples/design_motifscaffolding.sh.

The "active site" model holds very small motifs in place

In the RFdiffusion preprint we noted that for very small motifs, RFdiffusion has the tendency to not keep them perfectly fixed in the output. Therefore, for scaffolding minimalist sites such as enzyme active sites, we fine-tuned RFdiffusion on examples similar to these tasks, allowing it to hold smaller motifs better in place, and better generate in silico successes. If your input functional motif is very small, we reccomend using this model, which can easily be specified using the following syntax: inference.ckpt_override_path=models/ActiveSite_ckpt.pt

The inpaint_seq flag

For those familiar with RFjoint Inpainting, the contigmap.inpaint_seq input is equivalent. The idea is that often, when, for example, fusing two proteins, residues that were on the surface of a protein (and are therefore likely polar), now need to be packed into the 'core' of the protein. We therefore want them to become hydrophobic residues. What we can do, rather than directly mutating them to hydrophobics, is to mask their sequence identity, and allow RFdiffusion to implicitly reason over their sequence, and better pack against them. This requires a different model than the 'base' diffusion model, that has been trained to understand this paradigm, but this is automatically handled by the inference script (you don't need to do anything).

To specify amino acids whose sequence should be hidden, use the following syntax:

'contigmap.inpaint_seq=[A1/A30-40]'

Here, we're masking the residue identity of residue A1, and all residues between A30 and A40 (inclusive).

An example of executing motif scaffolding with the contigmap.inpaint_seq flag is located in ./examples/design_motifscaffolding_inpaintseq.sh

A note on diffuser.T

RFdiffusion was originally trained with 200 discrete timesteps. However, recent improvements have allowed us to reduce the number of timesteps we need to use at inference time. In many cases, running with as few as approximately 20 steps provides outputs of equivalent in silico quality to running with 200 steps (providing a 10X speedup). The default is now set to 50 steps. Noting this is important for understanding the partial diffusion, described below.


Partial diffusion

Something we can do with diffusion is to partially noise and de-noise a structure, to get some diversity around a general fold. This can work really nicely (see Vazquez-Torres et al., BioRxiv 2022). This is specified by using the diffuser.parial_T input, and setting a timestep to 'noise' to.

alt text

More noise == more diversity. In Vazquez-Torres et al., 2022, we typically used `diffuser.partial_T` of approximately 80, but this was with respect to the 200 timesteps we were using. Now that the default `diffuser.T` is 50, you will need to adjust diffuser.partial_T accordingly. E.g. now that `diffuser.T=50`, the equivalent of 80 noising steps is `diffuser.partial_T=20`. We strongly recommend sampling different values for `partial_T` however, to find the best parameters for your specific problem.

When doing partial diffusion, because we are now diffusing from a known structure, this creates certain constraints. You can still use the contig input, but this has to yield a contig string exactly the same length as the input protein. E.g. if you have a binder:target complex, and you want to diversify the binder (length 100, chain A), you would need to input something like this:

'contigmap.contigs=[100-100/0 B1-150]' diffuser.partial_T=20

The reason for this is that, if your input protein was only 80 amino acids, but you've specified a desired length of 100, we don't know where to diffuse those extra 20 amino acids from, and hence, they will not lie in the distribution that RFdiffusion has learned to denoise from.

An example of partial diffusion can be found in ./examples/design_partialdiffusion.sh!

You can also keep parts of the sequence of the diffused chain fixed, if you want. An example of why you might want to do this is in the context of helical peptide binding. If you've threaded a helical peptide sequence onto an ideal helix, and now want to diversify the complex, allowing the helix to be predicted now not as an ideal helix, you might do something like:

'contigmap.contigs=[100-100/0 20-20]' 'contigmap.provide_seq=[100-119]' diffuser.partial_T=10

In this case, the 20aa chain is the helical peptide. The contigmap.provide_seq input is zero-indexed, and you can provide a range (so 100-119 is an inclusive range, unmasking the whole sequence of the peptide). Multiple sequence ranges can be provided separated by a comma, e.g. 'contigmap.provide_seq=[172-177,200-205]'.

Note that the provide_seq option requires using a different model checkpoint, but this is automatically handled by the inference script.

An example of partial diffusion with providing sequence in diffused regions can be found in ./examples/design_partialdiffusion_withseq.sh. The same example specifying multiple sequence ranges can be found in ./examples/design_partialdiffusion_multipleseq.sh.


Binder Design

Hopefully, it's now obvious how you might make a binder with diffusion! Indeed, RFdiffusion shows excellent in silico and experimental ability to design de novo binders.

alt text

If chain B is your target, then you could do it like this:

./scripts/run_inference.py 'contigmap.contigs=[B1-100/0 100-100]' inference.output_prefix=test_outputs/binder_test inference.num_designs=10

This will generate 100 residue long binders to residues 1-100 of chain B.

However, this probably isn't the best way of making binders. Because diffusion is somewhat computationally-intensive, we need to try and make it as fast as possible. Providing the whole of your target, uncropped, is going to make diffusion very slow if your target is big (and most targets-of-interest, such as cell-surface receptors tend to be very big). One tried-and-true method to speed up binder design is to crop the target protein around the desired interface location. BUT! This creates a problem: if you crop your target and potentially expose hydrophobic core residues which were buried before the crop, how can you guarantee the binder will go to the intended interface site on the surface of the target, and not target the tantalizing hydrophobic patch you have just artificially created?

We solve this issue by providing the model with what we call "hotspot residues". The complex models we refer to earlier in this README file have all been trained with hotspot residues, in this training regime, during each example, the model is told (some of) the residues on the target protein which contact the target (i.e., resides that are part of the interface). The model readily learns that it should be making an interface which involved these hotspot residues. At inference time then, we can provide our own hotspot residues to define a region which the binder must contact. These are specified like this: 'ppi.hotspot_res=[A30,A33,A34]', where A is the chain ID in the input pdb file of the hotspot residue and the number is the residue index in the input pdb file of the hotspot residue.

Finally, it has been observed that the default RFdiffusion model often generates mostly helical binders. These have high computational and experimental success rates. However, there may be cases where other kinds of topologies may be desired. For this, we include a "beta" model, which generates a greater diversity of topologies, but has not been extensively experimentally validated. Try this at your own risk:

inference.ckpt_override_path=models/Complex_beta_ckpt.pt

An example of binder design with RFdiffusion can be found in ./examples/design_ppi.sh.


Practical Considerations for Binder Design

RFdiffusion is an extremely powerful binder design tool but it is not magic. In this section we will walk through some common pitfalls in RFdiffusion binder design and offer advice on how to get the most out of this method.

Selecting a Target Site

Not every site on a target protein is a good candidate for binder design. For a site to be an attractive candidate for binding it should have >~3 hydrophobic residues for the binder to interact with. Binding to charged polar sites is still quite hard. Binding to sites with glycans close to them is also hard since they often become ordered upon binding and you will take an energetic hit for that. Historically, binder design has also avoided unstructured loops, it is not clear if this is still a requirement as RFdiffusion has been used to bind unstructured peptides which share a lot in common with unstructured loops.

Truncating your Target Protein

RFdiffusion scales in runtime as O(N^2) where N is the number of residues in your system. As such, it is a very good idea to truncate large targets so that your computations are not unnecessarily expensive. RFdiffusion and all downstream steps (including AF2) are designed to allow for a truncated target. Truncating a target is an art. For some targets, such as multidomain extracellular membranes, a natural truncation point is where two domains are joined by a flexible linker. For other proteins, such as virus spike proteins, this truncation point is less obvious. Generally you want to preserve secondary structure and introduce as few chain breaks as possible. You should also try to leave ~10A of target protein on each side of your intended target site. We recommend using PyMol to truncate your target protein.

Picking Hotspots

Hotspots are a feature that we integrated into the model to allow for the control of the site on the target which the binder will interact with. In the paper we define a hotspot as a residue on the target protein which is within 10A Cbeta distance of the binder. Of all of the hotspots which are identified on the target 0-20% of these hotspots are actually provided to the model and the rest are masked. This is important for understanding how you should pick hotspots at inference time.; the model is expecting to have to make more contacts than you specify. We normally recommend between 3-6 hotspots, you should run a few pilot runs before generating thousands of designs to make sure the number of hotspots you are providing will give results you like.

If you have run the previous PatchDock RifDock binder design pipeline, for the RFdiffusion paper we chose our hotspots to be the PatchDock residues of the target.

Binder Design Scale

In the paper, we generated ~10,000 RFdiffusion binder backbones for each target. From this set of backbones we then generated two sequences per backbone using ProteinMPNN-FastRelax (described below). We screened these ~20,000 designs using AF2 with initial guess and target templating (also described below).

Given the high success rates we observed in the paper, for some targets it may be sufficient to only generate ~1,000 RFdiffusion backbones in a campaign. What you want is to get enough designs that pass pAE_interaction < 10 (described more in Binder Design Filtering section) such that you are able to fill a DNA order with these successful designs. We have found that designs that do not pass pAE_interaction < 10 are not worth ordering since they will likely not work experimentally.

Sequence Design for Binders

You may have noticed that the binders designed by RFdiffusion come out with a poly-Glycine sequence. This is not a bug. RFdiffusion is a backbone-generation model and does not generate sequence for the designed region, therefore, another method must be used to assign a sequence to the binders. In the paper we use the ProteinMPNN-FastRelax protocol to do sequence design. We recommend that you do this as well. The code for this protocol can be found in this GitHub repo. While we did not find the FastRelax part of the protocol to yield the large in silico success rate improvements that it yielded with the RifDock-generated docks, it is still a good way to increase your number of shots-on-goal for each (computationally expensive) RFdiffusion backbone. If you would prefer to simply run ProteinMPNN on your binders without the FastRelax step, that will work fine but will be more computationally expensive.

Binder Design Filtering

One of the most important parts of the binder design pipeline is a filtering step to evaluate if your binders are actually predicted to work. In the paper we filtered using AF2 with an initial guess and target templating, scripts for this protocol are available here. We have found that filtering at pae_interaction < 10 is a good predictor of a binder working experimentally.


Fold Conditioning

Something that works really well is conditioning binder design (or monomer generation) on particular topologies. This is achieved by providing (partial) secondary structure and block adjacency information (to a model that has been trained to condition on this).

alt text

We are still working out the best way to actually generate this input at inference time, but for now, we have settled upon generating inputs directly from pdb structures. This permits 'low resolution' specification of output topology (i.e., I want a TIM barrel but I don't care precisely where resides are). In `helper_scripts/`, there's a script called `make_secstruc_adj.py`, which can be used as follows:

e.g. 1:

./make_secstruc_adj.py --input_pdb ./2KL8.pdb --out_dir /my/dir/for/adj_secstruct

or e.g. 2:

./make_secstruc_adj.py --pdb_dir ./pdbs/ --out_dir /my/dir/for/adj_secstruct

This will process either a single pdb, or a folder of pdbs, and output a secondary structure and adjacency pytorch file, ready to go into the model. For now (although this might not be necessary), you should also generate these files for the target protein (if you're doing PPI), and provide this to the model. You can then use these at inference as follows:

./scripts/run_inference.py inference.output_prefix=./scaffold_conditioned_test/test scaffoldguided.scaffoldguided=True scaffoldguided.target_pdb=False scaffoldguided.scaffold_dir=./examples/ppi_scaffolds_subset

A few extra things:

  1. As mentioned above, for PPI, you will want to provide a target protein, along with its secondary structure and block adjacency. This can be done by adding:
scaffoldguided.target_pdb=True scaffoldguided.target_path=input_pdbs/insulin_target.pdb inference.output_prefix=insulin_binder/jordi_ss_insulin_noise0_job0 'ppi.hotspot_res=[A59,A83,A91]' scaffoldguided.target_ss=target_folds/insulin_target_ss.pt scaffoldguided.target_adj=target_folds/insulin_target_adj.pt

To generate these block adjacency and secondary structure inputs, you can use the helper script.

This will now generate 3-helix bundles to the insulin target.

For ppi, it's probably also worth adding this flag:

scaffoldguided.mask_loops=False

This is quite important to understand. During training, we mask some of the secondary structure and block adjacency. This is convenient, because it allows us to, at inference, easily add extra residues without having to specify precise secondary structure for every residue. E.g. if you want to make a long 3 helix bundle, you could mask the loops, and add e.g. 20 more 'mask' tokens to that loop. The model will then (presumbly) choose to make e.g. 15 of these residues into helices (to extend the 3HB), and then make a 5aa loop. But, you didn't have to specify that, which is nice. The way this would be done would be like this:

scaffoldguided.mask_loops=True scaffoldguided.sampled_insertion=15 scaffoldguided.sampled_N=5 scaffoldguided.sampled_C=5

This will, at each run of inference, sample up to 15 residues to insert into loops in your 3HB input, and up to 5 additional residues at N and C terminus. This strategy is very useful if you don't have a large set of pdbs to make block adjacencies for. For example, we showed that we could generate loads of lengthened TIM barrels from a single starting pdb with this strategy. However, for PPI, if you're using the provided scaffold sets, it shouldn't be necessary (because there are so many scaffolds to start from, generating extra diversity isn't especially necessary).

Finally, if you have a big directory of block adjacency/secondary structure files, but don't want to use all of them, you can make a .txt file of the ones you want to use, and pass:

scaffoldguided.scaffold_list=path/to/list

For PPI, we've consistently seen that reducing the noise added at inference improves designs. This comes at the expense of diversity, but, given that the scaffold sets are huge, this probably doesn't matter too much. We therefore recommend lowering the noise. 0.5 is probably a good compromise:

denoiser.noise_scale_ca=0.5 denoiser.noise_scale_frame=0.5

This just scales the amount of noise we add to the translations (noise_scale_ca) and rotations (noise_scale_frame) by, in this case, 0.5.

An additional example of PPI with fold conditioning is available here: ./examples/design_ppi_scaffolded.sh


Generation of Symmetric Oligomers

We're going to switch gears from discussing PPI and look at another task at which RFdiffusion performs well on: symmetric oligomer design. This is done by symmetrising the noise we sample at t=T, and symmetrising the input at every timestep. We have currently implemented the following for use (with the others coming soon!):

  • Cyclic symmetry
  • Dihedral symmetry
  • Tetrahedral symmetry

alt text

Here's an example:

./scripts/run_inference.py --config-name symmetry  inference.symmetry=tetrahedral 'contigmap.contigs=[360]' inference.output_prefix=test_sample/tetrahedral inference.num_designs=1

Here, we've specified a different config file (with --config-name symmetry). Because symmetric diffusion is quite different from the diffusion described above, we packaged a whole load of symmetry-related configs into a new file (see configs/inference/symmetry.yml). Using this config file now puts diffusion in symmetry-mode.

The symmetry type is then specified with inference.symmetry=. Here, we're specifying tetrahedral symmetry, but you could also choose cyclic (e.g. c4) or dihedral (e.g. d2).

The configmap.contigs length refers to the total length of your oligomer. Therefore, it must be divisible by n chains.

More examples of designing oligomers can be found here: ./examples/design_cyclic_oligos.sh, ./examples/design_dihedral_oligos.sh, ./examples/design_tetrahedral_oligos.sh.


Using Auxiliary Potentials

Performing diffusion with symmetrized noise may give you the idea that we could use other external interventions during the denoising process to guide diffusion. One such intervention that we have implemented is auxiliary potentials. Auxiliary potentials can be very useful for guiding the inference process. E.g. whereas in RFjoint inpainting, we have little/no control over the final shape of an output, in diffusion we can readily force the network to make, for example, a well-packed protein. This is achieved in the updates we make at each step.

Let's go a little deeper into how the diffusion process works: At timestep T (the first step of the reverse-diffusion inference process), we sample noise from a known prior distribution. The model then makes a prediction of what the final structure should be, and we use these two states (noise at time T, prediction of the structure at time 0) to back-calculate where t=T-1 would have been. We therefore have a vector pointing from each coordinate at time T, to their corresponding, back-calculated position at time T-1. But, we want to be able to bias this update, to push the trajectory towards some desired state. This can be done by biasing that vector with another vector, which points towards a position where that residue would reduce the 'loss' as defined by your potential. E.g. if we want to use the monomer_ROG potential, which seeks to minimise the radius of gyration of the final protein, if the models prediction of t=0 is very elongated, each of those distant residues will have a larger gradient when we differentiate the monomer_ROG potential w.r.t. their positions. These gradients, along with the corresponding scale, can be combined into a vector, which is then combined with the original update vector to make a "biased update" at that timestep.

The exact parameters used when applying these potentials matter. If you weight them too strongly, you're not going to end up with a good protein. Too weak, and they'll have little effect. We've explored these potentials in a few different scenarios, and have set sensible defaults, if you want to use them. But, if you feel like they're too weak/strong, or you just fancy exploring, do play with the parameters (in the potentials part of the config file).

Potentials are specified as a list of strings with each string corresponding to a potential. The argument for potentials is potentials.guiding_potentials. Within the string per-potential arguments may be specified in the following syntax: arg_name1:arg_value1,arg_name2:arg_value2,...,arg_nameN:arg_valueN. The only argument that is required for each potential is the name of the potential that you wish to apply, the name of this argument is type as-in the type of potential you wish to use. Some potentials such as olig_contacts and substrate_contacts take global options such as potentials.substrate, see config/inference/base.yml for all the global arguments associated with potentials. Additionally, it is useful to have the effect of the potential "decay" throughout the trajectory, such that in the beginning the effect of the potential is 1x strength, and by the end is much weaker. These decays (constant,linear,quadratic,cubic) can be set with the potentials.guide_decay argument.

Here's an example of how to specify a potential:

potentials.guiding_potentials=[\"type:olig_contacts,weight_intra:1,weight_inter:0.1\"] potentials.olig_intra_all=True potentials.olig_inter_all=True potentials.guide_scale=2 potentials.guide_decay='quadratic'

We are still fully characterising how/when to use potentials, and we strongly recommend exploring different parameters yourself, as they are clearly somewhat case-dependent. So far, it is clear that they can be helpful for motif scaffolding and symmetric oligomer generation. However, they seem to interact weirdly with hotspot residues in PPI. We think we know why this is, and will work in the coming months to write better potentials for PPI. And please note, it is often good practice to start with no potentials as a baseline, then slowly increase their strength. For the oligomer contacts potentials, start with the ones provided in the examples, and note that the intra chain potential often should be higher than the inter chain potential.

We have already implemented several potentials but it is relatively straightforward to add more, if you want to push your designs towards some specified goal. The only condition is that, whatever potential you write, it is differentiable. Take a look at potentials.potentials.py for examples of the potentials we have implemented so far.


Symmetric Motif Scaffolding.

We can also combine symmetric diffusion with motif scaffolding to scaffold motifs symmetrically. Currently, we have one way for performing symmetric motif scaffolding. That is by specifying the position of the motif specified w.r.t. the symmetry axes.

alt text

Special input .pdb and contigs requirements

For now, we require that a user have a symmetrized version of their motif in their input pdb for symmetric motif scaffolding. There are two main reasons for this. First, the model is trained by centering any motif at the origin, and thus the code also centers motifs at the origin automatically. Therefore, if your motif is not symmetrized, this centering action will result in an asymmetric unit that now has the origin and axes of symmetry running right through it (bad). Secondly, the diffusion code uses a canonical set of symmetry axes (rotation matrices) to propogate the asymmetric unit of a motif. In order to prevent accidentally running diffusion trajectories which are propogating your motif in ways you don't intend, we require that a user symmetrize an input using the RFdiffusion canonical symmetry axes.

RFdiffusion canonical symmetry axes

Group Axis
Cyclic Z
Dihedral (cyclic) Z
Dihedral (flip/reflection) X

Example: Inputs for symmetric motif scaffolding with motif position specified w.r.t the symmetry axes.

This example script examples/design_nickel.sh can be used to scaffold the C4 symmetric Nickel binding domains shown in the RFdiffusion paper. It combines many concepts discussed earlier, including symmetric oligomer generation, motif scaffolding, and use of guiding potentials.

Note that the contigs should specify something that is precisely symmetric. Things will break if this is not the case.


A Note on Model Weights

Because of everything we want diffusion to be able to do, there is not One Model To Rule Them All. E.g., if you want to run with secondary structure conditioning, this requires a different model than if you don't. Under the hood, we take care of most of this by default - we parse your input and work out the most appropriate checkpoint. This is where the config setup is really useful. The exact model checkpoint used at inference contains in it all of the parameters is was trained with, so we can just populate the config file with those values, such that inference runs as designed. If you do want to specify a different checkpoint (if, for example, we train a new model and you want to test it), you just have to make sure it's compatible with what you're doing. E.g. if you try and give secondary structure features to a model that wasn't trained with them, it'll crash.

Things you might want to play with at inference time

Occasionally, it might good to try an alternative model (for example the active site model, or the beta binder model). These can be specified with inference.ckpt_override_path. We do not recommend using these outside of the described use cases, however, as there is not a guarantee they will understand other kinds of inputs.

For a full list of things that are implemented at inference, see the config file (configs/inference/base.yml or configs/inference/symmetry.yml). Although you can modify everything, this is not recommended unless you know what you're doing. Generally, don't change the model, preprocess or diffuser configs. These pertain to how the model was trained, so it's unwise to change how you use the model at inference time. However, the parameters below are definitely worth exploring: -inference.final_step: This is when we stop the trajectory. We have seen that you can stop early, and the model is already making a good prediction of the final structure. This speeds up inference. -denoiser.noise_scale_ca and denoiser.noise_scale_frame: These can be used to reduce the noise used during sampling (as discussed for PPI above). The default is 1 (the same noise added at training), but this can be reduced to e.g. 0.5, or even 0. This actually improves the quality of models coming out of diffusion, but at the expense of diversity. If you're not getting any good outputs, or if your problem is very constrained, you could try reducing the noise. While these parameters can be changed independently (for translations and rotations), we recommend keeping them tied.

Understanding the output files

We output several different files.

  1. The .pdb file. This is the final prediction out of the model. Note that every designed residue is output as a glycine (as we only designed the backbone), and no sidechains are output. This is because, even though RFdiffusion conditions on sidechains in an input motif, there is no loss applied to these predictions, so they can't strictly be trusted.
  2. The .trb file. This contains useful metadata associated with that specific run, including the specific contig used (if length ranges were sampled), as well as the full config used by RFdiffusion. There are also a few other convenient items in this file:
    • details about mapping (i.e. how residues in the input map to residues in the output)
      • con_ref_pdb_idx/con_hal_pdb_idx - These are two arrays including the input pdb indices (in con_ref_pdb_idx), and where they are in the output pdb (in con_hal_pdb_idx). This only contains the chains where inpainting took place (i.e. not any fixed receptor/target chains)
      • con_ref_idx0/con_hal_idx0 - These are the same as above, but 0 indexed, and without chain information. This is useful for splicing coordinates out (to assess alignment etc).
      • inpaint_seq - This details any residues that were masked during inference.
  3. Trajectory files. By default, we output the full trajectories into the /traj/ folder. These files can be opened in pymol, as multi-step pdbs. Note that these are ordered in reverse, so the first pdb is technically the last (t=1) prediction made by RFdiffusion during inference. We include both the pX0 predictions (what the model predicted at each timestep) and the Xt-1 trajectories (what went into the model at each timestep).

Docker

We have provided a Dockerfile at docker/Dockerfile to help run RFDiffusion on HPC and other container orchestration systems. Follow these steps to build and run the container on your system:

  1. Clone this repository with git clone https://github.com/RosettaCommons/RFdiffusion.git and then cd RFdiffusion
  2. Verify that the Docker daemon is running on your system with docker info. You can find Docker installation instructions for Mac, WIndows, and Linux in the official Docker docs. You may also consider Finch, the open source client for container development.
  3. Build the container image on your system with docker build -f docker/Dockerfile -t rfdiffusion .
  4. Create some folders on your file system with mkdir $HOME/inputs $HOME/outputs $HOME/models
  5. Download the RFDiffusion models with bash scripts/download_models.sh $HOME/models
  6. Download a test file (or another of your choice) with wget -P $HOME/inputs https://files.rcsb.org/view/5TPN.pdb
  7. Run the container with the following command:
docker run -it --rm --gpus all \
  -v $HOME/models:$HOME/models \
  -v $HOME/inputs:$HOME/inputs \
  -v $HOME/outputs:$HOME/outputs \
  rfdiffusion \
  inference.output_prefix=$HOME/outputs/motifscaffolding \
  inference.model_directory_path=$HOME/models \
  inference.input_pdb=$HOME/inputs/5TPN.pdb \
  inference.num_designs=3 \
  'contigmap.contigs=[10-40/A163-181/10-40]'

This starts the rfdiffusion container, mounts the models, inputs, and outputs folders, passes all available GPUs, and then calls the run_inference.py script with the parameters specified.

Conclusion

We are extremely excited to share RFdiffusion with the wider scientific community. We expect to push some updates as and when we make sizeable improvements in the coming months, so do stay tuned. We realize it may take some time to get used to executing RFdiffusion with perfect syntax (sometimes Hydra is hard), so please don't hesitate to create GitHub issues if you need help, we will respond as often as we can.

Now, let's go make some proteins. Have fun!

- Joe, David, Nate, Brian, Jason, and the RFdiffusion team.


RFdiffusion builds directly on the architecture and trained parameters of RoseTTAFold. We therefore thank Frank DiMaio and Minkyung Baek, who developed RoseTTAFold. RFdiffusion is released under an open source BSD License (see LICENSE file). It is free for both non-profit and for-profit use.

rfdiffusion's People

Contributors

brandonfrenz avatar brianloyal avatar davidcjuergens avatar decarboxy avatar isdementyev avatar jasonkyuyim avatar javierbq avatar joewatchwell avatar jsrimr avatar lyskov avatar nrbennet avatar nzrandol avatar rhsimplex avatar roccomoretti avatar tmsincomb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rfdiffusion's Issues

Repo examples only sampling glycine

Hey! Thanks for sharing this :)
I just had the following problem—maybe I'm doing something wrong on my end—but by trying both (i) RFdiffusion/examples/design_motifscaffolding.sh and (ii) RFdiffusion/examples/design_unconditional.sh, the randomly sampled ranges only contain Gs (e.g., for the motif example: "GGGGGGGGGGGGGGGGGGEVNKIKSALLSTNKAVVSLGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG")

Loading IGSO3 from a cache directory

Hello! I'm having a lot of fun playing around with this!

One feature that would be convenient to have would be a constant directory where all of the diffuser's pre-computed schedules live. Right now the inference code recomputes the schedule pkl if it doesn't exists in ./. To avoid recomputing whenever you're running in a different directory, the constant cache directory ensures that the schedule is only computed once.

I've implemented a simple version of this by replacing line 116 in inference/model_runners.py with
self.diffuser = Diffuser(**self._conf.diffuser, cache_dir=f'{SCRIPT_DIR}/../schedules')
instead of
self.diffuser = Diffuser(**self._conf.diffuser).
This assumes that there is a schedules directory in the RFdiffusion folder where all pre-computed schedules will live. Only downside to this simple approach is that the cache directory couldn't be overwritten (but this is technically consistent with the model checkpoints).

Compatibility issues when creating conda env

This is really amazing!

Not sure what I am doing wrong, but conda is telling me that it found some conflicts when creating the environment. So I removed the constraints, changed the channel from defaults to conda-forge and it solved my issue:

name: SE3nv
channels:
  - conda-forge
  - pytorch
  - dglteam
dependencies:
  - python
  - pytorch
  - torchaudio
  - torchvision
  - cudatoolkit
  - dgl-cuda11.1
  - pip
  - pip:
    - hydra-core
    - pyrsistent
    - icecream

Not sure neither if this is the way to go but seems to have worked for me! And all the examples I tried so far are running smoothly.

Computer config:

  • Ubuntu 22.04.2 LTS, 32,0 GiB
  • AMD® Ryzen 9 5900x 12-core processor × 24
  • NVIDIA Corporation GA102 [GeForce RTX 3080]

... again amazing!

Colabfold implementation

I've been using RFdiffusion on colab pro for a few weeks now, using Google's GPUs. Realistically, we want to generate hundreds to thousands of designs and then filter through these. How is this supposed to be implemented when the max I can do is 32 designs in colab pro? I'm not sure how running RFdiffusion locally on my own GPU (RTX3060) would work out, any thoughts? I have never done any serious computing since I'm mostly at the bench, so technically I am very naive.

Thanks for RFdiffusion!

why there is no example for design binder for protein complex

I got error with paras like this, i wonder if it is a incompatible problem that i try to design binder for more than one molecule? ['contigmap.contigs=[C/0 F/0 100-120]', 'ppi.hotspot_res=[C80,C82,C86,C87,C90,C92,C93,E128,C138,C185,C187]', 'denoiser.noise_scale_ca=0', 'denoiser.noise_scale_frame=0'
Traceback (most recent call last):
File "/Share/app/RFdiffusion/scripts/run_inference.py", line 194, in
main()
File "/Share/app/miniconda3.9/envs/SE3nv/lib/python3.9/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/Share/app/miniconda3.9/envs/SE3nv/lib/python3.9/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/Share/app/miniconda3.9/envs/SE3nv/lib/python3.9/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/Share/app/miniconda3.9/envs/SE3nv/lib/python3.9/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/Share/app/miniconda3.9/envs/SE3nv/lib/python3.9/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/Share/app/miniconda3.9/envs/SE3nv/lib/python3.9/site-packages/hydra/_internal/utils.py", line 458, in
lambda: hydra.run(
File "/Share/app/miniconda3.9/envs/SE3nv/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/Share/app/miniconda3.9/envs/SE3nv/lib/python3.9/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/Share/app/miniconda3.9/envs/SE3nv/lib/python3.9/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/Share/app/RFdiffusion/scripts/run_inference.py", line 84, in main
x_init, seq_init = sampler.sample_init()
File "/Share/app/RFdiffusion/rfdiffusion/inference/model_runners.py", line 278, in sample_init
self.contig_map = self.construct_contig(self.target_feats)
File "/Share/app/RFdiffusion/rfdiffusion/inference/model_runners.py", line 240, in construct_contig
return ContigMap(target_feats, **self.contig_conf)
File "/Share/app/RFdiffusion/rfdiffusion/contigs.py", line 78, in init
) = self.expand_sampled_mask()
File "/Share/app/RFdiffusion/rfdiffusion/contigs.py", line 225, in expand_sampled_mask
int(subcon.split("-")[0][1:]), int(subcon.split("-")[1]) + 1
ValueError: invalid literal for int() with base 10: ''

partial diffusion with a few residues fixed

How can I perform partial diffusion for all residues except 3 (residues of an active site)? I tried to fix them in 'contigs', but they were moving anyway. I also tried to use models/ActiveSite_ckpt.pt, but it ruined the fold.

The provided environment yaml for SE3nv installs pytorch-cpu version, when the installed cuda version is not 11.1

A different driver version (in my case: Driver Version: 510.108.03 CUDA Version: 11.6) fails to install the cuda version of pytorch, resulting in the runtime error when trying to run all of the examples.
This is probably due to the fact that there's no cudatoolkit version 11.1 (which is required by the original SE3nv yaml) for this driver.

To solve it, I installed a cudatoolkit11.6 and pytorch1.12.1, I attached the export of my environment, in case someone else encounters this problem

environment.yaml.gz

Design C3-symmetric oligomers to bind the SARS-CoV-2 Spike protein.

Dear developers, I am repeating the analysis of "Design of C3-symmetric oligomers to scaffold the binding interface of the designed ACE2 mimic". But I found RFDiffusion may change the orientation of motif protomer.

Data link: https://drive.google.com/drive/folders/19BZTqTx-uKEjVqGp06Ez2zufb7hgva-q?usp=share_link

The file 7uhc.pdb has been centrelized by me can been accessed from this link. I used Chimera to prove it is C3 symmetry along z axis:

open 7uhc.pdb
delete #0:.B #0:.C #0:.E #0:.F
sym #0 group C3 axis z
open 7uhc.pdb

image

Then I used RFDiffusion to design the C3-symmetric oligomers with following command:

run_inference.py \
    inference.symmetry=C3 \
    inference.num_designs=1 \
    inference.output_prefix=Spike_Symmetric_PPI/1_structure_design/Spike_Symmetric_PPI_1.0_0.1_Base \
    'potentials.guiding_potentials=["type:olig_contacts,weight_intra:1.0,weight_inter:0.1"]' \
    potentials.olig_intra_all=True \
    potentials.olig_inter_all=True \
    potentials.guide_scale=2 \
    diffuser.T=50 \
    potentials.guide_decay=quadratic \
    inference.input_pdb=7uhc.pdb \
    'contigmap.contigs=[D1-55/120/0 E1-55/120/0 F1-55/120/0]' \
    inference.ckpt_override_path=models/Base_ckpt.pt

The output file can be accessed from this link. And I open 7uhc.pdb and Spike_Symmetric_PPI_1.0_0.1_Base_0.pdb with Chimera:

open 7uhc.pdb
open Spike_Symmetric_PPI_1.0_0.1_Base_0.pdb
delete #0:.A #0:.B #0:.C
display @CA
~ribbon

image

And I align model #1 to model #0:

 mm #0 #1

image

Only one motif can be perfectly matched.

ModuleNotFoundError: No module named 'rfdiffusion'

Thanks so much to the RosettaCommons team for open-sourcing RFdiffusion! I'm looking forward to getting started.

Unfortunately, something appears to have gone wrong with the folder organization of my installation. I'm trying to execute the first example in the README, generating unconstrained backbones with 150 residues. Here is my attempt to execute, and my output:

(SE3nv) john@john-Desktop:~/RFdiffusion$ ./scripts/run_inference.py 'contigmap.contigs=[150-150]' inference.output_prefix=test_outputs/test inference.num_designs=10
Traceback (most recent call last):
  File "/home/john/RFdiffusion/./scripts/run_inference.py", line 24, in <module>
    from rfdiffusion.util import writepdb_multi, writepdb
ModuleNotFoundError: No module named 'rfdiffusion'

So run_inference.py is being found -- but it's looking for the util.py file somewhere other than where it is, which is /home/john/RFdiffusion/rfdiffusion.

I know how to reorganize Python folders and/or to provide a folder link, to hack my way around this problem, but I'm concerned that I would cause other problems by doing so.

Please advise, thanks.

How to fix part of sequence in binder design or provide hotspot in partial denoising?

Hello authors,

Thank you for providing this fantastic code for protein design! I was wondering if there is a smart way to incorporate some sequence information in binder design process or add a hot spot potential in partial denoise process. I was trying to optimize a binder. If I use the partial denoise pipeline, I cannot define the hotspot; If I use the binder design pipeline, I cannot fix the part of the sequence of interest. I assume there should be a way to combine both features because they are simply two different potential functions. Could you let me know if that's possible or some clue for me to implement this feature? Thank you very much!

Best,
Shuhao

Binder Design for Small Ligands

Is there any prediction for when RF diffusion will be extended to allow for design of ligand-protein interactions?

Really awesome program so far!

Thanks!

NVTX missing using SE3nv.yml -- Pytorch 2.0 solution

Device

OS: CentOS Linux 7
GPU: gtx 1080

Issue

Hi! I get the following error running any of the examples scripts

RuntimeError: NVTX functions not installed. Are you sure you have a CUDA build?

When using the current SE3nv.yml I get the following versions

pytorch                   1.9.1           cpu_py39hc5866cc_3    conda-forge
torchaudio                0.9.1           py39                  pytorch
torchvision               0.14.1          cpu_py39h39206e8_1    conda-forge

Solution

I did a clean install running pip3 install --force-reinstall torch torchvision torchaudio

torch                     2.0.0                    pypi_0    pypi
torchaudio                2.0.1                    pypi_0    pypi
torchvision               0.15.1                   pypi_0    pypi

That seems to run every example without an issue. I've come into issues before with conda installs for pytorch when not using the most recent version. Is there a known issue from keeping RFdiffusion from moving to pytorch 2.0?

OSError: [Errno 30] Read-only file system: 'outputs'

After having built RFdiffusion's docker container and pulled it to our HPC, I have run the following test by using singularity:

singularity run --env TF_FORCE_UNIFIED_MEMORY=1,XLA_PYTHON_CLIENT_MEM_FRACTION=4.0,OPENMM_CPU_THREADS=10,HYDRA_FULL_ERROR=1 \
        -B $HOME/outputs,$HOME/models,$HOME/inputs \
        --pwd /app/RFdiffusion \
        --nv $HOME/rfdiffusion/rfdiffusion_v1.1.0.sif \
        inference.output_prefix=$HOME/outputs/motifscaffolding \
        inference.model_directory_path=$HOME/models \
        inference.input_pdb=$HOME/inputs/5TPN.pdb \
        inference.num_designs=3 \
        'contigmap.contigs=[10-40/A163-181/10-40]'

However, I got errors as follows:

Traceback (most recent call last):
  File "/usr/lib/python3.9/pathlib.py", line 1313, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'outputs/2023-06-24/15-54-50'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.9/pathlib.py", line 1313, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'outputs/2023-06-24'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/RFdiffusion/scripts/run_inference.py", line 194, in <module>
    main()
  File "/usr/local/lib/python3.9/dist-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.9/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.9/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.9/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/usr/local/lib/python3.9/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/usr/local/lib/python3.9/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.9/dist-packages/hydra/_internal/hydra.py", line 119, in run
    ret = run_job(
  File "/usr/local/lib/python3.9/dist-packages/hydra/core/utils.py", line 146, in run_job
    Path(str(output_dir)).mkdir(parents=True, exist_ok=True)
  File "/usr/lib/python3.9/pathlib.py", line 1317, in mkdir
    self.parent.mkdir(parents=True, exist_ok=True)
  File "/usr/lib/python3.9/pathlib.py", line 1317, in mkdir
    self.parent.mkdir(parents=True, exist_ok=True)
  File "/usr/lib/python3.9/pathlib.py", line 1313, in mkdir
    self._accessor.mkdir(self, mode)
OSError: [Errno 30] Read-only file system: 'outputs'

Can you please let me know what I did incorrectly?
Thanks a lot in advance.

Question about the modified RosettaFold and training

Dear Authors,

Thank you for your great work! I am writing to inquire if there are any plans to release the pretraining code for both the modified RosettaFold and RF diffusion on GitHub. As someone with a keen interest in this field, I am particularly curious about this aspect and would appreciate any information or updates you could provide.

Long inference time, GPU avaialble but not using

Thanks for sharing RFDiffusion.

I'm using it on a T4 GPU. It takes 30-60seconds for this:

RFdiffusion-main/scripts/run_inference.py 'contigmap.contigs=[10-20]' inference.output_prefix=test_outputs/test inference.num_designs=1

Cuda is also available.
How could I make sure if it is using GPU?

This is my dockerfile:

FROM nvidia/cuda:11.1.1-cudnn8-runtime-ubuntu20.04
ENV PATH="/root/miniconda3/bin:${PATH}"
ARG PATH="/root/miniconda3/bin:${PATH}"
RUN apt-get update

RUN apt-get install -y wget git && rm -rf /var/lib/apt/lists/*

RUN wget \
    https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \
    && mkdir /root/.conda \
    && bash Miniconda3-latest-Linux-x86_64.sh -b \
    && rm -f Miniconda3-latest-Linux-x86_64.sh
RUN conda --version

COPY RFdiffusion-main RFdiffusion-main
RUN conda env create -f RFdiffusion-main/env/SE3nv.yml

RUN echo "conda activate SE3nv" >> ~/.bashrc
SHELL ["/bin/bash", "--login", "-c"]
SHELL ["conda", "run", "--no-capture-output", "-n", "SE3nv", "/bin/bash", "-c"]

RUN pip install --no-cache-dir -r RFdiffusion-main/env/SE3Transformer/requirements.txt
RUN python RFdiffusion-main/env/SE3Transformer/setup.py install

RUN pip install -e RFdiffusion-main 
COPY requirements.txt requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
COPY entrypoint.sh entrypoint.sh

RUN chmod +x entrypoint.sh
RUN pip install se3-transformer-pytorch
RUN pip install -e RFdiffusion-main/env/SE3Transformer

RUN pip install --force-reinstall torch torchvision torchaudio

Any plan to release training code?

I'd like to finetune RFdiffusion to fit my data.
However, there seems no training code provided for now.

Is there any plan to release training code?

Thanks!

Running RFdiffusion on Intel macbook pro without Nvidia GPU

Hi all,

May I check if it is possible to run the code in this repo on a Intel Macbook without Nvidia GPU? I installed Pytorch but this error keeps coming up:
image

I installed Pytorch like this:
image

I searched Nvidia's website for CUDA tookit 11.1 but it seesm like there isn't an option for Mac.

If it is possible, may I know how I can install the missing packages?

Greatly appreciate any help! Thank you!

RuntimeError: Index put requires the source and destination dtypes match, got Long for the destination and Int for the source.

When I am trying to run inference to yield an unconditional monomer as described in the README, I get the following error:

_[2023-04-01 20:22:08,689][main][INFO] - Making design test_outputs/test_0
[2023-04-01 20:22:08,692][inference.model_runners][INFO] - Using contig: ['150-150']
Error executing job with overrides: ['contigmap.contigs=[150-150]', 'inference.output_prefix=test_outputs/test', 'inference.num_designs=10']
Traceback (most recent call last):
File "C:\Users\Norb\RFdiffusion\run_inference.py", line 76, in main
x_init, seq_init = sampler.sample_init()
File "C:\Users\Norb\RFdiffusion\inference\model_runners.py", line 341, in sample_init
seq_t[contig_map.hal_idx0] = seq_orig[contig_map.ref_idx0]
RuntimeError: Index put requires the source and destination dtypes match, got Long for the destination and Int for the source.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
(SE3nv)_

Sorry in case I am missing something basic, I am an absolute beginner. Thank you so much in advance.

Problem Hydra arguments on windows conda

I found out after several failures to run the inference script that on Windows when running the inference script, if one use
'contigmap.contigs=[B1-100/0 100-100]'
or 'ppi.hotspot_res=[A30,A33,A34]' , it will not work.
One need to use " and not ' around the arguments otherwise the arguments are not correctly parsed and it raises an error.
eg.. 'contigmap.contigs=[B1-100/0 100-100]' becomes "contigmap.contigs=[B1-100/0 100-100]"

Not sure if this is because the default language of my windows is not english

Mulitple sequences ranges to `provide_seq`

Hi RFdiffusion team, thank you for this great project and taking the time to make it public and write comprehensive documentation!

I was wondering if it's was possible to provide multiple sequence ranges to provide_seq when doing partial diffusion? For instance, the following does not throw an error:

'contigmap.provide_seq=[0-383,498-580,692-821]'

However only AAs 0-383 appear to be unmasked. Am I missing something?

Thank you!

Ability to redesign existing scaffolds to create binding interfaces in comparison to de novo binder design

Hello!

Thank you for publishing your work! I'm currently experimenting with your software, and I have some ideas in mind, but I am not sure to what extent they are practical and possible with RFdiffusion. As I am new to the protein design field, feel free to correct any assumptions I may have made; I would greatly appreciate any guidance.

I have a target protein and a highly stable scaffold protein; however, the scaffold protein was not designed to bind to this specific target, so no actual binding motif is present. Is it possible (and effective) to use the RFdiffusion protocol, such as motif scaffolding, to redesign some parts of the existing scaffold that face the target hotspot to create a binding interface? Or would it be more effective to use partial diffusion/fold conditioned binder design to create new, but structurally similar scaffolds? If I understand correctly, the latter approach will cause loss of the binder sequence, which might lead to a possible loss of stability compared to the original scaffold.
I apologize if I am missing something about presented pipelines.

Thank you!

Handling of non natural amino acids (Colab notebook version)

On the Colab notebook I am using as a template for binder generation the mirror image of a naturally occurring protein (uploaded pdb file). The generation of the poly glycine backbone works well but when this input is used for protein MPNN and Alphafold evaluation the outputs alter the stereochemistry of the input protein back to the natural enantiomer. It's not clear whether this is happening at the protein MPNN step or whether this is a step taken by Alphafold. Is there a way to discern whether Protein MPNN is assigning side chains based the mirror image protein input (or whether Protein MPNN reverts the stereochemistry)?

Length ranges in symmetry modes

Hi,

I've noticed that when using a symmetric mode (cyclic, dihedral, etc), it's not possible to supply a length range for the new diffused regions. A single value must be used, otherwise 'ValueError: Sequence length must be divisble by n' is returned.

I'm guessing it's because the lengths of the diffused regions in each chain/monomer aren't tied, so usually the total sequence length ends up indivisible by the oligomeric state?

Is this something that might be possible in future updates?

Thanks!

Ali

Symmetry mode NOT usable for e.g. inpainting a loop?

I've tried to use the cyclic symmetric mode to generate a set of symmetric loops between symmetric structural units (e.g. loops to join together helices in a helical bundle with cyclic symmetry).

However, while this kind of operation works to generate non-symmetric loops, when switching to use of symmetry it acts to always assume chain-breaks after the newly generated loops, such that they are never positioned so as to actually join the e.g. helices of the symmetric helical bundle.

Based on the text in the nickel design example, which says that chain breaks don't strictly need to be given the contigs when using symmetry, it seems this might be a known limitation of the symmetry mode. Is this true? Is it possible to circumvent this limiation?

Help with error: no such file or directory /models/Base_ckpt.pt

I've just set this up in WSL2 and had no issues during setup. I just copied the unconditional design example to quickly test if it works but I'm getting the error below. I'm not sure how to interpret this error so any help would be great!

~/RFdiffusion$ scripts/run_inference.py inference.output_prefix=example_outputs/design_unconditional 'contigmap.contigs=[100-200]' inference.num_designs=10
[2023-06-26 22:44:09,438][main][INFO] - Found GPU with device_name NVIDIA GeForce RTX 3060. Will run RFdiffusion on NVIDIA GeForce RTX 3060
Reading models from /home/usr/RFdiffusion/rfdiffusion/inference/../../models
[2023-06-26 22:44:09,439][rfdiffusion.inference.model_runners][INFO] - Reading checkpoint from /home/usr/RFdiffusion/rfdiffusion/inference/../../models/Base_ckpt.pt
This is inf_conf.ckpt_path
/home/usr/RFdiffusion/rfdiffusion/inference/../../models/Base_ckpt.pt
Error executing job with overrides: ['inference.output_prefix=example_outputs/design_unconditional', 'contigmap.contigs=[100-200]', 'inference.num_designs=10']
Traceback (most recent call last):
File "/home/usr/RFdiffusion/scripts/run_inference.py", line 54, in main
sampler = iu.sampler_selector(conf)
File "/home/usr/RFdiffusion/rfdiffusion/inference/utils.py", line 511, in sampler_selector
sampler = model_runners.SelfConditioning(conf)
File "/home/usr/RFdiffusion/rfdiffusion/inference/model_runners.py", line 37, in init
self.initialize(conf)
File "/home/usr/RFdiffusion/rfdiffusion/inference/model_runners.py", line 103, in initialize
self.load_checkpoint()
File "/home/usr/RFdiffusion/rfdiffusion/inference/model_runners.py", line 181, in load_checkpoint
self.ckpt = torch.load(
File "/home/usr/anaconda3/envs/SE3nv/lib/python3.9/site-packages/torch/serialization.py", line 594, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/usr/anaconda3/envs/SE3nv/lib/python3.9/site-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/usr/anaconda3/envs/SE3nv/lib/python3.9/site-packages/torch/serialization.py", line 211, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/home/usr/RFdiffusion/rfdiffusion/inference/../../models/Base_ckpt.pt'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Help: RFdiffusion EOFError: Ran out of input

Dear all, please help me with this error. Thank you very much.

./scripts/run_inference.py 'contigmap.contigs=[150-150]' inference.output_prefix=test_outputs/test inference.num_designs=10

[2023-06-04 19:49:19,789][main][INFO] - Found GPU with device_name NVIDIA GeForce RTX 3090. Will run RFdiffusion on NVIDIA GeForce RTX 3090
Reading models from /home/hesong/local/RFdiffusion/rfdiffusion/inference/../../models
[2023-06-04 19:49:19,790][rfdiffusion.inference.model_runners][INFO] - Reading checkpoint from /home/hesong/local/RFdiffusion/rfdiffusion/inference/../../models/Base_ckpt.pt
This is inf_conf.ckpt_path
/home/hesong/local/RFdiffusion/rfdiffusion/inference/../../models/Base_ckpt.pt
Assembling -model, -diffuser and -preprocess configs from checkpoint
USING MODEL CONFIG: self._conf[model][n_extra_block] = 4
USING MODEL CONFIG: self._conf[model][n_main_block] = 32
USING MODEL CONFIG: self._conf[model][n_ref_block] = 4
USING MODEL CONFIG: self._conf[model][d_msa] = 256
USING MODEL CONFIG: self._conf[model][d_msa_full] = 64
USING MODEL CONFIG: self._conf[model][d_pair] = 128
USING MODEL CONFIG: self._conf[model][d_templ] = 64
USING MODEL CONFIG: self._conf[model][n_head_msa] = 8
USING MODEL CONFIG: self._conf[model][n_head_pair] = 4
USING MODEL CONFIG: self._conf[model][n_head_templ] = 4
USING MODEL CONFIG: self._conf[model][d_hidden] = 32
USING MODEL CONFIG: self._conf[model][d_hidden_templ] = 32
USING MODEL CONFIG: self._conf[model][p_drop] = 0.15
USING MODEL CONFIG: self._conf[model][SE3_param_full] = {'num_layers': 1, 'num_channels': 32, 'num_degrees': 2, 'n_heads': 4, 'div': 4, 'l0_in_features': 8, 'l0_out_features': 8, 'l1_in_features': 3, 'l1_out_features': 2, 'num_edge_features': 32}
USING MODEL CONFIG: self._conf[model][SE3_param_topk] = {'num_layers': 1, 'num_channels': 32, 'num_degrees': 2, 'n_heads': 4, 'div': 4, 'l0_in_features': 64, 'l0_out_features': 64, 'l1_in_features': 3, 'l1_out_features': 2, 'num_edge_features': 64}
USING MODEL CONFIG: self._conf[model][freeze_track_motif] = False
USING MODEL CONFIG: self._conf[model][use_motif_timestep] = True
USING MODEL CONFIG: self._conf[diffuser][T] = 50
USING MODEL CONFIG: self._conf[diffuser][b_0] = 0.01
USING MODEL CONFIG: self._conf[diffuser][b_T] = 0.07
USING MODEL CONFIG: self._conf[diffuser][schedule_type] = linear
USING MODEL CONFIG: self._conf[diffuser][so3_type] = igso3
USING MODEL CONFIG: self._conf[diffuser][crd_scale] = 0.25
USING MODEL CONFIG: self._conf[diffuser][so3_schedule_type] = linear
USING MODEL CONFIG: self._conf[diffuser][min_b] = 1.5
USING MODEL CONFIG: self._conf[diffuser][max_b] = 2.5
USING MODEL CONFIG: self._conf[diffuser][min_sigma] = 0.02
USING MODEL CONFIG: self._conf[diffuser][max_sigma] = 1.5
USING MODEL CONFIG: self._conf[preprocess][sidechain_input] = False
USING MODEL CONFIG: self._conf[preprocess][motif_sidechain_input] = True
USING MODEL CONFIG: self._conf[preprocess][d_t1d] = 22
USING MODEL CONFIG: self._conf[preprocess][d_t2d] = 44
USING MODEL CONFIG: self._conf[preprocess][prob_self_cond] = 0.5
USING MODEL CONFIG: self._conf[preprocess][str_self_cond] = True
USING MODEL CONFIG: self._conf[preprocess][predict_previous] = False
[2023-06-04 19:49:22,270][rfdiffusion.inference.model_runners][INFO] - Loading checkpoint.
[2023-06-04 19:49:24,666][rfdiffusion.diffusion][INFO] - Using cached IGSO3.
Error executing job with overrides: ['contigmap.contigs=[150-150]', 'inference.output_prefix=test_outputs/test', 'inference.num_designs=10']
Traceback (most recent call last):
File "/home/hesong/local/RFdiffusion/./scripts/run_inference.py", line 54, in main
sampler = iu.sampler_selector(conf)
File "/home/hesong/local/RFdiffusion/rfdiffusion/inference/utils.py", line 511, in sampler_selector
sampler = model_runners.SelfConditioning(conf)
File "/home/hesong/local/RFdiffusion/rfdiffusion/inference/model_runners.py", line 37, in init
self.initialize(conf)
File "/home/hesong/local/RFdiffusion/rfdiffusion/inference/model_runners.py", line 130, in initialize
self.diffuser = Diffuser(**self._conf.diffuser, cache_dir=schedule_directory)
File "/home/hesong/local/RFdiffusion/rfdiffusion/diffusion.py", line 582, in init
self.so3_diffuser = IGSO3(
File "/home/hesong/local/RFdiffusion/rfdiffusion/diffusion.py", line 198, in init
self.igso3_vals = self._calc_igso3_vals(L=L)
File "/home/hesong/local/RFdiffusion/rfdiffusion/diffusion.py", line 233, in _calc_igso3_vals
igso3_vals = read_pkl(cache_fname)
File "/home/hesong/local/RFdiffusion/rfdiffusion/diffusion.py", line 144, in read_pkl
raise (e)
File "/home/hesong/local/RFdiffusion/rfdiffusion/diffusion.py", line 140, in read_pkl
return pickle.load(handle)
EOFError: Ran out of input

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

[WSL2] nvrtc compilation failed

Trying to use the tool in WSL2 with my RTX4090. Windows version doesn't work (see the issue #13 ).

Everything loads fine, but then I see an error:

(SE3nv) pavel@Gigabyte-PC:~/RFdiffusion$ python scripts/run_inference.py inference.model_directory_path=/mnt/d/Models/RFdiffusion 'contigmap.contigs=[150-150]' inference.output_prefix=test_outputs/test inference.num_designs=10
Reading models from /mnt/d/Models/RFdiffusion
[2023-04-06 15:59:06,421][rfdiffusion.inference.model_runners][INFO] - Reading checkpoint from /mnt/d/Models/RFdiffusion/Base_ckpt.pt
This is inf_conf.ckpt_path
/mnt/d/Models/RFdiffusion/Base_ckpt.pt
Assembling -model, -diffuser and -preprocess configs from checkpoint
USING MODEL CONFIG: self._conf[model][n_extra_block] = 4
USING MODEL CONFIG: self._conf[model][n_main_block] = 32
USING MODEL CONFIG: self._conf[model][n_ref_block] = 4
USING MODEL CONFIG: self._conf[model][d_msa] = 256
USING MODEL CONFIG: self._conf[model][d_msa_full] = 64
USING MODEL CONFIG: self._conf[model][d_pair] = 128
USING MODEL CONFIG: self._conf[model][d_templ] = 64
USING MODEL CONFIG: self._conf[model][n_head_msa] = 8
USING MODEL CONFIG: self._conf[model][n_head_pair] = 4
USING MODEL CONFIG: self._conf[model][n_head_templ] = 4
USING MODEL CONFIG: self._conf[model][d_hidden] = 32
USING MODEL CONFIG: self._conf[model][d_hidden_templ] = 32
USING MODEL CONFIG: self._conf[model][p_drop] = 0.15
USING MODEL CONFIG: self._conf[model][SE3_param_full] = {'num_layers': 1, 'num_channels': 32, 'num_degrees': 2, 'n_heads': 4, 'div': 4, 'l0_in_features': 8, 'l0_out_features': 8, 'l1_in_features': 3, 'l1_out_features': 2, 'num_edge_features': 32}
USING MODEL CONFIG: self._conf[model][SE3_param_topk] = {'num_layers': 1, 'num_channels': 32, 'num_degrees': 2, 'n_heads': 4, 'div': 4, 'l0_in_features': 64, 'l0_out_features': 64, 'l1_in_features': 3, 'l1_out_features': 2, 'num_edge_features': 64}
USING MODEL CONFIG: self._conf[model][d_time_emb] = 0
USING MODEL CONFIG: self._conf[model][d_time_emb_proj] = 10
USING MODEL CONFIG: self._conf[model][freeze_track_motif] = False
USING MODEL CONFIG: self._conf[model][use_motif_timestep] = True
USING MODEL CONFIG: self._conf[diffuser][T] = 50
USING MODEL CONFIG: self._conf[diffuser][b_0] = 0.01
USING MODEL CONFIG: self._conf[diffuser][b_T] = 0.07
USING MODEL CONFIG: self._conf[diffuser][schedule_type] = linear
USING MODEL CONFIG: self._conf[diffuser][so3_type] = igso3
USING MODEL CONFIG: self._conf[diffuser][crd_scale] = 0.25
USING MODEL CONFIG: self._conf[diffuser][so3_schedule_type] = linear
USING MODEL CONFIG: self._conf[diffuser][min_b] = 1.5
USING MODEL CONFIG: self._conf[diffuser][max_b] = 2.5
USING MODEL CONFIG: self._conf[diffuser][min_sigma] = 0.02
USING MODEL CONFIG: self._conf[diffuser][max_sigma] = 1.5
USING MODEL CONFIG: self._conf[preprocess][sidechain_input] = False
USING MODEL CONFIG: self._conf[preprocess][motif_sidechain_input] = True
USING MODEL CONFIG: self._conf[preprocess][d_t1d] = 22
USING MODEL CONFIG: self._conf[preprocess][d_t2d] = 44
USING MODEL CONFIG: self._conf[preprocess][prob_self_cond] = 0.5
USING MODEL CONFIG: self._conf[preprocess][str_self_cond] = True
USING MODEL CONFIG: self._conf[preprocess][predict_previous] = False
[2023-04-06 15:59:10,778][rfdiffusion.inference.model_runners][INFO] - Loading checkpoint.
[2023-04-06 15:59:13,459][rfdiffusion.diffusion][INFO] - Calculating IGSO3.
Successful diffuser __init__
[2023-04-06 15:59:17,256][__main__][INFO] - Making design test_outputs/test_0
[2023-04-06 15:59:17,260][rfdiffusion.inference.model_runners][INFO] - Using contig: ['150-150']
With this beta schedule (linear schedule, beta_0 = 0.04, beta_T = 0.28), alpha_bar_T = 0.00013696048699785024
[2023-04-06 15:59:17,271][rfdiffusion.inference.model_runners][INFO] - Sequence init: ------------------------------------------------------------------------------------------------------------------------------------------------------
Error executing job with overrides: ['inference.model_directory_path=/mnt/d/Models/RFdiffusion', 'contigmap.contigs=[150-150]', 'inference.output_prefix=test_outputs/test', 'inference.num_designs=10']
Traceback (most recent call last):
  File "/home/pavel/RFdiffusion/scripts/run_inference.py", line 85, in main
    px0, x_t, seq_t, plddt = sampler.sample_step(
  File "/home/pavel/RFdiffusion/rfdiffusion/inference/model_runners.py", line 665, in sample_step
    msa_prev, pair_prev, px0, state_prev, alpha, logits, plddt = self.model(msa_masked,
  File "/home/pavel/.local/share/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pavel/RFdiffusion/rfdiffusion/RoseTTAFoldModel.py", line 114, in forward
    msa, pair, R, T, alpha_s, state = self.simulator(seq, msa_latent, msa_full, pair, xyz[:,:,:3],
  File "/home/pavel/.local/share/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pavel/RFdiffusion/rfdiffusion/Track_module.py", line 420, in forward
    msa_full, pair, R_in, T_in, state, alpha = self.extra_block[i_m](msa_full,
  File "/home/pavel/.local/share/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pavel/RFdiffusion/rfdiffusion/Track_module.py", line 332, in forward
    R, T, state, alpha = self.str2str(msa, pair, R_in, T_in, xyz, state, idx, motif_mask=motif_mask, top_k=0)
  File "/home/pavel/.local/share/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pavel/.local/share/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/cuda/amp/autocast_mode.py", line 141, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/pavel/RFdiffusion/rfdiffusion/Track_module.py", line 266, in forward
    shift = self.se3(G, node.reshape(B*L, -1, 1), l1_feats, edge_feats)
  File "/home/pavel/.local/share/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pavel/RFdiffusion/rfdiffusion/SE3_network.py", line 83, in forward
    return self.se3(G, node_features, edge_features)
  File "/home/pavel/.local/share/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pavel/.local/share/miniconda3/envs/SE3nv/lib/python3.9/site-packages/se3_transformer-1.0.0-py3.9.egg/se3_transformer/model/transformer.py", line 140, in forward
    basis = basis or get_basis(graph.edata['rel_pos'], max_degree=self.max_degree, compute_gradients=False,
  File "/home/pavel/.local/share/miniconda3/envs/SE3nv/lib/python3.9/site-packages/se3_transformer-1.0.0-py3.9.egg/se3_transformer/model/basis.py", line 167, in get_basis
    spherical_harmonics = get_spherical_harmonics(relative_pos, max_degree)
  File "/home/pavel/.local/share/miniconda3/envs/SE3nv/lib/python3.9/site-packages/se3_transformer-1.0.0-py3.9.egg/se3_transformer/model/basis.py", line 58, in get_spherical_harmonics
    sh = o3.spherical_harmonics(all_degrees, relative_pos, normalize=True)
  File "/home/pavel/.local/share/miniconda3/envs/SE3nv/lib/python3.9/site-packages/e3nn/o3/_spherical_harmonics.py", line 180, in spherical_harmonics
    return sh(x)
  File "/home/pavel/.local/share/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pavel/.local/share/miniconda3/envs/SE3nv/lib/python3.9/site-packages/e3nn/o3/_spherical_harmonics.py", line 82, in forward
    sh = _spherical_harmonics(self._lmax, x[..., 0], x[..., 1], x[..., 2])
RuntimeError: nvrtc: error: invalid value for --gpu-architecture (-arch)

nvrtc compilation failed:

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)


template<typename T>
__device__ T maximum(T a, T b) {
  return isnan(a) ? a : (a > b ? a : b);
}

template<typename T>
__device__ T minimum(T a, T b) {
  return isnan(a) ? a : (a < b ? a : b);
}

extern "C" __global__
void fused_pow_pow_pow_su_9196483836509741110(float* tz_1, float* ty_1, float* tx_1, float* aten_mul, float* aten_mul_1, float* aten_mul_2, float* aten_sub, float* aten_add, float* aten_mul_3, float* aten_pow) {
{
  if (512 * blockIdx.x + threadIdx.x<22350 ? 1 : 0) {
    float ty_1_1 = __ldg(ty_1 + 3 * (512 * blockIdx.x + threadIdx.x));
    aten_pow[512 * blockIdx.x + threadIdx.x] = ty_1_1 * ty_1_1;
    float tz_1_1 = __ldg(tz_1 + 3 * (512 * blockIdx.x + threadIdx.x));
    float tx_1_1 = __ldg(tx_1 + 3 * (512 * blockIdx.x + threadIdx.x));
    aten_mul_3[512 * blockIdx.x + threadIdx.x] = (float)((double)(tz_1_1 * tz_1_1 - tx_1_1 * tx_1_1) * 0.8660254037844386);
    aten_add[512 * blockIdx.x + threadIdx.x] = tx_1_1 * tx_1_1 + tz_1_1 * tz_1_1;
    aten_sub[512 * blockIdx.x + threadIdx.x] = ty_1_1 * ty_1_1 - (float)((double)(tx_1_1 * tx_1_1 + tz_1_1 * tz_1_1) * 0.5);
    aten_mul_2[512 * blockIdx.x + threadIdx.x] = (float)((double)(ty_1_1) * 1.732050807568877) * tz_1_1;
    aten_mul_1[512 * blockIdx.x + threadIdx.x] = (float)((double)(tx_1_1) * 1.732050807568877) * ty_1_1;
    aten_mul[512 * blockIdx.x + threadIdx.x] = (float)((double)(tx_1_1) * 1.732050807568877) * tz_1_1;
  }
}
}


Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Unable to install environment from SE3nv.yml

First of all, thanks to the RFDiffusion team for this tool and making it open source! I'm excited to use it!

I've begun the installation process and have tried to create the conda environment from the SE3nv.yml file with conda env create -f env/SE3nv.yml. But get the following error:

ResolvePackageNotFound: 
  - icecream
  - cudatoolkit=11.1

I've resolved this issue by adding - nvidia to the channels and moving icecream to the pip installs. The final file looks like this:

name: SE3nv
channels:
  - defaults
  - pytorch
  - dglteam
  - nvidia
dependencies:
  - python=3.9
  - pytorch=1.9
  - torchaudio
  - torchvision
  - cudatoolkit=11.1
  - dgl-cuda11.1
  - pip
  - pip:
    - icecream
    - hydra-core
    - pyrsistent

Not sure if this is the best way to handle creating the environment but seems to have worked for me!

run on multiple GPUs?

Any way to get this to run on multiple GPUs simultaneously?
Right now it only ones on a single GPU even when there are multiple present. Any flags I might try?

always get mistaken chain connection in binder design

image
here shows the partial comparison between original stucture and the one after diffusion. confusingly, i always get wrong connection within those part that i did not declare to be disigned.
image
i use /0 as the sign for chain break but seem to be failed, since chain E and F was connected wrong

i havent figure out the exact rule behind, any advice will be appreciated!

Problems with using Complex_beta_ckpt.pt

Hi, RF diffusion team

This is a great work.

I am trying to make a binder to a beta sheet, and i tried using Complex_beta_ckpt.pt.

The results showed that the binder sequences were GGGGGGGG....

What should i do to solve this problems?

My code is python ${script_path}/run_inference.py inference.output_prefix=out/design_ppi inference.input_pdb=input/target.pdb 'contigmap.contigs=[C1-62/0 50-79]' 'ppi.hotspot_res=[C2,C3,C14,C15,C16,C17,C18,C20]' inference.ckpt_override_path=${script_path}/models/Complex_beta_ckpt.pt inference.num_designs=10 denoiser.noise_scale_ca=0 denoiser.noise_scale_frame=0

Thank you.

Extend N-terminus with a helical bundle structure

Hi,
thank you for providing this code and the examples! Suppose I have a .pdb file of a protein with a long alpha helix at the N-terminus. Is it possible to extend this helix by a helix bundle, i.e. to generate an N-terminal fusion of a three helix bundle to my target protein? The helical bundles that are generated by the design_ppi script would be perfect.

Output circular proteins?

Any way to get RFdiffusion to connect an N to C terminus to form a circularized protein?
If possible, this would be phenomenally useful functionality.

Many thanks.

Create a jupyter notebook

The colab notebook is great but google colab is particularly unreliable recently in free account.
Would it be possible that someone translate the colab notebook to Jupyter notebook?

Help With Binder Design Denoiser

Hi there, I've been using the google colab version of the program and wanted to know where do I key in the denoiser.noise_scale and denoiser.noise_scale_frame command on the code? Also would like to know how to filter i_pae results down to < 10. Thank you! Love the program, keep up the great work! :)

About ‘Generation of Symmetric Oligomers’

Hi,in the part of ‘Generation of Symmetric Oligomers’,
I saw the command is:
"./scripts/run_inference.py --config-name symmetry inference.symmetry=tetrahedral 'contigmap.contigs=[360]' inference.output_prefix=test_sample/tetrahedral inference.num_designs=1"
, but an error will be reported when running:
File ".../RFdiffusion/rfdiffusion/contigs.py", line 137, in get_sampled_mask
contig_list = self.contigs[0].strip().split()
AttributeError: 'int' object has no attribute 'strip'

the symmetry.yaml in config/inference is set as follow:
contigmap:
# Specify a single integer value to sample unconditionally.
# Must be evenly divisible by the number of chains in the symmetry.
contigs: ['100']

So, is there some problem in this part?

Finally, please standardize the content of the README.md, because there seem to be some differences between the README.md and the code,thanks

About downstream sequence assignment

Hi, RF diffusion team

The results showed that every designed residue is output as glycine, and we should use ProteinMPNN to assign these residues.

My question is, if we want to fix some residues from the input structure (e.g., enzyme design or scaffold functional motif, I want to fix the activate site or functional motif), how to specify in ProteinMPNN?

Thank you.

Question about RFDiffusion -> ProteinMPNN handoff for binder design

First, thank you to the authors for releasing this code and model!

When I run RFDiffusion for binder design, the output .pdbs show the binder as polyglycine (expected) and the target protein with the original sequence (also expected). However, when you look at the structure, the target protein no longer has side chains, but only the backbone atoms are preserved (not what I expected). Is this a problem if I intend to use these .pdbs as inputs to ProteinMPNN? Or should I take the RFDiffusion backbone and make a new pdb with the original target structure with sidechains and all?

How to set the seed value

Is it possible to set deterministic at conf/inference/base.yamlas True and set its seed value from the argument?

if conf.inference.deterministic:
make_deterministic()

Or is it preferable to simply set deterministic to True and a large value of inference.num_designs?
But in that case, it will take longer to get the result because it will be executed sequentially instead of in parallel.
Is there a big difference in output between running in parallel with many seeds with inference.num_designs = 1 and running sequentially with a large value of inference.num_designs?

Integration of ProteinMPNN & AF2 filtering

Thanks a lot for making the RFDiffusion project available! I am trying to wrap my head around what is needed to get the whole design workflow set up locally.

RFdiffusion as described here only seems to output "poly-glycine" PDBs. So we still need to run ProteinMPNN and AF2 filtering on all candidate solutions. The colab version of RFdiffusion seems to perform these steps through a call to colabdesign/rf/designability_test.py. However, that script doesn't seem to exist neither in this repo nor in the colabdesign/rf one.

Could you please add this script to this repo here so that one can really reproduce the workflow described in your paper?

Code comments

Hi team, congratulation on the great paper and milestone results.

I've collected multiple remarks on the code while I was studying it, hopefully this will help in improving the codebase.

terminology

  • templates dimension: sometimes called T, sometimes N, sometimes s.
  • t1d and t2d are very non-descriptive names.

dependencies

  • opt_einsum is used, but should be replaced with torch.einsum: 1. pytorch relies on opt_einsum anyway 2. opt_einsum handles contraction order for 3 or more tensors, which is never the case in rosetta code 3. it precludes scripting/compiling code

ignored parameters/code

  • igso3.py: calculate_igso3 ignores L argument
  • init_lecun_normal and init_lecun_normal_param ignore passed initialization scale
  • Attention and MaskedTokenNetwork ignore p_drop, and dropout is never applied (while signature suggests the opposite)
  • class Denoise: multiple parameters are not used
  • looks like * self.guide_scale is missing here
    'cubic' : lambda t: t**3/self.T**3
  • diff_util.py: unused file
  • olig_intra_contacts: misses self.d0 and self.r0

distributing weights

  • I'd recommend replacing unsafe pickles with something simple and safe (npz/safetensors)

computations

  • manual normalizations (lots of them) can be replaced with F.normalize, cosines (in many places) can be computed with F.cosine_similarity
  • Sergey's one hot trick (used in multiple places) - strangely, created embedding layer is never used. Better just create a linear module, or only a trainable parameter
  • looks like you try to implement torch.nanmean here
    mask = torch.isnan(xyz_t[:,:,:,:3]).any(dim=-1).any(dim=-1) # (B, T, L)
    #
    center_CA = ((~mask[:,:,:,None]) * torch.nan_to_num(xyz_t[:,:,:,1,:])).sum(dim=2) / ((~mask[:,:,:,None]).sum(dim=2)+1e-4) # (B, T, 3)

checkpointing

  • Checkpointing on iterblock: wouldn’t it be more efficient to checkpoint whole block instead of checkpointing every step? This will simplify logic as well
  • create_custom_forward: unclear why you need it. This function was used only once with an actual kwarg (topk), and it can be replaced with lambda anyway: lambda *args: module(*args, topk=...)

duplicated code

  • computation of Cbeta: three torch implementations in repo, and there is discrepancy in used weights
  • make_contact_matrix: implemented twice
  • dihedral computation: implemented twice in numpy, once in pytorch, implementations scattered

Minor:

  • PositionalEncoding2D: I think binning is asymmetric (enumerates shifts from -32 to 31 + outside regions instead of -32 to 32)
  • icecream is imported but never used (and required in dependencies)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.