freyrs / dmasif Goto Github PK

View Code? Open in Web Editor NEW

183.0 183.0 43.0 2.93 MB

License: Other

Python 1.30% Jupyter Notebook 98.66% Shell 0.04%

dmasif's People

Contributors

Stargazers

Watchers

dmasif's Issues

No save model checkpoints for site prediction

Thanks for making your code public.

I want to run main_inference.py for site prediction. The only pretrained model you provide is this one for search prediction
dMaSIF_search_3layer_12A_16dim

when I try and run it for site prediction, it fails with a can't load state dict error

Python -W ignore -u main_inference.py --experiment_name dMaSIF_search_3layer_12A_16dim --batch_size 64 --embedding_layer dMaSIF --site True --emb_dims 16 --device cuda:0 --radius 12.0 --n_layers 3
RuntimeError: Error(s) in loading state_dict for dMaSIF:
        Missing key(s) in state_dict: "net_out.0.weight", "net_out.0.bias", "net_out.2.weight", "net_out.2.bias", "net_out.4.weight", "net_out.4.bias". 
        Unexpected key(s) in state_dict: "orientation_scores2.0.weight", "orientation_scores2.0.bias", "orientation_scores2.2.weight", "orientation_scores2.2.bias", "conv2.layers.0.net_in.0.weight", "conv2.layers.0.net_in.0.bias", "conv2.layers.0.net_in.2.weight", "conv2.layers.0.net_in.2.bias", "conv2.layers.0.norm_in.weight", "conv2.layers.0.norm_in.bias", "conv2.layers.0.conv.0.weight", "conv2.layers.0.conv.0.bias", "conv2.layers.0.conv.2.weight", "conv2.layers.0.conv.2.bias", "conv2.layers.0.net_out.0.weight", "conv2.layers.0.net_out.0.bias", "conv2.layers.0.net_out.2.weight", "conv2.layers.0.net_out.2.bias", "conv2.layers.0.norm_out.weight", "conv2.layers.0.norm_out.bias", "conv2.layers.1.net_in.0.weight", "conv2.layers.1.net_in.0.bias", "conv2.layers.1.net_in.2.weight", "conv2.layers.1.net_in.2.bias", "conv2.layers.1.norm_in.weight", "conv2.layers.1.norm_in.bias", "conv2.layers.1.conv.0.weight", "conv2.layers.1.conv.0.bias", "conv2.layers.1.conv.2.weight", "conv2.layers.1.conv.2.bias", "conv2.layers.1.net_out.0.weight", "conv2.layers.1.net_out.0.bias", "conv2.layers.1.net_out.2.weight", "conv2.layers.1.net_out.2.bias", "conv2.layers.1.norm_out.weight", "conv2.layers.1.norm_out.bias", "conv2.layers.2.net_in.0.weight", "conv2.layers.2.net_in.0.bias", "conv2.layers.2.net_in.2.weight", "conv2.layers.2.net_in.2.bias", "conv2.layers.2.norm_in.weight", "conv2.layers.2.norm_in.bias", "conv2.layers.2.conv.0.weight", "conv2.layers.2.conv.0.bias", "conv2.layers.2.conv.2.weight", "conv2.layers.2.conv.2.bias", "conv2.layers.2.net_out.0.weight", "conv2.layers.2.net_out.0.bias", "conv2.layers.2.net_out.2.weight", "conv2.layers.2.net_out.2.bias", "conv2.layers.2.norm_out.weight", "conv2.layers.2.norm_out.bias", "conv2.linear_layers.0.0.weight", "conv2.linear_layers.0.0.bias", "conv2.linear_layers.0.2.weight", "conv2.linear_layers.0.2.bias", "conv2.linear_layers.1.0.weight", "conv2.linear_layers.1.0.bias", "conv2.linear_layers.1.2.weight", "conv2.linear_layers.1.2.bias", "conv2.linear_layers.2.0.weight", "conv2.linear_layers.2.0.bias", "conv2.linear_layers.2.2.weight", "conv2.linear_layers.2.2.bias", "conv2.linear_transform.0.weight", "conv2.linear_transform.0.bias", "conv2.linear_transform.1.weight", "conv2.linear_transform.1.bias", "conv2.linear_transform.2.weight", "conv2.linear_transform.2.bias"

Am I using the command wrong? or is the model just not available?

Confused about some properties of the data

Hi, thanks for your great work!
However, I am quite confused about some of the properties. For example, we can extract atom_coords and atom_types from the PDB files. Then what is the difference between xyz in PLY files and atom_coords in PDB files?
As for the face/triangle, I thought every triple forms a triangle surface, right?
I am also curious about the normals from PLY files. In the paper, it seems that it should be calculated by sampling algorithm, so how do we get them straight from PLY files?

Sorry for not being familiar with the protein surface representation and I hope you can answer my questions.
Thanks!

mesh ply files from masif preprocessing scripts were not fed into atoms_to_points_normals

atoms_to_points_normals() function was decrbied clearly in your paper. But the trainning process by using mesh ply files from masif preprocessing scripts skipped it.
My question is that, in the inference step, If such model directly took a pdb as the input which would be convert to points by atoms_to_points_normals() function, does it make sense when input files are different between training and inference?

How to replace with my own dataset

Is the downloaded dataset "masif_site_masif_search_pdbs_and_ply_files. tar. gz" the result of preprocessing the dataset through previous work "masif"?
What should I do if I want to replace it with my own dataset?

RuntimeError: legacy constructor expects device type: cpubut device type: cuda was passed

python main_inference.py --experiment_name dMaSIF_site_3layer_16dims_9A_0.7res_150sup_epoch85 --embedding_layer dMaSIF --emb_dims 16 --n_layers 3 --resolution 0.7 --radius 9 --single_pdb 1A0G_B_B --site True --device cuda:0
0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "main_inference.py", line 86, in
pdb_ids=test_pdb_ids,
File "/home/v/proj/dmasif/content/MaSIF_colab/data_iteration.py", line 286, in iterate
P1_batch, P2_batch = process(args, protein_pair, net)
File "/home/v/proj/dmasif/content/MaSIF_colab/data_iteration.py", line 126, in process
net.preprocess_surface(P1)
File "/home/v/proj/dmasif/content/MaSIF_colab/model.py", line 454, in preprocess_surface
distance=self.args.distance,
File "/home/v/proj/dmasif/content/MaSIF_colab/geometry_processing.py", line 266, in atoms_to_points_normals
atomtypes=atomtypes,
File "/home/v/proj/dmasif/content/MaSIF_colab/geometry_processing.py", line 164, in soft_distances
[170, 110, 152, 155, 180, 190], device=x.device
RuntimeError: legacy constructor expects device type: cpubut device type: cuda was passed

about ply files

How can I get .ply corresponding to .pdb as provided in download dataset?
In other words, how does this work convert pdb to ply?

Some problem while executing code

Hi , i tried to executed some command codes from benchmark_scripts.

However, I met some problems:

There are some dimension problem when the program loading PyG dataset
collate TypeError: cat_dim() takes 3 positional arguments but 4 were given
So i modified the code in data.py from
def cat_dim(self, key, value):
to
def cat_dim(self, key, value, *args, **kwargs):
atoms_to_points_normals
Here comes some error like
File exists: '/home/user/.cache/pykeops-1.5-cpython-38//build-pybind11_template-libKeOps_template_40f62f56de'

/home/user/.cache/pykeops-1.5-cpython-38:
formula: Sum_Reduction(Exp(Minus(Sqrt(Sum(Square((Var(0,3,0) - Var(1,3,1))))))),1)
aliases: Var(0,3,0); Var(1,3,1);
dtype : float32
...
make: *** 「KeOps_formula」。。

      --------------------- MAKE DEBUG -----------------
      Command '['cmake', '--build', '.', '--target', 'KeOps_formula', '--', 'VERBOSE=1']' returned non-zero exit status 2.

Pre-trained model in Hugging face

Hi,

I found a link to Huggingface with pre-trained weights. It seems to be NOT up now. Is there a pre-trained model that this paper have on Hugging face. Thanks in advance.

dMaSIF for interaction prediction, how to find complementary regions?

I'm trying to use dMaSIF for interaction prediction between proteins (taking a target and finding the best binder in a large collection of potential binders)

At the moment, I process both binder and target molecule identically with dMaSIF up to the convolutional step and export the outputs "xxxx_predfeatures_emb1.npy" and "predcoords.npy" for both proteins.
According to the paper, these features of both binding partners should be passed through a separate convolutional network, allowing the network to find complementary (instead of similar) regions. Unfortunately I was not able to find the code doing that. Could you point me to the right section in the dMaSIF code?

Thanks so much to all contributors
DavidGraber

About helper.py

I noticed the function diagonal_ranges() in the helper.py repeatedly returned ranges_y and ranges_x. Is this necessary or just a mistake？

Input seems to require a y value

I'm trying to run your model on a single protein pair (target + binder). It seems you must supply a y/label column vector with each protein in the pair.

If you set the --single_pdb parameter to a pdb file on the command line, the first function that is called is load_protein_pair() which tries to access a 'y' attribute on the Data object returned by load_protein_npy()

def load_protein_pair(pdb_id, data_dir,single_pdb=False):
    """Loads a protein surface mesh and its features"""
    pspl = pdb_id.split("_")
    p1_id = pspl[0] + "_" + pspl[1]
    p2_id = pspl[0] + "_" + pspl[2]

    p1 = load_protein_npy(p1_id, data_dir, center=False,single_pdb=single_pdb)
    p2 = load_protein_npy(p2_id, data_dir, center=False,single_pdb=single_pdb)
    # pdist = ((p1['xyz'][:,None,:]-p2['xyz'][None,:,:])**2).sum(-1).sqrt()
    # pdist = pdist<2.0
    # y_p1 = (pdist.sum(1)>0).to(torch.float).reshape(-1,1)
    # y_p2 = (pdist.sum(0)>0).to(torch.float).reshape(-1,1)
    y_p1 = p1["y"] # <- tries to access a y value that will not be there at inference time
    y_p2 = p2["y"]
...

It looks like load_protein_npy() tries to set the 'y' attribute on the Data object to None if the --single_pdb parameter is set to a file, but irritatingly, the Data constructor does not set attributes if they are None.

So I just get a key error.

test_dataset = [load_protein_pair(args.single_pdb, NPY_DIR, single_pdb=True)]
test_pdb_ids = [args.single_pdb]

KeyError                                  Traceback (most recent call last)
Cell In [5], line 1
----> 1 test_dataset = [load_protein_pair(args.single_pdb, NPY_DIR, single_pdb=True)]
      2 test_pdb_ids = [args.single_pdb]

File ~/git/dMaSIF/data.py:247, in load_protein_pair(pdb_id, data_dir, single_pdb)
    242 p2 = load_protein_npy(p2_id, data_dir, center=False,single_pdb=single_pdb)
    243 # pdist = ((p1['xyz'][:,None,:]-p2['xyz'][None,:,:])**2).sum(-1).sqrt()
    244 # pdist = pdist<2.0
    245 # y_p1 = (pdist.sum(1)>0).to(torch.float).reshape(-1,1)
    246 # y_p2 = (pdist.sum(0)>0).to(torch.float).reshape(-1,1)
--> 247 y_p1 = p1["y"]
    248 y_p2 = p2["y"]
    250 protein_pair_data = PairData(
    251     xyz_p1=p1["xyz"],
    252     xyz_p2=p2["xyz"],
   (...)
    266     atom_types_p2=p2["atom_types"],
    267 )

File ~/mambaforge/envs/PyG-env6/lib/python3.10/site-packages/torch_geometric/data/data.py:444, in Data.__getitem__(self, key)
    443 def __getitem__(self, key: str) -> Any:
--> 444     return self._store[key]
...
File ~/mambaforge/envs/PyG-env6/lib/python3.10/site-packages/torch_geometric/data/storage.py:81, in BaseStorage.__getitem__(self, key)
     80 def __getitem__(self, key: str) -> Any:
---> 81     return self._mapping[key]

KeyError: 'y'

Site and Search do not output the same number of atoms

Hi,
I have successfully managed to predict on new pdbs using the provided weights. Thanks for providing them.

I am trying to perform a PPI search, so I need to run dMaSIF Site, and then Search. I have used the commands below:

python -W ignore -u main_inference.py --experiment_name dMaSIF_search_3layer_12A_16dim --batch_size 64 --embedding_layer dMaSIF --emb_dims 16 --device cuda:0 --radius 12.0 --n_layers 3 --npy_dir NPY_DIR --search True

python -W ignore -u main_inference.py --experiment_name dMaSIF_site_3layer_16dims_12A_100sup_epoch71 --batch_size 64 --embedding_layer dMaSIF --emb_dims 16 --device cuda:0 --radius 12.0 --n_layers 3 --npy_dir NPY_DIR --site True

I added --npy_dir NPY_DIR for my use case. Basically, it calls load_protein_pair on proteins from a folder, but it only fills the first protein of the pair.

The problem is that for a single protein, the Site and Search predictions have a different number of rows!

Here are my questions:

Why does the dMaSIF site and search produce outputs with a different number of rows (atoms) ?
Why does dMaSIF site outputs embeddings? How good are those for PPI search ? If I could use them, it would solve my problem, since I need an embedding and an inferface prediction for each atom.

Thanks!

CUDA memory leak

Hello authors and anyone else who has tried dMaSIF,

do you ever face a CUDA memory leak issue?

I'm training it on a dataset of protein surface patches on an RTX 2080Ti and the GPU VRAM usage creeps up over time. Even using a small batch size to start with, the VRAM usage starts at ~3 GB, but slowly creeps up to 8 GB after several epochs (maybe 10 epochs). This is even with me calling torch.cuda.empty_cache() 10 times during the epoch.

This is the exact error message:

python3: /opt/conda/conda-bld/magma-cuda113_1619629459349/work/interface_cuda/interface.cpp:901: void magma_queue_create_from_cuda_internal(magma_device_t, cudaStream_t, cublasHandle_t, cusparseHandle_t, magma_queue**, const char*, const char*, int): Assertion `queue->dCarray__ != __null' failed.
3

I have not done any profiling checks as this is not particularly trivial... but I may not have much of a choice. I suspect it is from pykeops ? Maybe it is bad at releasing VRAM that it was holding onto, even with tensor.detach(). Honestly, I don't know how well writen pykeops is from an engineering standpoint. It was quite difficult to even install it properly and get it to see CUDA, which doesn't inspire confidence. C++ libs can have issues with efficient memory allocation.

I also notice that changing the batch size doesn't change the training speed (in terms of items per second), which is quite suspicious, as usually increasing the batch size increases the throughput.

It would be supremely helpful if anyone has found workarounds around this. Currently, I'm using a bash script that runs the experiment for N times (say 10 times), so even if one crashes halfway, it can resume from the last checkpoint. It works and I can train for many epochs (50+), but it is inconvenient and very hacky, to say the least.

I truly believe works like this are in the right direction for modelling proteins, their properties & interactions. As someone with a chemistry & biology background, this just makes much more sense than learning naiively from 1D string sequences and hoping the model somehow generalises. I hope we can work together to improve this work.

Running you example does not provide same ROC-AUC as claimed in the paper

I ran the below command (had to fetch the pretrained model from the dMaSIF_colab repo) to test the site prediction accuracy and the ROC-AUC was on average of 0.62, not the 0.87 you state in the paper. How did you get this figure?

python --experiment_name dMaSIF_site_3layer_16dims_9A_100sup_epoch64 --batch_size 64 --embedding_layer dMaSIF --site True --emb_dims 16 --device cuda:0 --radius 9.0 --n_layers 3

I also ran the same command but for search prediction and I got an average ROC-AUC of ~0.5 (so essentially random).

It would be really helpful if you could explain how to use this tool to get the results you say you obtained in the paper.

Thanks

faces of the protein surface

Hi, thanks for sharing this code repository and this impressive work!

I understand that dMASIF can calculate the coordinates and normal vectors of protein surface vertices as below.

def preprocess_surface(self, P):
    P["xyz"], P["normals"], P["batch"] = atoms_to_points_normals(
        P["atoms"],
        P["batch_atoms"],
        atomtypes=P["atomtypes"],
        resolution=self.args.resolution,
        sup_sampling=self.args.sup_sampling,
        distance=self.args.distance,
    )
    if P['mesh_labels'] is not None:
        project_iface_labels(P)

I would like to know whether dMASIF can provide information about the faces of the protein surface. If dMASIF does not have a direct interface to calculate the faces of proteins, do you know of any methods (e.g., python packages) that can be used to compute the protein's faces based on P[“xyz”]? Thanks in advance!

RuntimeError: operation does not have an identity.

Hi dMaSIF team and all the developers, I have this problem when reproducing the code, when pre-processing the training dataset, it appears that the points are all deleted when removing the points that may be trapped inside. Looking forward to your replies!

Preprocessing training dataset
  0%|                                                                                             | 0/2958 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "main_training.py", line 55, in <module>
    train_dataset = iterate_surface_precompute(train_loader, net, args)
  File "/hy-tmp/dMaSIF-master/data_iteration.py", line 427, in iterate_surface_precompute
    P1, P2 = process(args, protein_pair, net)
  File "/hy-tmp/dMaSIF-master/data_iteration.py", line 126, in process
    net.preprocess_surface(P1)
  File "/hy-tmp/dMaSIF-master/model.py", line 457, in preprocess_surface
    P["xyz"], P["normals"], P["batch"] = atoms_to_points_normals(
  File "/hy-tmp/dMaSIF-master/geometry_processing.py", line 305, in atoms_to_points_normals
    points, batch_points = subsample(z, batch_z, scale=resolution)
  File "/hy-tmp/dMaSIF-master/geometry_processing.py", line 126, in subsample
    batch_size = torch.max(batch).item() + 1  # Typically, =32
RuntimeError: operation does not have an identity.

Transform Dataset object to list

dMaSIF/main_training.py

Line 55 in 0dcc26c

train_dataset = iterate_surface_precompute(train_loader, net, args)

This operation converts the dataset into list. IMHO this may lead to loss of Dataset properties, for example data transformation (random rotation) between epochs.

benchmark_scripts/

The readme mentions that the command line options needed to reproduce the benchmarks can be found in benchmark_scripts/. Could it be that this folder is missing?

Choice of the atom RADII

Where do the following RADII values for the atoms come from?

atomic_radii = torch.cuda.FloatTensor([170, 110, 152, 155, 180, 190], device=x.device)

In the previous mesh based MASIF version, the following values are used for the MSMS program.

radii = {"N": "1.540000", "O": "1.400000", "C": "1.740000", "H": "1.200000", "S": "1.800000", "P": "1.800000",
         "Z": "1.39", "X": "0.770000"}

Availability of '{pdb_id}_meshpoints.npy' data?

Hi there!

Cool code! I trained a model following your scripts and i wanted to benchmark its performance with the provided notebooks. However, there's some data which is not available from data_analysis/analyse_output.ipynb.

meshpoints = np.load(datafolder/(pdb_id+'_meshpoints.npy'))

How should I get the data mentioned here?

Best,
Eric Alcaide

tutorial and use case

Hello,

i was wandering if it would be possible to provide documentation on how tu use dMaSIF.

What command needs to be performed for example to run main_inference.py site for PLD1 (4ZQK.pdb)

Also more details on how to predict interactome using main_inference.py search for PLD1 would be greatly appreciated.

If you provide a script for each of theses usecases would greatly help understanding how to operate the software.

This software would be very usefull for our group. We would like to find the interactome of many new proteins we have detected by MS and predicted structure using Alphafold.

Thank you very much in advance for your help.

why the interface is computed like this ?

this function in data_iteration.py as below,

def generate_matchinglabels(args, P1, P2):
    if args.random_rotation:
        P1["xyz"] = torch.matmul(P1["rand_rot"].T, P1["xyz"].T).T + P1["atom_center"]
        P2["xyz"] = torch.matmul(P2["rand_rot"].T, P2["xyz"].T).T + P2["atom_center"]
    xyz1_i = LazyTensor(P1["xyz"][:, None, :].contiguous())
    xyz2_j = LazyTensor(P2["xyz"][None, :, :].contiguous())

    xyz_dists = ((xyz1_i - xyz2_j) ** 2).sum(-1).sqrt()
    xyz_dists = (1.0 - xyz_dists).step()

    p1_iface_labels = (xyz_dists.sum(1) > 1.0).float().view(-1)
    p2_iface_labels = (xyz_dists.sum(0) > 1.0).float().view(-1)

    P1["labels"] = p1_iface_labels
    P2["labels"] = p2_iface_labels

my question is:
what is the meaning of xyz_dists.sum(1) > 1.0
and I think the value in the array of xyz_dists.sum(1) is less than 1.0 , and maybe in most case the value should be negative, so the p1_iface_labels may have no non-zero value. which means there is no interface.

why do we manually loop through each batch ?

Hello authors,

I am in the process of modifying dMaSIF for the downstream task of protein-ligand binding affinity prediction. While reading & modifying your code, I noticed that in data_iteration.iterate, https://github.com/FreyrS/dMaSIF/blob/master/data_iteration.py#L290
we actually extract individual proteins/protein-pairs in a batch, and then do forward pass on each of those batches.

Effectively, doesn't this equate to a batch_size of 1 ? even though in the benchmark_scripts, the --batch_size argument is set to 64, it is not actually used and the batch_size is hardcoded to 1. https://github.com/FreyrS/dMaSIF/blob/master/main_training.py#L51

Is there a reason for doing this, rather than just doing a forward pass on the entire batch?

As a side note, this line (https://github.com/FreyrS/dMaSIF/blob/master/data_iteration.py#L299) also indicates that the code is hardcoded to a batch_size of 1. My understanding was that it should be
P1["rand_rot"] = protein_pair.rand_rot1.view(-1, 3, 3)[protein_it] instead of P1["rand_rot"] = protein_pair.rand_rot1.view(-1, 3, 3)[0]

Thank you and appreciate your help.

Interface Labels

The code use interface labels from the ply file. How is this "iface" item labeled ?
Also what is the function in data.py iface_valid_filter for ?
Thanks for your great work and neat code !
Best,
Chang

Models available for benchmarking

Will other models for the benchmark scripts be made available? Currently only dMaSIF_search_3layer_12A_16dim is in the models directory.

How to predict the interface characters of single protein?

Dear Authors,
This is a really great work! Thank you!
I wonder whether this script can be used on single protein ? And how ?
BTW, I'm working on interaction between protein-DNA , should I re-train the model? Can this model adapted to DNA surface?

Runtime and memory much higher than described

Hello! I'm trying to run the search model on ~40k protein pairs, and on a single A6000, it takes approximately 50s per pair, and even with a batch size of 1, I often run out of memory (47.5 G total capacity).

Is there any preprocessing you would recommend to speed this up or reduce the memory cost?

Return value in diagonal_ranges function

Hello, I would like to ask about the return value in the diagonal_ranges function that does not seem to be consistent with common sense, as follows.

return ranges_x, slices_x, ranges_y, ranges_y, slices_y, ranges_x

Should it be modified to the following code?

return ranges_x, slices_x, ranges_y, slices_y

tarfile.ReadError: file could not be opened successfully

thanks for your great work! But I am in trouble when I run a sctipt in benchmark_scripts.
I do not change anything, but simply run the python -W ignore -u main_training.py --experiment_name dMaSIF_search_1layer_12A --batch_size 64 --embedding_layer dMaSIF --search True --device cuda:0 --random_rotation True --radius 12.0 --n_layers 1 . The terminal show me an error tarfile.ReadError: file could not be opened successfully . The code is tar = tarfile.open(self.raw_paths[0]) in data.py.
How could solve the problem?

Do you know where the group truth site label is loaded in this code?

Hi,

I am quite new to this repo. And I could not know how to find the label. Could you help?

Best regards

If i have duplicate pdb 1NPU and 1NPU.can you provide a example of main_inferency.py for masif_site and masif search

In generate_matchinglabels, is it right that distances between xyz coordinates of p1,p2 were used to define labels?

In masif, the label was defined based on descriptor vectors. But in dmasif, labels are defined based on xyz coordinates. Is it right?

`def generate_matchinglabels(args, P1, P2):
if args.random_rotation:
P1["xyz"] = torch.matmul(P1["rand_rot"].T, P1["xyz"].T).T + P1["atom_center"]
P2["xyz"] = torch.matmul(P2["rand_rot"].T, P2["xyz"].T).T + P2["atom_center"]
xyz1_i = LazyTensor(P1["xyz"][:, None, :].contiguous())
xyz2_j = LazyTensor(P2["xyz"][None, :, :].contiguous())

xyz_dists = ((xyz1_i - xyz2_j) ** 2).sum(-1).sqrt()
xyz_dists = (1.0 - xyz_dists).step()

p1_iface_labels = (xyz_dists.sum(1) > 1.0).float().view(-1)
p2_iface_labels = (xyz_dists.sum(0) > 1.0).float().view(-1)

P1["labels"] = p1_iface_labels
P2["labels"] = p2_iface_labels`

a tiny error in save_protein_batch_single

labels = P["labels"].view(-1, 1) if P1["labels"] is not None else 0.0 * predictions
P1 --> P, isn't it?

[Question] Are ligands or PTMs considered

Hi, I wanted to ask whether modeled ligands or PTMs (e.g. phosphorylated residues) that are in the same chain as the polypeptide are included in the descriptor calculation?

Cannot replicate results using benchmark scripts and provided model

Hi,
The model provided in models/ does not match any that are used in the benchmarking scripts.
Also, the structure of the state_dict of the provided checkpoint does not match the structure of the model. It does not seem like this checkpoint was produced with this repo.

Consequently, the provided model does not help reproducing the results from the paper. Worse, it's only after having spent lots of time trying to get the code to work that we noticed this issue. At first glance, it looks like we have a pretrained model to work with; but sadly this is not the case.

Problems regarding loss function and auroc

Dear Authors,

Thank you for sharing the work and code! I have some problems regarding the AUC-ROC and loss function part. In

dMaSIF/data_iteration.py

Lines 202 to 215 in 0dcc26c

pos_indices = torch.randperm(len(pos_labels))[:n_points_sample]

neg_indices = torch.randperm(len(neg_labels))[:n_points_sample]

pos_preds = pos_preds[pos_indices]

pos_labels = pos_labels[pos_indices]

neg_preds = neg_preds[neg_indices]

neg_labels = neg_labels[neg_indices]

preds_concat = torch.cat([pos_preds, neg_preds])

labels_concat = torch.cat([pos_labels, neg_labels])

loss = F.binary_cross_entropy_with_logits(preds_concat, labels_concat)

return loss, preds_concat, labels_concat

You seemed to sample a random number of positive and negative samples to compute the loss and AUC-ROC. For loss, this could be fine, but regarding the AUC-ROC, did you use the same pipeline for the test dataset when measuring the model performance? If so, what's the performance given the entire surface?

Best,
Wenkai

save dmasif_search result

It's better to add some codes in the function 'def save_protein_batch_single (protein_pair_id, P, save_path, pdb_idx)' to save dmasif_search predictions such as 'sampled_preds' variable.

Why "iface_preds " contain NAN when training dmasif_

It seems that the training is not stable.
I follow the "benchmark_scripts" to retrain the dMaSIF_site_3layer_9A. But when calculate the roc-auc, it raised "Problem with computing roc-auc" and I found that the "iface_preds" contain NAN.
Does anyone have similar problem?

Conda environment

Would it be possible to provide a conda environment file for this project? Would make trying the package out significantly more easy. Thanks for the great project.

Kind regards,
Jonas

how can I run the PDL1 example by dmasif?

There're not clear instructions for downstream analysis by dmasif, such as docking.

Using main_training.py

Hi there, I'd like to know is it feasible to train dMaSIF model without download the masif_site_masif_search_pdbs_and_ply_files.tar.gz file, and using dMaSIF data_preprocessing code only.

Can you provide more information on the features in the predfeatures numpy file?

Dear dMaSIF team and users,

I am using the Google Colab version of dMaSIF to get the surface predictions from the model protein pdb files.
Among the outputs of dMaSIF I found predfeatures_emb1.npy file with 34 columns and the corresponding .npy file containing the coordinates.
If I understand correctly, this is an array of surface patches with biochemical features and coordinates of each patch.

Maybe I have overlooked it, but couldn't find anywhere hints on how to decipher the columns in the file.
Right now they are numbered from 0 to 33, but can you provide a list of columns/features, so people can check e.g. what feature is predicted in the column 10 and how this changes between different patches.
It would dramatically increase the usability of these predictions.

Thanks!

How to perform massif-ligand "type" tasks with this framework.

Dear Authors,

This is really an amazing body of work, thank you.

Would you have any suggestions as to how to adapt dMaSIF for tasks where we want to predict a feature vector of a fixed size (unlike the embedding created which varies with the size of the input molecule from my understanding).

I was trying to add an application specific head to the network but I see the embedding produced is always dependent on the molecule size so I am having difficulty feeding this into a fixed sized fence network of linear layers.

I would really appreciate any ideas. Maybe it would be very simple and I am missing something conceptually. I hope the question is clear.

All the best,
Oz

main_inference result analysis

hi there, could you tell me how to visualize the .vtk file after I finished main_inference.py with a single PDB, and how to use the other .npy files.

How to mathematically define the binding surface?

Does anyone have a docker image that can run this project? Can you share it with me?

Python version requirements on ubuntu

Hi,

Trying to get an install working on ubuntu. When I

conda create -n dmasif python=3.6.9
conda activate dmasif
pip install -r requirements.txt

I get a complaint:

ERROR: Could not find a version that satisfies the requirement ipython==7.19.0 (from versions: 0.10, 0.10.1, 0.10.2, 0.11, 0.12, 0.12.1, 0.13, 0.13.1, 0.13.2, 1.0.0, 1.1.0, 1.2.0, 1.2.1, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.4.0, 2.4.1, 3.0.0, 3.1.0, 3.2.0, 3.2.1, 3.2.2, 3.2.3, 4.0.0b1, 4.0.0, 4.0.1, 4.0.2, 4.0.3, 4.1.0rc1, 4.1.0rc2, 4.1.0, 4.1.1, 4.1.2, 4.2.0, 4.2.1, 5.0.0b1, 5.0.0b2, 5.0.0b3, 5.0.0b4, 5.0.0rc1, 5.0.0, 5.1.0, 5.2.0, 5.2.1, 5.2.2, 5.3.0, 5.4.0, 5.4.1, 5.5.0, 5.6.0, 5.7.0, 5.8.0, 5.9.0, 5.10.0, 6.0.0rc1, 6.0.0, 6.1.0, 6.2.0, 6.2.1, 6.3.0, 6.3.1, 6.4.0, 6.5.0, 7.0.0b1, 7.0.0rc1, 7.0.0, 7.0.1, 7.1.0, 7.1.1, 7.2.0, 7.3.0, 7.4.0, 7.5.0, 7.6.0, 7.6.1, 7.7.0, 7.8.0, 7.9.0, 7.10.0, 7.10.1, 7.10.2, 7.11.0, 7.11.1, 7.12.0, 7.13.0, 7.14.0, 7.15.0, 7.16.0, 7.16.1, 7.16.2, 7.16.3)
ERROR: No matching distribution found for ipython==7.19.0

Fair enough. With

conda create -n dmasif python=3.7.7
conda activate dmasif
pip install -r requirements.txt

I get a new issue -- well, initially pykeops complains that I don't have a numpy version, so I install one, but this is after that has been resolved:

Collecting torch-cluster==1.5.8
  Using cached torch_cluster-1.5.8.tar.gz (38 kB)
    ERROR: Command errored out with exit status 1:
     command: /home/andy/miniconda3/envs/dmasif/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-xzwwngxk/torch-cluster_103368352f6d483aaa0cfdcb9667be1e/setup.py'"'"'; __file__='"'"'/tmp/pip-install-xzwwngxk/torch-cluster_103368352f6d483aaa0cfdcb9667be1e/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-f_nokz7k
         cwd: /tmp/pip-install-xzwwngxk/torch-cluster_103368352f6d483aaa0cfdcb9667be1e/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-xzwwngxk/torch-cluster_103368352f6d483aaa0cfdcb9667be1e/setup.py", line 7, in <module>
        import torch
    ModuleNotFoundError: No module named 'torch'
    ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/ff/39/6524367aed276171ba66675b25b368c1522e1d35e17b96ee4ff4025e1d98/torch_cluster-1.5.8.tar.gz#sha256=a0a32f63faac40a026ab1e9da31f6babdb4d937e53be40bd1c91d9b5a286eee6 (from https://pypi.org/simple/torch-cluster/) (requires-python:>=3.6). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement torch-cluster==1.5.8 (from versions: 0.1.1, 0.2.3, 0.2.4, 1.0.1, 1.0.3, 1.1.1, 1.1.2, 1.1.3, 1.1.4, 1.1.5, 1.2.1, 1.2.2, 1.2.3, 1.2.4, 1.3.0, 1.4.0, 1.4.1, 1.4.2, 1.4.3a1, 1.4.3, 1.4.4, 1.4.5, 1.5.2, 1.5.3, 1.5.4, 1.5.5, 1.5.6, 1.5.7, 1.5.8, 1.5.9, 1.6.0)
ERROR: No matching distribution found for torch-cluster==1.5.8

This all is with a totally fresh environment generated from miniconda, so it's not totally clear to me what the issue might be. Do you happen to have an environment.yml or a docker image that could do the trick?

Longer term, I'm also curious about how to reproduce a couple specific studies in the paper -- for example, the model dMaSIF_search_3layer_12A_16dim in the repository, what flags were used to train it? what does it predict? how can I pipe its output layer into an interpretable format? It looks like the scripts for making figures are purely for reproducing the figures given data, not for repeating the same analysis using dMaSIF and a collection of PDBs. I'd love to (for example) have a script + model to produce the electrostatic surface of a PDB, but it's not clear if that's available!

embedding example

Hi @FreyrS ! Love this robustified version of masif!
Would you mind providing an example of how to embed a local pdb with dmasif?
Thanks!

Benchmark experiments for site and single protein cannot run.

Tried to run the benchmark experiments. Resulted in a dict call of a None Type.

Through code reviewing, I saw the following function:

https://github.com/FreyrS/dMaSIF/blob/master/data_iteration.py#:~:text=def%20iterate_surface_precompute(dataset,return%20processed_dataset

where if single protein is passed as True, P2 is still called like a regular Dict where it is of type None from:

https://github.com/FreyrS/dMaSIF/blob/0dcc26c3c218a39d5fe26beb2e788b95fb028896/data_iteration.py#:~:text=def%20process(args,return%20P1%2C%20P2

	pos_indices = torch.randperm(len(pos_labels))[:n_points_sample]
	neg_indices = torch.randperm(len(neg_labels))[:n_points_sample]

	pos_preds = pos_preds[pos_indices]
	pos_labels = pos_labels[pos_indices]
	neg_preds = neg_preds[neg_indices]
	neg_labels = neg_labels[neg_indices]

	preds_concat = torch.cat([pos_preds, neg_preds])
	labels_concat = torch.cat([pos_labels, neg_labels])

	loss = F.binary_cross_entropy_with_logits(preds_concat, labels_concat)

	return loss, preds_concat, labels_concat

freyrs / dmasif Goto Github PK

dmasif's People

Contributors

Stargazers

Watchers

Forkers

dmasif's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs