freyrs / dmasif Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
Hi
Thanks for making your code public.
I want to run main_inference.py for site prediction. The only pretrained model you provide is this one for search prediction
dMaSIF_search_3layer_12A_16dim
when I try and run it for site prediction, it fails with a can't load state dict error
Python -W ignore -u main_inference.py --experiment_name dMaSIF_search_3layer_12A_16dim --batch_size 64 --embedding_layer dMaSIF --site True --emb_dims 16 --device cuda:0 --radius 12.0 --n_layers 3
RuntimeError: Error(s) in loading state_dict for dMaSIF:
Missing key(s) in state_dict: "net_out.0.weight", "net_out.0.bias", "net_out.2.weight", "net_out.2.bias", "net_out.4.weight", "net_out.4.bias".
Unexpected key(s) in state_dict: "orientation_scores2.0.weight", "orientation_scores2.0.bias", "orientation_scores2.2.weight", "orientation_scores2.2.bias", "conv2.layers.0.net_in.0.weight", "conv2.layers.0.net_in.0.bias", "conv2.layers.0.net_in.2.weight", "conv2.layers.0.net_in.2.bias", "conv2.layers.0.norm_in.weight", "conv2.layers.0.norm_in.bias", "conv2.layers.0.conv.0.weight", "conv2.layers.0.conv.0.bias", "conv2.layers.0.conv.2.weight", "conv2.layers.0.conv.2.bias", "conv2.layers.0.net_out.0.weight", "conv2.layers.0.net_out.0.bias", "conv2.layers.0.net_out.2.weight", "conv2.layers.0.net_out.2.bias", "conv2.layers.0.norm_out.weight", "conv2.layers.0.norm_out.bias", "conv2.layers.1.net_in.0.weight", "conv2.layers.1.net_in.0.bias", "conv2.layers.1.net_in.2.weight", "conv2.layers.1.net_in.2.bias", "conv2.layers.1.norm_in.weight", "conv2.layers.1.norm_in.bias", "conv2.layers.1.conv.0.weight", "conv2.layers.1.conv.0.bias", "conv2.layers.1.conv.2.weight", "conv2.layers.1.conv.2.bias", "conv2.layers.1.net_out.0.weight", "conv2.layers.1.net_out.0.bias", "conv2.layers.1.net_out.2.weight", "conv2.layers.1.net_out.2.bias", "conv2.layers.1.norm_out.weight", "conv2.layers.1.norm_out.bias", "conv2.layers.2.net_in.0.weight", "conv2.layers.2.net_in.0.bias", "conv2.layers.2.net_in.2.weight", "conv2.layers.2.net_in.2.bias", "conv2.layers.2.norm_in.weight", "conv2.layers.2.norm_in.bias", "conv2.layers.2.conv.0.weight", "conv2.layers.2.conv.0.bias", "conv2.layers.2.conv.2.weight", "conv2.layers.2.conv.2.bias", "conv2.layers.2.net_out.0.weight", "conv2.layers.2.net_out.0.bias", "conv2.layers.2.net_out.2.weight", "conv2.layers.2.net_out.2.bias", "conv2.layers.2.norm_out.weight", "conv2.layers.2.norm_out.bias", "conv2.linear_layers.0.0.weight", "conv2.linear_layers.0.0.bias", "conv2.linear_layers.0.2.weight", "conv2.linear_layers.0.2.bias", "conv2.linear_layers.1.0.weight", "conv2.linear_layers.1.0.bias", "conv2.linear_layers.1.2.weight", "conv2.linear_layers.1.2.bias", "conv2.linear_layers.2.0.weight", "conv2.linear_layers.2.0.bias", "conv2.linear_layers.2.2.weight", "conv2.linear_layers.2.2.bias", "conv2.linear_transform.0.weight", "conv2.linear_transform.0.bias", "conv2.linear_transform.1.weight", "conv2.linear_transform.1.bias", "conv2.linear_transform.2.weight", "conv2.linear_transform.2.bias"
Am I using the command wrong? or is the model just not available?
Hi, thanks for your great work!
However, I am quite confused about some of the properties. For example, we can extract atom_coords and atom_types from the PDB files. Then what is the difference between xyz in PLY files and atom_coords in PDB files?
As for the face/triangle, I thought every triple forms a triangle surface, right?
I am also curious about the normals from PLY files. In the paper, it seems that it should be calculated by sampling algorithm, so how do we get them straight from PLY files?
Sorry for not being familiar with the protein surface representation and I hope you can answer my questions.
Thanks!
atoms_to_points_normals() function was decrbied clearly in your paper. But the trainning process by using mesh ply files from masif preprocessing scripts skipped it.
My question is that, in the inference step, If such model directly took a pdb as the input which would be convert to points by atoms_to_points_normals() function, does it make sense when input files are different between training and inference?
Is the downloaded dataset "masif_site_masif_search_pdbs_and_ply_files. tar. gz" the result of preprocessing the dataset through previous work "masif"?
What should I do if I want to replace it with my own dataset?
python main_inference.py --experiment_name dMaSIF_site_3layer_16dims_9A_0.7res_150sup_epoch85 --embedding_layer dMaSIF --emb_dims 16 --n_layers 3 --resolution 0.7 --radius 9 --single_pdb 1A0G_B_B --site True --device cuda:0
0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "main_inference.py", line 86, in
pdb_ids=test_pdb_ids,
File "/home/v/proj/dmasif/content/MaSIF_colab/data_iteration.py", line 286, in iterate
P1_batch, P2_batch = process(args, protein_pair, net)
File "/home/v/proj/dmasif/content/MaSIF_colab/data_iteration.py", line 126, in process
net.preprocess_surface(P1)
File "/home/v/proj/dmasif/content/MaSIF_colab/model.py", line 454, in preprocess_surface
distance=self.args.distance,
File "/home/v/proj/dmasif/content/MaSIF_colab/geometry_processing.py", line 266, in atoms_to_points_normals
atomtypes=atomtypes,
File "/home/v/proj/dmasif/content/MaSIF_colab/geometry_processing.py", line 164, in soft_distances
[170, 110, 152, 155, 180, 190], device=x.device
RuntimeError: legacy constructor expects device type: cpubut device type: cuda was passed
How can I get .ply corresponding to .pdb as provided in download dataset?
In other words, how does this work convert pdb to ply?
Hi , i tried to executed some command codes from benchmark_scripts.
However, I met some problems:
There are some dimension problem when the program loading PyG dataset
collate TypeError: cat_dim() takes 3 positional arguments but 4 were given
So i modified the code in data.py from
def cat_dim(self, key, value):
to
def cat_dim(self, key, value, *args, **kwargs):
atoms_to_points_normals
Here comes some error like
File exists: '/home/user/.cache/pykeops-1.5-cpython-38//build-pybind11_template-libKeOps_template_40f62f56de'
/home/user/.cache/pykeops-1.5-cpython-38:
formula: Sum_Reduction(Exp(Minus(Sqrt(Sum(Square((Var(0,3,0) - Var(1,3,1))))))),1)
aliases: Var(0,3,0); Var(1,3,1);
dtype : float32
...
make: *** 「KeOps_formula」。 。
--------------------- MAKE DEBUG -----------------
Command '['cmake', '--build', '.', '--target', 'KeOps_formula', '--', 'VERBOSE=1']' returned non-zero exit status 2.
Hi,
I found a link to Huggingface with pre-trained weights. It seems to be NOT up now. Is there a pre-trained model that this paper have on Hugging face. Thanks in advance.
Hi
I'm trying to use dMaSIF for interaction prediction between proteins (taking a target and finding the best binder in a large collection of potential binders)
At the moment, I process both binder and target molecule identically with dMaSIF up to the convolutional step and export the outputs "xxxx_predfeatures_emb1.npy" and "predcoords.npy" for both proteins.
According to the paper, these features of both binding partners should be passed through a separate convolutional network, allowing the network to find complementary (instead of similar) regions. Unfortunately I was not able to find the code doing that. Could you point me to the right section in the dMaSIF code?
Thanks so much to all contributors
DavidGraber
I noticed the function diagonal_ranges() in the helper.py repeatedly returned ranges_y and ranges_x. Is this necessary or just a mistake?
Hi
I'm trying to run your model on a single protein pair (target + binder). It seems you must supply a y/label column vector with each protein in the pair.
If you set the --single_pdb parameter to a pdb file on the command line, the first function that is called is load_protein_pair()
which tries to access a 'y' attribute on the Data object returned by load_protein_npy()
def load_protein_pair(pdb_id, data_dir,single_pdb=False):
"""Loads a protein surface mesh and its features"""
pspl = pdb_id.split("_")
p1_id = pspl[0] + "_" + pspl[1]
p2_id = pspl[0] + "_" + pspl[2]
p1 = load_protein_npy(p1_id, data_dir, center=False,single_pdb=single_pdb)
p2 = load_protein_npy(p2_id, data_dir, center=False,single_pdb=single_pdb)
# pdist = ((p1['xyz'][:,None,:]-p2['xyz'][None,:,:])**2).sum(-1).sqrt()
# pdist = pdist<2.0
# y_p1 = (pdist.sum(1)>0).to(torch.float).reshape(-1,1)
# y_p2 = (pdist.sum(0)>0).to(torch.float).reshape(-1,1)
y_p1 = p1["y"] # <- tries to access a y value that will not be there at inference time
y_p2 = p2["y"]
...
It looks like load_protein_npy()
tries to set the 'y' attribute on the Data object to None if the --single_pdb parameter is set to a file, but irritatingly, the Data constructor does not set attributes if they are None.
So I just get a key error.
test_dataset = [load_protein_pair(args.single_pdb, NPY_DIR, single_pdb=True)]
test_pdb_ids = [args.single_pdb]
KeyError Traceback (most recent call last)
Cell In [5], line 1
----> 1 test_dataset = [load_protein_pair(args.single_pdb, NPY_DIR, single_pdb=True)]
2 test_pdb_ids = [args.single_pdb]
File ~/git/dMaSIF/data.py:247, in load_protein_pair(pdb_id, data_dir, single_pdb)
242 p2 = load_protein_npy(p2_id, data_dir, center=False,single_pdb=single_pdb)
243 # pdist = ((p1['xyz'][:,None,:]-p2['xyz'][None,:,:])**2).sum(-1).sqrt()
244 # pdist = pdist<2.0
245 # y_p1 = (pdist.sum(1)>0).to(torch.float).reshape(-1,1)
246 # y_p2 = (pdist.sum(0)>0).to(torch.float).reshape(-1,1)
--> 247 y_p1 = p1["y"]
248 y_p2 = p2["y"]
250 protein_pair_data = PairData(
251 xyz_p1=p1["xyz"],
252 xyz_p2=p2["xyz"],
(...)
266 atom_types_p2=p2["atom_types"],
267 )
File ~/mambaforge/envs/PyG-env6/lib/python3.10/site-packages/torch_geometric/data/data.py:444, in Data.__getitem__(self, key)
443 def __getitem__(self, key: str) -> Any:
--> 444 return self._store[key]
...
File ~/mambaforge/envs/PyG-env6/lib/python3.10/site-packages/torch_geometric/data/storage.py:81, in BaseStorage.__getitem__(self, key)
80 def __getitem__(self, key: str) -> Any:
---> 81 return self._mapping[key]
KeyError: 'y'
Hi,
I have successfully managed to predict on new pdbs using the provided weights. Thanks for providing them.
I am trying to perform a PPI search, so I need to run dMaSIF Site, and then Search. I have used the commands below:
python -W ignore -u main_inference.py --experiment_name dMaSIF_search_3layer_12A_16dim --batch_size 64 --embedding_layer dMaSIF --emb_dims 16 --device cuda:0 --radius 12.0 --n_layers 3 --npy_dir NPY_DIR --search True
python -W ignore -u main_inference.py --experiment_name dMaSIF_site_3layer_16dims_12A_100sup_epoch71 --batch_size 64 --embedding_layer dMaSIF --emb_dims 16 --device cuda:0 --radius 12.0 --n_layers 3 --npy_dir NPY_DIR --site True
I added --npy_dir NPY_DIR
for my use case. Basically, it calls load_protein_pair on proteins from a folder, but it only fills the first protein of the pair.
The problem is that for a single protein, the Site and Search predictions have a different number of rows!
Here are my questions:
Thanks!
Hello authors and anyone else who has tried dMaSIF,
do you ever face a CUDA memory leak issue?
I'm training it on a dataset of protein surface patches on an RTX 2080Ti and the GPU VRAM usage creeps up over time. Even using a small batch size to start with, the VRAM usage starts at ~3 GB, but slowly creeps up to 8 GB after several epochs (maybe 10 epochs). This is even with me calling torch.cuda.empty_cache()
10 times during the epoch.
This is the exact error message:
python3: /opt/conda/conda-bld/magma-cuda113_1619629459349/work/interface_cuda/interface.cpp:901: void magma_queue_create_from_cuda_internal(magma_device_t, cudaStream_t, cublasHandle_t, cusparseHandle_t, magma_queue**, const char*, const char*, int): Assertion `queue->dCarray__ != __null' failed.
3
I have not done any profiling checks as this is not particularly trivial... but I may not have much of a choice. I suspect it is from pykeops
? Maybe it is bad at releasing VRAM that it was holding onto, even with tensor.detach()
. Honestly, I don't know how well writen pykeops
is from an engineering standpoint. It was quite difficult to even install it properly and get it to see CUDA, which doesn't inspire confidence. C++ libs can have issues with efficient memory allocation.
I also notice that changing the batch size doesn't change the training speed (in terms of items per second), which is quite suspicious, as usually increasing the batch size increases the throughput.
It would be supremely helpful if anyone has found workarounds around this. Currently, I'm using a bash script that runs the experiment for N times (say 10 times), so even if one crashes halfway, it can resume from the last checkpoint. It works and I can train for many epochs (50+), but it is inconvenient and very hacky, to say the least.
I truly believe works like this are in the right direction for modelling proteins, their properties & interactions. As someone with a chemistry & biology background, this just makes much more sense than learning naiively from 1D string sequences and hoping the model somehow generalises. I hope we can work together to improve this work.
Hi
I ran the below command (had to fetch the pretrained model from the dMaSIF_colab repo) to test the site prediction accuracy and the ROC-AUC was on average of 0.62, not the 0.87 you state in the paper. How did you get this figure?
python --experiment_name dMaSIF_site_3layer_16dims_9A_100sup_epoch64 --batch_size 64 --embedding_layer dMaSIF --site True --emb_dims 16 --device cuda:0 --radius 9.0 --n_layers 3
I also ran the same command but for search prediction and I got an average ROC-AUC of ~0.5 (so essentially random).
It would be really helpful if you could explain how to use this tool to get the results you say you obtained in the paper.
Thanks
Hi, thanks for sharing this code repository and this impressive work!
I understand that dMASIF can calculate the coordinates and normal vectors of protein surface vertices as below.
def preprocess_surface(self, P):
P["xyz"], P["normals"], P["batch"] = atoms_to_points_normals(
P["atoms"],
P["batch_atoms"],
atomtypes=P["atomtypes"],
resolution=self.args.resolution,
sup_sampling=self.args.sup_sampling,
distance=self.args.distance,
)
if P['mesh_labels'] is not None:
project_iface_labels(P)
I would like to know whether dMASIF can provide information about the faces of the protein surface. If dMASIF does not have a direct interface to calculate the faces of proteins, do you know of any methods (e.g., python packages) that can be used to compute the protein's faces based on P[“xyz”]? Thanks in advance!
Hi dMaSIF team and all the developers, I have this problem when reproducing the code, when pre-processing the training dataset, it appears that the points are all deleted when removing the points that may be trapped inside. Looking forward to your replies!
Preprocessing training dataset
0%| | 0/2958 [00:00<?, ?it/s]
Traceback (most recent call last):
File "main_training.py", line 55, in <module>
train_dataset = iterate_surface_precompute(train_loader, net, args)
File "/hy-tmp/dMaSIF-master/data_iteration.py", line 427, in iterate_surface_precompute
P1, P2 = process(args, protein_pair, net)
File "/hy-tmp/dMaSIF-master/data_iteration.py", line 126, in process
net.preprocess_surface(P1)
File "/hy-tmp/dMaSIF-master/model.py", line 457, in preprocess_surface
P["xyz"], P["normals"], P["batch"] = atoms_to_points_normals(
File "/hy-tmp/dMaSIF-master/geometry_processing.py", line 305, in atoms_to_points_normals
points, batch_points = subsample(z, batch_z, scale=resolution)
File "/hy-tmp/dMaSIF-master/geometry_processing.py", line 126, in subsample
batch_size = torch.max(batch).item() + 1 # Typically, =32
RuntimeError: operation does not have an identity.
Line 55 in 0dcc26c
The readme mentions that the command line options needed to reproduce the benchmarks can be found in benchmark_scripts/. Could it be that this folder is missing?
Where do the following RADII values for the atoms come from?
atomic_radii = torch.cuda.FloatTensor([170, 110, 152, 155, 180, 190], device=x.device)
In the previous mesh based MASIF version, the following values are used for the MSMS program.
radii = {"N": "1.540000", "O": "1.400000", "C": "1.740000", "H": "1.200000", "S": "1.800000", "P": "1.800000",
"Z": "1.39", "X": "0.770000"}
Hi there!
Cool code! I trained a model following your scripts and i wanted to benchmark its performance with the provided notebooks. However, there's some data which is not available from data_analysis/analyse_output.ipynb
.
meshpoints = np.load(datafolder/(pdb_id+'_meshpoints.npy'))
How should I get the data mentioned here?
Best,
Eric Alcaide
Hello,
i was wandering if it would be possible to provide documentation on how tu use dMaSIF.
What command needs to be performed for example to run main_inference.py site for PLD1 (4ZQK.pdb)
Also more details on how to predict interactome using main_inference.py search for PLD1 would be greatly appreciated.
If you provide a script for each of theses usecases would greatly help understanding how to operate the software.
This software would be very usefull for our group. We would like to find the interactome of many new proteins we have detected by MS and predicted structure using Alphafold.
Thank you very much in advance for your help.
this function in data_iteration.py as below,
def generate_matchinglabels(args, P1, P2):
if args.random_rotation:
P1["xyz"] = torch.matmul(P1["rand_rot"].T, P1["xyz"].T).T + P1["atom_center"]
P2["xyz"] = torch.matmul(P2["rand_rot"].T, P2["xyz"].T).T + P2["atom_center"]
xyz1_i = LazyTensor(P1["xyz"][:, None, :].contiguous())
xyz2_j = LazyTensor(P2["xyz"][None, :, :].contiguous())
xyz_dists = ((xyz1_i - xyz2_j) ** 2).sum(-1).sqrt()
xyz_dists = (1.0 - xyz_dists).step()
p1_iface_labels = (xyz_dists.sum(1) > 1.0).float().view(-1)
p2_iface_labels = (xyz_dists.sum(0) > 1.0).float().view(-1)
P1["labels"] = p1_iface_labels
P2["labels"] = p2_iface_labels
my question is:
what is the meaning of xyz_dists.sum(1) > 1.0
and I think the value in the array of xyz_dists.sum(1)
is less than 1.0 , and maybe in most case the value should be negative, so the p1_iface_labels may have no non-zero value. which means there is no interface.
Hello authors,
I am in the process of modifying dMaSIF for the downstream task of protein-ligand binding affinity prediction. While reading & modifying your code, I noticed that in data_iteration.iterate
, https://github.com/FreyrS/dMaSIF/blob/master/data_iteration.py#L290
we actually extract individual proteins/protein-pairs in a batch, and then do forward pass on each of those batches.
Effectively, doesn't this equate to a batch_size of 1 ? even though in the benchmark_scripts
, the --batch_size
argument is set to 64, it is not actually used and the batch_size
is hardcoded to 1. https://github.com/FreyrS/dMaSIF/blob/master/main_training.py#L51
Is there a reason for doing this, rather than just doing a forward pass on the entire batch?
As a side note, this line (https://github.com/FreyrS/dMaSIF/blob/master/data_iteration.py#L299) also indicates that the code is hardcoded to a batch_size of 1. My understanding was that it should be
P1["rand_rot"] = protein_pair.rand_rot1.view(-1, 3, 3)[protein_it]
instead of P1["rand_rot"] = protein_pair.rand_rot1.view(-1, 3, 3)[0]
Thank you and appreciate your help.
The code use interface labels from the ply file. How is this "iface" item labeled ?
Also what is the function in data.py iface_valid_filter for ?
Thanks for your great work and neat code !
Best,
Chang
Will other models for the benchmark scripts be made available? Currently only dMaSIF_search_3layer_12A_16dim is in the models directory.
Dear Authors,
This is a really great work! Thank you!
I wonder whether this script can be used on single protein ? And how ?
BTW, I'm working on interaction between protein-DNA , should I re-train the model? Can this model adapted to DNA surface?
Hello! I'm trying to run the search model on ~40k protein pairs, and on a single A6000, it takes approximately 50s per pair, and even with a batch size of 1, I often run out of memory (47.5 G total capacity).
Is there any preprocessing you would recommend to speed this up or reduce the memory cost?
Hello, I would like to ask about the return value in the diagonal_ranges function that does not seem to be consistent with common sense, as follows.
return ranges_x, slices_x, ranges_y, ranges_y, slices_y, ranges_x
Should it be modified to the following code?
return ranges_x, slices_x, ranges_y, slices_y
thanks for your great work! But I am in trouble when I run a sctipt in benchmark_scripts.
I do not change anything, but simply run the python -W ignore -u main_training.py --experiment_name dMaSIF_search_1layer_12A --batch_size 64 --embedding_layer dMaSIF --search True --device cuda:0 --random_rotation True --radius 12.0 --n_layers 1
. The terminal show me an error tarfile.ReadError: file could not be opened successfully
. The code is tar = tarfile.open(self.raw_paths[0])
in data.py.
How could solve the problem?
Hi,
I am quite new to this repo. And I could not know how to find the label. Could you help?
Best regards
In masif, the label was defined based on descriptor vectors. But in dmasif, labels are defined based on xyz coordinates. Is it right?
`def generate_matchinglabels(args, P1, P2):
if args.random_rotation:
P1["xyz"] = torch.matmul(P1["rand_rot"].T, P1["xyz"].T).T + P1["atom_center"]
P2["xyz"] = torch.matmul(P2["rand_rot"].T, P2["xyz"].T).T + P2["atom_center"]
xyz1_i = LazyTensor(P1["xyz"][:, None, :].contiguous())
xyz2_j = LazyTensor(P2["xyz"][None, :, :].contiguous())
xyz_dists = ((xyz1_i - xyz2_j) ** 2).sum(-1).sqrt()
xyz_dists = (1.0 - xyz_dists).step()
p1_iface_labels = (xyz_dists.sum(1) > 1.0).float().view(-1)
p2_iface_labels = (xyz_dists.sum(0) > 1.0).float().view(-1)
P1["labels"] = p1_iface_labels
P2["labels"] = p2_iface_labels`
labels = P["labels"].view(-1, 1) if P1["labels"] is not None else 0.0 * predictions
P1 --> P, isn't it?
Hi, I wanted to ask whether modeled ligands or PTMs (e.g. phosphorylated residues) that are in the same chain as the polypeptide are included in the descriptor calculation?
Hi,
The model provided in models/
does not match any that are used in the benchmarking scripts.
Also, the structure of the state_dict
of the provided checkpoint does not match the structure of the model. It does not seem like this checkpoint was produced with this repo.
Consequently, the provided model does not help reproducing the results from the paper. Worse, it's only after having spent lots of time trying to get the code to work that we noticed this issue. At first glance, it looks like we have a pretrained model to work with; but sadly this is not the case.
Dear Authors,
Thank you for sharing the work and code! I have some problems regarding the AUC-ROC and loss function part. In
Lines 202 to 215 in 0dcc26c
You seemed to sample a random number of positive and negative samples to compute the loss and AUC-ROC. For loss, this could be fine, but regarding the AUC-ROC, did you use the same pipeline for the test dataset when measuring the model performance? If so, what's the performance given the entire surface?
Best,
Wenkai
It's better to add some codes in the function 'def save_protein_batch_single (protein_pair_id, P, save_path, pdb_idx)' to save dmasif_search predictions such as 'sampled_preds' variable.
Would it be possible to provide a conda environment file for this project? Would make trying the package out significantly more easy. Thanks for the great project.
Kind regards,
Jonas
There're not clear instructions for downstream analysis by dmasif, such as docking.
Hi there, I'd like to know is it feasible to train dMaSIF model without download the masif_site_masif_search_pdbs_and_ply_files.tar.gz file, and using dMaSIF data_preprocessing code only.
Dear dMaSIF team and users,
I am using the Google Colab version of dMaSIF to get the surface predictions from the model protein pdb files.
Among the outputs of dMaSIF I found predfeatures_emb1.npy file with 34 columns and the corresponding .npy file containing the coordinates.
If I understand correctly, this is an array of surface patches with biochemical features and coordinates of each patch.
Maybe I have overlooked it, but couldn't find anywhere hints on how to decipher the columns in the file.
Right now they are numbered from 0 to 33, but can you provide a list of columns/features, so people can check e.g. what feature is predicted in the column 10 and how this changes between different patches.
It would dramatically increase the usability of these predictions.
Thanks!
Dear Authors,
This is really an amazing body of work, thank you.
Would you have any suggestions as to how to adapt dMaSIF for tasks where we want to predict a feature vector of a fixed size (unlike the embedding created which varies with the size of the input molecule from my understanding).
I was trying to add an application specific head to the network but I see the embedding produced is always dependent on the molecule size so I am having difficulty feeding this into a fixed sized fence network of linear layers.
I would really appreciate any ideas. Maybe it would be very simple and I am missing something conceptually. I hope the question is clear.
All the best,
Oz
hi there, could you tell me how to visualize the .vtk file after I finished main_inference.py with a single PDB, and how to use the other .npy files.
Hi,
Trying to get an install working on ubuntu. When I
conda create -n dmasif python=3.6.9
conda activate dmasif
pip install -r requirements.txt
I get a complaint:
ERROR: Could not find a version that satisfies the requirement ipython==7.19.0 (from versions: 0.10, 0.10.1, 0.10.2, 0.11, 0.12, 0.12.1, 0.13, 0.13.1, 0.13.2, 1.0.0, 1.1.0, 1.2.0, 1.2.1, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.4.0, 2.4.1, 3.0.0, 3.1.0, 3.2.0, 3.2.1, 3.2.2, 3.2.3, 4.0.0b1, 4.0.0, 4.0.1, 4.0.2, 4.0.3, 4.1.0rc1, 4.1.0rc2, 4.1.0, 4.1.1, 4.1.2, 4.2.0, 4.2.1, 5.0.0b1, 5.0.0b2, 5.0.0b3, 5.0.0b4, 5.0.0rc1, 5.0.0, 5.1.0, 5.2.0, 5.2.1, 5.2.2, 5.3.0, 5.4.0, 5.4.1, 5.5.0, 5.6.0, 5.7.0, 5.8.0, 5.9.0, 5.10.0, 6.0.0rc1, 6.0.0, 6.1.0, 6.2.0, 6.2.1, 6.3.0, 6.3.1, 6.4.0, 6.5.0, 7.0.0b1, 7.0.0rc1, 7.0.0, 7.0.1, 7.1.0, 7.1.1, 7.2.0, 7.3.0, 7.4.0, 7.5.0, 7.6.0, 7.6.1, 7.7.0, 7.8.0, 7.9.0, 7.10.0, 7.10.1, 7.10.2, 7.11.0, 7.11.1, 7.12.0, 7.13.0, 7.14.0, 7.15.0, 7.16.0, 7.16.1, 7.16.2, 7.16.3)
ERROR: No matching distribution found for ipython==7.19.0
Fair enough. With
conda create -n dmasif python=3.7.7
conda activate dmasif
pip install -r requirements.txt
I get a new issue -- well, initially pykeops
complains that I don't have a numpy version, so I install one, but this is after that has been resolved:
Collecting torch-cluster==1.5.8
Using cached torch_cluster-1.5.8.tar.gz (38 kB)
ERROR: Command errored out with exit status 1:
command: /home/andy/miniconda3/envs/dmasif/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-xzwwngxk/torch-cluster_103368352f6d483aaa0cfdcb9667be1e/setup.py'"'"'; __file__='"'"'/tmp/pip-install-xzwwngxk/torch-cluster_103368352f6d483aaa0cfdcb9667be1e/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-f_nokz7k
cwd: /tmp/pip-install-xzwwngxk/torch-cluster_103368352f6d483aaa0cfdcb9667be1e/
Complete output (5 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-xzwwngxk/torch-cluster_103368352f6d483aaa0cfdcb9667be1e/setup.py", line 7, in <module>
import torch
ModuleNotFoundError: No module named 'torch'
----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/ff/39/6524367aed276171ba66675b25b368c1522e1d35e17b96ee4ff4025e1d98/torch_cluster-1.5.8.tar.gz#sha256=a0a32f63faac40a026ab1e9da31f6babdb4d937e53be40bd1c91d9b5a286eee6 (from https://pypi.org/simple/torch-cluster/) (requires-python:>=3.6). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement torch-cluster==1.5.8 (from versions: 0.1.1, 0.2.3, 0.2.4, 1.0.1, 1.0.3, 1.1.1, 1.1.2, 1.1.3, 1.1.4, 1.1.5, 1.2.1, 1.2.2, 1.2.3, 1.2.4, 1.3.0, 1.4.0, 1.4.1, 1.4.2, 1.4.3a1, 1.4.3, 1.4.4, 1.4.5, 1.5.2, 1.5.3, 1.5.4, 1.5.5, 1.5.6, 1.5.7, 1.5.8, 1.5.9, 1.6.0)
ERROR: No matching distribution found for torch-cluster==1.5.8
This all is with a totally fresh environment generated from miniconda, so it's not totally clear to me what the issue might be. Do you happen to have an environment.yml
or a docker image that could do the trick?
Longer term, I'm also curious about how to reproduce a couple specific studies in the paper -- for example, the model dMaSIF_search_3layer_12A_16dim
in the repository, what flags were used to train it? what does it predict? how can I pipe its output layer into an interpretable format? It looks like the scripts for making figures are purely for reproducing the figures given data, not for repeating the same analysis using dMaSIF and a collection of PDBs. I'd love to (for example) have a script + model to produce the electrostatic surface of a PDB, but it's not clear if that's available!
Hi @FreyrS ! Love this robustified version of masif!
Would you mind providing an example of how to embed a local pdb with dmasif?
Thanks!
Tried to run the benchmark experiments. Resulted in a dict call of a None Type.
Through code reviewing, I saw the following function:
where if single protein is passed as True, P2 is still called like a regular Dict where it is of type None from:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.