salilab / imp-sampcon Goto Github PK
View Code? Open in Web Editor NEWScripts to assess sampling convergence and exhaustiveness
Home Page: https://www.ncbi.nlm.nih.gov/pubmed/29211988
License: GNU General Public License v2.0
Scripts to assess sampling convergence and exhaustiveness
Home Page: https://www.ncbi.nlm.nih.gov/pubmed/29211988
License: GNU General Public License v2.0
Bug report from @jkosinski :
pyRMSD.RMSDCalculator.RMSDCalculator.pairwiseRMSDMatrix and thus get_rmsds_matrix does not really take ambiguity into account, while pyRMSD.RMSDCalculator.RMSDCalculator.oneVsFollowing does.
As a test, I prepared two RMF files of a complex, three proteins, two copies each. The two files are almost identical, but chains of one protein are swapped. But from imp-sampcon and pairwiseRMSDMatrix I get rmsd of 114A. From oneVsFollowing I get rmsd of 2A.
See test case and script here:
https://oc.embl.de/index.php/s/NhFYx12ebbKe0v1
@ichem001 I think you are the expert here. Can you please help with this?
I can direct it to a null file to get rid of it and replace it with a percentage update.
While setting the number of cores, there are 3 steps where they might be utilized (RMSD calculation, clustering, IO). A large number of cores in RMSD calculation significantly speeds up the process. However, for large systems, the clustering step occupies a lot of memory and a similar number of cores in that case will cause the memory to overflow, requiring a smaller number there. Separating the number-of-cores argument between its different usages should help this issue.
@shruthivis Since we've now published some studies that use this protocol, we should merge this module into IMP proper so it's available to all. Any objections, or anything you need to clean up before I do that?
Platform: Linux, Windows
Version: Latest
Currently, during sampling exhaustiveness checking, the clustering of the good-scoring-models at various cutoff thresholds happens sequentially under a for
loop which takes quite a bit of time given a large number of models and a large number of thresholds to check for. To improve the time-efficiency of the sampcon pipeline, can this step be parallelized since each clustering is independent?
PS: In case a simple multiprocessing
based parallelization is enough, I am opening a PR to achieve that. If this is sufficient, perhaps the initial steps of parsing and reading the RMF files into memory can be parallelized in a similar way (can open a PR for that too).
In large assemblies, with a lot of ambiguity, the permutation space to explore while accounting for ambiguity can be large. Currently, the code only allows setting each molecule as an individual member of a symmetry group. However, allowing for a multi-molecule complex to be an individual member (such that the whole complex is swapped together) would greatly reduce the permutation space to explore.
Example:
For a system with A.0, A.1, A.2, B.0, B.1, B.2. Currently, the ambiguity file would look as follows:
A.0 A.1 A.2
B.0 B.1 B.2
There are a total of 64 permutations/swaps to explore. If A.x and B.x are complexed, specifying this should make the ambiguity file look as follows:
A.0/B.0 A.1/B.1 A.2/B.2
This has only 8 permutations to explore.
Hello,
It would be nice if the --subunit option of MasterSamplingExhaustiveness.py supported multiple subunits or residue ranges, perhaps in a syntax symilar to density_ranges.txt. This would allow analysis of cases where movers are disabled in certain parts of the system.
Not sure the code can handle multi-state models currently
Currently densities can only be calculated at the subunit/protein level. Need to make the code more general and be able to display domain-level densities as well.
When one tries to compute exhaustiveness of sampling with both flag --align
and --ambiguity
, the RMSD calculator does not align and only deals with ambiguity.
On a small test system with two (2) rigid bodies and seven (7) flexible beads, the sampling precision is computed to be:
--align
flag only, the RMSD is 22ร
while using--ambiguity
with and without the --align
flag, the RMSD is ~12,000ร
.Code needs to be clean of debug prints.
Rather than reading stat files with our own code, we should use the IMP.pmi.output.ProcessOutput
class. This handles both v1 and v2 statfiles, and also RMF files (stat file information can be written into the RMF file itself rather than a separate text file).
Localization probability densities created with both noalign and align options need to be rotated and translated to fit the experimental EM map. @saijananiganesan also observed that the density maps created with align option are distorted/incorrect.
Not sure the code can handle multiple copies of a protein correctly
Currently, the RMF-reading part of the code (in rmsd_calculation.py
) reads each frame sequentially and loads all the particle-coordinates into the conform
variable for downstream processing. This can be quite slow for a large number of selected models or large number of particles per model. The coordinate extraction process can be parallelized using multiprocessing
and finally pieced together in the conform
array.
Ideally, we would feed the output cluster centroid RMF(s) directly to our mmCIF generation pipeline. We do add an RMF.ClusterProvenance
node to the RMF file with basic information (number of members in the cluster, precision) but in order to fully populate the mmCIF we would also need
Identities_A.txt
and Identities_B.txt
files)We could add a new RMF node or add extra attributes to the existing ClusterProvenance
.
Running select_good_scoring_models.py creates all input required for Master_Sampling_Exhaustiveness_Analysis.py except scoresA.txt and scoresB.txt.
Currently, the pyRMSD code expects each symm group to have two elements that need to be swapped (or a set of 2-element pairs). For some systems, a multi-element pair would be required as the symm group, for which some modification of the symm group object fed to the pyrmsd code is needed.
Currently supported symm group file example
mol.0 mol.1
mol2.0 mol2.1
Support needed
mol.0 mol.1 mol.2
mol2.0 mol2.1 mol2.2
Part in pyRMSD/symmtools.py
that disallows the multi-element groups:
def symm_groups_validation( symm_groups):
"""
Checks that symmetry groups are well defined (each n-tuple has a correspondent symmetric n-tuple)
"""
try:
for sg in symm_groups:
for symm_pair in sg:
if len(symm_pair) != 2:
raise Exception
except Exception:
raise ValueError('Symmetry groups are not well defined')
Currently, however, there is no test file which checks for multi-element symm groups.
e.g. get_particles_from_superposed_amb
is largely cut-and-paste from get_particles_from_superposed
; the two functions should be consolidated.
Hello,
I know this is not designed to do this- but I was wondering if it is possible to have density generation from pdb files by having chains in selection..
density_custom_ranges = {"interesting" : [(1,23, "A")]}
where A is the chain ID of those residues. Is it possible to pass this selection syntax to pyRMSD? Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.