GithubHelp home page GithubHelp logo

salilab / imp-sampcon Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 6.0 28.02 MB

Scripts to assess sampling convergence and exhaustiveness

Home Page: https://www.ncbi.nlm.nih.gov/pubmed/29211988

License: GNU General Public License v2.0

Python 92.13% Gnuplot 4.69% Shell 1.12% CMake 2.06%

imp-sampcon's People

Contributors

benmwebb avatar grandrea avatar ichem001 avatar rakeshr10 avatar shruthivis avatar stochastic13 avatar tanmoy7989 avatar varunullanat avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

imp-sampcon's Issues

pyRMSD's pairwiseRMSDMatrix not taking ambiguity into account

Bug report from @jkosinski :

pyRMSD.RMSDCalculator.RMSDCalculator.pairwiseRMSDMatrix and thus get_rmsds_matrix does not really take ambiguity into account, while pyRMSD.RMSDCalculator.RMSDCalculator.oneVsFollowing does.

As a test, I prepared two RMF files of a complex, three proteins, two copies each. The two files are almost identical, but chains of one protein are swapped. But from imp-sampcon and pairwiseRMSDMatrix I get rmsd of 114A. From oneVsFollowing I get rmsd of 2A.

See test case and script here:
https://oc.embl.de/index.php/s/NhFYx12ebbKe0v1

@ichem001 I think you are the expert here. Can you please help with this?

Large memory usage with many cores in clustering

While setting the number of cores, there are 3 steps where they might be utilized (RMSD calculation, clustering, IO). A large number of cores in RMSD calculation significantly speeds up the process. However, for large systems, the clustering step occupies a lot of memory and a similar number of cores in that case will cause the memory to overflow, requiring a smaller number there. Separating the number-of-cores argument between its different usages should help this issue.

Merge into IMP proper

@shruthivis Since we've now published some studies that use this protocol, we should merge this module into IMP proper so it's available to all. Any objections, or anything you need to clean up before I do that?

Parallelizing the clustering step to improve efficiency

Platform: Linux, Windows
Version: Latest

Currently, during sampling exhaustiveness checking, the clustering of the good-scoring-models at various cutoff thresholds happens sequentially under a for loop which takes quite a bit of time given a large number of models and a large number of thresholds to check for. To improve the time-efficiency of the sampcon pipeline, can this step be parallelized since each clustering is independent?

PS: In case a simple multiprocessing based parallelization is enough, I am opening a PR to achieve that. If this is sufficient, perhaps the initial steps of parsing and reading the RMF files into memory can be parallelized in a similar way (can open a PR for that too).

Allow for complexes while setting permutations for ambiguity

In large assemblies, with a lot of ambiguity, the permutation space to explore while accounting for ambiguity can be large. Currently, the code only allows setting each molecule as an individual member of a symmetry group. However, allowing for a multi-molecule complex to be an individual member (such that the whole complex is swapped together) would greatly reduce the permutation space to explore.

Example:
For a system with A.0, A.1, A.2, B.0, B.1, B.2. Currently, the ambiguity file would look as follows:

A.0 A.1 A.2
B.0 B.1 B.2

There are a total of 64 permutations/swaps to explore. If A.x and B.x are complexed, specifying this should make the ambiguity file look as follows:

A.0/B.0 A.1/B.1 A.2/B.2

This has only 8 permutations to explore.

feature request - more options for subunit seection in RMSD

Hello,

It would be nice if the --subunit option of MasterSamplingExhaustiveness.py supported multiple subunits or residue ranges, perhaps in a syntax symilar to density_ranges.txt. This would allow analysis of cases where movers are disabled in certain parts of the system.

Domain densities unavailable

Currently densities can only be calculated at the subunit/protein level. Need to make the code more general and be able to display domain-level densities as well.

Ambiguity and alignment does not work.

When one tries to compute exhaustiveness of sampling with both flag --align and --ambiguity, the RMSD calculator does not align and only deals with ambiguity.

On a small test system with two (2) rigid bodies and seven (7) flexible beads, the sampling precision is computed to be:

  • --align flag only, the RMSD is 22ร… while using
  • --ambiguity with and without the --align flag, the RMSD is ~12,000ร….

Use PMI stat file handling functions

Rather than reading stat files with our own code, we should use the IMP.pmi.output.ProcessOutput class. This handles both v1 and v2 statfiles, and also RMF files (stat file information can be written into the RMF file itself rather than a separate text file).

Handle ambiguity

Not sure the code can handle multiple copies of a protein correctly

Parallelize the I/O step before clustering

Currently, the RMF-reading part of the code (in rmsd_calculation.py) reads each frame sequentially and loads all the particle-coordinates into the conform variable for downstream processing. This can be quite slow for a large number of selected models or large number of particles per model. The coordinate extraction process can be parallelized using multiprocessing and finally pieced together in the conform array.

Add extra metadata to cluster centroid RMF to aid in mmCIF generation

Ideally, we would feed the output cluster centroid RMF(s) directly to our mmCIF generation pipeline. We do add an RMF.ClusterProvenance node to the RMF file with basic information (number of members in the cluster, precision) but in order to fully populate the mmCIF we would also need

  • Path to each localization density and the residue range(s) it corresponds to
  • Paths to RMF files, frame numbers, and whether they are in sample A or B, for all other models in the cluster (this could simply be a list of indexes plus a pointer or copy of the Identities_A.txt and Identities_B.txt files)

We could add a new RMF node or add extra attributes to the existing ClusterProvenance.

Multi-element symm groups

Currently, the pyRMSD code expects each symm group to have two elements that need to be swapped (or a set of 2-element pairs). For some systems, a multi-element pair would be required as the symm group, for which some modification of the symm group object fed to the pyrmsd code is needed.

Currently supported symm group file example

mol.0 mol.1
mol2.0 mol2.1

Support needed

mol.0 mol.1 mol.2
mol2.0 mol2.1 mol2.2

Part in pyRMSD/symmtools.py that disallows the multi-element groups:

def symm_groups_validation( symm_groups):
    """
    Checks that symmetry groups are well defined (each n-tuple has a correspondent symmetric n-tuple)
    """
    try:
        for sg in symm_groups:
            for symm_pair in sg:
                if len(symm_pair) != 2:
                    raise Exception
    except Exception:
        raise ValueError('Symmetry groups are not well defined')

Currently, however, there is no test file which checks for multi-element symm groups.

Consolidate duplicated code

e.g. get_particles_from_superposed_amb is largely cut-and-paste from get_particles_from_superposed; the two functions should be consolidated.

selection of ranges in pdb files?

Hello,

I know this is not designed to do this- but I was wondering if it is possible to have density generation from pdb files by having chains in selection..

density_custom_ranges = {"interesting" : [(1,23, "A")]}

where A is the chain ID of those residues. Is it possible to pass this selection syntax to pyRMSD? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.