salilab / imp-sampcon Goto Github PK

Scripts to assess sampling convergence and exhaustiveness

Home Page: https://www.ncbi.nlm.nih.gov/pubmed/29211988

License: GNU General Public License v2.0

Python 92.30% Gnuplot 4.76% Shell 0.84% CMake 2.10%

imp-sampcon's Introduction

\brief Sampling exhaustiveness protocol

This module implements the sampling exhaustiveness test described in Viswanath et al, 2017. The protocol is primarily designed to work with models generated by the Integrative Modeling Platform (IMP) (and more specifically the IMP::pmi module), but could probably be adapted for other systems.

Dependencies:

pyRMSD is needed. (This is a fork of the original pyRMSD - which is no longer maintained - to fix bugs and add Python 3 support.)

In the Sali lab, pyRMSD is already built, so can be used with module load python3/pyrmsd.

imp_sampcon: sampling exhaustiveness test {#imp_sampcon}

The protocol is typically run using the imp_sampcon command line tool:

imp_sampcon show_stat to show the available fields (e.g. scoring function terms, restraint satisfaction) in an IMP::pmi stat file.
imp_sampcon select_good to select a subset of good-scoring models from a set of IMP::pmi trajectories.
imp_sampcon plot_score to plot the score distributions of the selected models.
imp_sampcon exhaust to analyze the selected models and determine whether sampling was exhaustive.

For a full demonstration of the protocol, see its usage in IMP's actin modeling tutorial.

Selecting models for exhaustiveness test

PMI analysis is the current way to select models for the exhaustiveness test. It selects models based on MCMC equilibration and HDBSCAN density-based clustering of model scores.

In this module, we also have scripts for an earlier method to select models based on thresholds on score terms.

select_good selects a set of good-scoring models based on user-specified score thresholds. The options are described in detail in the actin modeling tutorial.

A few things to note:

How does one set score thresholds? select_good has two modes to help with this. First, one can just guess the values of lower and upper thresholds of score terms based on individual stat files from sampling and use select_good in FILTER mode (i.e. without -e option) to see how many models one can get with the current thresholds. One can also plot these scores using plot_score to aid in refining the thresholds. Once the final thresholds are obtained, one can use select_good in EXTRACT mode (i.e. with -e option) to extract the corresponding RMFs.
-alt and -aut (aggregate lower and upper thresholds) are the lower threshold and upper thresholds to be set for most score terms like Total_Score. -mlt and -mut (member lower and upper thresholds) only need to be set for crosslinks when specifying distance thresholds for each type of crosslink.
-sl (selection list) specifies which scores are selected based on specified thresholds and -pl(print list) is the extra set of scores that will be printed for each selected model. If a score term is already in the selection list it need not be specified in the print list. Currently, empty print lists are not allowed.
The script assumes the output stat files from sampling are in $run_directory/$run_prefix*/output where $run_directory and $run_prefix are specified by -rd and -rp.

Running the exhaustiveness test

See the usage of exhaust in IMP's actin modeling tutorial.

Outputs

The output includes

Convergence of top score file : *.Top_Score_Conv.txt contains the average and standard deviation of the top score (second and third columns) obtained by random subsets of size 10%, 20% …. 100% of the total number of models (first column).
Difference in score distributions : *.Score_Hist_*.txt stores the histogram of score distributions and *.KS_test.txt contains the effect size D and p-value from the KS test.
Chi-square test : actin.ChiSquare_Grid_Stats.txt contains for each clustering threshold (first column), the p-value, Cramer’s V and population of models in the contingency table (second, third and last columns). actin.Sampling_Precision_Stats.txt contains the value of sampling precision and the p-value, Cramer’s V and population of models at the sampling precision.

PDF files are generated for results of the above tests if the gnuplot option is specified (-gp).

Clusters : A directory is created for each cluster (cluster.0, cluster.1 and so on). The indices of models belonging to each cluster x are listed in cluster.x.all.txt, and listed by sample in cluster.x.sample_A.txt and cluster.x.sample_B.txt. Cluster populations are in actin.Cluster_Population.txt, showing the number of models in samples A (second column) and B (third column) in each cluster (first column). Cluster precisions are in actin.Cluster_Precision.txt, with the precision defined by the average RMSD to the cluster centroid. The individual cluster directories contain representative bead models and localization densities.

Minor updates to the protocol from the BJ 2017 paper

Note that for both the score distribution test and the chi-square tests, the distributions of models in samples A and B are deemed to be different only if the p-value is significant as well as effect size (D or Cramer's V) is also large. See also Similarity of Scores section in the paper Viswanath et al, 2017.

The model precision is now calculated as the average RMSD between a model in the cluster and the cluster centroid model, and not using weighted RMSF as mentioned in the original paper.

Special input options

Here a few special cases of exhaust are mentioned.

Number of cores

There are two separate number-of-cores arguments in exhaust. One (-c), sets the number of parallel threads to be used while computing the RMSD matrix, and the other (-cc), sets the number of cores to be used for clustering and parallelly reading the RMF files. For the latter, especially in clustering, large systems may occupy a significant portion of the memory and it is advised to keep this option at a lower value in that case. The former is relatively more memory-robust.

Clustering without sampling precision calculation

Sometimes one would like to simply obtain clusters given a threshold, without going through the sampling precision calculation. For example, this could be because the user has already run exhaust once on the same input and found too many clusters at the sampling precision and would like to visualize a smaller number of clusters by clustering at a threshold worse than the sampling precision. In this case, one can use the -s -ct <threshold> options to skip the expensive steps of distance matrix generation and sampling precision calculation and directly cluster models at the given threshold.

Alignment

This is usually not necessary if the sampling uses an EM map to place the complex. One must not align when specifying the subunit or a selected set of particles (-su or -sn option), since the reference frame will be specified by the fixed particles.

Doing the clustering, RMSD, and precision calculation on selected subunits

By default, exhaust considers all subunits for RMSD and clustering. There are a couple of options to specify one or more subunits in particular. To select a single subunit -su option can be used. To select multiple subunits or domains of subunits, -sn option can be used. Protein domains are specified with start and end residue numbers. Each selection is listed as an entry in a dictionary called density_custom_ranges in a text file, like in the following example.

density_custom_ranges = {"Rpb4":[(1,200,"Rpb4")],"Rpb7":[("Rpb7")],"Rpb11-Rpb14-subcomplex":[("Rpb11"),("Rpb14")]}

If there are multiple protein copies, to specify copy number in the protein name the format prot.copy_number must be used. For example:

density_custom_ranges = {"Rpb4":[(1,200,"Rpb4.0")],"Rpb7.1":[(1,710,"Rpb7.1")]}

Note that one usually does not align (-a) when specifying select subunits, since the frame of reference is determined by the fixed subunits.

Getting localization densities

A density file must be specified. The syntax is identical to the selection text file above (-sn). The output will contain MRC files, one per element of the dictionary in the density file.

Voxel (-v) and density threshold (-dt) input options are related to density generation.

Symmetry and ambiguity

The protocol can also handle systems with ambiguity and symmetry (e.g. multiple protein copies whose positions can be interchanged), where this information needs to be considered while calculating the RMSD between models. The RMSD between two protein models is the minimum RMSD over permutations of equivalent proteins.

Note: You only need to use this option if your protein copies are symmetric, i.e. they are interchangeable in the model. If there are multiple copies but they occupy distinct positions in the model, this option will not be useful and will give the same result as regular exhaust without ambiguity option.

Example

If a system has 2 copies of protein A and 1 copy of protein B, i.e. the proteins are A.0, A.1,B.0. The RMSD between any pair of models m0 and m1, is the minimum RMSD between RMSD[m0(A.0,A.1,B.0) , m1(A.0,A.1,B.1)] and RMSD[m0(A.0,A.1,B.1), m1(A.1,A.0,B.1]. Note that the copies of A in m1 were interchanged while calculating the second RMSD.

To implement this, pyRMSD takes an additional argument symm_groups which is a list of particle indices of equivalent particles. For the above case for instance, symm_groups has one symmetric group with the particle indices of A.0 and A.1. symm_groups=[[[A.0.b0,A.1.b0],[A.0.b1,A.1.b1],[A.0.b2,A.1.b2]..[A.0.bn,A.1.bn]]]. Here A.X.bi is the index of the i'th bead in protein A.X and the ith beads of the two protein copies are considered equivalent particles.

To generate this list of symmetric groups, one needs to pass an additional file with the ambiguity option to the master exhaust script. The file contains one line per symmetric group, and components of symmetric groups are separated by white space. See also the example in symminput.

Additionally, as different individual elements of a symm group, one can specify a complex of multiple molecules such that they are treated as a single unit while trying out different permutations to find the minimum RMSD. As an illustration, for a system with A.0, A.1, A.2, B.0, B.1, B.2, without complexes, the ambiguity file would look as follows:

A.0 A.1 A.2
B.0 B.1 B.2

There are a total of 64 permutations/swaps to explore, i.e. the final RMSD is the minimum of the RMSD of these permutations. If A.x and B.x are complexed, specifying this should make the ambiguity file look as follows:

A.0/B.0 A.1/B.1 A.2/B.2

This has only 8 permutations to explore.

Note: While using complexes (i.e. A.0/B.0), the script assumes that the order in which the selected particles are returned by the Selection on the model Hierarchy is regular, i.e. either A.0 A.1 B.0 B.1 or A.0 B.0 A.1 B.1 and any other arbitrary order might cause the symm groups to be incorrect.

imp-sampcon's People

Contributors

Stargazers

Watchers

Forkers

saijananiganesan grandrea shruthivis stochastic13 tanmoy7989 varunullanat

imp-sampcon's Issues

Parallelize the I/O step before clustering

Currently, the RMF-reading part of the code (in rmsd_calculation.py) reads each frame sequentially and loads all the particle-coordinates into the conform variable for downstream processing. This can be quite slow for a large number of selected models or large number of particles per model. The coordinate extraction process can be parallelized using multiprocessing and finally pieced together in the conform array.

Handle multi-state models

Not sure the code can handle multi-state models currently

EM map and localization densities are not aligned

Localization probability densities created with both noalign and align options need to be rotated and translated to fit the experimental EM map. @saijananiganesan also observed that the density maps created with align option are distorted/incorrect.

Parallelizing the clustering step to improve efficiency

Platform: Linux, Windows
Version: Latest

Currently, during sampling exhaustiveness checking, the clustering of the good-scoring-models at various cutoff thresholds happens sequentially under a for loop which takes quite a bit of time given a large number of models and a large number of thresholds to check for. To improve the time-efficiency of the sampcon pipeline, can this step be parallelized since each clustering is independent?

PS: In case a simple multiprocessing based parallelization is enough, I am opening a PR to achieve that. If this is sufficient, perhaps the initial steps of parsing and reading the RMF files into memory can be parallelized in a similar way (can open a PR for that too).

select_good_scoring_models does not create scoreX.txt files

Running select_good_scoring_models.py creates all input required for Master_Sampling_Exhaustiveness_Analysis.py except scoresA.txt and scoresB.txt.

Use PMI stat file handling functions

Rather than reading stat files with our own code, we should use the IMP.pmi.output.ProcessOutput class. This handles both v1 and v2 statfiles, and also RMF files (stat file information can be written into the RMF file itself rather than a separate text file).

Large memory usage with many cores in clustering

While setting the number of cores, there are 3 steps where they might be utilized (RMSD calculation, clustering, IO). A large number of cores in RMSD calculation significantly speeds up the process. However, for large systems, the clustering step occupies a lot of memory and a similar number of cores in that case will cause the memory to overflow, requiring a smaller number there. Separating the number-of-cores argument between its different usages should help this issue.

Add extra metadata to cluster centroid RMF to aid in mmCIF generation

Ideally, we would feed the output cluster centroid RMF(s) directly to our mmCIF generation pipeline. We do add an RMF.ClusterProvenance node to the RMF file with basic information (number of members in the cluster, precision) but in order to fully populate the mmCIF we would also need

Path to each localization density and the residue range(s) it corresponds to
Paths to RMF files, frame numbers, and whether they are in sample A or B, for all other models in the cluster (this could simply be a list of indexes plus a pointer or copy of the Identities_A.txt and Identities_B.txt files)

We could add a new RMF node or add extra attributes to the existing ClusterProvenance.

Suspected bug or historical reasons? for choosing either pval or cvs in clustering_rmsd.py at Line 194

Is there a historical reason for choosing satisfaction of either p-value or Cramer's v? Ideally, both should be satisfied.

The OR condition at Line 194 in clustering_rmsd.py chooses satisfaction of only one not both. " if pvals[i] > 0.05 or cvs[i] < 0.10: "
prot1.ChiSquare.pdf
prot1.ChiSquare_Grid_Stats.txt

In the attached figure, the p-value was ignored, and the selection was made based on Cramer's V and clustered population.

Context:
File: clustering_rmsd.py
Line Number: 194
Link to Line: clustering_rmsd.py

If this is a bug, the suggested fix is that OR should be changed to AND at Line 194 in clustering_rmsd.py.

if pvals[i] > 0.05 and cvs[i] < 0.10:

Multi-element symm groups

Currently, the pyRMSD code expects each symm group to have two elements that need to be swapped (or a set of 2-element pairs). For some systems, a multi-element pair would be required as the symm group, for which some modification of the symm group object fed to the pyrmsd code is needed.

Currently supported symm group file example

mol.0 mol.1
mol2.0 mol2.1

Support needed

mol.0 mol.1 mol.2
mol2.0 mol2.1 mol2.2

Part in pyRMSD/symmtools.py that disallows the multi-element groups:

def symm_groups_validation( symm_groups):
    """
    Checks that symmetry groups are well defined (each n-tuple has a correspondent symmetric n-tuple)
    """
    try:
        for sg in symm_groups:
            for symm_pair in sg:
                if len(symm_pair) != 2:
                    raise Exception
    except Exception:
        raise ValueError('Symmetry groups are not well defined')

Currently, however, there is no test file which checks for multi-element symm groups.

pyRMSD's pairwiseRMSDMatrix not taking ambiguity into account

Bug report from @jkosinski :

pyRMSD.RMSDCalculator.RMSDCalculator.pairwiseRMSDMatrix and thus get_rmsds_matrix does not really take ambiguity into account, while pyRMSD.RMSDCalculator.RMSDCalculator.oneVsFollowing does.

As a test, I prepared two RMF files of a complex, three proteins, two copies each. The two files are almost identical, but chains of one protein are swapped. But from imp-sampcon and pairwiseRMSDMatrix I get rmsd of 114A. From oneVsFollowing I get rmsd of 2A.

See test case and script here:
https://oc.embl.de/index.php/s/NhFYx12ebbKe0v1

@ichem001 I think you are the expert here. Can you please help with this?

rmf_slice output in select_good_scoring_models is long and not helpful

I can direct it to a null file to get rid of it and replace it with a percentage update.

Debug print outs need to be removed

Code needs to be clean of debug prints.

feature request - more options for subunit seection in RMSD

Hello,

It would be nice if the --subunit option of MasterSamplingExhaustiveness.py supported multiple subunits or residue ranges, perhaps in a syntax symilar to density_ranges.txt. This would allow analysis of cases where movers are disabled in certain parts of the system.

Merge into IMP proper

@shruthivis Since we've now published some studies that use this protocol, we should merge this module into IMP proper so it's available to all. Any objections, or anything you need to clean up before I do that?

Domain densities unavailable

Currently densities can only be calculated at the subunit/protein level. Need to make the code more general and be able to display domain-level densities as well.

Allow for complexes while setting permutations for ambiguity

In large assemblies, with a lot of ambiguity, the permutation space to explore while accounting for ambiguity can be large. Currently, the code only allows setting each molecule as an individual member of a symmetry group. However, allowing for a multi-molecule complex to be an individual member (such that the whole complex is swapped together) would greatly reduce the permutation space to explore.

Example:
For a system with A.0, A.1, A.2, B.0, B.1, B.2. Currently, the ambiguity file would look as follows:

A.0 A.1 A.2
B.0 B.1 B.2

There are a total of 64 permutations/swaps to explore. If A.x and B.x are complexed, specifying this should make the ambiguity file look as follows:

A.0/B.0 A.1/B.1 A.2/B.2

This has only 8 permutations to explore.

Consolidate duplicated code

e.g. get_particles_from_superposed_amb is largely cut-and-paste from get_particles_from_superposed; the two functions should be consolidated.

Handle ambiguity

Not sure the code can handle multiple copies of a protein correctly

selection of ranges in pdb files?

Hello,

I know this is not designed to do this- but I was wondering if it is possible to have density generation from pdb files by having chains in selection..

density_custom_ranges = {"interesting" : [(1,23, "A")]}

where A is the chain ID of those residues. Is it possible to pass this selection syntax to pyRMSD? Thanks!

Ambiguity and alignment does not work.

When one tries to compute exhaustiveness of sampling with both flag --align and --ambiguity, the RMSD calculator does not align and only deals with ambiguity.

On a small test system with two (2) rigid bodies and seven (7) flexible beads, the sampling precision is computed to be:

--align flag only, the RMSD is 22Å while using
--ambiguity with and without the --align flag, the RMSD is ~12,000Å.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble