debbiemarkslab / evcouplings Goto Github PK

View Code? Open in Web Editor NEW

212.0 212.0 71.0 18.69 MB

Evolutionary couplings from protein and RNA sequence alignments

Home Page: http://evcouplings.org

License: Other

Python 45.38% Jupyter Notebook 54.62%

3d-structure protein protein-complexes structure-prediction

evcouplings's People

Contributors

Stargazers

Watchers

evcouplings's Issues

Update output tutorial to cover folding stage

Currently settings and output products of "fold" stages not explained in notebooks/output_files_tutorial.ipynb and notebooks/running_jobs.ipynb

compare stage: search using HMM from alignment stage

Current implementation uses jackhmmer only against the sequences of PDB structures. For greater sensitivity, should search with the broader HMM profile generated by searching against Uniref/Uniprot.

To implement,

generate HMM file from .a2 alignment (better to use than checkpoints_hmm parameter of jackhmmer search) using hmmbuild
scan HMM against PDB Uniprot sequence database (from SIFTS object) using hmmsearch

multimer_dists() fails for pdb 5o2u

Multimer distance calculation fails for hits to the protein P12497 (code and error message below). We believe this is due to pdb id 5o2u, as this is the only structure hit with multiple chains. Structure hits and mapping files are attached.

We have tried the following:

verified that sifts and pdb files exist and do not appear to have anything strange about them:
http://files.rcsb.org/view/5o2u.pdb
http://mmtf.rcsb.org/v1.0/full/5o2u
Tried calculating the distance map using the by_pdb_id method. This does not return any error.

Potentially of note: this error was originally returned on orchestra but I was able to reproduce it locally. Benni was not able to reproduce the error locally but got a different error that also seems to be related to missing coordinates.

Code:

uid = 'P12497'
s = SIFTS("/Users/AG/databases/pdb_chain_uniprot_plus_2017_08_03.csv", 
    "/Users/AG/databases/pdb_chain_uniprot_plus_2017_08_03.fa")
selected_structures = s.by_alignment(
    reduce_chains=False, sequence_id=uid, region=(378, 432),
    jackhmmer="/Applications/hmmer-3.1b2-macosx-intel/binaries/jackhmmer",
    
)
c = multimer_dists(sifts_result=selected_structures,intersect=False)

Error:

IndexError                                Traceback (most recent call last)
<ipython-input-14-2902a00c3a21> in <module>()
----> 1 c = multimer_dists(sifts_result=selected_structures,intersect=False)
      2 c.contacts()

/Users/AG/marks_lab_scripts/EVcouplings/evcouplings/compare/distances.py in multimer_dists(sifts_result, structures, atom_filter, intersect, output_prefix, model, raise_missing)
    876             # is close in some combination)
    877             distmap_sym = DistanceMap.aggregate(
--> 878                 distmap, distmap.transpose(), intersect=intersect
    879             )
    880             distmap_sym.symmetric = True

/Users/AG/marks_lab_scripts/EVcouplings/evcouplings/compare/distances.py in aggregate(cls, intersect, agg_func, *matrices)
    597             )
    598 
--> 599             new_mat[k][i_agg, j_agg] = m.dist_matrix[i_src, j_src]
    600 
    601         # aggregate

IndexError: arrays used as indices must be of integer (or boolean) type

nucleocapsid_b0.8_mapping1737.txt
nucleocapsid_b0.8_mapping1738.txt
nucleocapsid_b0.8_structure_hits.txt

Dependencies licences

Check licenses for tools and report here

As from #74

For this purpose, I propose listing all external dependencies here and slowly commenting with their licenses.

Dependencies:

CNS
psipred

Add a license

contenders: MIT, BSD 3-clause

Weird errors in fold

This one:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns.py", line 573, in cns_dgsa_fold
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns.py", line 274, in cns_generate_easy_inp
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns.py", line 101, in _cns_render_template
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/utils/system.py", line 129, in verify_resources
    "{}:\n{}".format(message, ", ".join(invalid))
evcouplings.utils.system.ResourceError: CNS template does not exist: /groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns_templates/generate_easy.inp:
/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns_templates/generate_easy.inp
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/utils/pipeline.py", line 453, in execute_wrapped
    outcfg = execute(**config)
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/utils/pipeline.py", line 173, in execute
    outcfg = runner(**incfg)
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/protocol.py", line 606, in run
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/protocol.py", line 486, in standard
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/multiprocessing/pool.py", line 268, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/multiprocessing/pool.py", line 608, in get
    raise self._value
evcouplings.utils.system.ResourceError: CNS template does not exist: /groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns_templates/generate_easy.inp:
/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns_templates/generate_easy.inp

This one:

Traceback (most recent call last):
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/utils/pipeline.py", line 453, in execute_wrapped
    outcfg = execute(**config)
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/utils/pipeline.py", line 173, in execute
    outcfg = runner(**incfg)
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/protocol.py", line 606, in run
    return PROTOCOLS[kwargs["protocol"]](**kwargs)
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/protocol.py", line 504, in standard
    shutil.copy(file_path, fold_dir)
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/shutil.py", line 235, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/shutil.py", line 114, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: 'output/capsid_config_mod_b0.2/fold/aux/capsid_config_mod_b0.2_significant_ECs_0.9_1_hMIN.pdb'

Pfam family size table creation function

Implement evcouplings.align.pfam.create_family_size_table(), parse the numbers from full Pfam-A flat file (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.full.gz).

Before implementing, check that this information is not available in a more straightforward way.

Incorporate new coloring of EC enrichment

New function for coloring EC enrichment should be incorporated into pipeline. The new function has more bins and a nicer color scheme.

Clean up couplings/protocol.py

Currently the protocols have quite some code duplication. I already refactored some of it in the complexes_dev branch (standard and complexes protocols). After these changes are merged into master, we should continue to refactor such that mean-field protocol also gets de-duplicated.

statistics summary files not always made

Benni and I have noticed that the "..statistics_summary.pdf" and "..statistics_summary.csv" files are not always made and it seems to be random when they are not. It seems they are 'timed out' but we don't know why. Even when runs take just a few hours the stats file can take a day .. odd??

Restraint weights in folding module (dg_sa)

The original pipeline had a small modification to the scalehot and scalecool modules of CNS to allow different restraint weights for secondary structure distance and EC-based distance restraints:

Old settings in dg_sa.inp:
md.cool.noe=5 (used to scale EC distance restraints)
md.cool.noe2=10 (used to scale secondary structure distance restraints)
These weights were used to rescale the restraints in the first (ECs) and second (sec. struct.) tbl files.

New settings
The reimplementation does not have this modification (only md.cool.noe = 5), since individual restraint weights can be specified in the tbl files. At the moment, this adjustment is however not done, which might lead to a lower overall contribution of secondary structure distance restraints (with unclear consequences).

To adjust, in restraints.yml, under secstruct_distance_restraints we should (probably) change weight: 5 to weight: 10. However it is unclear if this change

is really equivalent to the overall scale factor (md.cool.noe2), and
would have any beneficial effect on folding

Before making this change, its effect should best be tested empirically.

List of files to be stored in database for webserver (front-/backend) consumption

Master config file (job parameters)
subfoldbitscore/align/*_alignment_statistics.csv
*_job_statistics_summary.pdf --> Might be missing

Align

Master alignment statistics file (align\*_alignment_statistics.csv)
Sequences file (align\*.fa)
Alignment file (align\*.a2m)
Frequencies file (align\*_frequencies.csv)

Couplings

Couplings model file (couplings\*.model)
Enrichment file (couplings\*_enrichment_sausage)
Enrichment file (couplings\*_enrichment_sphere)
EV_Zoom file (couplings\*_evzoom.json)

if no compare:

Coupling scores file (couplings\*_CouplingScores.csv)

else:

Coupling scores file (compare\*_CouplingScoresCompared_all.csv)

Compare

Coupling scores file (compare\*_CouplingScoresCompared_longrange.csv)
Structure hits file (compare\*_structure_hits.csv)

Mutate

Mutation effect file (mutate\*_mutate_matrix.csv)

Fold

Sec structure file (fold\*_secondary_structure.csv)
Model ranking file (fold\*_ranking.csv)

if structures available:

Model ranking file (fold\*_comparison.csv) (has more cols)

Feature request: visualize ECs on a one-dimensional string

It would be nice if we can make figures that visualize ECs as arcs between points on a string, especially for proteins where we get some high-confidence ECs but not enough to fold the full protein.

Add ability to input own PDB for comparison

Great feature for unpublished PDBs or comparing own models to others

Add secondary structure prediction from Perry's code to output

Statistics summary csv doesn't always contain all the runs

see marks/projects/murj_ecoli/MURJ_ECOLI for example

Correct PDB file column widths for large structures

When loading a large structure (from mmCIF or MMTF) with many residues / atoms that exceed the maximum width in PDB files, and then writing the chains from these as PDB file using
evcouplings.compare.pdb.Chain.to_file(), the output file will have shifted columns (i.e. broken PDB file).

Fix:

Minimally, to_file() should raise an exception when trying to write an invalid file with shifted columns (this is easy to detected based on atom and residue ids)
to be still able to write these chains for use with the pipeline, the function should allow to renumber the atoms (e.g. shift numbers to start at 1);
If we number atoms in a chain from 1 by default this might however lead to issues with files that contain multiple chains (e.g. complexes pipeline) because of duplicate atoms (maxcluster is particularly sensitive to this)

Continuous Integration via Travis?

Perhaps it is a good idea to start thinking about continuous integration and testing.
One very easy way to do this is TravisCI (https://travis-ci.org). I can help setup the CI and write some unit tests....

parse error in uniref100_current.fasta

Hi guys,

the pipeline dies almost immediately when I set off a run. I get the following parse error:

Error: Parse failed (sequence file
/groups/marks/databases/jackhmmer/uniref100/uniref100_current.fasta):
Line 1: unexpected char ; expected FASTA to start with >

I looked at this briefly with Christian, and he suggested that this is caused by some kind of encoding error (I see weird characters when I run 'head uniref100_current.fasta' on the command-line).

Thanks!
Rohan

Create SIFTS mapping file with segments with structural coverage

We had a response from the SIFTS team that they will provide an segment mapping file that maps consecutive Uniprot sequence segments to segments with structural coverage later this year or early next year. In the meantime, we shouldtry to create such a file ourselves based on the residue-level xml files. We could then distribute these precomputed files using the evcouplings_dbupdate tool and as downloads on the lab website.

Requirements

The file should have the format of ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/flatfiles/csv/pdb_chain_uniprot.csv.gz
One PDB chain can have multiple rows in the table
Each row should be one consecutive segment of Uniprot residue numbers that maps to a consecutive segment with actual structural coverage in the ATOM records, and a consecutive SEQRES segment (i.e. no deletions or insertions in any of the numbering spaces inside a segment)

Implementation

Fetch xml files for entire PDB from ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/, best do this using rsync and write all parsing code in a way it only needs to look at updated entries between releases
Write XML Parser that extracts segments and stores them in table; in the process note the following caveats (there will be more...):
- PDB atom index can contain insertion codes (i.e. are strings)
- One chain can contain different Uniprot entries (i.e. one chain can lead to multiple rows in the output table, and each one can have a different Uniprot identifier)
- there is some initial parsing code in the old folding_dev pipeline, but the used XML parser uses insane amounts of RAM very quickly, so choice of parser should be done carefully.

Add predicted secondary structure to predicted contact maps

can be psipred and perry's from ECs
should have both options

Consistent behaviour for output file checking

Pipeline outputs with the suffix "_file" are implemented in some of the stages to be None if the product could not be created, whereas other stages do not return the key.

Need to make this behaviour consistent, and make sure that the pipeline (re-)runner does not choke when files are set to be None.

0.0.2 Release

Since EVcouplings has changed quit a bit over the last months, we should probably release a new official version.

What do you think?

Bug in align.protocol line 378?

I got an invalid syntax error when importing something from evcouplings.align.
The error comes from align.protocol line 378

    info = pd.DataFrame(
        {
            "pos": range(first_index, first_index + alignment.L),
            "target_seq": target_seq,
            "conservation": conservation,
            **fi_cols
        }
    )

Better score column handling

Currently the "cn" score column is the default score column in the entire pipeline. To make this more flexible for use with other scores (e.g. DI score from mean-field protocol), there should be a generic "score" column which the rest of the pipeline after the couplings stage uses. Which score gets assigned to that column is a parameter of the couplings stage (entry like "use_score: cn"). Needs consideration of special score properties (e.g. mixture model does not make sense with DI scores).

lsf submitter seems to block

compute_num_effective_seqs appears to be ignored

I have compute_num_effective_seqs: True in my config file, but Neff does not appear to be calculated at my align stage.

I am not 100% sure that this is not operator error. Here is the call I am using:

evcouplings --yolo -P output/capsid_config_mod -p P12497 -b "0.2, 0.5, 0.8" -t 0.99 -r 133-363 sample_config_align.txt

sample_config_align.txt

Estimate amount of memory needed for compute job

Goal: reduce amount of memory that needs to be requested by estimating upper bound based on length of target sequence.

This feature will be enabled by putting "auto" as value of environment->memory in configuration file.
To estimate, we need to know how long the target sequence is in evcouplings app, which does not touch the sequence (this only first happens in align stage of pipeline). Possible solutions:
- if region is given, can compute length from that
- otherwise, we have to open sequence_file or fetch sequence_id and compute from that

ECs represented as arcs on a line

Put UniProt numbering on beginning and end of line
Interactive arcs that display residue numbers and name
Range of colors for arcs related to strength of EC
Optional - predicted secondary structure along line

Secondary structure and numbering on contact maps

Put numbering and secondary structure logo on same axis in contact maps for better visualization.

Stage Subfolders

Make subfolders for all output stages

Finding structural "homologs"

We should search with HHPRED for all queries independent of alignment depth of search.
Just as an alert rather than to use and not to be put in web server.
( Current status with alignment search is under review from Thomas)

EVcomplex database file downloading / generation

Necessary for EVcomplex part of pipeline to be usable.

Two escalation levels:

Distribute pre-generated downloadable file using DB update tool (download from lab website)
Include script that generates databases from source files in fully automated way.

For next release, I think having 1) would be sufficient.

Index mapping in non-focus mode

Segments and index remapping in the coupling stage do not accommodate the following case:

Position present in the sequence alignment
No corresponding position present in the target sequence

Since all of the pipeline currently operates in focus mode (and this is the only sensible mode for downstream stages like compare, mutate and fold), this is not an issue currently. If non-focus mode applications are ever integrated into the pipeline that require mapping to a target sequence, this issue has to be addressed.

Make optimal rule for secondary structure vis in contact map

Decide on best rule for secondary structure is shown in compare contact map

Increase test coverage

Highest priority would be tests that check the sanity of the individual pipeline stage outputs, and that the pipeline runs through from start to end for the major workflows (monomers / complexes).

Create documentation pages

Use https://readthedocs.org/, couple to repository

"theta" not working in command line

Changing theta in config is working ok , changing it in command line doesn't .

Create Docker container for easier distribution

including all software dependencies, as far as possible
think about how to best handle databases
pay attention to licenses, some of the used tools are under GPL

Coloring contact maps

Color range for EC strengths ( discuss what it should be)

Add flag to change proline secondary structure

For prolines that appear in long helices, could be good to have a flag that would change them from H to C in secondary structure prediction file from PSIPRED to allow the helix to bend in fold stage

[ERROR] Seaborn is missing in setup.py as dependency

Improve hmmbuild_and_search with stockholm-formatted alignments

This comprises the following three steps:

Write a function that writes a stockholm-formatted sequence alignment. Function should be modeled after write_fasta in align/alignment.

Additionally, it should provide the option to input a reference coordinate annotation (see pg. 37 ofhttp://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf for example) and write that annotation to the output file.

In align/protocol/hmmbuild_and_search, after calling run_hmmsearch, the .sto formatted alignment output by run_hmmsearch isn't guaranteed to have the query sequence contained in it, so we need to patch in the query sequence and then write a new .sto formatted sequence alignment. This will contain the output of hmmsearch plus the query sequence AND RF annotation columns corresponding to the columns of the query sequence.
In compare/sifts/find_homologs, the function _make_hmmsearch_raw_fasta can be removed and the alignment from hmmbuild_and_search can be read in directly.

Add EVcomplex

Additional last step in pipeline: store list of files to database

After the pipeline has finished generating all files, there should be a final optional step where files are copied from the FS to a database. This should be as transparent as possible (i.e.: database can be mongo/postgres/...) and the files to be stored are read from some config file.

ValueError: No chains with given name found

I encountered the following error:

Traceback (most recent call last):
  File "/Users/bs224/Dropbox/PostDoc/projects/EVcouplings/test/TestPDB.py", line 33, in test_pbd
    distmap_intra = intra_dists(selected_structures)
  File "/Users/bs224/Dropbox/PostDoc/projects/EVcouplings/evcouplings/compare/distances.py", line 759, in intra_dists
    model
  File "/Users/bs224/Dropbox/PostDoc/projects/EVcouplings/evcouplings/compare/distances.py", line 667, in _prepare_chain
    chain = structures[pdb_id].get_chain(pdb_chain, model)
  File "/Users/bs224/Dropbox/PostDoc/projects/EVcouplings/evcouplings/compare/pdb.py", line 538, in get_chain
    "No chains with given name found"
ValueError: No chains with given name found

which basically breaks the complete pipeline. The culprit was 1OAY...

I propose to ignore the particular PDB structure and continue with the processing via a try and except block. Agreed?

Possible inconsistency in CSV export of secondary struct

In this function ( https://github.com/debbiemarkslab/EVcouplings/blob/master/evcouplings/visualize/pairs.py#L752 ), a column with id needs to be defined, but in sec_struct.csv the column's header is just i. Maybe inconsistent?

Add configuration parameter to delete intermediate alignment files

To save space, delete intermediate products of align stage (which are biggest space takers next to .model file):

.sto file
_faw.fasta
.output

Add a parameter like "save_model" couplings stage to enable/disable this.

Interactive contact maps

Add interactive contact maps plotting function - HTML - think about consistency with server whilst doing

Predicted Contact Maps (no structure)

Add to couplings stage