GithubHelp home page GithubHelp logo

debbiemarkslab / evcouplings Goto Github PK

View Code? Open in Web Editor NEW
212.0 212.0 71.0 18.69 MB

Evolutionary couplings from protein and RNA sequence alignments

Home Page: http://evcouplings.org

License: Other

Python 45.38% Jupyter Notebook 54.62%
3d-structure protein protein-complexes structure-prediction

evcouplings's People

Contributors

aaronkollasch avatar aggreen avatar b-schubert avatar channyclaus avatar joemin avatar kpgbrock avatar sacdallago avatar sophiamersmann avatar thomashopf avatar zachcp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

evcouplings's Issues

compare stage: search using HMM from alignment stage

Current implementation uses jackhmmer only against the sequences of PDB structures. For greater sensitivity, should search with the broader HMM profile generated by searching against Uniref/Uniprot.

To implement,

  1. generate HMM file from .a2 alignment (better to use than checkpoints_hmm parameter of jackhmmer search) using hmmbuild
  2. scan HMM against PDB Uniprot sequence database (from SIFTS object) using hmmsearch

multimer_dists() fails for pdb 5o2u

Multimer distance calculation fails for hits to the protein P12497 (code and error message below). We believe this is due to pdb id 5o2u, as this is the only structure hit with multiple chains. Structure hits and mapping files are attached.

We have tried the following:

  1. verified that sifts and pdb files exist and do not appear to have anything strange about them:
    http://files.rcsb.org/view/5o2u.pdb
    http://mmtf.rcsb.org/v1.0/full/5o2u

  2. Tried calculating the distance map using the by_pdb_id method. This does not return any error.

Potentially of note: this error was originally returned on orchestra but I was able to reproduce it locally. Benni was not able to reproduce the error locally but got a different error that also seems to be related to missing coordinates.

Code:

uid = 'P12497'
s = SIFTS("/Users/AG/databases/pdb_chain_uniprot_plus_2017_08_03.csv", 
    "/Users/AG/databases/pdb_chain_uniprot_plus_2017_08_03.fa")
selected_structures = s.by_alignment(
    reduce_chains=False, sequence_id=uid, region=(378, 432),
    jackhmmer="/Applications/hmmer-3.1b2-macosx-intel/binaries/jackhmmer",
    
)
c = multimer_dists(sifts_result=selected_structures,intersect=False)

Error:

IndexError                                Traceback (most recent call last)
<ipython-input-14-2902a00c3a21> in <module>()
----> 1 c = multimer_dists(sifts_result=selected_structures,intersect=False)
      2 c.contacts()

/Users/AG/marks_lab_scripts/EVcouplings/evcouplings/compare/distances.py in multimer_dists(sifts_result, structures, atom_filter, intersect, output_prefix, model, raise_missing)
    876             # is close in some combination)
    877             distmap_sym = DistanceMap.aggregate(
--> 878                 distmap, distmap.transpose(), intersect=intersect
    879             )
    880             distmap_sym.symmetric = True

/Users/AG/marks_lab_scripts/EVcouplings/evcouplings/compare/distances.py in aggregate(cls, intersect, agg_func, *matrices)
    597             )
    598 
--> 599             new_mat[k][i_agg, j_agg] = m.dist_matrix[i_src, j_src]
    600 
    601         # aggregate

IndexError: arrays used as indices must be of integer (or boolean) type

nucleocapsid_b0.8_mapping1737.txt
nucleocapsid_b0.8_mapping1738.txt
nucleocapsid_b0.8_structure_hits.txt

Dependencies licences

Check licenses for tools and report here

As from #74

For this purpose, I propose listing all external dependencies here and slowly commenting with their licenses.

Dependencies:

  • CNS
  • psipred

Weird errors in fold

  1. This one:
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns.py", line 573, in cns_dgsa_fold
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns.py", line 274, in cns_generate_easy_inp
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns.py", line 101, in _cns_render_template
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/utils/system.py", line 129, in verify_resources
    "{}:\n{}".format(message, ", ".join(invalid))
evcouplings.utils.system.ResourceError: CNS template does not exist: /groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns_templates/generate_easy.inp:
/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns_templates/generate_easy.inp
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/utils/pipeline.py", line 453, in execute_wrapped
    outcfg = execute(**config)
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/utils/pipeline.py", line 173, in execute
    outcfg = runner(**incfg)
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/protocol.py", line 606, in run
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/protocol.py", line 486, in standard
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/multiprocessing/pool.py", line 268, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/multiprocessing/pool.py", line 608, in get
    raise self._value
evcouplings.utils.system.ResourceError: CNS template does not exist: /groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns_templates/generate_easy.inp:
/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns_templates/generate_easy.inp
  1. This one:
Traceback (most recent call last):
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/utils/pipeline.py", line 453, in execute_wrapped
    outcfg = execute(**config)
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/utils/pipeline.py", line 173, in execute
    outcfg = runner(**incfg)
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/protocol.py", line 606, in run
    return PROTOCOLS[kwargs["protocol"]](**kwargs)
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/protocol.py", line 504, in standard
    shutil.copy(file_path, fold_dir)
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/shutil.py", line 235, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/shutil.py", line 114, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: 'output/capsid_config_mod_b0.2/fold/aux/capsid_config_mod_b0.2_significant_ECs_0.9_1_hMIN.pdb'

Pfam family size table creation function

Implement evcouplings.align.pfam.create_family_size_table(), parse the numbers from full Pfam-A flat file (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.full.gz).

Before implementing, check that this information is not available in a more straightforward way.

Clean up couplings/protocol.py

Currently the protocols have quite some code duplication. I already refactored some of it in the complexes_dev branch (standard and complexes protocols). After these changes are merged into master, we should continue to refactor such that mean-field protocol also gets de-duplicated.

statistics summary files not always made

Benni and I have noticed that the "..statistics_summary.pdf" and "..statistics_summary.csv" files are not always made and it seems to be random when they are not. It seems they are 'timed out' but we don't know why. Even when runs take just a few hours the stats file can take a day .. odd??

Restraint weights in folding module (dg_sa)

The original pipeline had a small modification to the scalehot and scalecool modules of CNS to allow different restraint weights for secondary structure distance and EC-based distance restraints:

Old settings in dg_sa.inp:
md.cool.noe=5 (used to scale EC distance restraints)
md.cool.noe2=10 (used to scale secondary structure distance restraints)
These weights were used to rescale the restraints in the first (ECs) and second (sec. struct.) tbl files.

New settings
The reimplementation does not have this modification (only md.cool.noe = 5), since individual restraint weights can be specified in the tbl files. At the moment, this adjustment is however not done, which might lead to a lower overall contribution of secondary structure distance restraints (with unclear consequences).

To adjust, in restraints.yml, under secstruct_distance_restraints we should (probably) change weight: 5 to weight: 10. However it is unclear if this change

  • is really equivalent to the overall scale factor (md.cool.noe2), and
  • would have any beneficial effect on folding

Before making this change, its effect should best be tested empirically.

List of files to be stored in database for webserver (front-/backend) consumption

  • Master config file (job parameters)
  • subfoldbitscore/align/*_alignment_statistics.csv
  • *_job_statistics_summary.pdf --> Might be missing

Align

  • Master alignment statistics file (align\*_alignment_statistics.csv)
  • Sequences file (align\*.fa)
  • Alignment file (align\*.a2m)
  • Frequencies file (align\*_frequencies.csv)

Couplings

  • Couplings model file (couplings\*.model)
  • Enrichment file (couplings\*_enrichment_sausage)
  • Enrichment file (couplings\*_enrichment_sphere)
  • EV_Zoom file (couplings\*_evzoom.json)

if no compare:

  • Coupling scores file (couplings\*_CouplingScores.csv)

else:

  • Coupling scores file (compare\*_CouplingScoresCompared_all.csv)

Compare

  • Coupling scores file (compare\*_CouplingScoresCompared_longrange.csv)
  • Structure hits file (compare\*_structure_hits.csv)

Mutate

  • Mutation effect file (mutate\*_mutate_matrix.csv)

Fold

  • Sec structure file (fold\*_secondary_structure.csv)
  • Model ranking file (fold\*_ranking.csv)

if structures available:

  • Model ranking file (fold\*_comparison.csv) (has more cols)

Correct PDB file column widths for large structures

When loading a large structure (from mmCIF or MMTF) with many residues / atoms that exceed the maximum width in PDB files, and then writing the chains from these as PDB file using
evcouplings.compare.pdb.Chain.to_file(), the output file will have shifted columns (i.e. broken PDB file).

Fix:

  • Minimally, to_file() should raise an exception when trying to write an invalid file with shifted columns (this is easy to detected based on atom and residue ids)
  • to be still able to write these chains for use with the pipeline, the function should allow to renumber the atoms (e.g. shift numbers to start at 1);
  • If we number atoms in a chain from 1 by default this might however lead to issues with files that contain multiple chains (e.g. complexes pipeline) because of duplicate atoms (maxcluster is particularly sensitive to this)

parse error in uniref100_current.fasta

Hi guys,

the pipeline dies almost immediately when I set off a run. I get the following parse error:

Error: Parse failed (sequence file
/groups/marks/databases/jackhmmer/uniref100/uniref100_current.fasta):
Line 1: unexpected char ; expected FASTA to start with >

I looked at this briefly with Christian, and he suggested that this is caused by some kind of encoding error (I see weird characters when I run 'head uniref100_current.fasta' on the command-line).

Thanks!
Rohan

Create SIFTS mapping file with segments with structural coverage

We had a response from the SIFTS team that they will provide an segment mapping file that maps consecutive Uniprot sequence segments to segments with structural coverage later this year or early next year. In the meantime, we shouldtry to create such a file ourselves based on the residue-level xml files. We could then distribute these precomputed files using the evcouplings_dbupdate tool and as downloads on the lab website.

Requirements

  • The file should have the format of ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/flatfiles/csv/pdb_chain_uniprot.csv.gz
  • One PDB chain can have multiple rows in the table
  • Each row should be one consecutive segment of Uniprot residue numbers that maps to a consecutive segment with actual structural coverage in the ATOM records, and a consecutive SEQRES segment (i.e. no deletions or insertions in any of the numbering spaces inside a segment)

Implementation

  1. Fetch xml files for entire PDB from ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/, best do this using rsync and write all parsing code in a way it only needs to look at updated entries between releases
  2. Write XML Parser that extracts segments and stores them in table; in the process note the following caveats (there will be more...):
    • PDB atom index can contain insertion codes (i.e. are strings)
    • One chain can contain different Uniprot entries (i.e. one chain can lead to multiple rows in the output table, and each one can have a different Uniprot identifier)
    • there is some initial parsing code in the old folding_dev pipeline, but the used XML parser uses insane amounts of RAM very quickly, so choice of parser should be done carefully.

Consistent behaviour for output file checking

Pipeline outputs with the suffix "_file" are implemented in some of the stages to be None if the product could not be created, whereas other stages do not return the key.

Need to make this behaviour consistent, and make sure that the pipeline (re-)runner does not choke when files are set to be None.

0.0.2 Release

Since EVcouplings has changed quit a bit over the last months, we should probably release a new official version.

What do you think?

Bug in align.protocol line 378?

I got an invalid syntax error when importing something from evcouplings.align.
The error comes from align.protocol line 378

    info = pd.DataFrame(
        {
            "pos": range(first_index, first_index + alignment.L),
            "target_seq": target_seq,
            "conservation": conservation,
            **fi_cols
        }
    )

Better score column handling

Currently the "cn" score column is the default score column in the entire pipeline. To make this more flexible for use with other scores (e.g. DI score from mean-field protocol), there should be a generic "score" column which the rest of the pipeline after the couplings stage uses. Which score gets assigned to that column is a parameter of the couplings stage (entry like "use_score: cn"). Needs consideration of special score properties (e.g. mixture model does not make sense with DI scores).

compute_num_effective_seqs appears to be ignored

I have compute_num_effective_seqs: True in my config file, but Neff does not appear to be calculated at my align stage.

I am not 100% sure that this is not operator error. Here is the call I am using:

evcouplings --yolo -P output/capsid_config_mod -p P12497 -b "0.2, 0.5, 0.8" -t 0.99 -r 133-363 sample_config_align.txt

sample_config_align.txt

Estimate amount of memory needed for compute job

Goal: reduce amount of memory that needs to be requested by estimating upper bound based on length of target sequence.

  • This feature will be enabled by putting "auto" as value of environment->memory in configuration file.
  • To estimate, we need to know how long the target sequence is in evcouplings app, which does not touch the sequence (this only first happens in align stage of pipeline). Possible solutions:
    • if region is given, can compute length from that
    • otherwise, we have to open sequence_file or fetch sequence_id and compute from that

ECs represented as arcs on a line

Put UniProt numbering on beginning and end of line
Interactive arcs that display residue numbers and name
Range of colors for arcs related to strength of EC
Optional - predicted secondary structure along line

Finding structural "homologs"

We should search with HHPRED for all queries independent of alignment depth of search.
Just as an alert rather than to use and not to be put in web server.
( Current status with alignment search is under review from Thomas)

EVcomplex database file downloading / generation

Necessary for EVcomplex part of pipeline to be usable.

Two escalation levels:

  1. Distribute pre-generated downloadable file using DB update tool (download from lab website)

  2. Include script that generates databases from source files in fully automated way.

For next release, I think having 1) would be sufficient.

Index mapping in non-focus mode

Segments and index remapping in the coupling stage do not accommodate the following case:

  • Position present in the sequence alignment
  • No corresponding position present in the target sequence

Since all of the pipeline currently operates in focus mode (and this is the only sensible mode for downstream stages like compare, mutate and fold), this is not an issue currently. If non-focus mode applications are ever integrated into the pipeline that require mapping to a target sequence, this issue has to be addressed.

Increase test coverage

Highest priority would be tests that check the sanity of the individual pipeline stage outputs, and that the pipeline runs through from start to end for the major workflows (monomers / complexes).

Add flag to change proline secondary structure

For prolines that appear in long helices, could be good to have a flag that would change them from H to C in secondary structure prediction file from PSIPRED to allow the helix to bend in fold stage

Improve hmmbuild_and_search with stockholm-formatted alignments

This comprises the following three steps:

  1. Write a function that writes a stockholm-formatted sequence alignment. Function should be modeled after write_fasta in align/alignment.

Additionally, it should provide the option to input a reference coordinate annotation (see pg. 37 ofhttp://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf for example) and write that annotation to the output file.

  1. In align/protocol/hmmbuild_and_search, after calling run_hmmsearch, the .sto formatted alignment output by run_hmmsearch isn't guaranteed to have the query sequence contained in it, so we need to patch in the query sequence and then write a new .sto formatted sequence alignment. This will contain the output of hmmsearch plus the query sequence AND RF annotation columns corresponding to the columns of the query sequence.

  2. In compare/sifts/find_homologs, the function _make_hmmsearch_raw_fasta can be removed and the alignment from hmmbuild_and_search can be read in directly.

Additional last step in pipeline: store list of files to database

After the pipeline has finished generating all files, there should be a final optional step where files are copied from the FS to a database. This should be as transparent as possible (i.e.: database can be mongo/postgres/...) and the files to be stored are read from some config file.

ValueError: No chains with given name found

I encountered the following error:

Traceback (most recent call last):
  File "/Users/bs224/Dropbox/PostDoc/projects/EVcouplings/test/TestPDB.py", line 33, in test_pbd
    distmap_intra = intra_dists(selected_structures)
  File "/Users/bs224/Dropbox/PostDoc/projects/EVcouplings/evcouplings/compare/distances.py", line 759, in intra_dists
    model
  File "/Users/bs224/Dropbox/PostDoc/projects/EVcouplings/evcouplings/compare/distances.py", line 667, in _prepare_chain
    chain = structures[pdb_id].get_chain(pdb_chain, model)
  File "/Users/bs224/Dropbox/PostDoc/projects/EVcouplings/evcouplings/compare/pdb.py", line 538, in get_chain
    "No chains with given name found"
ValueError: No chains with given name found

which basically breaks the complete pipeline. The culprit was 1OAY...

I propose to ignore the particular PDB structure and continue with the processing via a try and except block. Agreed?

Interactive contact maps

Add interactive contact maps plotting function - HTML - think about consistency with server whilst doing

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.