debbiemarkslab / evcouplings Goto Github PK
View Code? Open in Web Editor NEWEvolutionary couplings from protein and RNA sequence alignments
Home Page: http://evcouplings.org
License: Other
Evolutionary couplings from protein and RNA sequence alignments
Home Page: http://evcouplings.org
License: Other
Currently settings and output products of "fold" stages not explained in notebooks/output_files_tutorial.ipynb and notebooks/running_jobs.ipynb
Current implementation uses jackhmmer only against the sequences of PDB structures. For greater sensitivity, should search with the broader HMM profile generated by searching against Uniref/Uniprot.
To implement,
Multimer distance calculation fails for hits to the protein P12497 (code and error message below). We believe this is due to pdb id 5o2u, as this is the only structure hit with multiple chains. Structure hits and mapping files are attached.
We have tried the following:
verified that sifts and pdb files exist and do not appear to have anything strange about them:
http://files.rcsb.org/view/5o2u.pdb
http://mmtf.rcsb.org/v1.0/full/5o2u
Tried calculating the distance map using the by_pdb_id method. This does not return any error.
Potentially of note: this error was originally returned on orchestra but I was able to reproduce it locally. Benni was not able to reproduce the error locally but got a different error that also seems to be related to missing coordinates.
uid = 'P12497'
s = SIFTS("/Users/AG/databases/pdb_chain_uniprot_plus_2017_08_03.csv",
"/Users/AG/databases/pdb_chain_uniprot_plus_2017_08_03.fa")
selected_structures = s.by_alignment(
reduce_chains=False, sequence_id=uid, region=(378, 432),
jackhmmer="/Applications/hmmer-3.1b2-macosx-intel/binaries/jackhmmer",
)
c = multimer_dists(sifts_result=selected_structures,intersect=False)
IndexError Traceback (most recent call last)
<ipython-input-14-2902a00c3a21> in <module>()
----> 1 c = multimer_dists(sifts_result=selected_structures,intersect=False)
2 c.contacts()
/Users/AG/marks_lab_scripts/EVcouplings/evcouplings/compare/distances.py in multimer_dists(sifts_result, structures, atom_filter, intersect, output_prefix, model, raise_missing)
876 # is close in some combination)
877 distmap_sym = DistanceMap.aggregate(
--> 878 distmap, distmap.transpose(), intersect=intersect
879 )
880 distmap_sym.symmetric = True
/Users/AG/marks_lab_scripts/EVcouplings/evcouplings/compare/distances.py in aggregate(cls, intersect, agg_func, *matrices)
597 )
598
--> 599 new_mat[k][i_agg, j_agg] = m.dist_matrix[i_src, j_src]
600
601 # aggregate
IndexError: arrays used as indices must be of integer (or boolean) type
nucleocapsid_b0.8_mapping1737.txt
nucleocapsid_b0.8_mapping1738.txt
nucleocapsid_b0.8_structure_hits.txt
As from #74
For this purpose, I propose listing all external dependencies here and slowly commenting with their licenses.
Dependencies:
contenders: MIT, BSD 3-clause
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns.py", line 573, in cns_dgsa_fold
File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns.py", line 274, in cns_generate_easy_inp
File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns.py", line 101, in _cns_render_template
File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/utils/system.py", line 129, in verify_resources
"{}:\n{}".format(message, ", ".join(invalid))
evcouplings.utils.system.ResourceError: CNS template does not exist: /groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns_templates/generate_easy.inp:
/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns_templates/generate_easy.inp
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/utils/pipeline.py", line 453, in execute_wrapped
outcfg = execute(**config)
File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/utils/pipeline.py", line 173, in execute
outcfg = runner(**incfg)
File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/protocol.py", line 606, in run
File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/protocol.py", line 486, in standard
File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/multiprocessing/pool.py", line 268, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/multiprocessing/pool.py", line 608, in get
raise self._value
evcouplings.utils.system.ResourceError: CNS template does not exist: /groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns_templates/generate_easy.inp:
/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/cns_templates/generate_easy.inp
Traceback (most recent call last):
File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/utils/pipeline.py", line 453, in execute_wrapped
outcfg = execute(**config)
File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/utils/pipeline.py", line 173, in execute
outcfg = runner(**incfg)
File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/protocol.py", line 606, in run
return PROTOCOLS[kwargs["protocol"]](**kwargs)
File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/site-packages/evcouplings/fold/protocol.py", line 504, in standard
shutil.copy(file_path, fold_dir)
File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/shutil.py", line 235, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/groups/marks/software/anaconda/envs/evcouplings_env/lib/python3.6/shutil.py", line 114, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: 'output/capsid_config_mod_b0.2/fold/aux/capsid_config_mod_b0.2_significant_ECs_0.9_1_hMIN.pdb'
Implement evcouplings.align.pfam.create_family_size_table(), parse the numbers from full Pfam-A flat file (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.full.gz).
Before implementing, check that this information is not available in a more straightforward way.
New function for coloring EC enrichment should be incorporated into pipeline. The new function has more bins and a nicer color scheme.
Currently the protocols have quite some code duplication. I already refactored some of it in the complexes_dev branch (standard and complexes protocols). After these changes are merged into master, we should continue to refactor such that mean-field protocol also gets de-duplicated.
Benni and I have noticed that the "..statistics_summary.pdf" and "..statistics_summary.csv" files are not always made and it seems to be random when they are not. It seems they are 'timed out' but we don't know why. Even when runs take just a few hours the stats file can take a day .. odd??
The original pipeline had a small modification to the scalehot and scalecool modules of CNS to allow different restraint weights for secondary structure distance and EC-based distance restraints:
Old settings in dg_sa.inp:
md.cool.noe=5 (used to scale EC distance restraints)
md.cool.noe2=10 (used to scale secondary structure distance restraints)
These weights were used to rescale the restraints in the first (ECs) and second (sec. struct.) tbl files.
New settings
The reimplementation does not have this modification (only md.cool.noe = 5), since individual restraint weights can be specified in the tbl files. At the moment, this adjustment is however not done, which might lead to a lower overall contribution of secondary structure distance restraints (with unclear consequences).
To adjust, in restraints.yml, under secstruct_distance_restraints we should (probably) change weight: 5 to weight: 10. However it is unclear if this change
Before making this change, its effect should best be tested empirically.
subfoldbitscore/align/*_alignment_statistics.csv
*_job_statistics_summary.pdf
--> Might be missingalign\*_alignment_statistics.csv
)align\*.fa
)align\*.a2m
)align\*_frequencies.csv
)couplings\*.model
)couplings\*_enrichment_sausage
)couplings\*_enrichment_sphere
)couplings\*_evzoom.json
)if no compare:
couplings\*_CouplingScores.csv
)else:
compare\*_CouplingScoresCompared_all.csv
)compare\*_CouplingScoresCompared_longrange.csv
)compare\*_structure_hits.csv
)mutate\*_mutate_matrix.csv
)fold\*_secondary_structure.csv
)fold\*_ranking.csv
)if structures available:
fold\*_comparison.csv
) (has more cols)It would be nice if we can make figures that visualize ECs as arcs between points on a string, especially for proteins where we get some high-confidence ECs but not enough to fold the full protein.
Great feature for unpublished PDBs or comparing own models to others
see marks/projects/murj_ecoli/MURJ_ECOLI for example
When loading a large structure (from mmCIF or MMTF) with many residues / atoms that exceed the maximum width in PDB files, and then writing the chains from these as PDB file using
evcouplings.compare.pdb.Chain.to_file(), the output file will have shifted columns (i.e. broken PDB file).
Fix:
Perhaps it is a good idea to start thinking about continuous integration and testing.
One very easy way to do this is TravisCI (https://travis-ci.org). I can help setup the CI and write some unit tests....
Hi guys,
the pipeline dies almost immediately when I set off a run. I get the following parse error:
Error: Parse failed (sequence file
/groups/marks/databases/jackhmmer/uniref100/uniref100_current.fasta):
Line 1: unexpected char ; expected FASTA to start with >
I looked at this briefly with Christian, and he suggested that this is caused by some kind of encoding error (I see weird characters when I run 'head uniref100_current.fasta' on the command-line).
Thanks!
Rohan
We had a response from the SIFTS team that they will provide an segment mapping file that maps consecutive Uniprot sequence segments to segments with structural coverage later this year or early next year. In the meantime, we shouldtry to create such a file ourselves based on the residue-level xml files. We could then distribute these precomputed files using the evcouplings_dbupdate tool and as downloads on the lab website.
can be psipred and perry's from ECs
should have both options
Pipeline outputs with the suffix "_file" are implemented in some of the stages to be None if the product could not be created, whereas other stages do not return the key.
Need to make this behaviour consistent, and make sure that the pipeline (re-)runner does not choke when files are set to be None.
Since EVcouplings has changed quit a bit over the last months, we should probably release a new official version.
What do you think?
I got an invalid syntax error when importing something from evcouplings.align.
The error comes from align.protocol line 378
info = pd.DataFrame(
{
"pos": range(first_index, first_index + alignment.L),
"target_seq": target_seq,
"conservation": conservation,
**fi_cols
}
)
Currently the "cn" score column is the default score column in the entire pipeline. To make this more flexible for use with other scores (e.g. DI score from mean-field protocol), there should be a generic "score" column which the rest of the pipeline after the couplings stage uses. Which score gets assigned to that column is a parameter of the couplings stage (entry like "use_score: cn"). Needs consideration of special score properties (e.g. mixture model does not make sense with DI scores).
I have compute_num_effective_seqs: True in my config file, but Neff does not appear to be calculated at my align stage.
I am not 100% sure that this is not operator error. Here is the call I am using:
evcouplings --yolo -P output/capsid_config_mod -p P12497 -b "0.2, 0.5, 0.8" -t 0.99 -r 133-363 sample_config_align.txt
Goal: reduce amount of memory that needs to be requested by estimating upper bound based on length of target sequence.
Put UniProt numbering on beginning and end of line
Interactive arcs that display residue numbers and name
Range of colors for arcs related to strength of EC
Optional - predicted secondary structure along line
Put numbering and secondary structure logo on same axis in contact maps for better visualization.
Make subfolders for all output stages
We should search with HHPRED for all queries independent of alignment depth of search.
Just as an alert rather than to use and not to be put in web server.
( Current status with alignment search is under review from Thomas)
Necessary for EVcomplex part of pipeline to be usable.
Two escalation levels:
Distribute pre-generated downloadable file using DB update tool (download from lab website)
Include script that generates databases from source files in fully automated way.
For next release, I think having 1) would be sufficient.
Segments and index remapping in the coupling stage do not accommodate the following case:
Since all of the pipeline currently operates in focus mode (and this is the only sensible mode for downstream stages like compare, mutate and fold), this is not an issue currently. If non-focus mode applications are ever integrated into the pipeline that require mapping to a target sequence, this issue has to be addressed.
Decide on best rule for secondary structure is shown in compare contact map
Highest priority would be tests that check the sanity of the individual pipeline stage outputs, and that the pipeline runs through from start to end for the major workflows (monomers / complexes).
Use https://readthedocs.org/, couple to repository
Changing theta in config is working ok , changing it in command line doesn't .
Color range for EC strengths ( discuss what it should be)
For prolines that appear in long helices, could be good to have a flag that would change them from H to C in secondary structure prediction file from PSIPRED to allow the helix to bend in fold stage
This comprises the following three steps:
Additionally, it should provide the option to input a reference coordinate annotation (see pg. 37 ofhttp://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf for example) and write that annotation to the output file.
In align/protocol/hmmbuild_and_search, after calling run_hmmsearch, the .sto formatted alignment output by run_hmmsearch isn't guaranteed to have the query sequence contained in it, so we need to patch in the query sequence and then write a new .sto formatted sequence alignment. This will contain the output of hmmsearch plus the query sequence AND RF annotation columns corresponding to the columns of the query sequence.
In compare/sifts/find_homologs, the function _make_hmmsearch_raw_fasta can be removed and the alignment from hmmbuild_and_search can be read in directly.
After the pipeline has finished generating all files, there should be a final optional step where files are copied from the FS to a database. This should be as transparent as possible (i.e.: database can be mongo/postgres/...) and the files to be stored are read from some config file.
I encountered the following error:
Traceback (most recent call last):
File "/Users/bs224/Dropbox/PostDoc/projects/EVcouplings/test/TestPDB.py", line 33, in test_pbd
distmap_intra = intra_dists(selected_structures)
File "/Users/bs224/Dropbox/PostDoc/projects/EVcouplings/evcouplings/compare/distances.py", line 759, in intra_dists
model
File "/Users/bs224/Dropbox/PostDoc/projects/EVcouplings/evcouplings/compare/distances.py", line 667, in _prepare_chain
chain = structures[pdb_id].get_chain(pdb_chain, model)
File "/Users/bs224/Dropbox/PostDoc/projects/EVcouplings/evcouplings/compare/pdb.py", line 538, in get_chain
"No chains with given name found"
ValueError: No chains with given name found
which basically breaks the complete pipeline. The culprit was 1OAY...
I propose to ignore the particular PDB structure and continue with the processing via a try and except block. Agreed?
In this function ( https://github.com/debbiemarkslab/EVcouplings/blob/master/evcouplings/visualize/pairs.py#L752 ), a column with id
needs to be defined, but in sec_struct.csv the column's header is just i
. Maybe inconsistent?
To save space, delete intermediate products of align stage (which are biggest space takers next to .model file):
Add a parameter like "save_model" couplings stage to enable/disable this.
Add interactive contact maps plotting function - HTML - think about consistency with server whilst doing
Add to couplings stage
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.