robaina / metatag Goto Github PK
View Code? Open in Web Editor NEWmetaTag: functional and taxonomical annotation of metagenomes through phylogenetic tree placement
Home Page: https://robaina.github.io/MetaTag/
License: Apache License 2.0
metaTag: functional and taxonomical annotation of metagenomes through phylogenetic tree placement
Home Page: https://robaina.github.io/MetaTag/
License: Apache License 2.0
This is a self-reminder to replace filterFASTAbyIDs by its new, fastest implemention using seqkit
I am trying to make a database with a mix of hmms, one pfam, that does not require additional arguments, and two KOfams that require them. This is part of a test I am doing for something else, so I have changed the hmmersearch call in wrappers.py, but I have only changed the type of output file, from tblout to domtblout. I do not think this is causing the error.
Because, I got a ValueError; at some point the pipeline tries to convert the Pfam name to float. Probably, I am making a mistake in the module call..... This is the call and the error (forget about the last \ in the call):
(traits) nfernandez@elbrus:/data/mcm/nfernandez/TRAITS$ python3 ./code/makedatabase.py \
> --in data/databases/final_ref_database.faa \
> --outdir genes/prueba/results \
> --hmms genes/prueba/hmms/PF18582.hmm genes/prueba/hmms/K00367.hmm genes/prueba/hmms/K00370.hmm \
> --hmmsearch_args "None","-T 1033.77","-T 1075.37"\
>
* Making peptide-specific reference database...
* Processing hmm PF18582.hmm with additional arguments: --cut_nc
Running Hmmer...
Parsing Hmmer output file...
Traceback (most recent call last):
File "/data/mcm/nfernandez/opt/envs/traits/lib/python3.9/site-packages/Bio/File.py", line 72, in as_handle
with open(handleish, mode, **kwargs) as fp:
TypeError: expected str, bytes or os.PathLike object, not TextIOWrapper
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/mcm/nfernandez/TRAITS/./code/makedatabase.py", line 194, in <module>
main()
File "/data/mcm/nfernandez/TRAITS/./code/makedatabase.py", line 132, in main
filterFASTAByHMM(
File "/data/mcm/nfernandez/TRAITS/code/phyloplacement/database/manipulation.py", line 114, in filterFASTAByHMM
hmmer_hits = parseHMMsearchOutput(hmmer_output)
File "/data/mcm/nfernandez/TRAITS/code/phyloplacement/database/manipulation.py", line 62, in parseHMMsearchOutput
for queryresult in SearchIO.parse(handle, 'hmmer3-tab'):
File "/data/mcm/nfernandez/opt/envs/traits/lib/python3.9/site-packages/Bio/SearchIO/__init__.py", line 306, in parse
yield from generator
File "/data/mcm/nfernandez/opt/envs/traits/lib/python3.9/site-packages/Bio/SearchIO/HmmerIO/hmmer3_tab.py", line 33, in __iter__
yield from self._parse_qresult()
File "/data/mcm/nfernandez/opt/envs/traits/lib/python3.9/site-packages/Bio/SearchIO/HmmerIO/hmmer3_tab.py", line 98, in _parse_qresult
cur = self._parse_row()
File "/data/mcm/nfernandez/opt/envs/traits/lib/python3.9/site-packages/Bio/SearchIO/HmmerIO/hmmer3_tab.py", line 51, in _parse_row
hit["evalue"] = float(cols[4]) # evalue (full sequence)
ValueError: could not convert string to float: 'PF18582.4'
I have been trying to introduce Nrt arctic sequences into the Nrt reference tree. The procedure works well but there are some outliers. When I try to remove them using the removetreeoutliers.py
nothing changes. The command line is:
python3 /data/mcm/rlaso/Programs/TRAITS/code/removetreeoutliers.py \
--tree /data/mcm/rlaso/Traits/Nrt/results/Arctic/epa_result.newick \
--outdir /data/mcm/rlaso/Traits/Nrt/results/Arctic/ \
--aln /data/mcm/rlaso/Traits/Nrt/results/Arctic/TIGR01184_COG1116_arctic_90filtered_processed.faln
The trees and alignment are attached. The extension is modified to txt to allow upload to github
epa_result_relabel.newick.txt
epa_result_shrink_relabel.newick.txt
TIGR01184_COG1116_arctic_90filtered_processed.faln.txt
I am not sure if this is an issue and the stringency of the removal can be modified.
Currently, a bash pipeline has to be prepared for each gene. Would be useful to set general options in a config file (e.g. config.json) to automatize bash script
Short read placement may fail at placing sequences in the right place to resolve taxonomy. To test the accuracy of this method, we could place test sequences that have been already (taxonomically) classified. We could artificially fragment these sequences to simulate short reads.
So pretty much what the title says. This feature will avoid calling countplacements.py
multiple times
This is a convenient enhancement to allow multiple HMMs (e.g. TIGRFAMs) in code/makedatabase.py. This way, makedatabase.py can be called a single time instead of multiple callings for each HMM.
We should allow the selection of total databse size as well as individual sizes for each HMM...
Export text file with list of query sequences that were placed in a cluster with the "right" functionality. This is useful to assign taxo info by an external method
Created this new issue to expand on the idea of selecting tree clusters by representative sequences with experimental evidence via EFE3, introduced in #32
I am trying to redo the old trees with the new databases using an updated version of the pipeline. I updated the environment as well using conda update and the yaml file. I am having an error with the makedatabase_struct.py. Here is the code:
python3 /data/mcm/rlaso/Programs/TRAITS/code/makedatabase_struct.py --in /data/mcm/databases/TRAITs_reference_database/final_ref_database.fasta --outdir /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Nrt/data/reference_data_NrtA_K15576/ --hmms /data/mcm/databases/kofam/profiles/K15576.hmm /data/mcm/databases/kofam/profiles/K15577.hmm --hmm_struc "K15576 4 K15577" --hmmsearch_args "-T 501.70,-T 278.07" --target_hmm "K15576"
* Making peptide-specific reference database...
Running Hmmer...
Filtering results by HMM structure...
Filtering Fasta...
RuntimeError: get seq count and length error
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/data/mcm/rlaso/Programs/TRAITS/code/makedatabase_struct.py", line 157, in <module>
main()
File "/data/mcm/rlaso/Programs/TRAITS/code/makedatabase_struct.py", line 115, in main
filterFastaByHMMstructure(
File "/data/mcm/rlaso/Programs/TRAITS/code/phyloplacement/database/manipulation.py", line 459, in filterFastaByHMMstructure
filterFASTAbyIDs(input_fasta, record_ids=record_ids,
File "/data/mcm/rlaso/Programs/TRAITS/code/phyloplacement/database/manipulation.py", line 76, in filterFASTAbyIDs
fa = pyfastx.Fasta(input_fasta)
SystemError: <class 'Fasta'> returned a result with an error set
Previously the makedatabase_struct.py did work for me, so I am unsure if the error is in the new code or in the new conda environment.
Once the reference database was extended, I updated the database, the repo and the conda environment to include the new the changes on relabeltree.py --taxonomy to assign the GTDB taxonomy to the DB sequences included in a tree, I run the run_test.sh script and everything worked fine. But I got somre errors when using it with one of my trees:
First, the script did not work:
$ python3 ./code/relabeltree.py \
> --tree genes/nosZ/results/tree_before_placement/ref_database_shrink.newick \
> --aln genes/nosZ/results/tree_before_placement/ref_database_shrink.faln \
> --labels \
> genes/nosZ/results/nosZ_ref_database_id_dict.pickle \
> genes/nosZ/results/nosz_hallin18_short_ids_id_dict.pickle \
> --label_prefixes “refdb_” “hal_” \
> --taxonomy
* Relabelling tree...
Traceback (most recent call last):
File “/media/disk5/nfernandez/TRAITS/./code/relabeltree.py”, line 135, in <module>
main()
File “/media/disk5/nfernandez/TRAITS/./code/relabeltree.py”, line 116, in main
tree_label_dict, export_label_dict = assignTaxonomyToLabels(label_dict)
File “/media/disk5/nfernandez/TRAITS/./code/relabeltree.py”, line 89, in assignTaxonomyToLabels
taxopath = taxonomy.assignTaxonomyToLabel(label)
File “/media/disk5/nfernandez/TRAITS/code/phyloplacement/taxonomy.py”, line 112, in assignTaxonomyToLabel
return self._taxodata.loc[genome_id].item()
File “/media/disk5/nfernandez/opt/miniconda3/envs/traits/lib/python3.9/site-packages/pandas/core/generic.py”, line 5487, in __getattr__
return object.__getattribute__(self, name)
AttributeError: ‘DataFrame’ object has no attribute ‘item’
After Robaina fixed the first problem, I encountered a second one. Once the tree was relabel, several sequences missed the taxonomy. I checked again the following:
python3 ./code/relabeltree.py \
--tree genes/nosZ/results/tree_before_placement/ref_database_shrink.newick \
--aln genes/nosZ/results/tree_before_placement/ref_database_shrink.faln \
--labels \
genes/nosZ/results/nosZ_ref_database_id_dict.pickle \
genes/nosZ/results/nosz_hallin18_short_ids_id_dict.pickle \
--label_prefixes “refdb_” “hal_” \
--taxonomy
But I reproduced the problem, and many reference sequences were missing the taxonomy, as you can see in the tsv produced by relabletree.py --taxonomy:
I checked if the sequence ID of those missing the taxonomy were included in the merged_taxonomy.tsv
file and that was not the problem.
$ grep “OceanDNA_b4030” merged_taxonomy.tsv
OceanDNA_b4030 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Cytophagales;f__Cyclobacteriaceae;g__ELB16-189;s__
Hi,
As I was working with the Jupyter notebook to define the clusters, I realized a small functionality was missing. So I wrote some lines to help, that you can add to the jupyter notebook in the main branch.
Once the user has manually defined the clusters based on internal nodes in iTOL for instance, there was not (or at least I have not seen it) a quick way to build a dictionary with the user-defined cluster names and the leaves corresponding to them. So here is my solution.
First, create a tsv file with two columns: the first one with the cluster name, for instance: NasA_Gammaproteobacteria, or AioA_paralog; and the second one must be the internal node name, for instance, IN_2_100 or IN_65_87
You should create this file while checking the tree to define the clusters
Then using pandas import that tsv file into a dictionary and then from it, create another one populating the items with the getAllDescendantsOfTargetNode
function. I am using a dictionary to import the information in case we might want to join several branches into a same kind of cluster. Imaging that you have a group of paralogs inserted in the middle of a larger branch of orthologs, resulting in two groups of orthologs. In that case you might want to consider these two groups as a single one...
Here is the code
'''
Create a dictionary with cluster names as keys
and all leaves belonging to that cluster
starting with a tsv file that contains the cluster's names
in the first column and the internal node names in the
second column
'''
import pandas as pd
# Import tsv file into a dictionary
dict_from_tsv = pd.read_csv('test.tsv', sep='\t', header=None, index_col=0, squeeze=True).to_dict()
tree_clusters = {}
for key in dict_from_tsv.keys():
tree_clusters[key] = phylotree.getAllDescendantsOfTargetNode(dict_from_tsv[key])
Currently, taxonomy and function of placed sequences are assigned according to their landing cluster. However, sometimes clusters may be too broad to provide a meaningful, non-trivial taxonomy to queries. In such cases, may be better to directly assign taxonomy based on the closest reference sequence to the placement of the query. To this end, we need to traverse the tree from the point of placement to find out the closest reference sequences. PhyloIO from Bio provides some limited tools, perhaps other, general tools to parse DAGs (networkx?) are more helpful.
I am trying to build the tree for NrtC. NrtC has the TIGR01184, that is also present in the neighbor protein NrtD, but NrtC is longer and has a domain that hits the COG1116.
I have run the TIGR01184 in the MAR database smoothly, but now I wanted to run an hmm from the COG1116 to obtain the sequences and compared with the TIGR results, but it does not work.
First, I used the hmm model obtained from this website. The command was:
(traits) rlaso@elbrus:/data/mcm/rlaso/Traits/Nrt$ python3 /data/mcm/rlaso/Programs/TRAITS/code/makedatabase.py \
> --in /data/mcm/rlaso/Traits/Databases/MAR_faa/fullMARHQ_cleaned.faa \
> --outdir /data/mcm/rlaso/Traits/Nrt/results/COG1116/ \
> --hmm /data/mcm/rlaso/Traits/Nrt/hmms/COG1116.hmm \
> --prefix "ref_nrtC" --relabel
* Making peptide-specific reference database...
Running Hmmer...
Error: NC bit thresholds unavailable on model NOG.COG1116.clustalo_raw
Parsing Hmmer output file...
Traceback (most recent call last):
File "/data/mcm/rlaso/Programs/TRAITS/code/makedatabase.py", line 118, in <module>
main()
File "/data/mcm/rlaso/Programs/TRAITS/code/makedatabase.py", line 81, in main
filterFASTAByHMM(
File "/data/mcm/rlaso/Programs/TRAITS/code/phyloplacement/database/manipulation.py", line 110, in filterFASTAByHMM
if not hmmer_hits.id.values.tolist():
File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/core/generic.py", line 5487, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'id'
I understood the problem was that the COG model (see COG1116_hmm.txt) does not have the corresponding NC value present in TIGR hmms.
COG1116_hmm.txt
I tried to fix it adding the corresponding line, where the value was inferred from the ThresholdBitScore from the COG model. I know this value would not be "right" since the COG model is based on Blast and TIGR in hmm, but I wanted to give it a try.
NC 207.409
Still an error occurred, but it was not informative enough (bad file format HMM file)
(traits) rlaso@elbrus:/data/mcm/rlaso/Traits/Nrt/hmms$ python3 /data/mcm/rlaso/Programs/TRAITS/code/makedatabase.py --in /data/mcm/rlaso/Traits/Databases/MAR_faa/fullMARHQ_cleaned.faa --outdir /data/mcm/rlaso/Traits/Nrt/results/COG1116/ --hmm /data/mcm/rlaso/Traits/Nrt/hmms/COG1116.hmm --prefix "ref_nrtC" --relabel
* Making peptide-specific reference database...
Running Hmmer...
Error: bad file format in HMM file /data/mcm/rlaso/Traits/Nrt/hmms/COG1116.hmm
Parsing Hmmer output file...
Traceback (most recent call last):
File "/data/mcm/rlaso/Programs/TRAITS/code/makedatabase.py", line 118, in <module>
main()
File "/data/mcm/rlaso/Programs/TRAITS/code/makedatabase.py", line 81, in main
filterFASTAByHMM(
File "/data/mcm/rlaso/Programs/TRAITS/code/phyloplacement/database/manipulation.py", line 110, in filterFASTAByHMM
if not hmmer_hits.id.values.tolist():
File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/core/generic.py", line 5487, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'id'
I then tried to build the hmm myself. I used the whole raw alignment present in the eggnote website of the COG and hmm:
hmmbuild -n COG1116 COG1116_manual.hmm COG1116_alignment_eggnote.txt
# hmmbuild :: profile HMM construction from multiple sequence alignments
# HMMER 3.3.2 (Nov 2020); http://hmmer.org/
# Copyright (C) 2020 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# input alignment file: COG1116_alignment_eggnote.txt
# output HMM file: COG1116_manual.hmm
# name (the single) HMM: COG1116
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# idx name nseq alen mlen eff_nseq re/pos description
#---- -------------------- ----- ----- ----- -------- ------ -----------
1 COG1116 3486 2241 949 51.01 0.590
# CPU time: 1.34u 0.02s 00:00:01.36 Elapsed: 00:00:01.37
However, the NC error happened again and adding the NC value gave the same mistake.
I am unsure how to calculate this NC (noise criteria) since it is not indicated in the hmmbuild, and I am not such an expert in this regard. Does anyone knows how to do it? I just found this [paper] (https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6288931)
Would it be possible, if not, to extract the sequences from hmm without the NC?
Search by synteny structure has been now tested for a while. Hence, it makes sense to integrate this functionality into makedatabase.py
Either:
otherwise gappa examine assign cannot properly assign taxonomy since some references and internal nodes may have the same numerical label
Due to a problem with SqueezeMeta results, I was inadvertently using a fasta file containing protein and DNA query sequences while working on the placement of environmental sequences in a tree. Because there is no warning either in SqueezeMeta or in TRAITS pipeline, I did not realized about the mistake until I was reviewing the results in the tree and only because one of the DNA sequences was placed in a paralogous cluster. To my surprise, the DNA sequences are aligned in the query alignment (using papara).
I think, it would be useful to add a checkpoint in the preprocess.py script to hault the script and give a warning if there are mixed types of sequences in the fasta files.
I would like to take care of that, I would just take more time (way more) than you to fix it.
Also, make it compatible across alternative methods, such as fasttree and iqtree (currently iqtree producing a .contree file) 👍🏼
Hi @Robaina
I am trying to label some Arctic sequences introduced in the reference tree from NrtA. I have already defined clusters and scores. My command and the subsequent error is:
python /data/mcm/rlaso/Programs/TRAITS/code/labelplacements.py \
--jplace epa_result.jplace \
--labels /data/mcm/rlaso/Traits/Nitrate_transport/NrtA_structure_and_non_synthenic/data/K15576/ref_database_nrtA_processed_id_dict.pickle /data/mcm/rlaso/Traits/Nitrate_transport/NrtA_structure_and_non_synthenic/data/Arctic/K15576_arctic_processed_id_dict.pickle /data/mcm/rlaso/Traits/Nitrate_transport/NrtA_structure_and_non_synthenic/data/Swapni/NrtA_sequence_Swapni_processed_id_dict.pickle \
--ref_clusters ../cluster_id.tsv --ref_cluster_scores ../cluster_score.tsv --prefix arct_
Traceback (most recent call last):
File "/data/mcm/rlaso/Programs/TRAITS/code/labelplacements.py", line 103, in <module>
main()
File "/data/mcm/rlaso/Programs/TRAITS/code/labelplacements.py", line 91, in main
assignLabelsToPlacements(
File "/data/mcm/rlaso/Programs/TRAITS/code/phyloplacement/placement.py", line 402, in assignLabelsToPlacements
ref_clusters = parseTreeClusters(ref_clusters_file, cluster_as_key=False, sep='\t')
File "/data/mcm/rlaso/Programs/TRAITS/code/phyloplacement/placement.py", line 311, in parseTreeClusters
return dict(df.values)
ValueError: dictionary update sequence element #0 has length 10; 2 is required
I am sure where the problem is. In principle, I use the same dictionaries for relabeling the tree, so they should be fine.
Hi, I am trying to apply labelplacement on my amt tree but I keep getting an error related to gappa:
(traits) rlaso@elbrus:/data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/results/Arctic$ python /data/mcm/rlaso/Programs/TRAITS/code/labelplacements.py --jplace epa_result_modified.jplace --labels /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/data/reference_data_Amt_TIGR00836/ref_amtref_database_id_dict.pickle /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/data/McDonald_2016/mcdonald2016_prokaryotes_processed_id_dict.pickle /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/data/McDonald_2016/mcdonald2016_plant_processed_id_dict.pickle /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/data/McDonald_2016/mcdonald2016_rh_processed_id_dict.pickle /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/data/reviewed_sequences/SwissProt_Amt_prokaryotes_processed_id_dict.pickle --ref_clusters ../cluster_id.tsv --ref_cluster_scores ../cluster_score.tsv --prefix arctic
.... ....
'' '||. .||'
|| ||
'|.|'
...' .... ... ... ... ... .... .|'|.
| || '' .|| ||' || ||' || '' .|| .|' ||
|'' .|' || || | || | .|' || .|'|. ||
'.... '|..'|'. ||...' ||...' '|..'|. '||' ||:.
'....' || ||
'''' '''' v0.7.1 (c) 2017-2021
by Lucas Czech and Pierre Barbera
Invocation: gappa examine assign --jplace-path epa_result_modified.jplace --taxon-file temp_oxulnwpxnw --out-dir
/data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/results/Arctic --file-prefix arctic --allow-file-overwriting
--per-query-results --best-hit
Command: gappa examine assign
Input:
--jplace-path epa_result_modified.jplace
--taxon-file temp_oxulnwpxnw
--root-outgroup
--taxonomy
--ranks-string superkingdom|phylum|class|order|family|genus|species
Settings:
--sub-taxopath
--max-level 0
--distribution-ratio -1
--consensus-thresh 1
--resolve-missing-paths false
Output:
--out-dir /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/results/Arctic
--file-prefix arctic
--file-suffix
--cami false
--sample-id
--krona false
--sativa false
--per-query-results true
--best-hit true
Global Options:
--allow-file-overwriting true
--verbose false
--threads 52
--log-file
Run the following command to get the references that need to be cited:
`gappa tools citation Czech2020-genesis-and-gappa`
Started 2022-03-29 21:09:28
Found 1 jplace file
Running the assignment
Not all leafs in the reference tree were taxonomically labelled!(1000 / 1164)
Please check tree leaf label and taxon file taxa name congruency!
Segmentation fault (core dumped)
Traceback (most recent call last):
File "/data/mcm/rlaso/Programs/TRAITS/code/labelplacements.py", line 115, in <module>
main()
File "/data/mcm/rlaso/Programs/TRAITS/code/labelplacements.py", line 100, in main
assignLabelsToPlacements(
File "/data/mcm/rlaso/Programs/TRAITS/code/phyloplacement/placement.py", line 413, in assignLabelsToPlacements
parseGappaAssignTable(
File "/data/mcm/rlaso/Programs/TRAITS/code/phyloplacement/placement.py", line 298, in parseGappaAssignTable
table = pd.read_csv(input_table, sep='\t')
File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv
return _read(filepath_or_buffer, kwds)
File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 482, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 811, in __init__
self._engine = self._make_engine(self.engine)
File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 69, in __init__
self._reader = parsers.TextReader(self.handles.handle, **kwds)
File "pandas/_libs/parsers.pyx", line 549, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
The corresponding cluster files are here:
cluster_id.txt
cluster_score.txt
I thought the mistake could be related to the fact that I didn't provide a number to cluster_score, but I changed the string for numbers, and the same error appear again. Then I read more carefully the error, and I think it is related to pandas, but not sure what is the issue exactly
When hmmsearch doesn't retrieve any result, makedatabase.py outputs the error:
$ python3 ./code/makedatabase.py \
> --in /data/mcm/nfernandez/nosZ/test_cytC/data/cytc533_RefSeq_seqs_clean.fasta \
> --hmm /data/mcm/nfernandez/nosZ/test_cytC/hmms/NosZ.hmm \
> --outdir /data/mcm/nfernandez/nosZ/test_cytC/results
* Making peptide-specific reference database...
Running Hmmer...
Parsing Hmmer output file...
Filtering Fasta...
Traceback (most recent call last):
File "/data/mcm/nfernandez/TRAITS/./code/makedatabase.py", line 118, in <module>
main()
File "/data/mcm/nfernandez/TRAITS/./code/makedatabase.py", line 81, in main
filterFASTAByHMM(
File "/data/mcm/nfernandez/TRAITS/code/phyloplacement/database/manipulation.py", line 111, in filterFASTAByHMM
filterFASTAbyIDs(input_fasta, record_ids=hmmer_hits.id.values,
File "/data/mcm/nfernandez/opt/envs/traits/lib/python3.9/site-packages/pandas/core/generic.py", line 5487, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'id'
A more informative error could be displayed instead
When building a tree with buildtree.py if you want to specify the model with the --tree_model parameter, the script does not recognize the argument. For example:
$ python3 ./code/buildtree.py \
--in genes/nas/results/ref_seqs_final.faa
--outdir genes/nas/results/ref_tree_nas_final
--msa_method "muscle"
--tree_method "iqtree"
--tree_model "LG+F+I+G4"
usage: buildtree.py [-h] --in DATA [--outdir OUTDIR] [--msa_method {muscle,mafft}] [--tree_method {iqtree,fasttree}]
[--tree_model {iqtest,modeltest,a model name}]
buildtree.py: error: argument --tree_model: invalid choice: 'LG+F+I+G4' (choose from 'iqtest', 'modeltest', 'a model name')
when running preprocess.py with a folder containing several files, two files are created: merged_data.fasta and the output that the user specifies, i.e. cleaned.faa. The file sizes are different, what are the differences? which one should be used in the rest of the analysis?
Currently, stdout is filled with output from multiple tools employed along the pipeline. Would be helpful to be able to control the verbosity level of the pipeline, e.g. output only indicators to stdout and full output to log file
I have encounter two problems when using the relabeltree.py --taxonomy option to assign taxonomy to MARdb sequeces:
some genomes from the initial MARdb files that Jose sent are missing the MARdb ID and therefore, those sequences cannot be classified with their ID. To solve it, we need to:
a) First, figure out if all teams are working with the same files (Complete and Partial genomes, QC by Jose)
b) Check with José if the MAR IDs were removed during the QC he did - I will do that once we figure out the firs part.
For those sequences with MAR IDs the script is assigning the NCBI taxonomy and not the GTDB. For example:
One of the genera that appears in my tree classified as the NCBI (Beta) but in GTDB is Gamma.
This happens with several Betas that have been reclassified as Gammas, for example in the tree
The label color is the GTDB web classification and the square in the right the taxonomy applied by the script. Green: Zproteobacteria, Pink: Gamma, Orange: Beta
I have another example with a Streptomyces that has the NCBI taxonomy in the tree label and not the GTDB.
Are we are using an older release??
Popular request:
create optional argument in makedatabase.py to pass arguments to Hmmer so the file doesn't have to be edited
We need a way to deal with queries that are placed in different clusters. Currently, we remove those sequences. However, we are losing valuable information in some cases. Particularly, when two queries are placed in different clusters that contain the same "good" function.
In these cases, we would like to preserve these placements. We could take the placement that gets the highest LWR. And, as before, remove queries that were placed in clusters with different functions
We are using different databases and label formats. It would make sense to build a general label parser to extract the genome ID from any label that we use. The parser would guess the database from the label format and return the genome ID or an exception/None if the ID cannot be recovered or if there is a format conflict among databases.
This issue would fix #61
This new function would retrieve possible query duplicates from the environmental database and assign the same taxonomy and cluster than the (duplicated) placed query
We have a new script:
The script runs a double hmm search to retrieve regions in a contig including these hmm sequences, as long as they are separated by no more than 10 genes.
Would be nice to integrate this script into the pipeline and refactor code 😃
This functionally would avoid calling preprocess.py to mege additional sequences, e.g., classified references from publicatioons, into the reference database created via hmms.
So far, labelplacements.py only works for MARdb sequences with valid MMP identifiers (so a taxonomic path can be retrieved from the GTDB database). However, this method prevents cluster assignment to query sequences when reference sequences don't come from MARdb.
This issue will modify:
to allow taxonomy-free cluster assignment.
To this end, we can create a fake taxopath only containing the cluster name (perhaps additional fields if required, it doesn't matter). Gappa examine assign should then assign clusters to placed queries as if it had a valid taxopath
Include quality score when filtering placements to count taxonomic representation of the assigned function
This issue proposes to change the format of the output table generated by:
iTOL seems to require a specific format to properly visualize labels (such as no spaces in sequence labels). A function to check label compatibility would be useful.
Export merged ID-Label dictionary (created when inputting several ref sequences) into pickle file for further use. Required, for instance, by #6 and also by relabeltree.py
Implement filter to remove placements located too far away from insertion point / closest reference sequences in the tree
Many of the third-party tools employed by the code don't have proper documentation (such as the wrappers in wrappers.py).
Required to add minimal documentation in these cases
Dear all,
I have worked to assign GTDB taxonomy to UniProt reference sequences that we might add when building a tree. I have an script create_table.py
to modify the merged_taxonomy.tsv file to include the NCBI taxIDs and the corresponding GTDB taxonomy, and a few lines to capture the ncbi taxID from the UniProt fasta headers, in the ncbitogtdb.py
.
I am not able to integrate those into the quite complex, but wonderful, scripts that we already have in the project. So, please, integrate these files/code into them. I have attached both scripts:
ncbitogtdb.py.zip
create_table.py.zip
To build the new merged_taxonomy.tsv, you will need the metadata.tsv files from GTDB. They are quite big, so it is better to directly download them from the GTDB repository of the latest release:
https://data.gtdb.ecogenomic.org/releases/latest/
For Archaea:
ar122_metadata.tar.gz
For Bacteria:
bac120_metadata.tar.gz
Notice that there are other files with a similar name but called taxonomy. Do not use those.
Port into python:
grep "out_" tests/test_results/ref_database.faa | cut -c 2- > tests/test_results/outgroup_ids.txt
required by gappa examine assign to root the tree
So far, we have relied on the taxonomy info that MAR database assigns to its genomes. This issue proposes to expand the code to include the taxonomy info from the Oceans Microbiomics Database, which ultimately comes from GTDB.
Specifically, we have to adapt
to include the new taxonomy info.
Also, consider relocating it to a separate module dedicated to taxonomy assignment.
This issue partially solves #40.
Consider integrating modeltest-ng to select the best substitution model for both Fasttree and iqtree (instead of "TEST").
Adding an optional argument to write a tsv file with the assigned taxonomy of reference sequences as well as a pie chart depicting the frequencies at the chosen taxonomical level.
This feature would be useful to get a sense of the diversity captured in the reference database
Writing a web app to be able to manually select (i.e, with the mouse) clusters in a phylogenetic tree would be useful. We could employ cytoscape.js to this end. Here are some resources:
Few sequences get classified outside of cluster IV in some samples, there may be a problem with either the CART model or its implementation.
Enable adding gtdb taxonomy to MARdb sequences in reference tree in relabeltree.py
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.