muellan / metacache Goto Github PK
View Code? Open in Web Editor NEWmemory efficient, fast & precise taxnomomic classification system for metagenomic read mapping
License: GNU General Public License v3.0
memory efficient, fast & precise taxnomomic classification system for metagenomic read mapping
License: GNU General Public License v3.0
Hi,
I wonder if metacache can also work with Ion Torrent fastq files (single-ended)? Do these types of files need any pre-processing?
Many thanks,
Jaime
Hi. I'm running MetaCache query with the -abundaces profile.tsv
and -abundance-per species
flags. It appears this writes two profiling results to profile.tsv
: the full taxon profile and a species profile. However, these profiles do not appear to agree. For example, the taxon profile reports Streptococcus dysgalactiae subsp. equisimilis at 1.15% and no other results for S. dysgalactiae. The species profile reports S. dysgalactiae at 1.93%. Why is there a discrepancy?
Relevant lines from profile.tsv
:
# query summary: number of queries mapped per taxon
# rank:name | taxid | number of reads | abundance
...
subspecies:Streptococcus dysgalactiae subsp. equisimilis 119602 80740 1.15343%
...
# estimated abundance (number of queries) per species
# rank:name | taxid | number of reads | abundance
...
species:Streptococcus dysgalactiae 1334 135610 1.93728%
...
Is the best prediction of the abundance of S. dysgalactiae by MetaCache 1.15% or 1.93%?
Thanks,
Donovan
The rank/tax. name combination is not necessarily unique; would it be possible to
include NCBI taxids as an additional output column?
When i try to build the refseq database the process gets killed at 46%. This happens every time exactly at the same percent value. There is enough storage available. Hardware specs are: 64GB RAM, 16 Cores
Building new database 'refseq' from reference sequences.
Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 2429955 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version 2.0.1 (20210305)
database version 20200820
------------------------------------------------
sequence type mc::char_sequence
target id type unsigned short int 16 bits
target limit 65535
------------------------------------------------
window id type unsigned int 32 bits
window limit 4294967295
window length 127
window stride 112
------------------------------------------------
sketcher type mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type unsigned int 32 bits
feature hash mc::same_size_hash<unsigned int>
kmer size 16
kmer limit 16
sketch size 16
------------------------------------------------
bucket size type unsigned char 8 bits
max. locations 254
location limit 254
------------------------------------------------
Reading sequence to taxon mappings from genomes/refseq/archaea/assembly_summary.txt
Reading sequence to taxon mappings from genomes/refseq/bacteria/assembly_summary.txt
Reading sequence to taxon mappings from genomes/refseq/viral/assembly_summary.txt
Reading sequence to taxon mappings from genomes/taxonomy/assembly_summary_refseq.txt
Reading sequence to taxon mappings from genomes/taxonomy/assembly_summary_refseq_historical.txt
Processing reference sequences.
[=================> ] 24%
[=================> ] 24%
[==================================> ] 46%Killed
Hi,
I am trying to build a database from RefSeq and GenBank genomes.
The total size of the ~1.9 million compressed genomes is ~8.5T. Since the data set contains many genomes, some of which with extremely long chromosomes, I built MetaCache with:
make MACROS="-DMC_TARGET_ID_TYPE=uint64_t -DMC_WINDOW_ID_TYPE=uint64_t -DMC_KMER_TYPE=uint64_t"
What will the peak memory consumption during build be, when partitioning in x M sized partitions?
Atm, I am running with ${p2mc}/metacache-partition-genomes ${p2g} 1600000
, since I have at max ~ 1.9T of memory available.
Is the partition size reasonable?
Does it matter for the partition size calculation, if the genomes are compressed or not?
Would it be beneficial for build time and memory consumption to create more smaller partitions instead of fewer large ones? There was a similar question in #33 (comment), with advice to build fewer partitions, to keep the merging time in check.
Should I then try to find the maximum partition size that will fit into memory during building?
Since I am partitioning anyway, do I actually need to compile with uint64_t
, or check the largest partition, sequence count-wise, and see if I can get away with uint32_t
?
Would you expect performance differences between querying a single db, few partitions, and many partitions, using the merge functionality with the latter two?
I chose -kmerlen 20
based on Figure SF1 from the publication. Would you advise against this in favor of the default value of 16, maybe to keep the computational resource demand and query speed etc. at a reasonable level?
Should other sketching parameters be adjusted as well for a value of 20 or are the defaults fine?
Since the reference data set is large, should some of the advanced options be set/adjusted e.g. -remove-overpopulated-features
? If so, what values should be chosen, based on the reference data?
I would be very grateful for any guidance you have.
Best
Oskar
Just tried to build the database for Metacache, but neither of these commands are actually downloading the genomes. I tried the metacache-build-refseq first, then when it didn't download any, I tried the download-ncbi-genomes, thinking I might need to download them 'manually', as it were. Neither command gives an error per se; both download the assembly_summary.txt file, then when that's finished, give the following line and return to prompt:
cat: ftpdirpaths: input file is output file
Any ideas what I'm doing wrong? I was really intrigued by the paper and wanted to see how the tool performed on my data.
Hi
I ran into this error trying to compile metacache:
mkdir build_release
g++ -std=c++11 -O3 -Wall -Wextra -Wpedantic -c src/args_handling.cpp -o build_release/args_handling.o
g++ -std=c++11 -O3 -Wall -Wextra -Wpedantic -c src/cmdline_utility.cpp -o build_release/cmdline_utility.o
g++ -std=c++11 -O3 -Wall -Wextra -Wpedantic -c src/filesys_utility.cpp -o build_release/filesys_utility.o
g++ -std=c++11 -O3 -Wall -Wextra -Wpedantic -c src/main.cpp -o build_release/main.o
src/main.cpp: In function ‘int main(int, char**)’:
src/main.cpp:72:16: error: ‘runtime_error’ in namespace ‘std’ does not name a type
catch(std::runtime_error& e) {
^
src/main.cpp:73:37: error: ‘e’ was not declared in this scope
std::cerr << "\nABORT: " << e.what() << "!" << std::endl;
^
src/main.cpp:75:16: error: ‘invalid_argument’ in namespace ‘std’ does not name a type
catch(std::invalid_argument& e) {
^
src/main.cpp:76:37: error: ‘e’ was not declared in this scope
std::cerr << "\nABORT: " << e.what() << "!" << std::endl;
^
make: *** [build_release/main.o] Error 1
Any advice on how to resolve it would be great! Thank you.
Argument handling could need some improvements:
$ ./metacache query
terminate called after throwing an instance of 'std::length_error'
what(): vector::reserve
Aborted
Hi,
Not urgent, but it would be useful to have a CLI option that indicated the software version. Or alternatively if the version number was just indicate somewhere in the help so it could be parsed out.
Thanks,
Donovan
Hi,
I'm a little confused at how to interpret the -abundance
file generated by MetaCache. In particular, I'm confused at how to interpret the results with -lower
set to subspecies
. In a simple mock I have, if I run MetaCache with -lower
at the default I get Salmonella enterica subsp. enterica reported at 8.9%. However, if I run MetaCache with -lower
set to subspecies
, I get Salmonella enterica subsp. enterica reported at 1.9%.
I'm unclear at why this discrepancy occurs? Which result is generally consider more accurate?
Thanks,
Donovan
Hello,
I'm having trouble interpreting the per-read mapping report, specifically the taxid. I noticed there are some negative ones and when plotting with KronaTools most are not identified but some are classified as Eukaryota. Could you please clarify what they are?
Is there a recommended tool for plotting the results?
Thank you.
Hi,
I'm experiencing some issues building a RefSeq reference database for MetaCache v2.1.1.
The first issue is that the NCBI assembly_summary.txt
files now have a few entries where the ftp_path
field is set to na
. This breaks the metacache-build-refseq
script. This is easy enough to workaround once you know the problem, but it takes a bit of exploring to figure this out.
The more major issue is that metacache build
command has been stuck at 98% for several hours now and all the metacache processes are in a sleep state. I am running this on a 16 CPU machine with 126 GB of RAM. Only 82 GB of RAM is used at the moment so this doesn't appear to be a memory issue. I can also verify I have plenty of disk space.
Have you experience this problem? Can anyone verify that they have recently been able to build a MetaCache DB from complete RefSeq genomes?
Thanks,
Donovan
Hi. Is the target limit set by the DMC_TARGET_ID_TYP
data type the maximum number of genome assemblies or the maximum number of contigs/scaffolds?
Hi,
I built a custom database and have used the following command to query my paired end fastq files:
./metacache query Metacache_DB ${INPUT1} ${INPUT2} -pairfiles -abundances test_abundance.txt -lowest species -highest domain -out test_results.txt -threads ${THREAD_NUM} -omit-ranks -taxids-only -separate-cols -lineages
This produces an abundance text file and a results file. The results file looks something like below:
# TABLE_LAYOUT: query_header | taxid | taxid | taxid | taxid | taxid | taxid | taxid | taxid | taxid | taxid | taxid | taxid | taxid | taxid | taxid | taxid
READ:INFO:12701/1 | 10115 | 0 | 8909 | 0 | 0 | 0 | 8103 | 0 | 6795 | 5817 | 0 | 3719 | 0 | 0 | 2
I know that the first taxid column will contain the lowest possible rank for that read, so I am thinking of going through each read in the results file, extract the first taxid, and then obtain its corresponding higher ranks like genus, family, order, etc. using my custom taxonomic hierarchy file and add a count to each of those ranks. In this particular example, the taxid 10115 is the species Acetobacter persici. The genus would be Acetobacter, family Acetobacteraceae, order Acetobacterales, class Alphaproteobacteria, phylum Proteobacteria, and domain bacteria. I would add one count to each of these names, so by the end of my script run, I should have the read counts for each of the taxonomic ranks.
Is this the right way of going about this if I want to visualize my samples across various taxonomic ranks using an R package like phyloseq?
Thank you.
Dear all,
is there a GPU version of metacache
on bioconda
?
Can we also define MC_KMER_TYPE
for GPU? Beacuse from the code it doesn't look like it:
Lines 44 to 48 in 4c6db91
Several classification tools provide "memory-mapping" option.
When running a huge number of samples, instead of loading db into memory everytime, "memory-mapping" option will allow to preload the db into ram once and run classification across all the samples which improves run time by a huge margin.
I am getting this error every time I run a query to the database.
Reading database metadata ...
Reading 1 database part(s) ...
Completed database reading. %
custom query sketching settings: -sketchlen 32 -winlen 127 -winstride 112
Classifying query sequences.
Per-Read mappings will be written to file: /local/workdir/metacache/results_DBsimulated/reduced_32/sample_kmer32.txt_sample_10_1m_R1.fq_sample_10_1m_R2.fq.txt
Per-Taxon mappings will be written to file: /local/workdir/metacache/results_DBsimulated/reduced_32/abund_extraction.txt_sample_10_1m_R1.fq_sample_10_1m_R2.fq.txt
[> ] 0%Segmentation fault (core dumped)
can the use of long genomic reads improve the accuracy ?
tnx !!
Hi,
I had built a database for the GTDB, but when I check the database, all I see are the following files in my main non-database directory: GTDB_DB.cache0
& GTDB_DB.meta
I am wondering if these are the files I will need to use to classify my fastq files?
This is the command I used: ./metacache build GTDB_DB /path/to/genomes/ -taxonomy /path/to/custom/taxonomy/
Thank you.
Hi. We've run into a small issue that we are hoping can be fixed in the next release. The abundance profile produced with the -abundances
flag reports pair counts in scientific notation when numbers get large, e.g.:
# query summary: number of queries mapped per taxon
# rank name taxid number of reads abundance
domain Archaea 439684927 461 0.0130496%
domain Bacteria 609216830 1.05068e+06 29.7417%
This can result in small errors due to rounding. For example, in this case there is really 1050675 Bacterial read pairs, but it gets rounded up to 1050680. While having 5 extra read pairs is minor in terms of the resulting abundance estimates it makes it challenging to track the fait of all reads. In our code, we have a check that the number of input reads is equal to the number of reads in the MetaCache abundance profile (including unclassified). This is just a unit test to ensure our parsing is correct and that no reads are lost during any manipulation of data, but, more generally, not being able to account for all reads is a bit scary.
I imagine the intent is for this profile to produced integers, so am hoping this can be fixed in the next release. Thanks.
Hi,
I am trying to reproduce some of the experimental results on my own machine, but I got stuck in the database construction.
It seems that the default database script only builds the NCBI RefSeq (Release 90). Correct? Assuming that is true, I decided to build AFS20 using the commands for custom database construction since I want to see those results. So, I downloaded the first 20 reference genomes of Table 1 from the NCBI database, decompressed them, and put all the .fna files in one folder. Then I simply ran 1) make MACROS="-DMC_TARGET_ID_TYPE=uint32_t" 2) ./metacache build AFS20 Data/AFS20 -taxonomy genomes/taxonomy.
However, the program has got stuck in the "Processing reference sequences." phase forever (more than 10 hours). Can you tell me what I am doing wrong here?
I really appreciate any help you can provide.
Hi,
I've run into an issue where MetaCache runs as expected using the following parameters, but crashes with "Command terminated by signal 11" when the --lowest species
flag is added:
-pairfiles -no-map -taxids -lineage -separate-cols -threads 32 -abundances profile.tsv -abundance-per species -out classification.log"
Is there a set of incompatible flags I'm using or is it possible that using the -lowest flag has uncovered a bug?
Thanks,
Donovan
Hi,
I tried to run the database build, but it seems that my machine is only running MetaCache on a single thread. Is there a way to increase this? I do not see an argument for specifying threads in MetaCache.
Thanks.
Downloading from NCBI using ascp is much, much faster than wget. Changes something like this to your download script will speed it up by a lot:
FTPURL="ftp://ftp.ncbi.nih.gov/genomes"
wget $FTPURL/refseq/bacteria/assembly_summary.txt
# complete genomes
awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary.txt > ftpdirpaths ;;
awk 'BEGIN{FS=OFS="/";filesuffix="genomic.fna.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' ftpdirpaths | sed 's@ftp://ftp.ncbi.nlm.nih.gov@@' > ftpfilepaths
# ascp instead of wget
ascp -T -k2 -l 1000M -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh --user=anonftp --file-list=ftpfilepaths --mode=recv --host=ftp.ncbi.nlm.nih.gov .
Hi. Quick question. What do the form
and variety
attributes for -abundance-per
mean? I've always taken variety
to be a synonym of subspecies
and am not clear what form
means in terms of taxonomy. Thanks.
Hi,
I'm aiming to build a custom MetaCache DB by specifying the NCBI Taxon IDs for my genomes through the assembly_summary.txt
file. However, it appears MetaCache might make some assumptions about either the genome filenames or the contig IDs. I have the following test setup with 2 genomes:
Genome files:
A.fna
B.fna
assemby_summary.txt:
# Created by build_custom_metacache_db.py
# assembly_accession taxid
A.fna 1423
B.fna 1423
MetaCache results:
> metacache build test_db genomes -taxonomy /data/ncbi_taxonomy
Building new database 'copper_custom_db' from reference sequences.
Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 2450976 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version 2.2.1 (20220112)
database version 20200820
------------------------------------------------
sequence type mc::char_sequence
target id type unsigned int 32 bits
target limit 4294967295
------------------------------------------------
window id type unsigned int 32 bits
window limit 4294967295
window length 127
window stride 112
------------------------------------------------
sketcher type mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type unsigned int 32 bits
feature hash mc::same_size_hash<unsigned int>
kmer size 16
kmer limit 16
sketch size 16
------------------------------------------------
bucket size type unsigned char 8 bits
max. locations 254
location limit 254
------------------------------------------------
Reading sequence to taxon mappings from genomes/assembly_summary.txt
Reading sequence to taxon mappings from /data/wgs_pipeline/metacache_db/prok_viral/20211207/db_build_data/ncbi_taxonomy/assembly_summary_refseq.txt
Reading sequence to taxon mappings from /data/wgs_pipeline/metacache_db/prok_viral/20211207/db_build_data/ncbi_taxonomy/assembly_summary_refseq_historical.txt
Processing reference sequences.
Added 45 reference sequences in 3.623 s
targets 45
ranked targets 0
taxa in tree 2450976
------------------------------------------------
buckets 1500007
bucket size max: 1 mean: 1 +/- 0 <> -nan
features 5472
dead features 0
locations 5472
------------------------------------------------
1 targets are unranked.
1 targets remain unranked.
Writing database to file ... Writing database metadata to file 'test_db.meta' ... done.
Writing database part to file 'test_db.cache0' ... done.
done.
Total build time: 4.477 s
The output suggests it is unable to identify my genomes since ranked targets
is 0 and it indicates there is 1 target that remains unranked. Do I need to give the input genomes filenames with a certain format? Or do the contig IDs need to have a specific format? I've also tried changing the assembly__summary.txt
file to just use the assembly_accession
A and B without the .fna
extension.
The two genome files and the assembly__summary.txt
file are in the genome
directory.
Any insight into what may be happening would be much appreciated.
Thanks,
Donovan
Hello,
I noticed that MetaCache can classify reads at the sequence level. If I had my own strain level database, would I be able to use MetaCache to classify reads at strain level?
Thank you.
Hi there!
When I try building the metacache
package (v2.2.3) with multiple macros, make
throws the invalid option
error:
Seems like $(MACROS)
should be quoted here:
Line 136 in 532a737
For now, I build metacache
by running this command directly:
make release_dummy DIR=build_release ARTIFACT=metacache MACROS="-DMC_TARGET_ID_TYPE=uint32_t -DMC_WINDOW_ID_TYPE=uint32_t"
BR,
Zulfat
Dear Andre,
why Metacache classifies almost all reads to the species level when querying an extensive database (1000 genomes partitioned in 28 DBs).
Is there any inherent error bias introduced by partitioning databases and posterior merging?
Can you also explain the difference between -hitmin and -hitdiff? (In the GitHub page the description for both are the same)
Additionally, is there a way to set a threshold to classify at the species level based on the number of reads (Is this what you refer to as a feature)? For example, a species will be considered "present" if at least five reads are assigned.
We are contemplating the number of possible k-mers one can make with k = 16, 4^16, which would mean ~4.3 million available/possible k-mers for each species in a 1000 genome database (correct?) if they all were in the same DB.
For a 1 Gb genome, there are about 7.9 million 127 bp windows, and if 16 k-mers are selected from each, 125 million k-mers from each genome without any threshold and just one genome. This is 29 times the 4.3 million available/possible k-mers. I'm not sure of these numbers. Could you let me know if they sound right?
We were trying to extract info from the databases using the info command and found that featureCounts produces:
63168104 -> 69
805452997 -> 2
303063736 -> 10
But we are not able to relate this number to species. Is this a feature number? Is the feature different from k-mer?
Are Targets synonymous with sequences?
We want stats on how many k-mers and unique-k-mers each species is represented within the database. Is it possible to retrieve this using the current commands?
Dear Andre,
I am working with a partitioned database (44 total chunks, each 16GB). The querying of the database works well (I believe) and reasonably fast. However, once I have the 44 txt files, and I am merging these to get abundance-per-species, the computation is taking a lot of time (now running for three days). I am unsure if I am doing something wrong or if this is normal behavior (each txt file is around 100-300GB, and the fastq pair-end file contains 100 species with 10,000 reads each, a total of 1 million reads). Below is an example of the line I used to merge the text files:
metacache merge results_mix05 -taxonomy ncbi_taxonomy -lowest species -abundances 05_abund_extraction.txt -abundance-per species
Also, I am using the 64-bit version of metacache to generate k-mers of size 32.
Do you think if I partition the database to even smaller chunks will work faster for merging the final results?
I will appreciate any guidance you could provide.
Best wishes,
Jaime
Hi.
Are there plans to support reads in compressed gzipped format (i.e. my_reads.fq.gz)? This would be a major help for incorporating MetaCache into workflows.
Cheers,
Donovan
Hello,
if I use the query option with a fastq file containing more than 15 Ts in a row I get a segmentation fault. If I replace the 16 Ts with 16 As (or more) there is no segmentation fault.
This fastq works properly:
@QHOT2:10403:11886
CTGACAGCATGTTGACTTTTGCTTCCTTTGTCAAGCACAGAAAAACAGGAAGCACAAAATCATTTTTTTTTTTTTTT
+
But if I add an extra T at the end, I get a segmentation fault.
@QHOT2:10403:11886
CTGACAGCATGTTGACTTTTGCTTCCTTTGTCAAGCACAGAAAAACAGGAAGCACAAAATCATTTTTTTTTTTTTTTT
+
thank you
I tried to use tab as separator to set '-separator "\t"' parameter for query. However, the control sequence '\t' is literally printed in the output, instead of converted into tab.
e.g., ERR024951.132\tspecies\tKlebsiella pneumoniae\t573
Hi, I am trying to compile MetaCache for use with a large reference data set.
I am running it like this:
git clone https://github.com/muellan/metacache.git
cd metacache
mamba activate compile
LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:${LD_LIBRARY_PATH}
export LD_LIBRARY_PATH
CPLUS_INCLUDE_PATH=${CONDA_PREFIX}/include:${CPLUS_INCLUDE_PATH}
export CPLUS_INCLUDE_PATH
make MACROS="-DMC_TARGET_ID_TYPE=uint64_t -DMC_WINDOW_ID_TYPE=uint64_t -DMC_KMER_TYPE=uint64_t"
I get the following error:
make release_dummy DIR=build_release ARTIFACT=metacache MACROS="-DMC_TARGET_ID_TYPE=uint64_t -DMC_WINDOW_ID_TYPE=uint64_t -DMC_KMER_TYPE=uint64_t"
make[1]: Entering directory '/mnt/data/local_tools/metacache'
.../miniconda3/envs/compile/bin/x86_64-conda-linux-gnu-c++ -DMC_TARGET_ID_TYPE=uint64_t -DMC_WINDOW_ID_TYPE=uint64_t -DMC_KMER_TYPE=uint64_t -std=c++14 -Wall -Wextra -Wpedantic -I/include -O3 -c src/building.cpp -o build_release/building.o
In file included from src/options.h:31,
from src/candidate_structs.h:28,
from src/candidate_generation.h:27,
from src/database.h:27,
from src/building.h:27,
from src/building.cpp:24:
src/taxonomy.h: In member function 'void mc::ranked_lineages_of_targets::update(mc::target_id)':
src/taxonomy.h:980:45: error: no matching function for call to 'min(unsigned int, long unsigned int)'
980 | const unsigned numThreads = std::min(4U, numNewTargets / (1U << 10));
| ~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from .../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/char_traits.h:39,
from .../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/string:40,
from src/bitmanip.h:29,
from src/dna_encoding.h:27,
from src/hash_dna.h:27,
from src/config.h:34,
from src/candidate_structs.h:27,
from src/candidate_generation.h:27,
from src/database.h:27,
from src/building.h:27,
from src/building.cpp:24:
.../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_algobase.h:230:5: note: candidate: 'template<class _Tp> constexpr const _Tp& std::min(const _Tp&, const _Tp&)'
230 | min(const _Tp& __a, const _Tp& __b)
| ^~~
.../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_algobase.h:230:5: note: template argument deduction/substitution failed:
In file included from src/options.h:31,
from src/candidate_structs.h:28,
from src/candidate_generation.h:27,
from src/database.h:27,
from src/building.h:27,
from src/building.cpp:24:
src/taxonomy.h:980:45: note: deduced conflicting types for parameter 'const _Tp' ('unsigned int' and 'long unsigned int')
980 | const unsigned numThreads = std::min(4U, numNewTargets / (1U << 10));
| ~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from .../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/char_traits.h:39,
from .../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/string:40,
from src/bitmanip.h:29,
from src/dna_encoding.h:27,
from src/hash_dna.h:27,
from src/config.h:34,
from src/candidate_structs.h:27,
from src/candidate_generation.h:27,
from src/database.h:27,
from src/building.h:27,
from src/building.cpp:24:
.../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_algobase.h:278:5: note: candidate: 'template<class _Tp, class _Compare> constexpr const _Tp& std::min(const _Tp&, const _Tp&, _Compare)'
278 | min(const _Tp& __a, const _Tp& __b, _Compare __comp)
| ^~~
.../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_algobase.h:278:5: note: template argument deduction/substitution failed:
In file included from src/options.h:31,
from src/candidate_structs.h:28,
from src/candidate_generation.h:27,
from src/database.h:27,
from src/building.h:27,
from src/building.cpp:24:
src/taxonomy.h:980:45: note: deduced conflicting types for parameter 'const _Tp' ('unsigned int' and 'long unsigned int')
980 | const unsigned numThreads = std::min(4U, numNewTargets / (1U << 10));
| ~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from .../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/algorithm:62,
from src/../dep/hpc_helpers/include/cuda_helpers.cuh:7,
from src/dna_encoding.h:30,
from src/hash_dna.h:27,
from src/config.h:34,
from src/candidate_structs.h:27,
from src/candidate_generation.h:27,
from src/database.h:27,
from src/building.h:27,
from src/building.cpp:24:
.../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_algo.h:3449:5: note: candidate: 'template<class _Tp> constexpr _Tp std::min(std::initializer_list<_Tp>)'
3449 | min(initializer_list<_Tp> __l)
| ^~~
.../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_algo.h:3449:5: note: template argument deduction/substitution failed:
In file included from src/options.h:31,
from src/candidate_structs.h:28,
from src/candidate_generation.h:27,
from src/database.h:27,
from src/building.h:27,
from src/building.cpp:24:
src/taxonomy.h:980:45: note: mismatched types 'std::initializer_list<_Tp>' and 'unsigned int'
980 | const unsigned numThreads = std::min(4U, numNewTargets / (1U << 10));
| ~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from .../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/algorithm:62,
from src/../dep/hpc_helpers/include/cuda_helpers.cuh:7,
from src/dna_encoding.h:30,
from src/hash_dna.h:27,
from src/config.h:34,
from src/candidate_structs.h:27,
from src/candidate_generation.h:27,
from src/database.h:27,
from src/building.h:27,
from src/building.cpp:24:
.../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_algo.h:3455:5: note: candidate: 'template<class _Tp, class _Compare> constexpr _Tp std::min(std::initializer_list<_Tp>, _Compare)'
3455 | min(initializer_list<_Tp> __l, _Compare __comp)
| ^~~
.../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_algo.h:3455:5: note: template argument deduction/substitution failed:
In file included from src/options.h:31,
from src/candidate_structs.h:28,
from src/candidate_generation.h:27,
from src/database.h:27,
from src/building.h:27,
from src/building.cpp:24:
src/taxonomy.h:980:45: note: mismatched types 'std::initializer_list<_Tp>' and 'unsigned int'
980 | const unsigned numThreads = std::min(4U, numNewTargets / (1U << 10));
| ~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
make[1]: *** [Makefile:215: build_release/building.o] Error 1
make[1]: Leaving directory '/mnt/data/local_tools/metacache'
make: *** [Makefile:137: release] Error 2
Just make
or e.g. make MACROS="-DMC_TARGET_ID_TYPE=uint32_t -DMC_WINDOW_ID_TYPE=uint32_t"
works.
Am I calling it somehow wrong? make MACROS="-DMC_KMER_TYPE=uint64_t"
from the example also fails in the same manner.
Best
Oskar
Hi again,
I just ran MetaCache on a sample dataset. I used the following command:
./metacache query mydb.db test_genomes/ -out test_results.txt -lowest subspecies -abundances test_abundances.txt
When I took a look at the test_abundances.txt file, I only see rows for domain, family, and subspecies abundances like below:
domain:Bacteria | 3 | 56893 | 0.123854%
family:Actinomycetaceae | 101 | 319 | 0.00069445%
subspecies:strain_158623 | 8058 | 1 | 2.17696e-06%
subspecies:strain_287365 | 8174 | 4422 | 0.00962652%
subspecies:strain_627946 | 8239 | 167 | 0.000363552%
subspecies:strain_623479 | 10643 | 246807 | 0.537289%
I do not see species, genus, order, class, or phylum abundances. However, when I ran the above command with -abundance-per phylum
, I was able to get the abundances for all the phylum ranks. Then I decided to use -allranks
command that was mentioned in the documentation to get all ranks but this always brings me to the help documentation on my command line. I am not sure how to get -allranks
working. My main goal is to classify all my reads at all the different ranks. Please let me know how I can do this or if I am making any mistakes here.
Edit: I suppose -allranks only works with -precision and for that I would need the header information set correctly.
Thank you.
Hi, I am testing the validity of using smaller databases and then merging the results. However, when I am testing this, the results do not match those of querying one DB. For example, I have a DB with 40 species and created two DBs with 20 species each. When I use the MERGE function, the results of merging the two 20-species DB do not match the abundance results from the full 40-species DB. Here are the commands I am using:
metacache build 20sp_DB1 /test_merge_DB/DB1 -taxonomy ncbi_taxonomy -remove-overpopulated-features
metacache build 20sp_DB2 /test_merge_DB/DB2 -taxonomy ncbi_taxonomy -remove-overpopulated-features
metacache query 20sp_DB1 MixA_1.fastq.gz MixA_2.fastq.gz -pairfiles -tophits -queryids -lowest species -out res1.txt
metacache query 20sp_DB2 MixA_1.fastq.gz MixA_2.fastq.gz -pairfiles -tophits -queryids -lowest species -out res2.txt
metacache merge res1.txt res2.txt -lowest species -taxonomy ncbi_taxonomy -max-cand 4 -hitmin 2 -hitdiff 2 -mapped-only -abundances test_abundance.txt -abundance-per species > out_metacache_merge.txt
Hi,
It appears a number of users are experiencing segmentation faults on different condition (issues 27, 28). I am experiencing this on a custom DB I just built. The DB builds without issue. I've posted the DB build output results below. The segmentation fault occurs even when running MetaCache v2.2.3 with default settings, e.g.:
> metacache query my_db my_seqs.fq.gz -out results.txt
The segmentation fault occurs immediately after the DB is loaded:
> metacache query my_db my_seqs.fq.gz -out results.txt
Reading database metadata ...
Reading 1 database part(s) ...
Completed database reading.
Classifying query sequences.
Per-Read mappings will be written to file: results.txt
[> ] 0%Segmentation fault (core dumped)
This seems to be the same issue as previously reported. I'm running this on a machine with lots of memory (512 GB; reference DB is ~223 GB on disk and looks to require about the same amount of RAM) and disk space (>100 GB free). Notably, the following command also causes a segmentation fault:
> metacache info my_db lin > lineages.tsv
Reading database from file 'gtdb_r214_db_ext' ... Reading database metadata ...
Completed database reading.
done.
Segmentation fault (core dumped)
Any insights into what might be causing this segmentation fault?
Thanks,
Donovan
Output from building custom DB:
Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 184936 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version 2.3.2
database version 20200820
------------------------------------------------
sequence type mc::char_sequence
target id type unsigned int 32 bits
target limit 4294967295
------------------------------------------------
window id type unsigned int 32 bits
window limit 4294967295
window length 127
window stride 112
------------------------------------------------
sketcher type mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type unsigned int 32 bits
feature hash mc::same_size_hash<unsigned int>
kmer size 16
kmer limit 16
sketch size 16
------------------------------------------------
bucket size type unsigned char 8 bits
max. locations 254
location limit 254
------------------------------------------------
Reading sequence to taxon mappings from ./genomes/assembly_summary.txt
Processing reference sequences.
Added 10919768 reference sequences in 7244.87 s
targets 10919768
ranked targets 10411403
taxa in tree 184936
------------------------------------------------
buckets 964032481
bucket size max: 254 mean: 55.5022 +/- 68.7783 <> 1.65723
features 531291666
dead features 0
locations 29487841493
------------------------------------------------
All targets are ranked.
Writing database to file ... Writing database metadata to file 'gtdb_r214_db_ext.meta' ... done.
Writing database part to file 'gtdb_r214_db_ext.cache0' ... done.
done.
Total build time: 9440.64 s
Hi,
Thanks again for your work on MetaCache. I have noticed something a bit strange with the new version (MetaCache 2.0) where database loading takes much longer than I would expect (without any signs of I/O contention in iotop). The problem seems especially bad once the database loading % exceeds 90%. Loading from 0 to 90% seems to comprise ~20% of the load time, followed by the remaining 10% of the database. I am basing these percentages off of the stderr log, see attached for a representative log.
The actual analysis seems fast once it gets going. As it stands, I think I am spending about 75% of the time per sample just reading the database, but I am not sure why. Do you have any suggestions about how to improve sample throughput? I know completion percentages are usually off but I'm not sure how to debug further.
Hi,
I'm building a custom DB from a large set of genome files. I'm indicating the TaxonId of each sequence using the NCBI-style accession2taxid tab-separated files
. When I build the DB, it appears some sequences are not being given a rank. Specifically, the output of metacache build
indicates 262383 targets remain unranked
. Does this mean that MetaCache identified and placed sequenced in the DB that do not have a taxonomic assignment (i.e. associated TaxonId)? Is there any way to determine which sequences remain unranked?
Thanks,
Donovan
>metacache build gtdb_r207_db ./gtdb_reps/genomes/ -taxonomy ./gtdb_taxonomy/R207 -reset-taxa -taxpostmap accn_to_taxid.tsv
Building new database 'gtdb_r207_db' from reference sequences.
Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 401816 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version 2.2.3 (20220708)
database version 20200820
------------------------------------------------
sequence type mc::char_sequence
target id type unsigned int 32 bits
target limit 4294967295
------------------------------------------------
window id type unsigned int 32 bits
window limit 4294967295
window length 127
window stride 112
------------------------------------------------
sketcher type mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type unsigned int 32 bits
feature hash mc::same_size_hash<unsigned int>
kmer size 16
kmer limit 16
sketch size 16
------------------------------------------------
bucket size type unsigned char 8 bits
max. locations 254
location limit 254
------------------------------------------------
Processing reference sequences.
Added 8302685 reference sequences in 14348.7 s
targets 8302685
ranked targets 0
taxa in tree 401816
------------------------------------------------
buckets 964032481
bucket size max: 254 mean: 46.7784 +/- 61.2667 <> 2.0026
features 518555067
dead features 0
locations 24257165919
------------------------------------------------
8302685 targets are unranked.
Try to map sequences to taxa using 'accn_to_taxid.tsv' (381 MB)
262383 targets remain unranked.
Writing database to file ... Writing database metadata to file 'gtdb_r207_db.meta' ... done.
Writing database part to file 'gtdb_r207_db.cache0' ... done.
done.
Thanks for this piece of work. I am very interested in using the hashing method in TGS reads analysis and wonder how I would use this as a CMake module, etc. Are you planning to give this support. Some direction on this concern is highly appreciated.
Example use case; hash a million sequences (using LSH) and compare with another seq of sequences. Finally estimate similarity, etc.
Hello,
I just need a little clarification. In the custom database documentation, would I be able to have my own taxonomy and ID rather than using NCBI taxonomy and database if I am able to provide all the necessary files? I have my own custom database and would like to use my own taxonomy rather than NCBI taxonomy and taxids.
Thank you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.