derrickwood / kraken Goto Github PK
View Code? Open in Web Editor NEWKraken taxonomic sequence classification system
Home Page: http://ccb.jhu.edu/software/kraken/
License: GNU General Public License v3.0
Kraken taxonomic sequence classification system
Home Page: http://ccb.jhu.edu/software/kraken/
License: GNU General Public License v3.0
When building a bacterial genome database, do you include the plasmids at the same place in the taxonomic tree or only include them in plasmid builds?
kraken --paired --fastq-input --gzip-compressed -db /db/kraken --threads 1 --out --classified-out kraken_out file.R1.gz file.R2.gz
kraken: --paired requires exactly two filenames
Ok now added a colon between the two files as follows:
kraken --paired --fastq-input --gzip-compressed -db /db/kraken --threads 1 --out --classified-out kraken_out file.R1.gz,file.R2.gz
read_merger.pl: kraken_out does not exist
classify: malformed fasta file - expected header char > not found
0 sequences (0.00 Mbp) processed in 0.000s (0.0 Kseq/m, 0.00 Mbp/m).
0 sequences classified (-nan%)
0 sequences unclassified (-nan%)
Kraken is running very slowly after update to MacOS Sierra (from 10.9). Even using 14 threads, 'classify' is only using 2% of CPU resources running on fastq files. Happens when using either RAM or a ramdisk. I also tried re-compiling kraken after the update, no difference. Anyone else had this problem?
what should I do about the error?
Kraken build set to minimize disk writes.
Creating k-mer set (step 1 of 6)...
Found jellyfish v1.1.10
Hash size not specified, using '9096'
K-mer set created. [0.037s]
Skipping step 2, no database reduction requested.
Sorting k-mer set (step 3 of 6)...
K-mer set sorted. [1m46.611s]
Creating GI number to seqID map (step 4 of 6)...
GI number to seqID map created. [0.008s]
Creating seqID to taxID map (step 5 of 6)...
5 sequences mapped to taxa. [0.005s]
Setting LCAs in database (step 6 of 6)...
set_lcas: error opening taxonomy/nodes.dmp: No such file or directory
Derrick,
Firstly, I just want to say that Kraken is a really nice piece of software, with good documentation and command line interface. Thank you!
I have an enhancement request. The typical use case is:
% kraken --db DB > kraken.out
OR
% kraken --db DB --outfile kraken.out
THEN
% kraken-report --db DB kraken.out > kraken.report
I was thinking an option like this:
% kraken --db DB --outfile kraken.out --report kraken.report --mpa-report kraken.mpa
That way the user doesn't have to RE-specify the --db parameter, and the user-end result is available in one command.
(Maybe a --filter option as well in case that makes sense wrt kraken-filter)
Torsten
The url for NCBI's ftp site appears to have changed causing the command kraken-build --download-taxonomy --db name
to fail.
Line 27 of download_taxonomy.sh
currently reads:
NCBI_SERVER="ftp.ncbi.nih.gov"
Changing it to:
NCBI_SERVER="ftp.ncbi.nlm.nih.gov"
and rebuilding Kraken resolved the issue.
Thanks,
Adam
Dear Dr. Wood,
It’s impossible to build new bacteria database using "kraken-build" because the genome sequences for RefSeq bacteria have moved on the ftp as you can see on the NCBI readme file: ftp://ftp.ncbi.nlm.nih.gov/genomes/README.txt
Have you planned to update Kraken with the new path of the bacteria genomes?
Best regards,
Nicolas
Hi,
I'm getting errors on version 0.10.5b when I prepare a fasta file as described in the manual and run kraken-build. First, kraken-build complains about a missing taxonomy/gi_taxid_nucl.dmp which is a 8 GiB file which not needed in this case. When I create this as an empty file, there is a mmap error.
Without gi_taxid_nucl.dmp:
kraken-build --db mydb --threads 20 --build
Kraken build set to minimize disk writes.
Creating k-mer set (step 1 of 6)...
Found jellyfish v1.1.11
Hash size not specified, using '81910966608'
K-mer set created. [2h4m52.942s]
Skipping step 2, no database reduction requested.
Sorting k-mer set (step 3 of 6)...
K-mer set sorted. [11h41m59.065s]
Creating GI number to seqID map (step 4 of 6)...
GI number to seqID map created. [4m29.637s]
Creating seqID to taxID map (step 5 of 6)...
make_seqid_to_taxid_map: unable to open taxonomy/gi_taxid_nucl.dmp: No such file or directory
With empty gi_taxid_nucl.dmp:
kraken-build --db mydb --threads 20 --build
Kraken build set to minimize disk writes.
Skipping step 1, k-mer set already exists.
Skipping step 2, no database reduction requested.
Skipping step 3, k-mer set already sorted.
Skipping step 4, GI number to seqID map already complete.
Creating seqID to taxID map (step 5 of 6)...
make_seqid_to_taxid_map: unable to mmap taxonomy/gi_taxid_nucl.dmp: Invalid argument
This could be changed.
Dear Derrick,
Thanks for developing kraken pipeline for the scientific community.
I have couple of queries regarding the kraken database.
Cheers,
Ram
After building a custom database from a single multi-fasta file which I constructed using the described syntax with kraken:taxid header entries, kraken-build (v0.10.5b) fails in step 6:
kraken-build --db mydb --build
Kraken build set to minimize disk writes.
Skipping step 1, k-mer set already exists.
Skipping step 2, no database reduction requested.
Skipping step 3, k-mer set already sorted.
Skipping step 4, GI number to seqID map already complete.
Skipping step 5, seqID to taxID map already complete.
Setting LCAs in database (step 6 of 6)...
Processed 108029 sequencesxargs: Prozeß cat wurde durch das Signal 13 abgebrochen.
/net/programs/Debian-7-x86_64/kraken-0.10.5b/bin/build_kraken_db.sh: Zeile 197: 13939 Fertig find library/ '(' -name '*.fna' -o -name '*.fa' -o -name '*.ffn' ')' -print0
13940 Exit 125 | xargs -0 cat
13941 Speicherzugriffsfehler | set_lcas $MEMFLAG -x -d database.kdb -i database.idx -n taxonomy/nodes.dmp -t $KRAKEN_THREAD_CT -m seqid2taxid.map -F /dev/fd/0
I isolated the respective call and ran set_lcas manually which lead to a simple segfault:
set_lcas -x -d database.kdb -i database.idx -n taxonomy/nodes.dmp -t 1 -m seqid2taxid.map -F library/added/1Y0AnEaxod.fna
Processed 108029 sequences
Speicherzugriffsfehler
This refers to the same run that is described in issue #34. FASTA headers look like:
grep '^>' library/added/1Y0AnEaxod.fna | head
>NZ_AQYU01000016.1|kraken:taxid|35400
>NZ_AQYU01000015.1|kraken:taxid|35400
>NZ_AQYU01000017.1|kraken:taxid|35400
>NZ_AQYU01000014.1|kraken:taxid|35400
>NZ_AQYU01000018.1|kraken:taxid|35400
>NZ_AQYU01000011.1|kraken:taxid|35400
>NZ_AQYU01000010.1|kraken:taxid|35400
>NZ_AQYU01000009.1|kraken:taxid|35400
>NZ_AQYU01000012.1|kraken:taxid|35400
>NZ_AQYU01000008.1|kraken:taxid|35400
The taxid is mostly on the species level and I downloaded the latest gi_taxid_nucl.dmp from the NCBI ftp site to be able to run kraken-build on the custom dataset.
Hi Derrick,
./kraken-build --threads 20 --build --jellyfish-hash-size 2400M --db kraken_bv_072516/
working on a server with 128GB of RAM and 3 TB of disk space..
database.jdb.tmp file is around ~400gb and still I end up getting
Kraken build set to minimize RAM usage.
Creating k-mer set (step 1 of 6)...
Found jellyfish v1.1.11
K-mer set created. [6h45m35.859s]
Skipping step 2, no database reduction requested.
Sorting k-mer set (step 3 of 6)...
db_sort: unable to mmap database.jdb: Cannot allocate memory
I do not want to compromise on the sensitivity of the results, that is the reason I am avoiding --max-db-size, also tried with work on disk flag bust end up similar result.
Thanks.
Best
Sid
Hi,
Would it be possible to use Kraken on PacBio FASTQ reads to remove contaminations?
Thank you in advance.
Michal
We are attempting to build a large kraken database out of complete and draft bacterial genomes and have run into file system issues when all the data are loaded into the kraken database as separate individual FASTA files. The current NCBI bacterial assembly folder has over 1 million .fna
files currently; this clobbers most file systems when all the files are dumped into the same directory.
I attempted to load a concatenated multi-record as a test. kraken-build
splits the data up, then treats the concatenated record as an RNA (.ffn
) record.
I can probably hack in something to make kraken-build
accept multi-record FASTA, but just curious: is there a specific reason why multi-record FASTA isn't supported or is problematic? Couldn't find anything on the mail list.
In Kraken version 0.10.5b, when fasta sequences and corresponding and gi2seqid.map is large (in my case ~150 k sequences), the binary make_seqid_to_taxid_map fails with std::size_error or similar. I suspect this is due to the integer size type used for the container type?
I simply replaced this step by an awk snipplet followed by GNU sort -k1,1
which seems to produce identical output.
Gruß Johannes
NCBI is discontinuing support for GI numbers in favor of accession numbers.
There is currently a database of GI -> TaxID available from NCBI's ftp site so moving kraken from GIs to accessions should be easy.
(related to #39)
As it takes a while to download the DB files, it'd be very useful if Kraken-build could check for Jellyfish at the start of the build process.
Hi, first I'd like to say the Kraken is a really well-written program!
I found that kraken (the classification part) does not succeed on systems where the amount of main memory (+ swap) is smaller than the index (database.kdb). However, I believe this should be possible via memory mapping, in particular in this case because the data needs to only be read by the program which allows the OS to do efficient swapping. While it should be technically possible, I cannot make any comment about whether it would be efficient.
Issue: In the file quickfile.cpp, you correctly use the parameters PROT_READ and MAP_SHARED to trigger this kind of reading in the read-only mode. However, its seems the database file is always opened in read-write mode and I don't know why. IMO the correct way would be to use the read-only flags and warn the user if this results in an inefficient memory access behavior or to require a parameter like '--force-memory-overcommit' in the classification program.
Cheers, Johannes
Specifying the full --db path and --threads each time is a bit tedious. What would you think about having (optional) environment variables for these defaults:
KRAKEN_DB_DEFAULT=/usr/local/share/kraken/db/mini_kraken
KRAKEN_THREADS_DEFAULT=8
Yes, I could write a wrapper script for these for the typical user. If I get time I could generate a pull request, but until then this Issue report will remind me.
Q1 . Is it somehow possible to get an output field in the kraken-report which summarizes the distinct number of non overlapping kmers (or overlapping if the former calculation is complex) that were directly assigned to a particular taxonomic node. This could give an idea about the total amount of coverage one sees for a particular taxonomic node. While we are at it, for each taxonomic node how about computing the ratio of observed kmers assigned for the node / total number of kmer assigned for that node .
Q2. Is there a good strategy to get rid of low complexity kmers ? They are especially troublesome when analyzing meta-transcriptomic data.
Hi,
great job with kraken!
I just want to mention that your install script returns 1 on a successful installation.
It is because of these lines:
for file in $KRAKEN_DIR/kraken*
do
[ -x "$file" ] && echo " $file"
done
One of your files you are ckecking is not executable.
best,
Peter
Hello everybody,
I assembled genomes from known species with paired reads.
Then, i used kraken to confirm the species of my genomes.
Generally, using kraken on the assembled genomes gives good predictions.
However, using kraken on the reads gives less accurate predictions.
For example, from my kraken output, I have this line :
C id 288681 201 86661:3 288681:31 86661:36 A:31 86661:23 0:47
I don't understand why, for this read, the assigned taxon number is 288681 when a majority of kmers is assigned 86661. Did someone have the same issue ?
Thanks !
I recently downloaded all genomes available on ncbi in hopes of making a database that would be as comprehensive as possible for determining taxonomy for a largely unknown eukaryote/prokaryote metagenomic dataset. However, it appears that there are currently no GI number assigned to the files downloaded from ncbi and the headers look like this :
NC_008801.1 Monodelphis domestica chromosome 1, MonDom5, whole genome shotgun sequence AtcctcccccccaccaccaccccagcATGCAGGCCGCCACCATCTTATCCACCAGGCCGCCCCGGTGCGTGGC
rather than
gi|701219395|ref|NC_025403.1| Achimota virus 1, complete genome
ACCAGAGGGAAAATATAACAATGTCGTTTTATAGCGATGTAAATAATACTTATGTAGGCCCGAAAGTGC
I noticed for a previous issue that you mentioned that you are working on a better solution that will allow inclusion of sequences that lack GI numbers but have only accession numbers so hopefully that can help fix this issue down the road. However, I was wondering if you had any ideas for a workaround for this issue at this point in time that may make it easier than trying to individually assign taxonomy for the 60,000+ genomes that I am trying to make into a database. Any thoughts are greatly appreciated.
Best,
Eric
Hello,
I'm trying to build the standard kraken database. I gave the job a memory of 140gb and running on 16 threads.
I get this error after kraken has downloaded the gi to taxid mapping file and the taxonomy dump from NCBI.
Found jellyfish v1.1.11
--2016-06-23 16:32:01-- ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
=> “gi_taxid_nucl.dmp.gz.2”
Resolving ftp.ncbi.nih.gov... 130.14.250.10, 2607:f220:41e:250::12
Connecting to ftp.ncbi.nih.gov|130.14.250.10|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/taxonomy ... done.
==> SIZE gi_taxid_nucl.dmp.gz ... 1405001825
==> PASV ... done. ==> RETR gi_taxid_nucl.dmp.gz ... done.
Length: 1405001825 (1.3G) (unauthoritative)
100%[=================================================================>] 1,405,001,825 7.42M/s in 4m 42s
2016-06-23 16:36:45 (4.74 MB/s) - “gi_taxid_nucl.dmp.gz.2” saved [1405001825]
Downloaded GI to taxon map
--2016-06-23 16:36:45-- ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
=> “taxdump.tar.gz”
Resolving ftp.ncbi.nih.gov... 130.14.250.12, 2607:f220:41e:250::12
Connecting to ftp.ncbi.nih.gov|130.14.250.12|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/taxonomy ... done.
==> SIZE taxdump.tar.gz ... 36397155
==> PASV ... done. ==> RETR taxdump.tar.gz ... done.
Length: 36397155 (35M) (unauthoritative)
100%[===================================================================>] 36,397,155 10.2M/s in 5.2s
2016-06-23 16:36:51 (6.62 MB/s) - “taxdump.tar.gz” saved [36397155]
Downloaded taxonomy tree data
gzip: gi_taxid_nucl.dmp.gz: unexpected end of file
Is there something I'm missing?
Thanks.
could replace with ls + awk
stat -c %s database.idx
8589934608
ls -n database.idx | awk '{print $5}'
8589934608
Hi Derrick,
I was testing Kraken on some sequences when I noted that Kraken completely misses some sequences which have a 100% identity match in the database.
Version: zip file master branch yesterday (18.05.2015)
Database building: standard (as shown in the manual), full bacteria, no database shrinkage
Example:
>from_BacSu_Natto_195
CTTGAATGGATGCAGCAACTTGAGTGTCAGTAGACGTAACATCAATGTCGCAAGAATCTC
TAACGATAATCATTTCTTCAGATGTTTGTTTGTTAAGGCTTAGCTGATCAAAATCTTGAA
ATGCGTTAGCATCTTCTCCTCCAGAACAAACGTCATGACAAACCGCTTCTCTTTTATAGC
CGCTACCCGGGTGTGTACATGAATTATCTAGGGCAACCCATGAGTAAGGTCTTGATTCCA
TTTGTTTTCCTCCTTTCGTTTACTTTAGTAGATGATGACAGTTCATATTTTGTATAAGCA
AATATCATGGTTTTAGATACCTATTTTAAATATTATCGTTTTTTCTTTGACGTACGCGAT
CTCTCAGTGTTGTTCGACGTCTTTTTTGCGGCGCTGCTTCTTGTTCTGCTTTTTCAGCAT
CTGCCTTTTTCTTATTGACTCTTTCAACAAATTCGTTAGCCTGTTTTTGAAGGTCATTTA
CCATTGTAATAATCGCTTGTTTTTGAGCTTCTGATAAGTTGCGTTTTTTATCTTCTGTTA
CATTGTTTTTGTTCAGTACATCTGAAACAAGAAGTTTCGTTAGGGGGTCGATATTTGCTT
CTAGTTCTTCTTTGTTTTTTTGCAGTGCATCTTTTTCTTCGGACATGTTATTCACCTCGA
CTTCTATAAAATTAGAAAGAAAGGGCTACAGAATGTCAGCTTCTGTTATAAGAGCGATTA
ATGTTTGGGTTAAGAGCTGAACGGATGCCATGATGTCTGTAGCTGATAAAGTGATTGTGA
TATTATAGCAGTTTATAATTTTGATTTTTACTTTCTTTCGGCTATGTGCTGTAGAGCGTG
CTATCAGATCACTCGCAACAATGTCTGCTATATCGCTATCGCTGATAAGAAGCTGGGTTG
TAACAGTGATTAAAGCTGAAACTGCTGATTGCAATGAAAGTGCAAATGTGGTGTCTGATT
to which kraken says
$ kraken --db /scratch1/tmp/bachtmp/krakentest/bacteria bug.fasta >kraoutbug.txt
1 sequences (0.00 Mbp) processed in 0.001s (76.1 Kseq/m, 73.10 Mbp/m).
0 sequences classified (0.00%)
1 sequences unclassified (100.00%)
whereas performing a simple fasta36 search confirms that the data should be in the database:
# fasta36 bug.fasta ../krakentest/bacteria/library/Bacteria/Bacillus_subtilis_natto_BEST195_uid183001/NC_017196.fna
FASTA searches a protein or DNA sequence data bank
version 36.3.4 May, 2011
Please cite:
W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
Query: bug.fasta
1>>>from_BacSu_Natto_195 - 960 nt
Library: ../krakentest/bacteria/library/Bacteria/Bacillus_subtilis_natto_BEST195_uid183001/NC_017196.fna
4091591 residues in 1 sequences
The best scores are: opt bits E(28)
gi|428277412|ref|NC_017196.1| Bacillus subtili (4091591) [f] 4800 523.7 8.6e-149
...
>>gi|428277412|ref|NC_017196.1| Bacillus subtilis subsp. (4091591 nt)
initn: 4800 init1: 4800 opt: 4800 Z-score: 2729.5 bits: 523.7 E(28): 8.6e-149
banded Smith-Waterman score: 4800; 100.0% identity (100.0% similar) in 960 nt overlap (1-960:1240841-1241800)
10 20 30
from_B CTTGAATGGATGCAGCAACTTGAGTGTCAG
::::::::::::::::::::::::::::::
gi|428 CAATTGTAATGACAGCAGTTTGGAGGGCGGCTTGAATGGATGCAGCAACTTGAGTGTCAG
1240820 1240830 1240840 1240850 1240860 1240870
40 50 60 70 80 90
from_B TAGACGTAACATCAATGTCGCAAGAATCTCTAACGATAATCATTTCTTCAGATGTTTGTT
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
gi|428 TAGACGTAACATCAATGTCGCAAGAATCTCTAACGATAATCATTTCTTCAGATGTTTGTT
1240880 1240890 1240900 1240910 1240920 1240930
I first suspected that that this is a bug with organisms containing multiple FASTA files (Natto has 2 .fna files), but a quick check with another bacterium having 6 .fna files found the sequence of all 6 test cases.
Any idea what's causing this bug?
Best,
Bastien
Dear Derrick,
I have installed jellyfish version1.1.11. to build the custom database with fungi, bacteria, and viruses. I followed the instructions in adding libraries to the library folder.
While at the step of building custom database from the library. I executed below command and I got the following messages in the terminal. I left more than 10 hours. Still I could see only the same message "Hash size not specified, using '12654893177'".
Is everything fine or am I missing something to build custom database?
wenchenaafc@wenchenaafc:~/metAMOS-1.5rc3/kraken_custom$ kraken-build --build --db customDB --threads 1 --kmer-len 31 --minimizer-len 15
Kraken build set to minimize disk writes.
Creating k-mer set (step 1 of 6)...
Found jellyfish v1.1.11
Hash size not specified, using '12654893177'
Hi,
I want to analyze a set of reads with kraken.
After the trimming step of my paired-end data, a part of the reads lost their mates (mate was too short after quality trimming). I obtain a fastq file with the right mate, a fastq file with the left mate and a fastq file with singletons.
Please, could you tell me if is it possible to include all the reads (paired and singleton) in the kraken analysis ? Should I concatenate the files containing right/left paired reads with the file containing unpaired right/left mate or should I run kraken separately for the paired and unpaired reads ?
Thank you in advance for your reply,
In its current form does Kraken accept BAM as an input? This would enable already mapped reads to be classified. Some places (including our center) prefer BAM for holding unmapped reads as well due to the flexibility of the format compared to FASTQ described (here)[http://blastedbio.blogspot.com/2011/10/fastq-must-die-long-live-sambam.html]. This is obviously just an enhancement but might enable more users.
Hi Derrick,
Can you please help on the following issue:
bhaley@NextSeq-Server:/$ ./kraken-build --standard --db //kraken_DB
Found jellyfish v1.1.11
--2016-03-08 17:40:58-- ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gz
=> ‘all.fna.tar.gz’
Resolving ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)... 130.14.250.11, 2607:f220:41e:250::10
Connecting to ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)|130.14.250.11|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /genomes/Bacteria ...
No such directory ‘genomes/Bacteria’.
Hi there. Your current master code looks fairly stable and has not seen many updates lately, whereas the 0.10.5-beta (the last tagged version) has a build issue. Looking at the commit history this seems to have been fixed very quickly. However, since there is no newer tag, I am stuck with using the master, which I would like to avoid using in my production setup. If there are no reasons against it, could you please release a new version and tag it?
Thanks a lot for a great tool!
Florian
H_sapiens library folder and size.
0B ./CHR_01
0B ./CHR_02
0B ./CHR_03
0B ./CHR_04
0B ./CHR_05
0B ./CHR_06
0B ./CHR_07
0B ./CHR_08
0B ./CHR_09
0B ./CHR_10
0B ./CHR_11
0B ./CHR_12
0B ./CHR_13
0B ./CHR_14
0B ./CHR_15
0B ./CHR_16
0B ./CHR_17
0B ./CHR_18
0B ./CHR_19
0B ./CHR_20
0B ./CHR_21
0B ./CHR_22
0B ./CHR_MT
0B ./CHR_Un
0B ./CHR_X
0B ./CHR_Y
kraken and jellyfish are installed to /kraken_install and /jellyfish-1.1.11 and all directories containing relevant executables have been added successfully to .bashrc
[ans74@hpc~]$ kraken-build
Must select a task option.
Usage: kraken-build [task option] [options]
Task options (exactly one must be selected):
--download-taxonomy Download NCBI taxonomic information
[ans74@hpc~]$ jellyfish
Too few arguments
Usage: jellyfish <cmd> [options] arg...
Where <cmd> is one of: count, stats, histo, dump, merge, query, cite, qhisto, qdump, qmerge, jf.
Options:
--version Display version
running my script results in:
[ans74@hpc ~]$ cat kraken_3*
/hpc/ans74/kraken_install/check_for_jellyfish.sh: line 27: jellyfish: command not found
however, running (not submitting) check_for_jellyfish.sh works!
[ans74@hpc ~]$ ./kraken_install/check_for_jellyfish.sh
Found jellyfish v1.1.11
Also, running the command I am trying to submitting to the HPC manually works:
[ans74@hpc ~]$ kraken-build --standard --db /hpc/scratch/ans74/kraken_database/
Found jellyfish v1.1.11
--2017-03-20 15:04:16-- ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
=> “gi_taxid_nucl.dmp.gz”
Resolving ftp.ncbi.nih.gov... 130.14.250.12, 2607:f220:41e:250::13
Connecting to ftp.ncbi.nih.gov|130.14.250.12|:21... connected.
Logging in as anonymous ... ^C
Is this somehow related to how kraken is written?
Does it make any sense that the executables are mapped into the .bashrc and work from there but then the script when submitted isn't able to detect jellyfish from the same directory?
Hello,
I would like to know if it's possible to achieve strain-level sensitivity for Bacillus with Kraken.
Thanks!
Dear Derrick,
Query regarding Annotation:
My metagenomics forward and reverse fastq files have 20 million reads. After removing plant similar reads from my input fastq files using (fastq_screen pipeline), I had 4 million reads. Then I provided this fastq file (4 million reads) as input to metAMOS pipeline. FCP option has annotated those reads but each of the custom kraken database and minikraken did not annotate as expected. What could have been the reason?
But for the initial fastq files (with 20 million reads), kraken custom DB based on nt database annotated correctly.
I tried four different databases with metAMOS pipeline.
Using minikraken database (DB size 4.5GB), for these 4 million reads I received an output with no hits in annotation.
Using custom kraken database (Bacterial, Viral, Archaeal, Fungal) (DB size 105GB), for these 4 million reads.
Using custom kraken database (nt database from ncbi) (DB size - 604GB), for these 4 million reads.
I was pointed to the 16s branch by someone at a conference.
Is it functional?
Any caveats?
I try to build the database but it does not work with the latest Jellyfish 1 program:
Found jellyfish v1.1.11 Kraken requires jellyfish version 1
Does the checking script make a mistake or what version should I use in stead?
My work uses a proxy server for access to the Internet. Kraken uses wget to download files from the NCBI for which we needed to set the ftp_proxy variable - should this be added to the documentation?
Derrick,
I notice that often my pure bacterial samples still get 5% say of reads being unclassified. When I assemble these reads, they turn out to be bacterial plasmids.
The problem is that some of the Bacteria folders have chromosomes and plasmids, but there are also many separately submitted plasmids which are in a different Plasmids folder at NCBI:
ftp://ftp.ncbi.nih.gov/genomes/Plasmids/
It would be great to add support for this in the download tools, and in MiniKraken.
kraken-build --standard --db $DBNAME
Only build Bacteria and Viruses, plasmids and human are not downloaded.
Not sure this is a bug or standard feature.
kraken-build --download-library plasmids --db testdb
--2016-07-08 15:06:33-- ftp://ftp.ncbi.nlm.nih.gov/genomes/Plasmids/plasmids.all.fna.tar.gz
=> ‘plasmids.all.fna.tar.gz’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.11, 2607:f220:41e:250::10
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.11|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /genomes/Plasmids ...
No such directory ‘genomes/Plasmids’.
Found jellyfish v1.1.11
tar: citations.dmp:无法将所有权改为 uid 9019,gid 583: 无效的参数
tar: 跳转到下一个头
tar: 归档包含 “\tn\tuncultur” 而需要数字值 off_t
tar: 归档包含 “n\t11\tn\t1” 而需要数字值 mode_t
tar: 归档包含 “don\n350657\tn” 而需要数字值 time_t
tar: 归档包含 “\tn\t0\tn\t” 而需要数字值 uid_t
tar: 归档包含 “\tn\t1\tn\t” 而需要数字值 gid_t
tar: pecies\tn\tUAon\t11\tn\t1\tn\t11\tn\t1\tn\t0\tn\t1\tn\t1\tn\t0\tn\tunculturedon\n350656\tn\t75661\tn\tspecies\tn\tUAon\t11\tn\t1\t:未知的文件类型“ ”,作为普通文件展开
tar: pecies n UAon 11 n 1 n 11 n 1 n 0 n 1 n 1 n 0 n unculturedon
350656 n 75661 n species n UAon 11 n 1 :不可信的旧时间戳 1970-01-01 07:59:59
tar: 跳转到下一个头
_gzip: stdin: invalid compressed data--crc error
gzip: stdin: invalid compressed data--length error
tar: Child returned status 1
tar: Error is not recoverable: exiting now_
WARNING: at kernel/rh_taint.c:13 mark_hardware_unsupported+0x39/0x40() (Not tainted)
:Hardware name: ThinkServer RD640
:Your hardware is unsupported. Please do not report bugs, panics, oopses, etc.,
on this hardware.
On a MacPro with 128Gb RAM, kraken-build fails at last step, using kraken-0.10.16 (from github) and jellyfish 1.1.11. Has happened with two different databases, one big, one small. Any suggestions? Thanks
>kraken-build --build --threads 14 --db ensemblgenomes --jellyfish-hash-size 14500M --max-db-size 100
Kraken build set to minimize disk writes.
Skipping step 1, k-mer set already exists.
Reducing database size (step 2 of 6)...
Shrinking DB to use only 8232020649 of the 15480282244 k-mers
Written 8232020649/8232020649 k-mers to new file
Database reduced. [1h21m16.000s]
Sorting k-mer set (step 3 of 6)...
K-mer set sorted. [1h32m56.000s]
Creating GI number to seqID map (step 4 of 6)...
GI number to seqID map created. [4m47.000s]
Creating seqID to taxID map (step 5 of 6)...
640200 sequences mapped to taxa. [2.000s]
Setting LCAs in database (step 6 of 6)...
set_lcas: database in improper format
Hi,
Which Jellyfish version is compatible to kraken:
> conda search jellyfish
Fetching package metadata ...............
jellyfish 0.5.1 py27_0 conda-forge
0.5.1 py34_0 conda-forge
0.5.1 py35_0 conda-forge
0.5.6 py35_0 conda-forge
0.5.6 py27_0 conda-forge
0.5.6 py34_0 conda-forge
1.1.11 0 bioconda
1.1.11 1 bioconda
2.2.3 0 bioconda
2.2.3 1 bioconda
* 2.2.6 0 bioconda
Thank you in advance.
Michal
It would be great if --add-to-library
could support .gbk
/.gbff
(and .gz
versions thereof).
These have the taxid in taxon:NNN
field in the source
tag.
To make it fast you can avoid parsing the Genbank file, and just read it as follows:
https://raw.githubusercontent.com/MDU-PHL/mdu-tools/master/bin/genbank-to-kraken_fasta.pl
Hi, I did a custom databate, search with kraken, but when i try to translate kraken.sequence (kraken-translate --db myKrakenDB --mpa-format sequences.kraken
), i get the following output:
Use of uninitialized value $_ in string eq at /Users/castrolab01/programs/kraken/kraken-translate line 116, <> line 1.
Use of uninitialized value $_ in string eq at /Users/castrolab01/programs/kraken/kraken-translate line 117, <> line 1.
Use of uninitialized value $_ in string eq at /Users/castrolab01/programs/kraken/kraken-translate line 118, <> line 1.
Use of uninitialized value $_ in string eq at /Users/castrolab01/programs/kraken/kraken-translate line 119, <> line 1.
Use of uninitialized value $_ in string eq at /Users/castrolab01/programs/kraken/kraken-translate line 120, <> line 1.
Use of uninitialized value $_ in string eq at /Users/castrolab01/programs/kraken/kraken-translate line 121, <> line 1.
Use of uninitialized value $_ in string eq at /Users/castrolab01/programs/kraken/kraken-translate line 122, <> line 1.
Use of uninitialized value $_ in string eq at /Users/castrolab01/programs/kraken/kraken-translate line 123, <> line 1.
Use of uninitialized value in transliteration (tr///) at /Users/castrolab01/programs/kraken/kraken-translate line 95, <> line 1.
Use of uninitialized value $taxid in numeric gt (>) at /Users/castrolab01/programs/kraken/kraken-translate line 101, <> line 1.
Use of uninitialized value $taxid in hash element at /Users/castrolab01/programs/kraken/kraken-translate line 104, <> line 1.
Use of uninitialized value $taxid in hash element at /Users/castrolab01/programs/kraken/kraken-translate line 110, <> line 1.
r1 root
Hi Derrick,
We talked about this topic during 2014 and I know you are working on it. I'm very excited to see you released a new kraken version! I really think kraken could be one of the best solutions for metabarcoding/metagenomics analysis.
Are you planning to release a guide or a script to create a kraken database from Silva, Greengenes or other custom databases?
Hi,
I have an error when I use paired reads that are in fast files. I think that maybe is missing something in the parsing options when paired is set up.
/home/lp113/soft/bin/kraken --paired --db /home/lp113/soft/kraken/db/minikraken_20140330 --quick --preload --min-hits 3 --threads 1 --out - --classified-out kraken_out /home/lp113/bcbio-nextgen/tests/test_automated_output/trim/Hsapiens_Mmusculus_1_trimmed.fq /home/lp113/bcbio-nextgen/tests/test_automated_output/trim/Hsapiens_Mmusculus_rep2_trimmed.fq
Loading database... read_merger.pl: mismatched mate pair names ('HWI-ST1233:94:C0YE6ACXX:2:1302:19463:27765' & 'HWI-ST1233:94:C0YE6ACXX:2:1302:19463:27765g')
complete.
classify: malformed fasta file - expected header char > not found
0 sequences (0.00 Mbp) processed in 0.000s (0.0 Kseq/m, 0.00 Mbp/m).
0 sequences classified (-nan%)
0 sequences unclassified (-nan%)
Hi - I am trying to build a custom KRAKEN database. I am following all the instructions provided on the manual, but I am not able to add the sequences to the database. Every time I am getting the following error message -"Can't add "/data1/home/sandeep/HMP_Dataset/Simulations/Genomes/CC_121.fa": sequence is missing GI number"
Below are the different modifications I tried on the header of the sequence.
It would be great if you could let me know what be the reason I am getting this error message
GI:686246107 Staphylococcus aureus ADL-101
gttacAAGCGCATTTTCGTTCAGTCAACTACTGCCAATATAACTTCGTAGAGCATAGAAC
ATTGATTTATGTACCAGCCTGATCAACATATAAATATAAATTTTTATGTTTCACGTAAAA686246107 Staphylococcus aureus ADL-101
gttacAAGCGCATTTTCGTTCAGTCAACTACTGCCAATATAACTTCGTAGAGCATAGAAC
ATTGATTTATGTACCAGCCTGATCAACATATAAATATAAATTTTTATGTTTCACGTAAAACC_121|kraken:taxid|1308698 Staphylococcus aureus ADL-101
gttacAAGCGCATTTTCGTTCAGTCAACTACTGCCAATATAACTTCGTAGAGCATAGAAC
ATTGATTTATGTACCAGCCTGATCAACATATAAATATAAATTTTTATGTTTCACGTAAAAkraken:taxid|1308698 Staphylococcus aureus ADL-101
gttacAAGCGCATTTTCGTTCAGTCAACTACTGCCAATATAACTTCGTAGAGCATAGAAC
ATTGATTTATGTACCAGCCTGATCAACATATAAATATAAATTTTTATGTTTCACGTAAAA
I'm building a custom db following the instructions on the kraken website using the scripts provided by Mick Watson (http://www.opiniomics.org/building-a-kraken-database-with-new-ftp-structure-and-no-gi-numbers/). Everything works fine up to the moment when I try to build the db using the following command:
karsten$ kraken-build --build --threads 24 --work-on-disk --db kraken_20160720
Kraken build set to minimize RAM usage.
Creating k-mer set (step 1 of 6)...
Found jellyfish v1.1.11
find: -printf: unknown primary or operator
Copied from the script, the line in question looks like:
"KRAKEN_HASH_SIZE=$(find library/ '(' -name '.fna' -o -name '.fa' -o -name '*.ffn' ')' -printf '%s\n' | perl -nle '$sum += $_; END {print int(1.15 * $sum)}')"
I'm using Mac OS X 10.11.6 with Perl v5.18.2.
So, what is wrong with the printf command?
Cheers,
Karsten
From Kraken v0.10.0 to v0.10.2, the default build operation manipulated data that was almost entirely on disk. Some users found this to cause their builds to take extraordinarily long times, often forcing the user to give up on the build. As of v0.10.3, the default build operation is to work in RAM as much as possible, which dramatically lessens the number of random access disk I/O operations. However, some people may not have enough extra RAM to use this mode of operation, and the other mode (accessible via --work-on-disk) may not work given the I/O configuration on the user's computer. I do have some plans in mind to break the build process down into more manageable parts for such users, and I hope to have them available in future versions.
The usage information for the kraken* tools should inform the user of the values of $KRAKEN_DB_PATH $KRAKEN_DEFAULT_DB $KRAKEN_NUM_THREADS
in the usage information:
--db NAME Name for Kraken DB
--threads NUM Number of threads
For example, if they are set:
--db NAME Name for Kraken DB (default=/var/db/kraken/minikraken)
--threads NUM Number of threads (default=8)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.