GithubHelp home page GithubHelp logo

daehwankimlab / centrifuge Goto Github PK

View Code? Open in Web Editor NEW
233.0 233.0 73.0 6.38 MB

Classifier for metagenomic sequences

License: GNU General Public License v3.0

Perl 2.63% C++ 88.39% Makefile 1.00% C 4.48% Python 2.91% Shell 0.58%

centrifuge's People

Contributors

druvus avatar fbreitwieser avatar infphilo avatar jsh58 avatar laserson avatar mourisl avatar sjaenick avatar stevenhwu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

centrifuge's Issues

Some code files have executable status

-rwxrwxr-x. 1 linuxbrew linuxbrew   8506 Aug 18 04:05 tinythread.cpp
-rwxrwxr-x. 1 linuxbrew linuxbrew  21220 Aug 18 04:05 tinythread.h
-rwxrwxr-x. 1 linuxbrew linuxbrew 6940 Aug 18 04:05 fast_mutex.h

SIGSEGV in centrifuge-build-bin

centrifuge-build-bin crashed on me attempting to build refseq_microbial with THREADS=150.
In addition, indices/Makefile does not check the return code of centrifuge-build, thus indicating
success instead of failing.

Last lines of output:

bmax according to bmaxDivN setting: 3991473403
Using parameters --bmax 2993605053 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 2993605053 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
  Building sPrime
  Building sPrimeOrder
  V-Sorting samples
Core was generated by `centrifuge-build-bin --wrapper basic-0 -p 150 --ftabchars 14 --conversion-table'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000000000040c0ce in try_lock (this=0x0) at fast_mutex.h:161
161           );
(gdb) bt
#0  0x000000000040c0ce in try_lock (this=0x0) at fast_mutex.h:161
#1  lock (this=0x0) at fast_mutex.h:125
#2  ThreadSafe (locked=true, ptr_mutex=0x0, this=<synthetic pointer>) at threading.h:42
#3  VSorting_worker<SString<char> > (vp=0x1d4de30) at diff_sample.h:696
#4  0x000000000046265f in tthread::thread::wrapper_function (aArg=0x1450a20) at tinythread.cpp:169
#5  0x00002b14eadf5184 in start_thread (arg=0x2b1507e54700) at pthread_create.c:312
#6  0x00002b14eb92537d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) up 3
#3  VSorting_worker<SString<char> > (vp=0x1d4de30) at diff_sample.h:696
696                 ThreadSafe ts(param->mutex, true);
(gdb) p param
$1 = (VSortingParam<SString<char> > *) 0x1d4de30
(gdb) p param->mutex
$2 = (tthread::fast_mutex *) 0x0
(gdb) l
691         const size_t hlen = host.length();
692         uint32_t v = dcs->v();
693         while(true) {
694             size_t cur = 0;
695             {
696                 ThreadSafe ts(param->mutex, true);
697                 cur = *(param->cur);
698                 (*param->cur)++;
699             }
700             if(cur >= param->boundaries->size()) return;

Encountered internal Centrifuge exception (#1)

Hi, I am trying to download centrifuge on my system. However, I am getting the following error in the index building step:

command I am giving:
centrifuge-build -p 4 --conversion-table seqid2taxid.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp input-sequences.fna abc

Output:
Settings:
Output files: "abv..cf"
Line rate: 7 (line is 128 bytes)
Lines per side: 1 (side is 128 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Local offset rate: 3 (one in 8)
Local fTable chars: 6
Max bucket size: default
Max bucket size, sqrt multiplier: default
Max bucket size, len divisor: 4
Difference-cover sample period: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void
:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
input-sequences.fna
Warning: Empty fasta file: 'input-sequences.fna'
Warning: All fasta inputs were empty
Total time for call to driver() for forward index: 00:00:00
Error: Encountered internal Centrifuge exception (#1)

Thanks
Gaurav

refseq_microbial: archaea-chromosome_level is not a valid domain

The refseq_microbial database target attempts to obtain archaea-chromosome_level and fails:

$ make refseq_microbial 
.....
Making: refseq_microbial: refseq_microbial
make -f Makefile IDX_NAME=refseq_microbial
make[1]: Entering directory `/ceph/mgx-sw/src/centrifuge/indices'
[[ -d tmp_refseq_microbial ]] && rm -rf tmp_refseq_microbial; mkdir -p tmp_refseq_microbial
Downloading and dust-masking archaea-chromosome_level
centrifuge-download -o tmp_refseq_microbial  -m -d "archaea-chromosome_level" -P 1 refseq > \
                tmp_refseq_microbial/all-archaea-chromosome_level.map
archaea-chromosome_level is not a valid domain - use one of the following:
make[1]: *** [reference-sequences/all-archaea-chromosome_level.fna] Error 1
make[1]: Leaving directory `/ceph/mgx-sw/src/centrifuge/indices'
make: *** [refseq_microbial] Error 2
$ cat tmp_refseq_microbial/all-archaea-chromosome_level.map 
archaea
bacteria
fungi
invertebrate
plant
protozoa
vertebrate_mammalian
vertebrate_other
viral
$ ls reference-sequences/
all-archaea.fna  all-bacteria.fna  all-fungi.fna  all-protozoa.fna  all-viral.fna
all-archaea.map  all-bacteria.map  all-fungi.map  all-protozoa.map  all-viral.map

No report file and error

Hi,

some of my sample are not generating a report file and the program exits with the following message.

report file /home/people/user/centrifuge/S1134.report
Number of iterations in EM algorithm: 1118
Probability diff. (P - P_prev) in the last iteration: 9.70335e-11
*** Error in `/home/people/user/apps/centrifuge-1.0.1-beta/centrifuge-class': free(): invalid next size (normal): 0x00000004819753a0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7d023)[0x7ffff7375023]
/home/people/user/apps/centrifuge-1.0.1-beta/centrifuge-class[0x445f7b]
/home/people/user/apps/centrifuge-1.0.1-beta/centrifuge-class[0x41cecb]
/home/people/user/apps/centrifuge-1.0.1-beta/centrifuge-class[0x41f3fb]
/home/people/user/apps/centrifuge-1.0.1-beta/centrifuge-class[0x49812b]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff7319b15]
/home/people/user/apps/centrifuge-1.0.1-beta/centrifuge-class[0x40f6e9]
======= Memory map: ========
00400000-00747000 r-xp 00000000 00:2c 6914555853 /home/people/user/apps/centrifuge-1.0.1-beta/centrifuge-class
00946000-0095e000 rw-p 00346000 00:2c 6914555853 /home/people/user/apps/centrifuge-1.0.1-beta/centrifuge-class
0095e000-481985000 rw-p 00000000 00:00 0 [heap]
7fed60000000-7fed62b0f000 rw-p 00000000 00:00 0
7fed62b0f000-7fed64000000 ---p 00000000 00:00 0
7fed88000000-7fed8903e000 rw-p 00000000 00:00 0
7fed8903e000-7fed8c000000 ---p 00000000 00:00 0
7fedd0000000-7fedd26b5000 rw-p 00000000 00:00 0
7fedd26b5000-7fedd4000000 ---p 00000000 00:00 0
7fedd4000000-7fedd55a4000 rw-p 00000000 00:00 0
7fedd55a4000-7fedd8000000 ---p 00000000 00:00 0
7fedd8000000-7fedd8fa1000 rw-p 00000000 00:00 0
7fedd8fa1000-7feddc000000 ---p 00000000 00:00 0
7fee40000000-7fee41254000 rw-p 00000000 00:00 0
7fee41254000-7fee44000000 ---p 00000000 00:00 0
7fee44000000-7fee45d13000 rw-p 00000000 00:00 0
7fee45d13000-7fee48000000 ---p 00000000 00:00 0
7fee48000000-7fee49d19000 rw-p 00000000 00:00 0
7fee49d19000-7fee4c000000 ---p 00000000 00:00 0
7ff5a6e85000-7ff5c7286000 rw-p 00000000 00:00 0
7ff5c8000000-7ff5c9b01000 rw-p 00000000 00:00 0
7ff5c9b01000-7ff5cc000000 ---p 00000000 00:00 0
7ff5cc000000-7ff5cd9b5000 rw-p 00000000 00:00 0
7ff5cd9b5000-7ff5d0000000 ---p 00000000 00:00 0
7ff5d0000000-7ff5d16f4000 rw-p 00000000 00:00 0
7ff5d16f4000-7ff5d4000000 ---p 00000000 00:00 0
7ff5d6c86000-7ff5d6c87000 ---p 00000000 00:00 0
7ff5d6c87000-7ff5d7487000 rw-p 00000000 00:00 0
7fff457a4000-7fffa63a6000 rw-p 00000000 00:00 0
7fffa67fd000-7fffa67fe000 ---p 00000000 00:00 0
7fffa67fe000-7fffa6ffe000 rw-p 00000000 00:00 0
7fffa6ffe000-7fffa6fff000 ---p 00000000 00:00 0
7fffa6fff000-7fffa77ff000 rw-p 00000000 00:00 0
7fffa77ff000-7fffa7800000 ---p 00000000 00:00 0
7fffa7800000-7fffa8000000 rw-p 00000000 00:00 0
7fffa8000000-7fffa9884000 rw-p 00000000 00:00 0
7fffa9884000-7fffac000000 ---p 00000000 00:00 0
7fffac000000-7fffacb5c000 rw-p 00000000 00:00 0
7fffacb5c000-7fffb0000000 ---p 00000000 00:00 0
7fffb04e2000-7fffb04f8000 r-xp 00000000 08:01 925922 /cm/local/apps/gcc/5.1.0/lib64/libgcc_s.so.1
7fffb04f8000-7fffb06f7000 ---p 00016000 08:01 925922 /cm/local/apps/gcc/5.1.0/lib64/libgcc_s.so.1
7fffb06f7000-7fffb06f8000 r--p 00015000 08:01 925922 /cm/local/apps/gcc/5.1.0/lib64/libgcc_s.so.1
7fffb06f8000-7fffb06f9000 rw-p 00016000 08:01 925922 /cm/local/apps/gcc/5.1.0/lib64/libgcc_s.so.1
7fffb06f9000-7fffb07f9000 rw-p 00000000 00:00 0
7fffb07f9000-7fffb07fa000 ---p 00000000 00:00 0
7fffb07fa000-7fffb0ffa000 rw-p 00000000 00:00 0
7fffb0ffa000-7fffb0ffb000 ---p 00000000 00:00 0
7fffb0ffb000-7fffb17fb000 rw-p 00000000 00:00 0
7fffb17fb000-7fffb17fc000 ---p 00000000 00:00 0
7fffb17fc000-7fffb1ffc000 rw-p 00000000 00:00 0
7fffb1ffc000-7fffb1ffd000 ---p 00000000 00:00 0
7fffb1ffd000-7fffb27fd000 rw-p 00000000 00:00 0
7fffb27fd000-7fffb27fe000 ---p 00000000 00:00 0
7fffb27fe000-7fffb2ffe000 rw-p 00000000 00:00 0
7fffb2ffe000-7fffb2fff000 ---p 00000000 00:00 0
7fffb2fff000-7fffb37ff000 rw-p 00000000 00:00 0
7fffb37ff000-7fffb3800000 ---p 00000000 00:00 0
7fffb3800000-7fffb4000000 rw-p 00000000 00:00 0
7fffb4000000-7fffb4e77000 rw-p 00000000 00:00 0
7fffb4e77000-7fffb8000000 ---p 00000000 00:00 0
7fffb8000000-7fffb9e85000 rw-p 00000000 00:00 0
7fffb9e85000-7fffbc000000 ---p 00000000 00:00 0
7fffbc000000-7fffbd4a1000 rw-p 00000000 00:00 0
7fffbd4a1000-7fffc0000000 ---p 00000000 00:00 0
7fffc002f000-7fffc00af000 rw-p 00000000 00:00 0
7fffc00af000-7fffc00b0000 ---p 00000000 00:00 0
7fffc00b0000-7fffe4b31000 rw-p 00000000 00:00 0
7fffe4b6d000-7fffe4b6e000 ---p 00000000 00:00 0
7fffe4b6e000-7fffe536e000 rw-p 00000000 00:00 0
7fffe536e000-7fffe536f000 ---p 00000000 00:00 0
7fffe536f000-7fffe5b6f000 rw-p 00000000 00:00 0
7fffe5b6f000-7fffe5b70000 ---p 00000000 00:00 0
7fffe5b70000-7fffe6370000 rw-p 00000000 00:00 0
7fffe6370000-7fffe6371000 ---p 00000000 00:00 0
7fffe6371000-7ffff72f8000 rw-p 00000000 00:00 0
7ffff72f8000-7ffff74ae000 r-xp 00000000 08:01 542276 /usr/lib64/libc-2.17.so
7ffff74ae000-7ffff76ae000 ---p 001b6000 08:01 542276 /usr/lib64/libc-2.17.so
7ffff76ae000-7ffff76b2000 r--p 001b6000 08:01 542276 /usr/lib64/libc-2.17.so
7ffff76b2000-7ffff76b4000 rw-p 001ba000 08:01 542276 /usr/lib64/libc-2.17.so
7ffff76b4000-7ffff76b9000 rw-p 00000000 00:00 0
7ffff76b9000-7ffff77ba000 r-xp 00000000 08:01 542597 /usr/lib64/libm-2.17.so
7ffff77ba000-7ffff79b9000 ---p 00101000 08:01 542597 /usr/lib64/libm-2.17.so
7ffff79b9000-7ffff79ba000 r--p 00100000 08:01 542597 /usr/lib64/libm-2.17.so
7ffff79ba000-7ffff79bb000 rw-p 00101000 08:01 542597 /usr/lib64/libm-2.17.so
7ffff79bb000-7ffff79be000 r-xp 00000000 08:01 556500 /usr/lib64/libdl-2.17.so
7ffff79be000-7ffff7bbd000 ---p 00003000 08:01 556500 /usr/lib64/libdl-2.17.so
7ffff7bbd000-7ffff7bbe000 r--p 00002000 08:01 556500 /usr/lib64/libdl-2.17.so
7ffff7bbe000-7ffff7bbf000 rw-p 00003000 08:01 556500 /usr/lib64/libdl-2.17.so
7ffff7bbf000-7ffff7bd5000 r-xp 00000000 08:01 556539 /usr/lib64/libpthread-2.17.so
7ffff7bd5000-7ffff7dd5000 ---p 00016000 08:01 556539 /usr/lib64/libpthread-2.17.so
7ffff7dd5000-7ffff7dd6000 r--p 00016000 08:01 556539 /usr/lib64/libpthread-2.17.so
7ffff7dd6000-7ffff7dd7000 rw-p 00017000 08:01 556539 /usr/lib64/libpthread-2.17.so
7ffff7dd7000-7ffff7ddb000 rw-p 00000000 00:00 0
7ffff7ddb000-7ffff7dfc000 r-xp 00000000 08:01 541997 /usr/lib64/ld-2.17.so
7ffff7e57000-7ffff7fe1000 rw-p 00000000 00:00 0
7ffff7ff8000-7ffff7ffa000 rw-p 00000000 00:00 0
7ffff7ffa000-7ffff7ffc000 r-xp 00000000 00:00 0 [vdso]
7ffff7ffc000-7ffff7ffd000 r--p 00021000 08:01 541997 /usr/lib64/ld-2.17.so
7ffff7ffd000-7ffff7ffe000 rw-p 00022000 08:01 541997 /usr/lib64/ld-2.17.so
7ffff7ffe000-7ffff7fff000 rw-p 00000000 00:00 0
7ffffffdd000-7ffffffff000 rw-p 00000000 00:00 0 [stack]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 vsyscall: centrifuge-class died with signal 6 (ABRT)

The command to run the sample:

/home/people/user/apps/centrifuge-1.0.1-beta/centrifuge -k 1 -q -p 16 --reorder -x /home/people/user/apps/centrifuge-1.0.1-beta/indices/b+h+v/b+h+v -1 /home/people/user/data_trimmed/S1134_R1.trim.fq -2 /home/people/user/data_trimmed/S1134_R2.trim.fq --report-file /home/people/user/centrifuge/S1134.report -S /home/people/user/centrifuge/S1134.summary

Usage using hisat as command

When I execute centrifuge I see this as usage:

Centrifuge version v1.0.1-beta-40-g689d12fbd0 by Daehwan Kim ([email protected], www.ccb.jhu.edu/people/infphilo)
Usage: 
  hisat [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <filename>] [--report-file <report>]
...

As you can see it say's to execute hisat.

No examples for custom database

Hello,

Not sure if this is the right place, but in your documentation, there are no examples in the 'Custom Database' section. Just a TODO list :-)

If you could add these in, that would be really useful.

Thanks!

Phil

(ERR): centrifuge-class died with signal 11 (SEGV)

Hello,
I was trying to use centrifuge on several bacterial shot gun metagenomic datasets.
It stop with the following error message:
report file study.18512
Number of iterations in EM algorithm: 8415
Probability diff. (P - P_prev) in the last iteration: 9.99892e-11
(ERR): centrifuge-class died with signal 11 (SEGV)

The report file was empty.

example files missing

In the manual, there is an example for building a reference:

$CENTRIFUGE_HOME/centrifuge-build --conversion-table $CENTRIFUGE_HOME/example/reference/gi_to_tid.dmp --taxonomy-tree $CENTRIFUGE_HOME/example/reference/nodes.dmp --name-table $CENTRIFUGE_HOME/example/reference/names.dmp $CENTRIFUGE_HOME/example/reference/test.fa test

However, the example/reference directory only contains test.fa, so the command fails because other files are missing.

(ERR): centrifuge-class exited with value 1

Working with a database I created using bacterial, archaeal and fungal genomes, I am now getting the following trying to run it with some reads:

Error reading _ebwt[] array: 1352, 7627660928
Error: Encountered internal Centrifuge exception (#1)
Command: /usr/local/bin/centrifuge-class --wrapper basic-0 -S /mnt/e/reads_output.txt --report-file /mnt/e/reads_report.tsv -f -p 8 -U /mnt/e/reads.fasta /mnt/e/centrifuge_database/abv
(ERR): centrifuge-class exited with value 1

Not sure what to do here. Any help @infphilo ?

centrifuge-inspect -s not working

./centrifuge-inspect -s indices/bacteria/bacteria                                                                                                                                                                   
Error: Encountered exception: 'Cannot open file indices/bacteria/bacteria.rev'
Command: centrifuge-inspect --wrapper basic-0 -s indices/bacteria/bacteria 

Zika Virus is not in your p+h+v pre-made indices? AND Centrifuge-download does not work?

Hello,

Kind of at wits-end with Centrifuge as I've been trying to get it to work with my own database, and NCBI bac & virus, for a long time now. To paraphrase Roseanna Roseannadanna, "Its always something..."

I recently gave it another go with your pre-made indices just to see if I could get it to run at all. Before running a bunch of my samples through, I used centrifuge-inspect to determine if all of my target organisms were indeed in the database. I used centrifuge-inspect and grep for this...

$ centrifuge-inspect --name-table p+h+v > nametable.txt
$ grep "Zika" nametable.txt
$

From what I can tell, Zika Virus is not in the p+h+v (the pre-made bacteria, viruses, archaea, human index listed on the right margin of your website)? All of my other target organisms (Human papillomavirus Type 132 and Variola virus, for example) are included in this index.

$ grep "Human papillomavirus type 132" nametable.txt
909331 Human papillomavirus type 132

$ grep "Variola" nametable.txt
10255 Variola virus

ALSO...

Since Zika did not seem to be included, I tried using centrifuge-download again, but I get an error. The connection to NCBI's ftp site seems to be blocked or otherwise not good. Below is the error I get...

$ centrifuge-download -o taxonomy taxonomy
Downloading NCBI taxonomy ...
rsync: failed to connect to ftp.ncbi.nih.gov (130.14.250.7): Connection refused (111)
rsync: failed to connect to ftp.ncbi.nih.gov (2607:f220:41e:250::13): Network is unreachable (101)
rsync error: error in socket IO (code 10) at clientserver.c(128) [Receiver=3.1.0]

I sent an email to NCBI describing what I was trying to do and asking whether there was an issue on their end or maybe my corporate firewall was the problem. Here is their response...

Hi,

Thanks for writing to us.

The issue is mostly in the http protocol used by the tool. With the switching to HTTPS late last year, NCBI also requires that http access to our ftp site be switched to HTTPS. You will need to contact the Centrifuge code provider for them to update their code to use HTTPS protocol instead.

A minor issue is the ftp.ncbi.nih.gov domain. Even though it may still work for historical reasons, it may not. The domain should be fully specified with .nlm included, aka ftp.ncbi.nlm.nih.gov

Regards,

Tao Tao, PhD
NCBI User Services

I dove into the centrifuge-download script to see if I could manually update the web address that the script is pointed to. There was only one place where the web address was listed that didn't have the '.nlm' in it, and that was line 194. I added the '.nlm' to the address on that line, saved and re-compiled, and re-ran....but I got the same error. I didn't see any references to http and/or https in the centrifuge-download source code.

Also, where does one manually retrieve the names.dmp and nodes.dmp files from NCBI? Weren't those files phased out when they updated to the new format without GI numbers?

Any help ironing out these problems would be much appreciated.

Thank you.

centrifuge_report.tsv empty

I've encountered a weird bug which happens whenever --metric-file or --report-file are given as parameters. The output of report.tsv (or whatever I named it) is incomplete (or rather blank) and only contains one line of headers.

If I don't define --metric-file and/or --report-file, the standard centrifuge_report.tsv is written normally.

Classification results are always being written though.

Inconistent executable bit on Perl scripts

12 -rwxrwxr-x. 1 linuxbrew linuxbrew 12122 Aug 18 04:05 centrifuge-BuildSharedSequence.pl
 4 -rw-rw-r--. 1 linuxbrew linuxbrew   403 Aug 18 04:05 centrifuge-RemoveEmptySequence.pl
 4 -rw-rw-r--. 1 linuxbrew linuxbrew  1002 Aug 18 04:05 centrifuge-RemoveN.pl
16 -rwxrwxr-x. 1 linuxbrew linuxbrew 12918 Aug 18 04:05 centrifuge-compress.pl
 4 -rwxrwxr-x. 1 linuxbrew linuxbrew  1564 Aug 18 04:05 centrifuge-sort-nt.pl

(ERR): centrifuge-class died with signal 11 (SEGV) (core dumped)

Hi there,

I installed the centrifuge successfully and ran the test data provided with the tool- It all worked fine. However, when I ran my own dataset with the command:

centrifuge -f -x abv ~/Documents/HC.650.fa

It gives me error:
(ERR): centrifuge-class died with signal 11 (SEGV) (core dumped)

First I was working locally, then I thought that this could be the memory issue- so I went to the server and increased the memory allocation too- but the problem remains the same.

I would appreciate if you please guide me through.

Thanks
Gaurav

Examples in Manual missing 'refseq'

FYI, the examples below appear to be missing 'refseq' at the end before the '>>'.

Just stepping through.

Thanks,
Bob

# download mouse and human reference genomes
centrifuge-download -o library -d "vertebrate_mammalian" -a "Chromosome" -t 9606,10090 -c 'reference genome' >> seqid2taxid.map
# only human
centrifuge-download -o library -d "vertebrate_mammalian" -a "Chromosome" -t 9606 -c 'reference genome' >> seqid2taxid.map
# only mouse
centrifuge-download -o library -d "vertebrate_mammalian" -a "Chromosome" -t 10090 -c 'reference genome' >> seqid2taxid.map

Inconsistent Perl paths

Please make them all /usr/bin/env

head -n 1 *.pl

==> centrifuge-BuildSharedSequence.pl <==
#!/bin/perl

==> centrifuge-RemoveEmptySequence.pl <==
#!/bin/perl

==> centrifuge-RemoveN.pl <==
#/bin/perl

==> centrifuge-compress.pl <==
#!/usr/bin/perl

==> centrifuge-sort-nt.pl <==
#! /usr/bin/env perl

taxid 0 reported

I am getting the following results for a read:
readID seqID taxID score 2ndBestScore hitLength queryLength numMatches
cc722ee0-14cf-43a0-b53c-75ba10f250e4_Basecall_Alignment_template no rank 0 256 256 31 5226 4
cc722ee0-14cf-43a0-b53c-75ba10f250e4_Basecall_Alignment_template no rank 0 256 256 31 5226 4
cc722ee0-14cf-43a0-b53c-75ba10f250e4_Basecall_Alignment_template family 10699 256 256 31 5226 4
cc722ee0-14cf-43a0-b53c-75ba10f250e4_Basecall_Alignment_template no rank 0 256 256 31 5226 4

Is it normal getting taxID 0? It is not present when using 'centrifuge-inspect --taxonomy-tree'

best,
c

Read: ACATACTTTACGTTCAGTTACGTATTGCTCAGCACCATCTATAGGTGGCAATGGCTCATTCAATTATTCTAAAACAATTAGTTATACCCAAAGAGTTATGTCAGTGAAGTAGACAAGCAAAACTCAAAATCTACTGTTAAATGATGTTCAAAGCAAACGAATTTGTACATACGATGGAAAAATCTGCGCATGATAGTATTTATTCGTACAAAGTCAAATGGTCCAGCAGTTTCAGCAAGAATATTTTGCTCCTGATAATCAGTACCACCTTTAGTTCAAGTGGCTTTAATCCATCGTTTATCACTACACTATCACATGAAAAGGTTCAAGTGATGAGTGAATTGAAATTTCATATGGTAGAAACTTAGATATTACATATGCGACTTTATTCCTAAATTTAGTATTTGCAGAAAGAAAGCATAATGCATTTGTAAATAGAAACTTTGTAGTTAGATATGAGTTAATTGGAAAACACGGGAATTAAGAGTGAAAGGACGCAATTAATATGAAATGAAAAATTGAGTCAAATCATCAGTTGCTTCATCGTTGCACTGCTTTTGCTATCGAATACAGTTGATGCAGCTCAACATATCACACCTG
Index: refseq-viral

There is no "install" target in Makefile

% make install
make: *** No rule to make target `install'.  Stop.

This is using your latest public release and following your Install docs.

If this is fixed in HEAD can you please make a new release for packaging in Homebrew Science.

Allow tab-defined, not hierarchical taxonomy by centrifuge-kreport

centrifuge-kreport currently outputs hierarchical 2-spaced reports which do not correspond to kraken's usual report format.

Downstream tools depending on this as Krona do not accept hierarchical formats, but the classical P;C;O;(...) output kraken delivers. It is also somewhat difficult to convert the current output to a Krona-friendly format.

Any chance a 'root;cellular organisms;Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Mycobacteriaceae;Mycobacterium;'-like format can be included in this command?

Either that or does someone have a workaround for this?

Thanks!

target 'install'

Hi,

I'm looking forward to trying your tool! Thanks for making the source code and the pre-print available.

I realized the Makefile does not have an install target, and I was curious whether you are planning to change that.

Best wishes,

Centrifuge Crashes with core dumped

Hi,

I am trying this beta build and it worked well for few files and then it stopped working with message below.

(ERR): centrifuge-class died with signal 11 (SEGV) (core dumped)

This machine has 448 GB of RAM and 32 cores.

ashish4@ashish4:/mnt/centrifuge/library$ ~/centri*/centrifuge --version
/home/ashish4/centrifuge/centrifuge-class version 1.0.0-beta
64-bit
Built on ashish
Sun Jan 31 01:03:16 UTC 2016
Compiler: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) 
Options: -O3 -m64 -msse2 -funroll-loops -g3 -DPOPCNT_CAPABILITY
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

Indices makefile failed - grep: .listing: No such file or directory

make b_compressed+h+v THREADS=72

Making: b_compressed+h+v: b_compressed+h+v
make -f Makefile IDX_NAME=b_compressed+h+v
make[1]: Entering directory `/mnt/seq/KRAKEN/centrifuge'
mkdir -p reference-sequences
[[ -d tmp_b_compressed+h+v ]] && rm -rf tmp_b_compressed+h+v; mkdir -p tmp_b_compressed+h+v
Downloading and dust-masking viral
centrifuge-download -o tmp_b_compressed+h+v  -m -d "viral" -P 72 refseq > \
        tmp_b_compressed+h+v/all-viral.map
grep: .listing: No such file or directory
viral is not a valid domain - use one of the following:
grep: .listing: No such file or directory
make[1]: *** [reference-sequences/all-viral.fna] Error 2
make[1]: Leaving directory `/mnt/seq/KRAKEN/centrifuge'
make: *** [b_compressed+h+v] Error 2

Here's the file status:

% find .
.
./Makefile
./reference-sequences
./tmp_b_compressed+h+v
./tmp_b_compressed+h+v/all-viral.map

and that last .map file is empty.

(ERR): centrifuge-class died with signal 11 (SEGV)

Hello,

Ops, I pressed the submit button to early for my previous request.
Here I will also include the log file and the error message below.
Could yo please let me know what is possible wrong.
I am using centrifuge-1.0.1-beta
Thanks very much,

Josef,

18512.C.pdf

report file study.18512
Number of iterations in EM algorithm: 8415
Probability diff. (P - P_prev) in the last iteration: 9.99892e-11
(ERR): centrifuge-class died with signal 11 (SEGV)

Number of unclassified reads are not correct

I did run some samples with centrifuge but I notice that I'm missing some reads in the end results. As example I just summarize what numbers I find:
fastq file = 27587 reads
centrifuge metrics - Read = 27587
centrifuge metrics - UnfilteredRead = 27587
cat <centrifuge output> | cut -f1 | uniq | wc -l = 22428 (including header)
centrifuge kreport - unclassified = 32
centrifuge kreport - root = 22394

So some thing about this number are not completely correct. In the centrifuge output I'm missing 27587 - 22427 = 5160 reads. What did happen with those reads? In the metrics file I could not directly see why does reads are not used. I do know we does read them because the number in the metrics file are correct.

When I add unclassified to unsigned I get 22394 + 32 = 22426. This means that 1 read is getting lost from centrifuge output to kreport. This seems like a array starting from 1 instead of 0?

On all samples I did run I see the pattern, does not matter if it's illumnia or nanopore data.

.gz support for kreport script

It would be nice if the kreport script supports zipped input files. The output is now 1 line for each read/pair, this can become a very large file. Just piping from centrifuge to gzip is easy but kreport does not support this.

parameter to produce one taxonomy ID per sequence

Hi,
I am trying to use Centrifuge to complete metagenomics project, I only need one taxonomy ID for each read, however there are too many of them. And I have no idea which parameter could take control of how many taxonomy IDs a read could output. Could you please show which parameter could limit the number of output taxonomy id?

Thanks a lot.

make error on downloading databases

I've been trying to get databases set up for centrifuge for a few weeks now and I've been casually banging my head against this problem. I get the same error on two different Mac OSX Sierra machines. I suspect it's a permission issue. Seems that after the files are downloaded the data can't be written to the database. I have tried to modify the makefile to use 'sudo' but none of my changes seen to work.

Here's my error message:

Progress : [######################################--] 97% 5855/5983
make[1]: *** [reference-sequences/all-viral.fna] Error 1
make: *** [b_compressed+h+v] Error 2

Any help here would be appreciated! Thanks!

Error in `centrifuge-class': munmap_chunk(): invalid pointer

After running the following command with 10 cores and 150GB of memory (of which 87.140GB were used) on CentOS Linux release 7.1.1503:

$CENTRIFUGE_HOME/centrifuge -q -x datasets/centrifuge/nt/nt -1 f1.R1.fastq -2 f1.R2.fastq

Centrifuge classifies sequences from the fastq files, but errors out when generating centrifuge_report.csv.

The error output is here:
error.txt

Am I doing something wrong?

make nt database index

I'm still banging my head on this problem. I've worked on this off and on for the last few months and can't seem to get this database to index. I'm trying to make the nt database index and I am now getting this error.

$ centrifuge-build -p 16 --bmax 1342177280 --conversion-table gi_taxid_nucl.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp nt.fna nt
Settings:
  Output files: "nt.*.cf"
  Line rate: 7 (line is 128 bytes)
  Lines per side: 1 (side is 128 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Local offset rate: 3 (one in 8)
  Local fTable chars: 6
  Max bucket size: 1342177280
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: default
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  nt.fna
Reading reference sizes
Warning: Encountered reference sequence with only gaps
Warning: Encountered reference sequence with only gaps
Warning: Encountered reference sequence with only gaps
Warning: Encountered reference sequence with only gaps
Warning: Encountered reference sequence with only gaps
Warning: Encountered reference sequence with only gaps
Warning: Encountered reference sequence with only gaps
Warning: Encountered reference sequence with only gaps
  Time reading reference sizes: 00:29:08
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
Killed: 9

Can you shed some light on what is happening here? Do I need to specify a different value for memory? Thanks!

add unmatched reads to report table

Maybe there is already a way to extract this information, but is it possible to add the number of unknown or unmatched reads to the report table? As far as I can tell, only matches are reported currently.

Specify taxonomic level

Hi,

Is there any option to specify the taxonomic level at which to report the classification, or even the full lineage. Currently I'm getting only species/genus classifications which aren't really informative for assessing environmental metagenome bins.

Thanks,

Ruben

centrifuge report fail

I tried running centrifuge on the test data and it works, but it keeps failing for my own using the same reference.

$CENTRIFUGE_HOME/centrifuge -f -x ~/ref/Centrifuge/b+h+v/b+h+v ./testfile.fasta

I see the results, but the report fails:

report file centrifuge_report.csv
Number of iterations in EM algorithm: 4
Probability diff. (P - P_prev) in the last iteration: 3.52546e-11
*** glibc detected *** /ifs/home/id/software/centrifuge-1.0.1-beta/centrifuge-class: munmap_chunk(): invalid pointer: 0x000000000185a7f0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x75f4e)[0x2aaaab3e8f4e]
/ifs/home/id/software/centrifuge-1.0.1-beta/centrifuge-class[0x445fd2]
/ifs/home/id/software/centrifuge-1.0.1-beta/centrifuge-class[0x41cecb]
/ifs/home/id/software/centrifuge-1.0.1-beta/centrifuge-class[0x41f3fb]
/ifs/home/id/software/centrifuge-1.0.1-beta/centrifuge-class[0x49812b]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2aaaab391d5d]
/ifs/home/id/software/centrifuge-1.0.1-beta/centrifuge-class[0x40f6e9]

This happened with 1,000 50bp sequences, but not when used 100.

incorrect genome sizes

It looks like the summary report may be reporting wrong genome sizes.

For human (taxID 9606):
From report: 6,339,524,059 (2X bigger than expected)
From NCBI: median total length (Mb): 2996.43

For gorilla (taxID 9593):
From report: 19,140,263 (100X smaller than expected)
From NCBI: median total length (Mb): 3058.03

For Picea glauca (taxID 3330):
From report: 26,852,969 (1,000X smaller than expected)
From NCBI: median total length (Mb): 25784.7

error in parsing UniVec?

The output of centrifuge-download contaminants looks like

gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
gnl|uv  32630
...

Unaligned reads output (--un)

Hi Daehwan,

Just wanted to notify you about a minor thing I noticed: While there were reads missing from the main Centrifuge output (supposedly unaligned?), the file for unaligned reads specified with the --un option remained empty.

Personally not an issue for me, since it was easy to find out the unaligned reads by other means.

Cheers,
Moritz

Warning: taxomony id doesn't exists for NC_001224.1!

Hi, I ran into an issue when I tried to add Saccharomyces cerevisiae genomes and others into my centrifuge database. I have a warning of 55 542 sequences ID.
I created my own file "seqid2taxid.map" and I can grep the sequencesID on it, for instance in vim the sequence id NC_001224.1 is associated to tax id 559292 (tab separated and nothing other characters in the line ie: NC_001224.1^I559292$

Then I looked in names.dmp and node.dmp files and 559292 is associated to
grep -w "^559292" taxonomy/names.dmp

559292 | Saccharomyces cerevisiae S288c | | scientific name |

And on nodes.dmp:

grep -w "^559292" taxonomy/nodes.dmp
559292 | 4932 | no rank | | 4 | 1 | 1 | 1 | 3 | 1 | 1 | 0 | |

Could you explained if I am doing something wrong that I can correct because I really want to count S. cerevisae in my data and I can't associate taxonomy to their sequences.

Thanks,

Alban

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.