GithubHelp home page GithubHelp logo

kasperskytte / autotax Goto Github PK

View Code? Open in Web Editor NEW
19.0 19.0 1.0 27.17 MB

Generate de novo taxonomy of full length 16S rRNA sequences directly from environmental samples

License: GNU General Public License v3.0

Shell 94.76% Dockerfile 5.24%

autotax's Introduction

Hi there 👋

🇩🇰 My name is Kasper, I enjoy:

KasperSkytte's GitHub stats

autotax's People

Contributors

kasperskytte avatar msdueholm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

cmc-aau

autotax's Issues

Split the pipeline

Hi Kasper,
I would like to have the pipeline split into separate scripts.

  1. Script used to generate ESVs
  2. Script used to add additional ESVs to an ESV-database (not made yet)
  3. Script used to classify ESVs based on SILVA+typestrains+denovo tax
  4. Script used to manually curate the taxonomy (replacement_list.txt)
  5. Script that runs the 4 scripts above.

Adding seqs can result in old FLASV ID's being changed

When adding new FLASVs, if the old ones don't have IDs in a straight sequence from 1 to N seqs without holes, old ones will be overridden, and so will new de novo clusters.

names(FLASVs) <- paste0("FLASV", 1:length(FLASVs), ".", lengths(FLASVs))

A solution would be to use something like this instead:

lastID <- as.integer(gsub("FLASV|\\.[0-9]+$", "", names(querySeqs)[length(querySeqs)]))
names(FLASVs)[c((length(querySeqs)+1):length(FLASVs))] <- paste0("FLASV", c(lastID:length(FLASVs)), ".", lengths(FLASVs))

How to get SILVA132-typestrains.arb ?

Maybe the solution is obvious but I cannot find the SILVA132-typestrains.arb file in any location.
I found only SILVA_132_SSURef_NR99_13_12_17_opt.arb.

Is this file available to download or I need to create one ?

Best regards

Error when input sequences only occur once

Hi, first of all, thank you for this tool. It seems to be very interesting and it is exactly what i need for my study.

I first ran getsilvadb.sh to create my database (file.arb, file.udh, file_typestrains.udh).
Everything ran well until i run autotax.bash with my fasta file:

bash autotax.bash -i file.fasta
[2021-06-28 11:57:15]: Checking for required R packages and installing if missing...
[2021-06-28 11:57:17]:   - Orienting sequences...
[2021-06-28 11:57:19]:   - Dereplicating sequences...
[2021-06-28 11:57:19]:   - Denoising sequences using UNOISE3
usearch v11.0.667_i86linux32, 4.0Gb RAM (16.0Gb total), 8 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch

License: personal use only

00:00 41Mb    100.0% Reading temp/uniques_wsize.fa
[2021-06-28 11:57:19]:   - Finding the longest representative sequence of identical sequences, then reorder and rename...
Error in S4Vectors:::normarg_names(value, class(x), length(x)) : 
  attempt to set too many names (2) on GroupedIRanges object of length 0
Calls: names<- -> names<- -> names<- -> names<- -> <Anonymous>
Exécution arrêtée

Can you tell me where i went wrong? I double checked everything but it is still not working.

Thank you for your help.

Error with vsearch in MergeTaxonomy and Orient

Hello!

First of all, I would like to thank you for making AutoTax publicly available. I am attempting to recreate the output you generated using your original script which uses USEARCH. At this moment I don’t have the 64-bit version of USEARCH available to me. When I ran your script and arrived at the “Finding taxonomy of best hit in SILVA database”-step the memory limit of the 32-bit process was exceeded. So I changed the command to its VSEARCH counterpart which is:
For searchTaxDB:
$ vsearch -usearch_global $input -db $database -maxaccepts 0 -maxrejects 0 -top_hits_only -strand plus -id 0 -blast6out $output -threads $MAX_THREADS

For searchTaxDB_typestrain:
$ vsearch -usearch_global $input -db $database -maxaccepts 0 -maxrejects 0 -strand plus -id 0.987 -blast6out $output -threads $MAX_THREADS

The script ran smoothly but when it arrived at the mergeTaxonomy-step I got this output:

Matching unique query sequences: 6 of 85 (7.06%)
[2021-05-27 13:00:24]: Clustering FLASV's at Species level (98.7% identity)
[2021-05-27 13:00:24]: Clustering FLASV's at Genus level (94.5% identity)
[2021-05-27 13:00:24]: Clustering FLASV's at Family level (86.5% identity)
[2021-05-27 13:00:24]: Clustering FLASV's at Order level (82.0% identity)
[2021-05-27 13:00:25]: Clustering FLASV's at Class level (78.5% identity)
[2021-05-27 13:00:25]: Clustering FLASV's at Phylum level (75.0% identity)
[2021-05-27 13:00:25]: Merging and outputting taxonomy...
No replacement file found, skipping...
Error in S4Vectors:::normarg_names(value, class(x), length(x)) : 
  attempt to set too many names (89) on GroupedIRanges object of length
  85
Calls: names<- -> names<- -> names<- -> names<- -> 
Execution halted

These are the files (three missing) that I've got in the /output folder:

tax_complete.csv
tax_denovo.csv
tax_SILVA.csv
tax_slv_typestr.csv
tax_typestrains.csv

I thought that there was something wrong with my own databases. So I changed the databases to the ones you’ve originally used (SILVA_138_SSURef_NR99) and shared here: https://figshare.com/articles/dataset/SILVA132_typestrains_in_ARB_UDB11_format/9994568?file=22790396

But when I tried to run your script with your original databases I get this error on the orient-step.

[2021-05-27 13:56:11]: Checking for required R packages and installing if missing...
[2021-05-27 13:56:13]:   - Orienting sequences...
usearch v11.0.667_i86linux32, 4.0Gb RAM (198Gb total), 128 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch


00:03 2.8Gb   100.0% Rows
00:03 2.8Gb  Reading pointers...done.
00:03 2.9Gb  Reading db seqs...

usearch -orient test/example_data/5K_addon_FLASVs.fa -db /home/genomics/npeeters/AutoTax/refdatabases/SILVA_138_SSURef_NR99_tax_silvaK.udb -fastaout temp/fSSUs_oriented.fa -threads 1

---Fatal error---
ReadStdioFile failed, attempted 754773441 bytes, read 748130870 bytes, errno=0

Have you encountered this problem before? Is there a way to fix this?

Thank you in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.