kasperskytte / autotax Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 1.0 27.17 MB

Generate de novo taxonomy of full length 16S rRNA sequences directly from environmental samples

License: GNU General Public License v3.0

Shell 94.76% Dockerfile 5.24%

autotax's Introduction

Hi there 👋

🇩🇰 My name is Kasper, I enjoy:

👉 the level 2 dad-life 🚸 as an HPC admin 💻
👉 data science of all sorts!
👉 bioinformatics and DNA sequencing 🧬
👉 Machine learning and (bio)statistics
👉 Linux 🐧 system administration and Ansible
👉 building R packages 📦 and scripts
👉 bundling anything into docker/singularity containers 🛳️
👉 building a Shiny app or two, tiny or huge
- for monitoring the fermentation of your homebrewed beer 🍺
  - https://github.com/KasperSkytte/KaspbeeryPi/
- or analyzing 🔬 complex microbial communities 🦠, simply:
👉 Self-hosting! I love tinkering with open source software, virtual machines, containers.
👉 Investing and finance, building trading robots! 💰

autotax's People

Contributors

Stargazers

Watchers

Forkers

cmc-aau

autotax's Issues

Split the pipeline

Hi Kasper,
I would like to have the pipeline split into separate scripts.

Script used to generate ESVs
Script used to add additional ESVs to an ESV-database (not made yet)
Script used to classify ESVs based on SILVA+typestrains+denovo tax
Script used to manually curate the taxonomy (replacement_list.txt)
Script that runs the 4 scripts above.

Missing content of example_data/10k_fSSUs.fa

The file example_data/10k_fSSUs.fa is empty.

Looks like you may have deleted its contents in commit 8418b85 ?

Adding seqs can result in old FLASV ID's being changed

When adding new FLASVs, if the old ones don't have IDs in a straight sequence from 1 to N seqs without holes, old ones will be overridden, and so will new de novo clusters.

AutoTax/autotax.bash

Line 498 in 0f26527

names(FLASVs) <- paste0("FLASV", 1:length(FLASVs), ".", lengths(FLASVs))

A solution would be to use something like this instead:

lastID <- as.integer(gsub("FLASV|\\.[0-9]+$", "", names(querySeqs)[length(querySeqs)]))
names(FLASVs)[c((length(querySeqs)+1):length(FLASVs))] <- paste0("FLASV", c(lastID:length(FLASVs)), ".", lengths(FLASVs))

How to get SILVA132-typestrains.arb ?

Maybe the solution is obvious but I cannot find the SILVA132-typestrains.arb file in any location.
I found only SILVA_132_SSURef_NR99_13_12_17_opt.arb.

Is this file available to download or I need to create one ?

Best regards

Error when input sequences only occur once

Hi, first of all, thank you for this tool. It seems to be very interesting and it is exactly what i need for my study.

I first ran getsilvadb.sh to create my database (file.arb, file.udh, file_typestrains.udh).
Everything ran well until i run autotax.bash with my fasta file:

bash autotax.bash -i file.fasta
[2021-06-28 11:57:15]: Checking for required R packages and installing if missing...
[2021-06-28 11:57:17]:   - Orienting sequences...
[2021-06-28 11:57:19]:   - Dereplicating sequences...
[2021-06-28 11:57:19]:   - Denoising sequences using UNOISE3
usearch v11.0.667_i86linux32, 4.0Gb RAM (16.0Gb total), 8 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch

License: personal use only

00:00 41Mb    100.0% Reading temp/uniques_wsize.fa
[2021-06-28 11:57:19]:   - Finding the longest representative sequence of identical sequences, then reorder and rename...
Error in S4Vectors:::normarg_names(value, class(x), length(x)) : 
  attempt to set too many names (2) on GroupedIRanges object of length 0
Calls: names<- -> names<- -> names<- -> names<- -> <Anonymous>
Exécution arrêtée

Can you tell me where i went wrong? I double checked everything but it is still not working.

Thank you for your help.

Error with vsearch in MergeTaxonomy and Orient

Hello!

First of all, I would like to thank you for making AutoTax publicly available. I am attempting to recreate the output you generated using your original script which uses USEARCH. At this moment I don’t have the 64-bit version of USEARCH available to me. When I ran your script and arrived at the “Finding taxonomy of best hit in SILVA database”-step the memory limit of the 32-bit process was exceeded. So I changed the command to its VSEARCH counterpart which is:
For searchTaxDB:
$ vsearch -usearch_global $input -db $database -maxaccepts 0 -maxrejects 0 -top_hits_only -strand plus -id 0 -blast6out $output -threads $MAX_THREADS

For searchTaxDB_typestrain:
$ vsearch -usearch_global $input -db $database -maxaccepts 0 -maxrejects 0 -strand plus -id 0.987 -blast6out $output -threads $MAX_THREADS

The script ran smoothly but when it arrived at the mergeTaxonomy-step I got this output:

Matching unique query sequences: 6 of 85 (7.06%)
[2021-05-27 13:00:24]: Clustering FLASV's at Species level (98.7% identity)
[2021-05-27 13:00:24]: Clustering FLASV's at Genus level (94.5% identity)
[2021-05-27 13:00:24]: Clustering FLASV's at Family level (86.5% identity)
[2021-05-27 13:00:24]: Clustering FLASV's at Order level (82.0% identity)
[2021-05-27 13:00:25]: Clustering FLASV's at Class level (78.5% identity)
[2021-05-27 13:00:25]: Clustering FLASV's at Phylum level (75.0% identity)
[2021-05-27 13:00:25]: Merging and outputting taxonomy...
No replacement file found, skipping...
Error in S4Vectors:::normarg_names(value, class(x), length(x)) : 
  attempt to set too many names (89) on GroupedIRanges object of length
  85
Calls: names<- -> names<- -> names<- -> names<- -> 
Execution halted

These are the files (three missing) that I've got in the /output folder:

tax_complete.csv
tax_denovo.csv
tax_SILVA.csv
tax_slv_typestr.csv
tax_typestrains.csv

I thought that there was something wrong with my own databases. So I changed the databases to the ones you’ve originally used (SILVA_138_SSURef_NR99) and shared here: https://figshare.com/articles/dataset/SILVA132_typestrains_in_ARB_UDB11_format/9994568?file=22790396

But when I tried to run your script with your original databases I get this error on the orient-step.

[2021-05-27 13:56:11]: Checking for required R packages and installing if missing...
[2021-05-27 13:56:13]:   - Orienting sequences...
usearch v11.0.667_i86linux32, 4.0Gb RAM (198Gb total), 128 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch


00:03 2.8Gb   100.0% Rows
00:03 2.8Gb  Reading pointers...done.
00:03 2.9Gb  Reading db seqs...

usearch -orient test/example_data/5K_addon_FLASVs.fa -db /home/genomics/npeeters/AutoTax/refdatabases/SILVA_138_SSURef_NR99_tax_silvaK.udb -fastaout temp/fSSUs_oriented.fa -threads 1

---Fatal error---
ReadStdioFile failed, attempted 754773441 bytes, read 748130870 bytes, errno=0

Have you encountered this problem before? Is there a way to fix this?

Thank you in advance!

kasperskytte / autotax Goto Github PK

autotax's Introduction

Hi there 👋

autotax's People

Contributors

Stargazers

Watchers

Forkers

autotax's Issues

Split the pipeline

Missing content of example_data/10k_fSSUs.fa

Adding seqs can result in old FLASV ID's being changed

How to get SILVA132-typestrains.arb ?

Error when input sequences only occur once

Error with vsearch in MergeTaxonomy and Orient

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs