GithubHelp home page GithubHelp logo

yanhui09 / laca Goto Github PK

View Code? Open in Web Editor NEW
7.0 2.0 2.0 126.86 MB

A reproducible and scalable workflow for Long Amplicon Consensus Analysis (LACA)

License: GNU General Public License v3.0

Python 99.60% Dockerfile 0.22% Shell 0.18%
denoise nanopore amplicon long-read-sequencing

laca's Introduction

LACA: Long Amplicon Consensus Analysis

snakemake linux/amd64 CI Docker

LACA is a reproducible and scalable workflow for Long Amplicon Consensus Analysis, e.g., 16S rRNA gene. Using snakemake as the job controller, LACA is wrapped into a python package for development and mainteniance. LACA provides an end-to-end solution from bascecalled reads to the final count matrix.

Important: LACA is only tested in Linux systems, i.e., Ubuntu.

preprint on bioRxiv

Docker image

The easiest way to use LACA is to pull the docker image from Docker Hub for cross-platform support.

docker pull yanhui09/laca

To use the docker image, you need to mount your data directory, e.g., pwd, to the /home in the container.

docker run -it -v `pwd`:/home --privileged yanhui09/laca

Installation from GitHub repository

Conda is the only required dependency prior to installation. Miniconda is enough for the whole pipeline.

  1. Clone the Github repository and create an isolated conda environment
git clone https://github.com/yanhui09/laca.git
cd laca
conda env create -n laca -f env.yaml 

You can speed up the whole process if mamba is installed.

mamba env create -n laca -f env.yaml 
  1. Install LACA with pip

To avoid inconsistency, we suggest installing LACA in the above conda environment

conda activate laca
pip install --editable .

At this moment, LACA uses a compiled but tailored guppy for barcode demultiplexing (in our lab).
Remember to prepare the barcoding files in guppy if new barcodes are introduced. Click me

Example

laca init -b /path/to/basecalled_fastqs -d /path/to/database    # init config file and check
laca run all                                         # start analysis

Usage

Usage: laca [OPTIONS] COMMAND [ARGS]...

  LACA: a reproducible and scaleable workflow for Long Amplicon Consensus
  Analysis. To follow updates and report issues, see:
  https://github.com/yanhui09/laca.

Options:
  -v, --version  Show the version and exit.
  -h, --help     Show this message and exit.

Commands:
  init  Prepare the config file.
  run   Run LACA workflow.

LACA is easy to use. You can start a new analysis in two steps using laca init and laca run .

Remember to activate the conda environment if LACA is installed in a conda environment.

conda activate laca
  1. Intialize a config file with laca init

laca init will generate a config file in the working directory, which contains the necessary parameters to run LACA.

Usage: laca init [OPTIONS]

  Prepare config file for LACA.

Options:
  -b, --bascdir PATH              Path to a directory of the basecalled fastq
                                  files. Option is mutually exclusive with
                                  'merge', 'merge_parent', 'demuxdir'.
  -x, --demuxdir PATH             Path to a directory of demultiplexed fastq
                                  files. Option is mutually exclusive with
                                  'merge', 'merge_parent', 'bascdir'.
  --merge PATH                    Path to the working directory of a completed
                                  LACA run  [Mutiple]. Runs will be combined
                                  if --merge_parent applied. Option is
                                  mutually exclusive with 'bascdir',
                                  'demuxdir'.
  --merge-parent PATH             Path to the parent of the working
                                  directories of completed LACA runs. Runs
                                  will be combined if --merge applied. Option
                                  is mutually exclusive with 'bascdir',
                                  'demuxdir'.
  -d, --dbdir PATH                Path to the taxonomy databases.  [required]
  -w, --workdir PATH              Output directory for LACA.  [default: .]
  --demuxer [guppy|minibar]       Demultiplexer.  [default: guppy]
  --fqs-min INTEGER               Minimum number of reads for the
                                  demultiplexed fastqs.  [default: 1000]
  --no-pool                       Do not pool the reads for denoising.
  --subsample                     Subsample the reads.
  --no-chimera-filt               Do not filter chimeric reads.
  --no-primer-check               Do not check primer pattern.
  --cluster [isONclust|umapclust|meshclust]
                                  Cluster approaches.  [Mutiple]  [default:
                                  isONclust, meshclust]
  --consensus [kmerCon|miniCon|isoCon|umiCon]
                                  Consensus methods.  [Mutiple]  [default:
                                  kmerCon]
  --quant [seqid|minimap2]        Create abundance matrix by sequence id or
                                  minimap2.  [Mutiple]  [default: seqid]
  --uchime                        Filter chimeras by uchime-denovo in vsearch.
  --jobs-min INTEGER              Number of jobs for common tasks.  [default:
                                  2]
  --jobs-max INTEGER              Number of jobs for threads-dependent tasks.
                                  [default: 6]
  --ont                           Use config template for ONT reads. Option is
                                  mutually exclusive with 'isoseq'.
  --isoseq                        Use config template for PacBio CCS reads.
                                  Option is mutually exclusive with 'ont'.
  --longumi                       Use primer design from longumi paper (https:
                                  //doi.org/10.1038/s41592-020-01041-y).
                                  Option is mutually exclusive with
                                  'simulate'.
  --simulate                      Use config template for in silicon test.
                                  Option is mutually exclusive with 'longumi'.
  --clean-flags                   Clean flag files.
  -h, --help                      Show this message and exit.
  1. Start analysis with laca run

laca run will trigger the full workflow or a specfic module under defined resource accordingly. Get a dry-run overview with -n. Snakemake arguments can be appened to laca run as well.

Usage: laca run [OPTIONS] {demux|qc|clust|kmerCon|miniCon|isoCon|umiCon|quant|
                taxa|tree|all|merge|initDB|simulate} [SNAKE_ARGS]...

  Run LACA workflow.

Options:
  -w, --workdir PATH     Working directory for LACA.  [default: .]
  -c, --configfile FILE  Config file for LACA. Use config.yaml in working
                         directory if not specified.
  -j, --jobs INTEGER     Maximum jobs to run in parallel.  [default: 6]
  -m, --maxmem FLOAT     Specify maximum memory (GB) to use. Memory is
                         controlled by profile in cluster execution.
  --profile TEXT         Snakemake profile for cluster execution.
  -n, --dryrun           Dry run.
  -h, --help             Show this message and exit.

laca's People

Contributors

yanhui09 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

gilmahu srisvs33

laca's Issues

Organising folders to analyse multiple samples (and amplicons) at once

Hi,

I would like to analyse multiple samples at once after demultiplexing. As input data I have one fasta file for each sample and each amplicon.

I'm not sure how to organise my data and I could not find it in the documentation. So there are several options to organise the data:

demultiplexed_data/
├─ sample1/
│  ├─ amplicon1/
│  ├─ amplicon2/
├─ sample2/
│  ├─ amplicon1/
│  ├─ amplicon2/

Or

demultiplexed_data1/
├─ sample1/
│  ├─ amplicon1/
├─ sample2/
│  ├─ amplicon1/
demultiplexed_data2/
├─ sample1/
│  ├─ amplicon2/
├─ sample2/
│  ├─ amplicon2/

The wanted output file after clustering would be like this:
clustered_consensus_all_samples_amplicon1.fasta
clustered_consensus_all_samples_amplicon2.fasta

With corresponding count table for each amplicon like this:

| OTU  | sample1 | sample2 |
|------|---------|---------|
| OTU1 | 0       | 250     |
| OTU2 | 142     | 0       |
| OTU3 | 143     | 1653    |
| ...  | ...     | ...     |

Is laca currently able to produce output like this? And should I analyse one amplicon at a time or are multiple amplicons possible?

Thanks!

Test data

Hello,

Thanks for providing this tool.

I was wondering if you could provide a test dataset with few fastq files and a database file.
I'm having some problems running the tool using the docker container, and I wanted to know if it's an issue with my input or I'm doing something wrong.

In case you don't have a ftp server, you could use https://zenodo.org/ to store the files.

More details regarding taxonomy DB

Thanks for developing this tool -- I am eager to try it out.

Now, it seems that a taxonomy database is a requirement for laca init. Would it be possible to clarify the format of this database (e.g., fasta with taxonomy strings in the sequence header ...?) -- I searched for some information but could not readily find what I was looking for?

Thanks!

Dieter

Run without singularity requirement

Hi, I am trying to run the laca workflow in an execution environment where I am not allowed (nor able) to run the docker container with --privileged enabled, and therefore the steps that are run in a singularity container fail, see below error message:

FATAL: while extracting /database/singularity_envs/be79a9f6f5e87678ce46ad686c92cb19.simg: root filesystem extraction failed: extract command failed: �[91mERROR : Failed to create user namespace: user namespace requires to set /proc/sys/kernel/unprivileged_userns_clone to 1

Is there any way to circumvent the use of singularity in the workflow?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.