GithubHelp home page GithubHelp logo

michalbukowski / rnaseq-pipeline-1 Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 965 KB

Simple Snakemake RNA-Seq pipeline

License: GNU General Public License v3.0

Python 80.48% R 19.52%
rna-seq rnaseq-analysis rnaseq-pipeline dge dge-analysis python3 r snakemake snakemake-pipeline snakemake-rna-seq salmon salmon-deseq2 pandas bioconductor miniconda python miniconda3

rnaseq-pipeline-1's Introduction

Simple RNA-Seq pipeline

Differential gene expression pipeline utilising Salmon and Bioconductor DESeq2 for Illumina RNA-Seq sequencing results for reversed stranded libraries. For detail see the following paper:

Bukowski M, Kosecka-Strojek M, Madry A, Zagorski-Przybylo R, Zadlo T, Gawron K, Wladyka B (2022) Staphylococcal saoABC operon codes for a DNA-binding protein SaoC implicated in the response to nutrient deficit. International Journal of Molecular Sciences, 23(12), 6443.

If you find the pipeline useful, please cite the paper.

1. Environment

The pipeline was created and tested in the following set-up:

  • Miniconda3 environment:
    • Python (3.8.5)
    • Snakemake (5.32.0)
    • Salmon (1.4.0)
    • Pandas (1.1.4)
  • R (4.0.3):
    • Bioconductor DESeq2 (1.30.0)
    • Bioconductor tximport (1.18.0)
    • optparse (1.6.6)

2. Directory structure and pipeline files

Names of the FASTQ files with reads in NCBI BioProject PRJNA798259 follow the pattern, which is required by the pipeline: {strain}_{group}_{setting}_{replica}_{reads}.fastq, e.g. rn_wt_lg_1_R1.fastq. Regarding the strain, in the project data there are files only for rn (Staphylococcus aureus RN4220). There are two groups: wt (wild type, the reference group), mt (mutant, saoC gene mutant), sampled in two settings/conditions: lg (logarithmic growth phase) and lt (late growth phase). For each there are 3 replicas (1 -- 3). Reads of each are written to two files: R1 and R2. All in all, there are 24 files.

The pipeline utilises the following directory structure:

your_pipeline_location/
├── Snakefile
├── scripts/
│   ├── collect.py
│   ├── dge.r
│   └── summary.py
├── input/
│   ├── reads/
│   │   └── (...)
│   └── refs/
│       └── rn.fna
├── output/
└── log/

In the working directory you can find the Snakefile describing the pipeline. Necessary scripts are in scripts/. In input/refs/ you will have rn.fna multiple FASTA file with reference transcript sequences for Staphylococcus aureus RN4220. In input/reads/ you should place reads from NCBI BioProject PRJNA798259. Directories output/ and log/ will be created automatically once the pipeline is run. All diagnostic and error messages from tools and scripts used by the pipeline will be redirected to files in the log/ directory.

3. Pipeline architecture

The pipline described in the Snakefile encompasses the following stages:

  1. index -- preparation of an index of reference transcript sequences with Salmon.
  2. quant -- read mapping and counts with Salmon.
  3. collect -- preparation of metadata for Salmon quant files with scripts/collect.py.
  4. dge -- differential gene expression with scripts/dge.r utilising DESeq2 library.
  5. summary -- generation of the final output in TSV format with scripts/summary.py.

More detailed description on how the pipeline works you will find in comments both in the Snakefile and the script files.

4. Running the pipeline

Providing that you have the environment properly set up, you can run the pipeline from the directory with Snakefile on as many cores as you wish:

snakemake --cores number_of_cores

5. Running the pipeline using a Docker image

You can run the pipeline in an already set up environment available as a Docker image. See DOCKER_MANUAL.md file for instructions.

rnaseq-pipeline-1's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.