GithubHelp home page GithubHelp logo

molgenis / ngs_rna Goto Github PK

View Code? Open in Web Editor NEW
3.0 23.0 8.0 76.15 MB

Home Page: http://molgenis.github.io/pipelines/

License: GNU Lesser General Public License v3.0

Shell 65.47% Perl 0.36% Python 23.13% FreeMarker 4.47% R 6.56%

ngs_rna's Introduction

NGS_RNA pipeline

Description of the different steps used in the RNA analysis pipeline

Gene expression quantification

The trimmed fastQ files were aligned to a reference genome using Star [1] with default settings. Before gene quantification SAMtools [2] was used to sort the aligned reads. The gene level quantification was performed by HTSeq-count [3] using --mode=union. The gene annotation database which is included in the results dir in folder expression/. Deseq2 was used for differential expression analysis on STAR bams. For experimental group conditions the 'condition' column in the samplesheet was used the distinct groups within the samples.

Calculate QC metrics on raw and aligned data

Quality control (QC) metrics are calculated for the raw sequencing data. This is done using the tool FastQC [4]. QC metrics are calculated for the aligned reads using Picard-tools [5], CollectRnaSeqMetrics, MarkDuplicates, CollectInsertSize- Metrics and SAMtools flagstat.

Splicing event calling using Leafcutter

Leafcutter quantifies RNA splicing variation detection.

GATK variant calling

Variant calling was done using GATK. First, we use a GATK tool called SplitNCigarReads developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions. The variant calling itself was done using HaplotypeCaller in GVCF mode. All samples are then jointly genotyped by taking the gVCFs produced earlier and running GenotypeGVCFs on all of them together to create a set of raw SNP and indel calls. [6]

Results archive The zipped archive contains the following data and subfolders:

  • alignment: merged BAM file with index, md5sums and alignment statistics (.Log.final.out)
  • expression: textfiles with gene level quantification per sample and per project.
  • fastqc: FastQC output
  • qcmetrics: Multiple qcMetrics and images generated with Picard-tools or SAMTools Flagstat.
  • leafcutter: Leafcutter and RegTools output files.
  • expression/Deseq2 differential expression analysis.
  • multiqc_data: Combined MultiQC tables used for multiqc report html.
  • variants: Variants calls using GATK.
  • rawdata: raw sequence file in the form of a gzipped fastq file (.fq.gz)

The root of the results directory contains the final QC report, README.txt, analysis results from each tool, and the samplesheet which formed the basis for this analysis.

  1. Alexander Dobin 1 , Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, Thomas R Gingeras: STAR: ultrafast universal RNA-seq aligner 2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/bts635. Epub 2012 Oct 25.
  2. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup 1000 Genome Project Data Processing: The Sequence Alignment/Map format and SAMtools. Bioinforma 2009, 25 (16):2078–2079.
  3. Anders S, Pyl PT, Huber W: HTSeq – A Python framework to work with high-throughput sequencing data HTSeq – A Python framework to work with high-throughput sequencing data. 2014:0–5.
  4. Andrews, S. (2010). FastQC a Quality Control Tool for High Throughput Sequence Data [Online]. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ ${samtoolsVersion}
  5. Picard Sourceforge Web site. http://picard.sourceforge.net/ ${picardVersion}
  6. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. McKenna A et al.2010 GENOME RESEARCH 20:1297-303, Version: ${gatkVersion}
  7. Li YI, Knowles DA, Humphrey J, et al. Annotation-free quantification of RNA splicing using LeafCutter. Nat Genet. 2018;50(1):151-158. doi:10.1038/s41588-017-0004-9

Manual

1) Copy rawdata to raw data ngs folder

scp –r SEQSTARTDATE_SEQ_RUNTEST_FLOWCELLXX username@yourcluster:${root}/groups/$groupname/${tmpDir}/rawdata/ngs/YOURDIR

2) Create a folder in the generatedscripts folder

mkdir ${root}/groups/$groupname/${tmpDir}/generatedscripts/TestRun

3) Copy samplesheet to generatedscripts folder

scp –r TestRun.csv username@yourcluster:/groups/$groupname/${tmpDir}/generatedscripts/

Note: the name of the folder should be the same as samplesheet (.csv) file. Note2: Example samplesheet can be found in $EBROOTNGS_RNA/templates/externalSamplesheet.csv

4) Run the generate script

module load NGS_RNA
cd ${root}/groups/$groupname/${tmpDir}/generatedscripts/TestRun
cp $EBROOTNGS_RNA/generate_template.sh .
bash generate_template.sh
cd scripts

Note: If you want to run the pipeline locally, you should change the backend in the CreateInhouseProjects.sh script (this can be done almost at the end of the script where you have something like: sh ${EBROOTMOLGENISMINCOMPUTE}/molgenis_compute.sh search for –b slurm and change it into –b localhost

bash submit.sh

5) Submit jobs

Navigate to jobs folder. The location of the jobs folder will be outputted at the step before this one (step 4).

bash submit.sh

ngs_rna's People

Contributors

benjaminsm avatar dennishendriksen avatar gerbenvandervries avatar kdelange avatar marieke-bijlsma avatar mmterpstra avatar pneerincx avatar roankanninga avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ngs_rna's Issues

ProcessReadCounts.jar missing in ngs-utils/21.04.2

s10_MakeExpressionTable_0.sh needs ProcessReadCounts.jar to make the expression table.
ProcessReadCounts.jor needs to be added to ngs-utils version 21.04.2 or the NGS_RNA needs to revere to ngs-utils/19.03.3-GCCcore-7.3.0

add set -o pipefail in script

This big pipe in the script below (3 scripts in one) can error but it will never produce an exit code, thus it will always say .finished

samtools \
view -h \
${sampleMergedBam}.nameSorted.bam | \
$EBROOTHTSEQ/scripts/htseq-count \
-m union \
-s ${STRANDED} \
- \
${annotationGtf} | \
head -n -5 \
> ${tmpSampleHTseqExpressionText}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.