GithubHelp home page GithubHelp logo

bimsbbioinfo / pigx_chipseq Goto Github PK

View Code? Open in Web Editor NEW
10.0 11.0 10.0 34.15 MB

Pipeline for Analysis of ChIP-Seq data

Home Page: http://bioinformatics.mdc-berlin.de/pigx/

License: GNU General Public License v3.0

Python 45.92% R 35.92% Scheme 1.10% Shell 13.27% Makefile 1.51% M4 2.29%
chipseq pipeline snakemake guix reproducibility

pigx_chipseq's Introduction

PiGx Logo

Copyright 2017-2022: Vedran Franke, Alexander Blume, Ricardo Wurmus. This work is distributed under the terms of the GNU General Public License, version 3 or later. It is free to use for all purposes.


Summary

PiGx ChIPseq (pipelines in genomics for Chromatin Immunoprecipitation Sequencing) is an analysis pipeline for preprocessing, peak calling and reporting for ChIP or ATAC sequencing experiments. It is easy to use and produces high quality reports. The inputs are reads files from the sequencing experiment, and a configuration file that describes the experiment. In addition to quality control of the experiment, the pipeline enables multiple peak calling analysis and allows the generation of a UCSC track hub in an easily configurable manner. Starting from versions >= 0.0.50 the pipeline allows to perform highly customizable differential analysis using DEDeq2 to detect genome regions with differences in occupancy between defined groups.

What does it do

  • Trim reads using trim-galore
  • Quality control reads using fastQC and multiQC
  • Map reads to genome using Bowtie2
  • Call peaks for multiple combinations of samples using MACS2
  • Control reproducibility of experiments using IDR
  • Generate a UCSC track hub to view in Genome Browser
  • NEW Highly customizable Differential Analysis

What does it output

  • QC reports
  • bam files
  • bigwig files
  • narrowPeak files
  • UCSC track hub folder
  • NEW Differential Analysis Report

Installation of the PiGx-ChIPseq Pipeline

Install via guix

You can install this pipeline with all its dependencies using GNU Guix:

guix install pigx-chipseq

After exporting the guix profile to the PATH you should be able to get started.

Install from source

You can also install it from source manually. You can find the latest release here. PiGx uses the GNU build system. Please make sure that all required dependencies are installed and then follow these steps after unpacking the latest release tarball:

./configure --prefix=/some/where
make install

Dependencies

By default the configure script expects tools to be in a directory listed in the PATH environment variable. If the tools are installed in a location that is not on the PATH you can tell the configure script about them with variables. Run ./configure --help for a list of all variables and options.

You can prepare a suitable environment with Conda or with GNU Guix. If you do not use one of these package managers, you will need to ensure that the following software is installed:

Software dependencies
  • R
    • argparser
    • biocparallel
    • biostrings
    • chipseq
    • data.table
    • dyplr
    • genomation
    • genomicalignments
    • genomicranges
    • rsamtools
    • rtracklayer
    • rsubread
    • s4vectors
    • stringr
    • jsonlite
    • heatmaply
    • htmlwidgets
    • ggplot2
    • ggrepel
    • plotly
    • rmarkdown
    • ggrepel
    • deseq2
    • dt
    • pheatmap
    • corrplot
    • reshape2
    • scales
    • crosstalk
    • gprofiler2
    • summarizedexperiment
    • hexbin
  • python
    • snakemake
    • pyyaml
    • wrapper
    • pytest
    • xlrd
    • magic
  • pandoc
  • fastqc
  • multiqc
  • trim-galore
  • bowtie
  • macs2
  • idr
  • samtools
  • bedtools
  • bedToBigBed
  • bamToBed

via Guix

Assuming you have Guix installed, the following command spawns a sub-shell in which all dependencies are available:

guix shell

Getting started

To run PiGx on your experimental data, first enter the necessary parameters in the spreadsheet file (see following section), and then from the terminal type

$ pigx-chipseq [options] sample_sheet.csv

To see all available options type the --help option

$ pigx-chipseq --help

usage: pigx-chipseq [-h] [-v] -s SETTINGS [-c CONFIGFILE] [--target TARGET]
                   [-n] [--graph GRAPH] [--force] [--reason] [--unlock]
                   samplesheet

PiGx ChIPseq Pipeline.

PiGx ChIPseq is a data processing pipeline for ChIPseq read data.

positional arguments:
  samplesheet                             The sample sheet containing sample data in yaml format.

optional arguments:
  -h, --help                              show this help message and exit
  -v, --version                           show program version number and exit
  -s SETTINGS, --settings SETTINGS        A YAML file for settings that deviate from the defaults.
  -c CONFIGFILE, --configfile CONFIGFILE  The config file used for calling the underlying snakemake process.  By
                                          default the file 'config.json' is dynamically created from the sample
                                          sheet and the settings file.
  --target TARGET                         Stop when the named target is completed instead of running the whole
                                          pipeline.  The default target is "final-report".  Pass "--target=help"
                                          to describe all available targets.
  -n, --dry-run                           Only show what work would be performed.  Do not actually run the
                                          pipeline.
  --graph GRAPH                           Output a graph in Graphviz dot format showing the relations between
                                          rules of this pipeline.  You must specify a graph file name such as
                                          "graph.pdf".
  --force                                 Force the execution of rules, even though the outputs are considered
                                          fresh.
  --reason                                Print the reason why a rule is executed.
  --unlock                                Recover after a snakemake crash.

This pipeline was developed by the Akalin group at MDC in Berlin in 2017-2018.

Input Parameters

The pipeline requires two files as input to specify the samples and the design of the analysis.

Sample Sheet

The samples used for any subsequent analysis are defined in the sample sheet section.

SampleName Read Read2
  • SampleName is the name for the sample
  • Read/Read2 are the fastq file names of paired end reads
    • the location of these files is specified in settings.yaml
    • for single-end data, leave the Read2 column in place, but have it empty

Technical Replicates

The sample sheet offers support for technical replicates, by repeating the sample name (first column) for different input files (second,third column). The quality check will be performed for any input file and replicates will be merged during the mapping.

SampleName Read Read2 Group (optional)
ChIPpe ChIPpe_R1.fq.gz ChIPpe_R2.fq.gz Group1
ChIPpe ChIPpe_t2_R1.fq.gz ChIPpe_t2_R2.fq.gz Group2

Settings File

The settings file is a file in yaml format specifying general settings and the details of the analysis. It has the following required sections:

Locations

Defines paths to be used in the pipeline, some of the items are required and some optional (can stay blank):

item required description
input-dir yes directory of the input files (fastq files)
output-dir yes output directory for the pipeline
genome-file yes path to the reference genome in fasta forma
index-dir no directory containing pre-built mapping indices for the given reference genome (created with bowtie2-build)
gff-file no location of a GTF file with genome annotations for the given reference genome

General

These are settings which apply to all analysis (unless adjusted in single analysis):

item required description
assembly yes version of reference genome (e.g. hg19,mm9, ...)
spikein_name no version of spike-in genome (e.g. hg19,mm9, ...)
organism no needed for GO term analysis (e.g. hsapiens, mmusculus, dmelanogaster, celegans)
params no list of default parameters for tools and scripts (for tools check respective manual for available parameters)

Execution

The execution section in the settings file allows the user to specify whether the pipeline is to be submitted to a cluster, or run locally, and the degree of parallelism. For a full list of possible parameters, see etc/settings.yaml.

A minimal settings file could look like this, but please consider that no analysis will be performed without adding analysis information :

locations:
  input-dir: in/reads/
  output-dir: out/
  genome-file: genome/my_genome.fa
  index-dir:
  gff-file: genome/mm_chr19.gtf

general:
  assembly: mm9
  spikein_name: hg38
  # The "organism" field is needed for GO term analysis. Leave it empty
  # if not interested in GO analysis.  Otherwise provide a string with
  # the initial of genus and the species name (e.g. hsapiens, mmusculus,
  # dmelanogaster, celegans)
  organism: 'mmusculus'
  params:
    export_bigwig:
        extend: 200
        scale_bw: 'yes'
    bowtie2:
        N: 0
        k: 1
    idr:
        idr-threshold: 0.1
    macs2:
        #        g: mm
        keep-dup: auto
        q: 0.05
        nomodel: ''
    extract_signal:
        expand_peak: 200
        bin_num: 20
    feature_counting:
        minFragLength: 50
        maxFragLength: 600
        countChimericFragments: ''
        requireBothEndsMapped: ''

execution:
  submit-to-cluster: no
  rules:
    __default__:
      queue: all.q
      memory: 8G
    bowtie2:
      queue: all.q
      memory: 16G

Analysis Sections

The analysis part of the setting file describes the experiment. It has following sections:

section required description
peak_calling yes defines which samples will be used to detect regions of enriched binding ( multiple combinations and variations are possible, see here for details )
idr no specifies pairs of peak calling analysis that are compared to determine the reproducibilty of the general experiment (see here for details)
hub no describes the general layout of a UCSC hub that can be created from the processed data and allows the visual inspection of results at a UCSC genome browser (see here for details)
feature_combination no defines for a list of peak calling and/or idr analysis the combination of regions shared among this list (see here for details)
differential_analysis no specifies for a list of groups defined in the sample sheet which samples are used as Case or Control in the Differential Analysis (see here for details)

The creation of these sections is straight forward considering the following snippets as template. Comments and examples within the snippets provide guidance of what is possible and what to take care of.

Peak Calling

The previously defined samples are used for subsequent peak calling analysis to detect regions of enriched binding. In this section any number of comparisons can be defined, while multiple combinations and variations are possible. In terms of peak calling the ChIP (also called treatment) is the sample in which we want to detect enriched regions compared to the Cont(rol) (or background) sample. Each analysis can be run with a unique set of parameters and default parameters for all analysis can be defined in the settings file , check available parameters and description here. For more information have a look at the publication for the software we are using "Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008) vol. 9 (9) pp. R137".

# define peak calling analysis
peak_calling:
    # analysis can have any name, but the names have to be unique 
    Peaks1: 
        # sample(s) to be used as treatment sample 
        ChIP: ChIP1
        # sample(s) to be used as control sample
        Cont: Cont1
        params:
            macs2:
                # each analysis can be adjusted independently
                # add/modify available parameters of the analysis
                nomodel: ''
                extsize: 300
    Peaks2:
        ChIP: ChIP2
        Cont: Cont2
        params:
            macs2:
                # each analysis can be adjusted independently
                nomodel: ''
                extsize: 147
    Peaks4:
        ChIP:
            # multiple samples can be used as treatment
            - ChIP1
            - ChIP2
        Cont:
            # multiple samples can be used as control
            - Cont1
            - Cont2
        params:
            macs2:
                nomodel: ''

    Peaks5:
        # the number of samples per group can differ 
        ChIP: ChIP2
        Cont:
            - Cont1
            - Cont2
        params:
            macs2:
                nomodel: ''

    Peaks6:
        # analysis can be performed without control
        ChIP: ChIP1
        Cont:
        params:
            macs2:
                nomodel: ''

(optional) IDR

Assuming that the some samples are (biological/technical) replicates, in order to measure the consistency between them use the irreproducible discovery rate (IDR) "Li, Q., Brown, J. B., Huang, H., & Bickel, P. J. (2011). Measuring reproducibility of high-throughput experiments. The annals of applied statistics, 5(3), 1752-1779.", which is in general a good (but very stringent) quality control.

idr:
    # idr analysis can have any name, but the names have to be unique 
    ChIP_IDR:
        # define the pair of samples, add more combinations for more replicates
        # ChIP1 and ChIP2 are required labels, only adjust values of those
        ChIP1: Peaks1
        ChIP2: Peaks2

(optional) Hub

In the hub section the general layout of a UCSC Track Hubs is described with some minimal items. The track hub is generated from the processed data and allows the visual inspection of results at a UCSC genome browser (for supported genomes).

The required items to define the hub are the following:

item example description
name PiGx_Hub name of the hub directory
shortLabel PiGx_Short short name of hub is displayed as name above track groups
longLabel PiGx_Hub_Long descriptive longer label for hub is displayed as hub description
email my.mail[at]domain.com whom to contact for questions about the hub or data
descriptionUrl pigx_hub.html URL to HTML page with a description of the hub's contents
super_tracks see below specification of hub layout (track groups, tracks)

This is a small example how this could look like:

hub:
    name: PiGx_Hub
    shortLabel: PiGx_Short
    longLabel: PiGx_Hub_Long
    email: [email protected]
    descriptionUrl: pigx_hub.html
    super_tracks:
        # track groups can have any name, but the names have to be unique 
        Tracks1:
            # tracks can have any name, but the names have to be unique 
            track11:
                # to add peaks as a track, define "type: macs" 
                name: Peaks1
                type: macs
            track12:
                # to add coverage signal as a track, define "type: bigwig"
                name: ChIP1
                type: bigWig
            # descriptive longer label for track group is
            # displayed as description in track settings
            long_label: Tracks1_long

(optional) Feature Combination

To find the combination of enriched binding regions, which is shared among a set of peak calling and/or idr analysis results, define a feature in the feature_combination section. Only items defined in the peak_calling and idr sections can be used here.

feature_combination:
    # features can have any name, but the names have to be unique
    Feature1:
        # define feature based on only one result
        - ChIP_IDR
    Feature2:
        # define feature based on more than one result
        - Peaks6
        - Peaks5
    Feature3:
        # define feature based on different analysis types
        - ChIP_IDR
        - Peaks5

(optional) Differential Analysis

To find genomic regions that are differentially bound by a certain chipped protein or that show differential open chomatin patterns, define a differential analysis to compare Case and Control samples, with optional covariates, using the DESeq2 framework ("Love, M.I., Huber, W. and Anders, S., 2014. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology, 15(12), p.550."). Optionally predefined peak calling and/or idr analysis resultscan be used to specify pre-called features that are combined to regions of interest, but by default a peak calling analysis containing all samples of the specifed groups is performed to create the consensus peaks defining the regions of interest. Each analysis can be run with a unique set of parameters for the peak calling and read summarization, however default parameters for all analysis can be defined in the general section of the settings file. Reads over features are counted using Rsubreads featureCount function, as this allows to finetune the summarization by multiple options explained in the package's manual, please see their publication for more details: "Liao, Y., Smyth, G.K. and Shi, W., 2019. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic acids research, 47(8), pp.e47-e47.".

differential_analysis:
    Analysis1:
        Case:
            - Group1 
            - Group2
        Control:
            - Group3
        # comma separated list of additional co-variates to control for
        # in differential expression analysis (e.g. batch, age,
        # temperature, sequencing_technology etc.). Must correspond to a
        # column field in the sample_sheet.csv file.)    
        Covariates: ''
        Peakset: # choose peaks to define consensus peak set
            - ChIP_IDR
            - Peaks6
            - Peaks5
        params:
            feature_counting:
                minFragLength: 50
                maxFragLength: 150
    Analysis2:
        Case:
            - Group1 
            - Group2
        Control:
            - Group3
        Peakset: # if no peaks given call joint peaks for given samples
    Analysis3:
        Case: 
            - Group1 
            - Group2
        Control: Group3
        params:
            macs2:
                nomodel: ''
                extsize: 147

Output Folder Structure

|-- Analysis
|-- Annotation
|-- BigWig
|-- Bowtie2_Index
|-- FastQC
|-- Log
|-- Mapped
|-- Peaks
|-- Reports
|-- Trimmed
|-- UCSC_HUB

Analysis

Contains RDS files with intermediary analysis steps. RDS are binary files which efficiently store R objects.

Annotation

Formatted GTF annotation.

BigWig

Symbolic links to the bigWig signal files.

Bowtie2_Index

Processed genme file along with the Bowtie2_Index

FastQC

FastQC sequencing quality report

Log

Detailed output from execution of each step of the pipeline.

Mapped

Mapped reads in .bam format, and corresponding bigWig files.

Peaks

Peaks called with MACS2. Depending on the parameters, contains either narrowPeak or broadPeak format. sample_qsort.bed contains uniformly processed peaks, sorted by their corresponding p value.

Reports

Contains MultiQC and ChIP quality reports in html format. Optionally contains folders with Results of Differential Analysis wth DESeq2 and featureCounts tables in CSV and Reports in html format.

Trimmed

Trimgalore adaptor and quality trimmed files.

UCSC_Hub

Contains a completely formatted UCSC hub, with track descriptions, peaks and bigWig tracks.

Documentation

If are interested in the general workflow of the pipeline and want to learn more about PiGx make sure you check:

Citation

If you use the pipeline in published research, please cite our publication:

Wurmus, R., Uyar, B., Osberg, B., Franke, V., Gosdschan, A., Wreczycka, K., Ronen, J., Akalin, A. (2018). PiGx: Reproducible genomics analysis pipelines with GNU Guix. GigaScience. https://doi.org/10.1093/gigascience/giy123

Questions

If you have further questions please e-mail: [email protected] or use the web form to ask questions https://groups.google.com/forum/#!forum/pigx/

pigx_chipseq's People

Contributors

alexg9010 avatar blosberg avatar frenkiboy avatar rekado avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pigx_chipseq's Issues

Make reports deterministic

Running the pipeline twice on the same data should result in the very same report HTML. Set the htmlwidgets seed and respect SOURCE_DATE_EPOCH to reset timestamps.

Add README

This pipeline has no documentation. At the very least the format of the settings.yaml and the sample sheet need to be described.

Use settings.yaml and sample sheet

  • provide etc/settings.yaml.in containing defaults; this should include all tool paths (the contents of what is now called "software.yaml")
  • update pipeline to load values from the installed settings.yaml, and then update values from a user-supplied settings.yaml

Add logos

Reports should include the logo.
The pipeline should print the pigx command line logo.

InputError is not defined

I get this error when I try to run the pipeline with the yaml sample sheet from the test data directory, and using the test settings.yaml file:

NameError in line 52 of /gnu/store/x8nzq1adcq56nry5na0gv7zq14h77500-pigx-chipseq-0.0.8/libexec/pigx_chipseq/scripts/SnakeFunctions.py:
name 'InputError' is not defined

Can't build DAG

I believe the dot package is missing from the environment here, and as such I can't build the DAG. Error output shown below. Maybe we just need to add it to the scheme file?

$ ./pigx-chipseq Tests/sample_sheet.yaml -s  Tests/settings.yaml --graph chip_DAG.pdf
Commencing snakemake run submission locally
Traceback (most recent call last):
  File "./pigx-chipseq", line 311, in <module>
    p2 = subprocess.Popen(['dot','-Tpdf'], stdin=p1.stdout, stdout=outfile)
  File "/gnu/store/jb3n0bsdpkhvyb8y70jyr8fcx8fqssr9-python-3.5.3/lib/python3.5/subprocess.py", line 676, in __init__
    restore_signals, start_new_session)
  File "/gnu/store/jb3n0bsdpkhvyb8y70jyr8fcx8fqssr9-python-3.5.3/lib/python3.5/subprocess.py", line 1282, in _execute_child
    raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'dot'

Must have example XLS sample sheet

We have an init feature, which copies the example data into the current working directory. It's a feature to make it easier to get started.

For this pipeline we don't have an example XLS sample sheet. In fact, there is no XLS sample sheet anywhere, so I don't know how to test this.

in etc/settings.yaml "genome-dir" is never used + triggers error AND "genome-file" is missing

I had this multiple times now:

SystemExit in line 17 of /home/agosdsc/libexec/pigx_chipseq/scripts/Check_Config.py:
ERROR: Config file is not properly formated:
        genome-diris not a valid file

  File "/home/agosdsc/libexec/pigx_chipseq/Snake_ChIPseq.py", line 25, in <module>
  File "/home/agosdsc/libexec/pigx_chipseq/scripts/Check_Config.py", line 17, in validate_config

But genome-dir is actually never refered to, so I guess we can get rid of it ...

What is even more bothering is that the variable "genome-file" is missing from default settings, but is used throughout the code.

make sure threads is defined in config file

Some rules have hard-coded values for threads that appear directly in the command line:
Make sure these are present in the config file. e.g.:

config['execution']['rules']['samtools_sort']['threads']
config['execution']['rules']['bowtie2']['threads']

Validation should confirm these values are provided

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.