GithubHelp home page GithubHelp logo

encode-dcc / chip-seq-pipeline Goto Github PK

View Code? Open in Web Editor NEW
119.0 40.0 52.0 18.92 MB

ENCODE Uniform processing pipeline for ChIP-seq

License: MIT License

Python 52.68% Shell 3.66% Perl 33.32% R 0.02% Makefile 0.06% Roff 10.12% AngelScript 0.13%

chip-seq-pipeline's Introduction

========== ENCODE ChIP-seq Pipeline

ENCODE Uniform processing pipeline for ChIP-seq

Current implementation is deployed to the DNAnexus platform.

Mapping

  1. Map reads with BWA, mark duplicates Picard, and remove duplicates.
  2. Estimate library complexity and calculate calculate NRF (non-redundant fraction), PBC1, PBC2 (PCR bottleneck coefficient).
  3. Calculate cross-correlation analysis with spp/phantompeakqualtools.
  4. Generate p-value and fold-over-control signal tracks for each replicate and replicates pooled with MACS2.

Peak calling (histone marks)

  1. Call peaks with MACS2.
  2. Calculate and report overlapping peaks from both replicates.

Peak calling (transcription factors)

  1. Call peaks with SPP.
  2. Threshold peaks with IDR.
  3. Report IDR-thresholded peak sets, self-consistency ratio, rescue ratio, reproducibility test.

chip-seq-pipeline's People

Contributors

asottile avatar hitz avatar keenangraham avatar ottojolanki avatar strattan avatar submarinesammitch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chip-seq-pipeline's Issues

why remove duplicates before calculate library complexity?

Hi,
I assumed you're using the following in the pipeline to calculate metrics for library complexity:

calculate PBC metrics

bedtools bamtobed -bedpe -i tmp.bam | awk 'BEGIN{OFS="\t"}{print $1,$2,$4,$6,$9,$10}'
| grep -v 'chrM' | sort | uniq -c | awk 'BEGIN{mt=0;m0=0;m1=0;m2=0}($1==1){m1=m1+1}
($1==2){m2=m2+1} {m0=m0+1} {mt=mt+$1}
END{printf "%d\t%d\t%d\t%d\t%f\t%f\t%f\n", mt,m0,m1,m2,m0/mt,m1/m0,m1/m2}' > ${sample}.pbc.qc
rm tmp.bam

where mt = # TotalReadPairs, m0 = # DistinctReadPairs, m1 = # OneReadPair, m2 = #TwoReadPairs, m0/mt = NRF=Distinct/Total, PBC1 = m1/m0 = OnePair/Distinct, PBC2 = m1/m2 = OnePair/TwoPair

Then if you remove duplicates mt becomes equal to m0 and NRF will be 1.
As I see it, the line "uniq -c" prefixes lines by the number of occurrences, so it adds prefix 1 if the lines is unique i.e. m1, then prefix 2 for a second occurrence if the line is repeated. However, identical lines are usually removed during remooval of duplicates. If we would use the definition of distinct genomic location then the code should not search for identical occurrences lines to classify them as m2 but for lines that map to the same location (partially overlapping fragments that originates from a different dna molecule)

I have tried to use these calculation after removing duplicates and the NRF does not look right. It is always. Maybe you can explain why this step is always after removing duplicates in the pipeline which cause NRF to be always 1

Find overlapped peaks about histone ChIPseq

Hi,
I'm curious about the function of finding overlapped peaks, expecially the scripts 'overlap_peaks.py'. But the explanation about how to excute this function is barely mentioned in your document? I even don't know what kind of files should be submitted. Could you help me to use it? I really want to get a more precise results.
Hanwen

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.