Light

encode-dcc / chip-seq-pipeline Goto Github PK

View Code? Open in Web Editor NEW

119.0 40.0 52.0 18.92 MB

ENCODE Uniform processing pipeline for ChIP-seq

License: MIT License

Python 52.68% Shell 3.66% Perl 33.32% R 0.02% Makefile 0.06% Roff 10.12% AngelScript 0.13%

chip-seq-pipeline's Introduction

========== ENCODE ChIP-seq Pipeline

ENCODE Uniform processing pipeline for ChIP-seq

Current implementation is deployed to the DNAnexus platform.

Mapping

Map reads with BWA, mark duplicates Picard, and remove duplicates.
Estimate library complexity and calculate calculate NRF (non-redundant fraction), PBC1, PBC2 (PCR bottleneck coefficient).
Calculate cross-correlation analysis with spp/phantompeakqualtools.
Generate p-value and fold-over-control signal tracks for each replicate and replicates pooled with MACS2.

Peak calling (histone marks)

Call peaks with MACS2.
Calculate and report overlapping peaks from both replicates.

Peak calling (transcription factors)

Call peaks with SPP.
Threshold peaks with IDR.
Report IDR-thresholded peak sets, self-consistency ratio, rescue ratio, reproducibility test.

chip-seq-pipeline's People

Contributors

Stargazers

Watchers

chip-seq-pipeline's Issues

why remove duplicates before calculate library complexity?

Hi,
I assumed you're using the following in the pipeline to calculate metrics for library complexity:

calculate PBC metrics

bedtools bamtobed -bedpe -i tmp.bam | awk 'BEGIN{OFS="\t"}{print $1,$2,$4,$6,$9,$10}'
| grep -v 'chrM' | sort | uniq -c | awk 'BEGIN{mt=0;m0=0;m1=0;m2=0}($1==1){m1=m1+1}
($1==2){m2=m2+1} {m0=m0+1} {mt=mt+$1}
END{printf "%d\t%d\t%d\t%d\t%f\t%f\t%f\n", mt,m0,m1,m2,m0/mt,m1/m0,m1/m2}' > ${sample}.pbc.qc
rm tmp.bam

where mt = # TotalReadPairs, m0 = # DistinctReadPairs, m1 = # OneReadPair, m2 = #TwoReadPairs, m0/mt = NRF=Distinct/Total, PBC1 = m1/m0 = OnePair/Distinct, PBC2 = m1/m2 = OnePair/TwoPair

Then if you remove duplicates mt becomes equal to m0 and NRF will be 1.
As I see it, the line "uniq -c" prefixes lines by the number of occurrences, so it adds prefix 1 if the lines is unique i.e. m1, then prefix 2 for a second occurrence if the line is repeated. However, identical lines are usually removed during remooval of duplicates. If we would use the definition of distinct genomic location then the code should not search for identical occurrences lines to classify them as m2 but for lines that map to the same location (partially overlapping fragments that originates from a different dna molecule)

I have tried to use these calculation after removing duplicates and the NRF does not look right. It is always. Maybe you can explain why this step is always after removing duplicates in the pipeline which cause NRF to be always 1

Find overlapped peaks about histone ChIPseq

Hi,
I'm curious about the function of finding overlapped peaks, expecially the scripts 'overlap_peaks.py'. But the explanation about how to excute this function is barely mentioned in your document? I even don't know what kind of files should be submitted. Could you help me to use it? I really want to get a more precise results.
Hanwen

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.