GithubHelp home page GithubHelp logo

stamlab / footprinting2012 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from sjneph/footprinting2012

4.0 9.0 1.0 216 KB

DNaseI Footprinting

License: GNU General Public License v2.0

C++ 97.88% Makefile 2.12%

footprinting2012's Introduction

Footprint Detection

by Shane Neph

Overview

The core program that implements the DNaseI footprinting description from: An expansive human regulatory lexicon encoded in transcription factor footprints, Nature 489, 83โ€“90 (2012).

Given a set of sequencing tag counts (integers) at each base of any region, this program creates an unthresholded list of candidate footprints.

Build

// requires g++ version 4.7 or newer
make -C src/
This will create subdirectory bin/ and place fp2012 in it.

Program Options:

fp2012
    [--help]
	[--flankmin <bases>  = 6]
	[--flankmax <bases>  = 12]
	[--centermin <bases> = 6]
	[--centermax <bases> = 100]
	File-Full-O-Integers

The footprint occupancy score (FOS) of a candidate footprint is defined as:
FOS = (C+1)/L + (C+1)/R, where C is the average number of tags over the central/core region of a potential footprint, and L (R) is the average tag level found in the left (right) flanking region. A lower FOS is a more significant score. Any case is ignored where L or R is less than or equal to C, and, consequently, no division by zero can occur.

--flankmin
--flankmax set the min/max number of flanking bases over which to find the max mean value for L and R

--centermin
--centermax set the min/max number of bases over which to find the min mean value for C

Input

This program accepts a file full of integers that represent the number of cleavage events assigned to each base. That is, each base receives an integer that shows the number of uniquely-mapping sequencing tags with 5' ends mapping to that position. For those bases with none, they receive a zero. Since cleavage occurs in between two nucleotides, choose the 5'-most of these (relative to the forward strand) to represent each cleavage event. It is not strictly necessary to use only uniquely-mapping tags though it is common in practice.

You may use a dash (-) to denote that input comes from stdin.

Regions of interest may be broken up into subsequences by file. We recommend that you partition your data by chromosome and run the fp2012 program on each of these in parallel. Note that the bedextract and unstarch programs are designed to stream data by chromosome very efficiently. Use the applicable tool to stream the data, and then select the column of interest using the built-in cut command.

Results

The output of this program consists of unthresholded candidate footprints. The algorithm is deterministic so if you run everything again with the same inputs, you will receive the same results. The file format has 8 columns.

  1. The leftmost position of the left-flanking region
  2. The start of the central/core potential footprint (1 bp beyond the end of left-flanking region)
  3. The start of the right-flanking region (1 bp beyond the end of the core region)
  4. The end of the right-flanking region (1 bp beyond the end of the right-flanking region)
  5. The FOS (lower is more significant)
  6. The mean tag level of the left-flanking region (L)
  7. The mean tag level of the core region (C)
  8. The mean tag level of the right-flanking region (R)

All candidate footprints' core regions are disjoint and they do not abutt. The set of chosen candidate footprints are optimized relative to each other in cases where they would overlap or abutt, and each candidate is optimized over the input parameter settings. You will next want to threshold results further, including the removal of candidate footprints with too many unmapped bases in their core regions. Another common filter is to ignore results outside of DNaseI FDR 1% hotspot regions.

Note that the program does not know about chromosomes. Further, it reads in and interprets the first integer as belonging to absolute position zero. If you feed it something that does not start at base zero, you need to adjust the output using the first input base's offset.

Even if you paste on chromosome information to the beginning of the output, be careful as the output is not in sorted BED order.

Typically, one rearranges columns 2&3 to 1&2, pastes on appropriate chromosome information as the first field, and then sorts to obtain a BED-like file. Then a thresholding procedure based upon FOS is applied to the potential footprints. Note that a lower FOS is a more significant FOS. Using this procedure, you end up with 0-based [start,end) footprint calls where columns 2 and 3 hold the core footprint start and end positions.

Portability

This program runs fine on Linux systems, Mac OS X, and other BSD systems. It is written in standard C++, so it should compile and run natively on Windows systems though a different build manager would need to replace the simple makefile. The makefile hardcodes g++ as the compiler. Change CC in the makefile (for example, to clang++) along with build flags as you see fit.

Tips

One method of sticking zeroes in for bases on chr7 that have no per-base number of cuts using bedops, where column 5 of the input holds the base counts, looks like:

  bedops -c -L --chrom chr7 <bases-with-tag-counts> \
    | awk '{ for (i=$2; i<$3; ++i) { print $1,i,i+1,".",0; } }' \
    | bedops -u - <bases-with-tag-counts> \
    | cut -f5

Note that <bases-with-tag-counts> must be a properly sorted BED file, and the output of this command sequence may be piped directly into the fp2012 program. Here, all bases beyond the last found in <bases-with-tag-counts> will have no integer representation. This could affect results in only the slightest way (the very last footprint call if several conditions are all met). You may add another file to the bedops -c call to put zeroes all of the way to the end of a chromosome for completeness if that is a concern. Ask a question on our forum if advice is needed.

Performance and scalability

We regularly run this program on deeply-sequenced data using a compute cluster. We break the genome up by chromosome and submit each to a cluster of modest machines. We typically set flanking search parameters to 3-10 and the center/core footprint search sizes at 6-40. With this method, you can expect full results in less than one hour with a genome roughly the size of that for human. One may implement tricks to restrict inputs to less than a whole chromosome (for example, restrict to 1% FDR DNaseI hotspots) in order to speed up computations considerably. The tradeoff is a significantly larger amount of bookkeeping to create inputs and to glue the final results together properly. This seems deceptively simple to do, and it's easy to overlook pitfalls. We recommend running fp2012 with data from an entire chromosome.

The program can use a bit of main memory and we recommend 2G or more RAM. Surprisingly, feeding the program all zeroes gives the worst case memory performance. That is something that we plan to address in the future. Note that we have never had any memory issues in practice when using real data sets.

footprinting2012's People

Contributors

sjneph avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

jessica-2019

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.