GithubHelp home page GithubHelp logo

fw1121 / bam-readcount Goto Github PK

View Code? Open in Web Editor NEW

This project forked from genome/bam-readcount

0.0 2.0 0.0 7.09 MB

count DNA sequence reads in BAM files

CMake 21.32% Perl 2.28% Python 38.42% XSLT 5.35% C++ 32.63%

bam-readcount's Introduction

bam-readcount

The purpose of this program is to generate metrics at single nucleotide positions.
There are number of metrics generated which can be useful for filtering out false positive calls.
Help is currently available on the commandline if you do not supply the program
any arguments.

Usage: bam-readcount [OPTIONS] <bam_file> [region]
Generate metrics for bam_file at single nucleotide positions.
Example: bam-readcount -f ref.fa some.bam
Available options:
  -h [ --help ]                         produce this message
  -v [ --version ]                      output the version
  -q [ --min-mapping-quality ] arg (=0) minimum mapping quality of reads used
                                        for counting.
  -b [ --min-base-quality ] arg (=0)    minimum base quality at a position to
                                        use the read for counting.
  -d [ --max-count ] arg (=10000000)    max depth to avoid excessive memory
                                        usage.
  -l [ --site-list ] arg                file containing a list of regions to
                                        report readcounts within.
  -f [ --reference-fasta ] arg          reference sequence in the fasta format.
  -D [ --print-individual-mapq ] arg    report the mapping qualities as a comma
                                        separated list.
  -p [ --per-library ]                  report results by library.
  -w [ --max-warnings ] arg             maximum number of warnings of each type
                                        to emit. -1 gives an unlimited number.
  -i [ --insertion-centric ]            generate indel centric readcounts.
                                        Reads containing insertions will not be
                                        included in per-base counts

This program reports readcounts for each base at each position requested.
It also reports the average base quality of these bases and mapping qualities of
the reads containing each base.

The list of regions should be formatted as chromosome start and end. Each field should be tab separated and coordinates should be 1-based.
The optional region specification on the command line should follow the same format as that used for samtools (chr:start-stop)

Normal Output

The output format prints to standard out as follows:

chr	position	reference_base	depth	base:count:avg_mapping_quality:avg_basequality:avg_se_mapping_quality:num_plus_strand:num_minus_strand:avg_pos_as_fraction:avg_num_mismatches_as_fraction:avg_sum_mismatch_qualities:num_q2_containing_reads:avg_distance_to_q2_start_in_q2_reads:avg_clipped_length:avg_distance_to_effective_3p_end   ...

Each tab separated field contains the following information, each one is tab-separated:

base → the base that all reads following in this field contain at the reported position i.e. C

count → the number of reads containing the base

avg_mapping_quality → the mean mapping quality of reads containing the base

avg_basequality → the mean base quality for these reads

avg_se_mapping_quality → mean single ended mapping quality

num_plus_strand → number of reads on the plus/forward strand

num_minus_strand → number of reads on the minus/reverse strand

avg_pos_as_fraction → average position on the read as a fraction (calculated with respect to the length after clipping)

avg_num_mismatches_as_fraction → average number of mismatches on these reads per base

avg_sum_mismatch_qualities → average sum of the base qualities of mismatches in the reads

num_q2_containing_reads → number of reads with q2 runs at the 3’ end

avg_distance_to_q2_start_in_q2_reads → average distance of position (as fraction of unclipped read length) to the start of the q2 run

avg_clipped_length → average clipped read length of reads

avg_distance_to_effective_3p_end → average distance to the 3’ prime end of the read (as fraction of unclipped read length)

Per Library Output

Beginning with version 0.5.0, bam-readcount can now output counts per each library within a single BAM file. Libraries are reported with the same metrics as in
normal counting, but each library is listed by name and its metrics denoted by curly braces as follows:

chr	position	reference_base	depth	libname	{	base:count:avg_mapping_quality:avg_basequality:avg_se_mapping_quality:num_plus_strand:num_minus_strand:avg_pos_as_fraction:avg_num_mismatches_as_fraction:avg_sum_mismatch_qualities:num_q2_containing_reads:avg_distance_to_q2_start_in_q2_reads:avg_clipped_length:avg_distance_to_effective_3p_end	}   ...

Build Dependencies

Available Packages for Dependencies

  • For APT-based systems (Debian, Ubuntu), install the following packages:
sudo apt-get install build-essential git-core cmake zlib1g-dev libncurses-dev patch
  • For RPM-based systems (Fedora, CentOS, RHEL), install the following packages instead:
sudo yum groupinstall "Development tools" 
sudo yum install zlib-devel ncurses-devel cmake patch

Note that for some RPM based systems (like RHEL) and older versions of Ubuntu, you will need to install cmake 2.8.3 or greater yourself as the packaged version is older.

Build Instructions

Clone the bam-readcount repository

  • Recursively clone the git repository
        git clone https://github.com/genome/bam-readcount.git

Compile bam-readcount

  • bam-readcount does not support in source builds. Create a new build directory, enter it, and run:
        cmake /path/to/bam-readcount/repo
        make

The binary can then be found in the bin/ subdirectory of your build directory.

FAQ

I get errors from cmake about missing modules. How do I fix this?

These errors should no longer occur as of commit e48b05cbb9a9222d7585958f9704b147f0a4cbea. For earlier commits, bam-readcount contains a git submodule called build-common. It contains helper modules for cmake. If you downloaded the source as a tarball from github or forgot to do a recursive clone using git, then you will not have this submodule and will see cmake errors. If you are using git, we recommend you go back and use the —recursive option when cloning the repository. If you cannot use git, follow the instructions below to remedy the situation.

  1. Download the build-common module separately here
  2. Extract that tarball, and rename the directory it creates to ‘build-common’.
  3. Replace the empty build-common subdirectory in the sniper directory with directory you just created.
  4. Resume following the build instructions.

I get a segfault when attempting to run on a whole bam file.

This is a known bug that will be addressed in future versions. In most cases, you will only need read counts on a subset of positions and specifying the sites of interest is functional. Should you need to run the entire genome through then running by chromosome and concatenating the results should be a decent workaround (e.g. like bam-readcount -f ref.fasta some.bam chr1).

User Support

Please first search Biostar and then ask a question there if needed. We automatically monitor Biostar for questions related to our tools.

bam-readcount's People

Contributors

ernfrid avatar tabbott avatar chrisamiller avatar indraniel avatar morgantaschuk avatar nnutter avatar sakoht avatar

Watchers

James Cloos avatar Wayne Fang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.