GithubHelp home page GithubHelp logo

farhat / methylcode Goto Github PK

View Code? Open in Web Editor NEW

This project forked from brentp/methylcode

1.0 3.0 0.0 242 KB

tools for methyation data: bowtie runner, plotter, moving average, analysis, etc.

License: Other

methylcode's Introduction

MethylCoder is a single program that takes of bisulfite-treated reads and outputs per-base methylation data. It also includes scripts for analysis and visualization. In addition to a binary output and a SAM alignment file, the direct output of methylcoder is a text file that looks like

#seqid  mt  bp  c   t
1   3   1354    0   1
1   3   1358    0   1
1   3   1393    0   1
1   3   1394    0   1
1   3   1402    0   1
1   3   1409    0   1

where columns are reference seqid methylation context (type) basepair location(bp) number of reads where a (c)ytosine was unconverted, number of reads where where a cytosine was converted to (t)hymine. Making methylation at every methylable basepair easily calculated as c / (c + t).

This software is developed in the Fischer Lab . At UC Berkeley. Please report any requests, bugs, patches, problems, docs to [email protected]

It is distributed under the New BSD License

Python 2.6 must be installed along with the following modules. all of these are available from pypi and as such are installable via

$  easy_install [module]
  • numpy to handle arrays and binary data in python
  • pyfasta to access/index fasta files

matplotlib is required to plot the per-chromosome methylation levels.

  • bowtie to align the reads to the genome.
  • (optional) gsnap alternative aligner. (part of gmap).
  • (optional) sam-tools to view the alignments and processing the reads

once the above python and c libraries are installed, download methylcoder from:

http://github.com/brentp/methylcode/tarball/master (tar ball)

and untar; or clone the repository via git:

git clone git://github.com/brentp/methylcode.git

Then, from the methylcode directory, it is still necessary to run

$ sudo python setup.py install

to install the package into your path. After that, the executable 'methylcoder' will be available on your path. Running with no arguments will print help.

The input to the pipeline is:

  • a reference fasta file with one entry per chromosome in the genome to which the reads are to be mapped.
  • a fastq or fasta reads file. all reads must be the same length and must be from Eckers/Zilberman bisulfite process (with only 2 possibilities not 4 from Cokus protocol). If 2 read files are specified, they are assumed to be pair ends and the aligner is called appropriately.
  • a textfile containing columns:
    1. seqid (chromosome)
    2. methylation type (1 to 6 see below)
    3. basepair position (0-indexed)
    4. reads with C at this position
    5. reads with T at this position

    methylation can be calculated as column 4 / (column 5 + column 4) 4 and 5 are corrected for strand (G, A respectively for - strand).

  • a set of binary files for each chromosome in the fasta file. each file

    contains a value for each basepair in the chromosome--many of which will be 0 if the position is not a C or G. these files contain no headers and can be read in any language by specifying the file-type (listed in [square brackets] below. these include:

    • methyltype.bin with values between 1 and 6 as described below (value of 0 means no methylation is possible at this basepair). [encoded as uint8]
    • cs.bin containing the number of reads with C's at each position (same as column 4 above). [encoded as uint32]
    • ts.bin containing the number of reads with T's at each position (same as column 5 above). [encoded as uint32]
  • Methylation type is a value between 1 and 6:
    1. CG on + strand
    2. CHG on + strand
    3. CHH on + strand
    4. CG on - strand
    5. CHG on - strand
    6. CHH on - strand

You must have:

  1. input reference fasta file to which to align the reads. here: thaliana_v9.fasta
  2. a reads file in fastq or fasta format. here: reads.fastq. if you have paired end reads, they must be specified in order 1, 2.
  3. a directory containing the bowtie and bowtie-build executables. (or the path to the gmap/gsnap source directory containing util/ and src/ dirs)

An example command to run the pipeline is:

$ methylcoder --bowtie /usr/local/src/bowtie/ \
              --extra-args "-m 1"
              --reference /path/to/thaliana_v9.fasta \
              /path/to/reads.fastq

or using the gsnap aligner on paired-end reads.:

$ methylcoder --gsnap /usr/local/src/gmap/gmap-2010-03-09/ \
              --reference /path/to/thaliana_v9.fasta \
              /path/to/reads_1.fastq /path/to/reads_2.fastq

Where you must adjust /path/to/reads.fastq to point to your BS-treated reads. This will create the files specified in Output above, sending the text to path/to/reads_methylcoder/methy-data-DATE.txt where DATE is the current date. The binary files will be sent to, that same directory as: thaliana_v9.fasta.[CHR].methyl.bin where [CHR] is substituted by each chromosome in the fasta file. Once bowtie is run once, its output is not deleted, and methylcoder.py will only re-run bowtie if its input has been modified since it was run last. NOTE if the methylcoder executable is called without any options, it will print help and available command-line arguments.

Additional args can be sent directly to the aligner as a string to methylcoder.py's --extra-args parameter. This would look like.

--extra-args "--solexa-quals -k 1 -m 1 --strata"

and that string will be passed directly to the bowtie invocation when it is called from methylcoder. Whenever 2 fastq files are sent, they are assumed to be paired-end reads.

  • when using bowtie, the reference size must be <= 1.8 Gigabases. This limitation can be circumvented by splitting the reference into 2 smaller reference sequences. For example with human, splitting into 2 fasta files, one with chromosomes 1-9 and the other with chromosomes 10+ works well. This limitation does not exist when GSNAP is used as the aligner.

See: http://github.com/brentp/methylcode/wikis/using-samtools-to-view-alignments

  • Eckers paper. http://www.nature.com/nature/journal/v462/n7271/extref/nature08514-s1.pdf
  • Bowtie Paper: Langmead B, Trapnell C, Pop M, Salzberg SL. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25.
  • GSNAP paper: Wu TD, Nacu S. 2010 Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 26(7):873-81.

warning when run with bowtie, methylcoder assumes that the Bisulfite converted reads are created using the Zilberman/Ecker method in which BS conversion occurs after conversion to solexa library--giving only 2 possibibilities. This is in contrast to the Jacobsen method which gives 4 possiblities. When run with gsnap, it is assumed that the Jacobsen method was used.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.