GithubHelp home page GithubHelp logo

davetang / defining_genomic_regions Goto Github PK

View Code? Open in Web Editor NEW
23.0 3.0 12.0 5.58 MB

Define regions in the genome

Home Page: https://davetang.org/muse/2013/01/18/defining-genomic-regions/

Perl 91.51% Shell 8.30% Makefile 0.20%
bedtools genomics

defining_genomic_regions's Introduction

Defining genomic regions

A reference genome is simply a string of A's, C's, G's, and T's. However, there are many functional elements within the genome and massive efforts have been undertaken to annotate genomes. The GENCODE Project was tasked with cataloguing genes and gene variants in the human and mouse genomes. TAIR coordinates and provides genome annotations for Arabidopsis thaliana. Typically, genome annotations are provided in a GTF file.

We will define regions in a reference genome by using BEDTools and a GTF file.

Example The IGV screenshot above shows various gene models in dark blue, exonic regions in light red, intronic regions in light green, and intergenic regions in light blue.

Install BEDTools

To get started, download and compile BEDTools, if you haven't already.

git clone https://github.com/arq5x/bedtools2.git
cd bedtools2
make clean && make all

Alternatively, you can install BEDTools using Conda.

conda install -c bioconda bedtools

Download GTF file

We will use the GTF file for Arabidopsis thaliana; more information on the format is provided at the UCSC Genome Browser help page.

wget -c ftp://ftp.ensemblgenomes.org/pub/release-36/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.36.gtf.gz

gunzip -c Arabidopsis_thaliana.TAIR10.36.gtf.gz | head
#!genome-build TAIR10
#!genome-version TAIR10
#!genome-date 2010-09
#!genome-build-accession GCA_000001735.1
#!genebuild-last-updated 2010-09
1       araport11       gene    3631    5899    .       +       .       gene_id "AT1G01010"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding";
1       araport11       transcript      3631    5899    .       +       .       gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
1       araport11       exon    3631    3913    .       +       .       gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon1";
1       araport11       CDS     3760    3913    .       +       0       gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; protein_version "1";
1       araport11       start_codon     3760    3762    .       +       0       gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";

Define exons

Exons are already defined in the GTF file, so we simply need to print lines that are marked exonic.

gunzip -c Arabidopsis_thaliana.TAIR10.36.gtf.gz |
awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5}' |
bedtools sort |
bedtools merge -i - | gzip > my_exon.bed.gz

Define introns

To obtain introns, we simply need the gene and exonic coordinates; by subtracting the exonic regions from the genic region, we have the intronic region.

gunzip -c Arabidopsis_thaliana.TAIR10.36.gtf.gz |
awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' |
bedtools sort |
bedtools subtract -a stdin -b my_exon.bed.gz |
gzip > my_intron.bed.gz

Define intergenic

For the intergenic region, we will require the size of the chromosomes.

gunzip -c Arabidopsis_thaliana.TAIR10.36.gtf.gz |
  awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' |
  bedtools sort -g chrom_info/araTha1.genome |
  bedtools complement -i stdin -g chrom_info/araTha1.genome |
  gzip > my_intergenic.bed.gz

Arabidopsis thaliana regions

How much of the is made up of exonic, intronic, and intergenic regions?

alias add='perl -nle '\''$i+=$_; END {print $i}'\'''

cat chrom_info/araTha1.genome
1       30427671
5       26975502
3       23459830
2       19698289
4       18585056
Mt      366924
Pt      154478

cat chrom_info/araTha1.chrom.sizes | cut -f2 | add
119667750

# exonic
gunzip -c my_exon.bed.gz | awk '{print $3-$2}' | add
47821763
bc -l<<<47821763*100/119667750
39.96211427055326100808

# intronic
gunzip -c my_intron.bed.gz | awk '{print $3-$2}' | add
18164145
bc -l<<<18164145*100/119667750
15.17881384082177529033

# intergenic
gunzip -c my_intergenic.bed.gz | awk '{print $3-$2}' | add
53769447
bc -l<<<53769447*100/119667750
44.93227874678014753348

# slightly off total
bc -l<<<15.17881384082177529033+44.93227874678014753348+39.96211427055326100808
100.07320685815518383189

hg19 reference genome

If you are working with hg19, simply run make to create all the different genomic regions.

make

# if everything ran successfully
for file in `ls *.gz`; do md5sum $file; done
bd83e28270e595d3bde6bfcb21c9748f  gencode.v19.annotation.gtf.gz
8c97ec4b54eaa176ba1e48bfeb60c08a  gencode_v19_exon_merged.bed.gz
ea03038b873ba2612383a4c0949c835d  gencode_v19_intergenic.bed.gz
4d5ff850e3115077bf50d87bc406a84f  gencode_v19_intron.bed.gz
48dbe15f4498baad1a2327c774a692c8  promoter.bed.gz
9d513cad3aafd5690bf8bbebb24b4df4  transcript_utr.bed.gz
35aed6aac655182c653cdc72060b914d  transcript_utr_number.out.gz

Annotate BAM files

samtools sort my_file.bam my_file
bedtools2/bin/bedtools bamtobed -i my_file.bam > my_file.bed
cat my_file.bed | wc -l
bedtools2/bin/bedtools intersect -a my_file.bed -b gencode_v19_exon_merged.bed.gz -u | wc -l
bedtools2/bin/bedtools intersect -a my_file.bed -b gencode_v19_intergenic.bed.gz -u | wc -l
bedtools2/bin/bedtools intersect -a my_file.bed -b gencode_v19_intron.bed.gz -u | wc -l

run.pl

I wrote run.pl to create exonic, intronic, and intergenic BED files as well as providing some simple statistics of each region.

run.pl Arabidopsis_thaliana.TAIR10.37.gtf.gz chrom_info/araTha1.genome
Creating exonic regions
Creating intronic regions
Creating intergenic regions

Coverage summary per region (percentage)

Exon: 39.99
Intron: 15.16
Intergenic: 44.84

Average length per region (bp)

Exon: 336.63
Intron: 157.37
Intergenic: 1951.60

Further reading

See assoicated blog post: https://davetang.org/muse/2013/01/18/defining-genomic-regions/

defining_genomic_regions's People

Contributors

davetang avatar sribi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.