GithubHelp home page GithubHelp logo

expr_shape's Introduction

expr_shape

DOI

JHPCE location: /dcl01/lieber/lcolladotor/recount2Misc_LIBD001/expr_shape

The code in this repository was used as inspiration/insight for the following project:

Incomplete annotation of OMIM genes is likely to be limiting the diagnostic yield of genetic testing, particularly for neurogenetic disorders David Zhang, Sebastian Guelfi, Sonia Garcia Ruiz, Beatrice Costa, Regina H Reynolds, Karishma D'Sa, Wenfei Liu, Thomas Courtin, Amy Peterson, Andrew E Jaffe, John Hardy, Juan Botia, Leonardo Collado-Torres, Mina Ryten bioRxiv 499103; doi: https://doi.org/10.1101/499103

If you use the code in this project, please cite the above paper. Thank you.

expr_shape's People

Contributors

lcolladotor avatar amy-peterson avatar

Stargazers

Emir Turkes avatar

Watchers

Andrew Jaffe avatar James Cloos avatar  avatar Geo Pertea avatar Bill Ulrich avatar Amanda Price avatar Stephen Semick avatar Nicholas Clifton avatar Cristian.Valencia avatar Abby Spangler avatar Emily Burke avatar

Forkers

weizhousjtu

expr_shape's Issues

12.05.2017

These are the detailed notes that Amy Peterson (MPH student doing her MPH practicum with Leo) took

Ryten Meeting Notes 05/DEC/17: Assessing the Completeness of Annotation for Mendelian-disease causing Genes

  • Figure: Comparison of the number of ERs overlapping OMIM genes using different maximum gaps
    • Testis highest number of ERs
    • 2nd and 3rd samples are Brain-Cerebellum and Brain-Cerebellar Hemisphere
    • Princy message (Battle lab) replicates of sample, done at different institutions, reflected in the data
  • Decision to go with 50kb, tissue differences still seen using this cut off
  • Seb – split read data, where most of novel exons will be, most within 50 kb
    • More specificity than sensitivity
    • Get split reads which lead to differentially defined regions 1 mb away, but few data points like that
  • Figure: Distribution of ERs Annotation types across tissues
    • Using max gap of 50 kb and non-overlapping OMIM genes
    • RPKM cut off of 0.1 in 80% of samples
    • For ER region to be classified into intergenic, intron, exon, intron or exon needs to lie completely within
    • For each tissue, calculate RPKM and applied 0.1 cut off as the mean across 80% of samples
    • Unannotated: exon, intron largest group across all samples – not sure if high pre mRNA in the boundary of the exon or something novel
    • GTEx protocol poly a enriched
  • Figure: proportion of ERs; distribution of ERs annotation types across tissues
    • Differences in intergenic and intron region difference does exist
    • Largest percentage of intron/intergenic regions in brain cerebellum and brain-cerebellar hemisphere
    • Testis also again region with one of the largest intron and intergenic regions
  • Considering threshold of how many annotations in unannotated exon intron ie. could filter out anything above 3 annotations in the future (exon; intron; exon, intron)
    • Possibly make similar plot to number of ERs that are 3 annotations or more to confirm they are unannotated: exon, intergenic; intron; or unannotated: exon, intron
    • Filter ER by ER overlapping three different elements of annotation (exon, intron, exon) if it goes beyond that ie. exon, intron, exon, intron again want to ensure what is being filtered falls into the correct category
      • Above 3 annotations; more annotations covered the less that categorization is believable
    • Exon, intergenic, intron category (blue) not a large portion of the data
    • Intronic: would be easiest to define with split reads, has immediate coding changes to it, tissue differences in terms of proportion of it (and intergenic also)
  • Figure: ER frequency across tissues
    • RPKM cut off tissue by tissue, for every ER within 50 kb window how many tissues that pass the RPKM cut off for; histogram, each bar is individual increase in tissue; far right 54, complete number of tissues in GTEx
    • ER in all tissues
    • Bimodal peak ERs fall within 1 or 2 tissues, and another peak at the maximum
    • 0 bar on the graph – ER value is passed in recount data as mean value with particular ERs
    • Normalize to the area under the curve and for the width of that ER then mean based coverage value for that region across all samples; would expect highest value would be the one with ER in all tissues
    • Bimodal distribution makes sense: things very useful or very tissue specific; most at tissue specific end
  • Cluster on the basis of known exonic region and cluster on basis of unannotated regions and see if you have different regions
    • Hope to see similarities
    • Do novel expressed regions separate out tissues better than annotated regions?
    • Count one of them, have more tissue specific ERs
  • Looking at tissue list, seem to be many duplications (skin samples, cortical regions)
    • Can use ER frequency across tissues for checking that the tissues are similar as they should be but also making statements about x axis
    • Uniqueness or how shared they are; might be better to combine and remove some of the duplication
  • Figure: Distribution of ER RPKM
    • Across width of ERs if there was a clear peak or if any of the distributions resembles each other might be able to say ER with a width that has high RPKM in relation to another might prioritize some of the ERs to the exons to be real width of ER (bp) vs. RPKM
    • Outliers on RPKM scale not on exon one, bottom right two groups: unannotated exon, intergenic, intron and unannotated exon, intron
    • Source of outliers? Mapping issue?
      • ERs, mappability, possibly in sections of the genome that are highly repetitive, possible need for filtering
      • Wouldn’t expect ribosomals to show up in intronic group, could be a function of the annotation if the ribosomes are not considered exons
      • Top 2 pink points on intron graph maybe plot and look in genome browser
    • Filtering out by 3 annotation elements if one is an exon might lose high RPKM values in that quadrant graph
  • Distribution of RPKM across ER width and distance (from nearest exon) – amygdala
    • Y axis – calculated distance for the intergenic regions to the nearest gene and then calculated distance from there
    • At a certain distance from the gene, and of a particular width, could have been cluster or something with relatively high RPKM value – what was being looked for but nothing really stood out in particular, random distribution of RPKM across width and distance
    • Similar results across all the tissues that were looked at
  • Make plot again, instead of color by RPKM number of tissues where ER is present, maybe ERs present in more tissues
    • Hard to see from current graph what percent of the points are above the y = 0 line
    • Everything that isn’t intergenic will be on the distance as 0, anything that overlaps with an existing annotation or gene will be classed as distance 0 – only intergenic regions that don’t overlap will have a natural positive distance above 0
  • Leo: CNVnator paper (Abyzov et. al.)
    • Software for finding CNVs
    • CNVnator uses coverage called read-depth analysis and then compute different windows and summarize the data for each window
    • Possibly could be modified for long ERs and broken up into smaller ones, for whole genome would take too long and we need to use smaller window size than what they used
    • Issue with the software itself, takes bam/sam files as input
      • Makes it complicated to use in terms of creative way of exporting the alignments for ERs in particular
    • Code might not be that easy to adapt to R, one possibility to try to look into
    • Adapted methods from signal processing to compute read-depth analysis
      • Could use by making modified bam file
    • CNVnator works one file at a time
    • Mark Gerstein – part of the psych encode consortium
  • Poly a in GTEx: Andrew has seen that if use derfinder with ribo-zero data, get more intron ERs, significant percent increase
    • With ribo-zero data see exons highly expressed, and then portion in the middle where not sure unclear what is going on (“expression blocks and noise” slide, practicum ppt), basically 3 states
    • Another idea: try to fit hidden markov model with 3 states trying to compute the fold change difference between one subsection and the previous one
      • Want to find the pieces where the breakpoints have a large change

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.