expr_shape's Introduction

expr_shape

JHPCE location: /dcl01/lieber/lcolladotor/recount2Misc_LIBD001/expr_shape

The code in this repository was used as inspiration/insight for the following project:

Incomplete annotation of OMIM genes is likely to be limiting the diagnostic yield of genetic testing, particularly for neurogenetic disorders David Zhang, Sebastian Guelfi, Sonia Garcia Ruiz, Beatrice Costa, Regina H Reynolds, Karishma D'Sa, Wenfei Liu, Thomas Courtin, Amy Peterson, Andrew E Jaffe, John Hardy, Juan Botia, Leonardo Collado-Torres, Mina Ryten bioRxiv 499103; doi: https://doi.org/10.1101/499103

If you use the code in this project, please cite the above paper. Thank you.

expr_shape's People

Contributors

Stargazers

Watchers

expr_shape's Issues

ERs for 2 subtissues missing

Check what is up with 2 sub-tissues that failed to run https://github.com/LieberInstitute/expr_shape/blob/master/subtissue_ers.R

12.05.2017

These are the detailed notes that Amy Peterson (MPH student doing her MPH practicum with Leo) took

Ryten Meeting Notes 05/DEC/17: Assessing the Completeness of Annotation for Mendelian-disease causing Genes

Figure: Comparison of the number of ERs overlapping OMIM genes using different maximum gaps
- Testis highest number of ERs
- 2nd and 3rd samples are Brain-Cerebellum and Brain-Cerebellar Hemisphere
- Princy message (Battle lab) replicates of sample, done at different institutions, reflected in the data
Decision to go with 50kb, tissue differences still seen using this cut off
Seb – split read data, where most of novel exons will be, most within 50 kb
- More specificity than sensitivity
- Get split reads which lead to differentially defined regions 1 mb away, but few data points like that
Figure: Distribution of ERs Annotation types across tissues
- Using max gap of 50 kb and non-overlapping OMIM genes
- RPKM cut off of 0.1 in 80% of samples
- For ER region to be classified into intergenic, intron, exon, intron or exon needs to lie completely within
- For each tissue, calculate RPKM and applied 0.1 cut off as the mean across 80% of samples
- Unannotated: exon, intron largest group across all samples – not sure if high pre mRNA in the boundary of the exon or something novel
- GTEx protocol poly a enriched
Figure: proportion of ERs; distribution of ERs annotation types across tissues
- Differences in intergenic and intron region difference does exist
- Largest percentage of intron/intergenic regions in brain cerebellum and brain-cerebellar hemisphere
- Testis also again region with one of the largest intron and intergenic regions
Considering threshold of how many annotations in unannotated exon intron ie. could filter out anything above 3 annotations in the future (exon; intron; exon, intron)
- Possibly make similar plot to number of ERs that are 3 annotations or more to confirm they are unannotated: exon, intergenic; intron; or unannotated: exon, intron
- Filter ER by ER overlapping three different elements of annotation (exon, intron, exon) if it goes beyond that ie. exon, intron, exon, intron again want to ensure what is being filtered falls into the correct category
  - Above 3 annotations; more annotations covered the less that categorization is believable
- Exon, intergenic, intron category (blue) not a large portion of the data
- Intronic: would be easiest to define with split reads, has immediate coding changes to it, tissue differences in terms of proportion of it (and intergenic also)
Figure: ER frequency across tissues
- RPKM cut off tissue by tissue, for every ER within 50 kb window how many tissues that pass the RPKM cut off for; histogram, each bar is individual increase in tissue; far right 54, complete number of tissues in GTEx
- ER in all tissues
- Bimodal peak ERs fall within 1 or 2 tissues, and another peak at the maximum
- 0 bar on the graph – ER value is passed in recount data as mean value with particular ERs
- Normalize to the area under the curve and for the width of that ER then mean based coverage value for that region across all samples; would expect highest value would be the one with ER in all tissues
- Bimodal distribution makes sense: things very useful or very tissue specific; most at tissue specific end
Cluster on the basis of known exonic region and cluster on basis of unannotated regions and see if you have different regions
- Hope to see similarities
- Do novel expressed regions separate out tissues better than annotated regions?
- Count one of them, have more tissue specific ERs
Looking at tissue list, seem to be many duplications (skin samples, cortical regions)
- Can use ER frequency across tissues for checking that the tissues are similar as they should be but also making statements about x axis
- Uniqueness or how shared they are; might be better to combine and remove some of the duplication
Figure: Distribution of ER RPKM
- Across width of ERs if there was a clear peak or if any of the distributions resembles each other might be able to say ER with a width that has high RPKM in relation to another might prioritize some of the ERs to the exons to be real width of ER (bp) vs. RPKM
- Outliers on RPKM scale not on exon one, bottom right two groups: unannotated exon, intergenic, intron and unannotated exon, intron
- Source of outliers? Mapping issue?
  - ERs, mappability, possibly in sections of the genome that are highly repetitive, possible need for filtering
  - Wouldn’t expect ribosomals to show up in intronic group, could be a function of the annotation if the ribosomes are not considered exons
  - Top 2 pink points on intron graph maybe plot and look in genome browser
- Filtering out by 3 annotation elements if one is an exon might lose high RPKM values in that quadrant graph
Distribution of RPKM across ER width and distance (from nearest exon) – amygdala
- Y axis – calculated distance for the intergenic regions to the nearest gene and then calculated distance from there
- At a certain distance from the gene, and of a particular width, could have been cluster or something with relatively high RPKM value – what was being looked for but nothing really stood out in particular, random distribution of RPKM across width and distance
- Similar results across all the tissues that were looked at
Make plot again, instead of color by RPKM number of tissues where ER is present, maybe ERs present in more tissues
- Hard to see from current graph what percent of the points are above the y = 0 line
- Everything that isn’t intergenic will be on the distance as 0, anything that overlaps with an existing annotation or gene will be classed as distance 0 – only intergenic regions that don’t overlap will have a natural positive distance above 0
Leo: CNVnator paper (Abyzov et. al.)
- Software for finding CNVs
- CNVnator uses coverage called read-depth analysis and then compute different windows and summarize the data for each window
- Possibly could be modified for long ERs and broken up into smaller ones, for whole genome would take too long and we need to use smaller window size than what they used
- Issue with the software itself, takes bam/sam files as input
  - Makes it complicated to use in terms of creative way of exporting the alignments for ERs in particular
- Code might not be that easy to adapt to R, one possibility to try to look into
- Adapted methods from signal processing to compute read-depth analysis
  - Could use by making modified bam file
- CNVnator works one file at a time
- Mark Gerstein – part of the psych encode consortium
Poly a in GTEx: Andrew has seen that if use derfinder with ribo-zero data, get more intron ERs, significant percent increase
- With ribo-zero data see exons highly expressed, and then portion in the middle where not sure unclear what is going on (“expression blocks and noise” slide, practicum ppt), basically 3 states
- Another idea: try to fit hidden markov model with 3 states trying to compute the fold change difference between one subsection and the previous one
  - Want to find the pieces where the breakpoints have a large change

Recommend Projects

lieberinstitute / expr_shape Goto Github PK

expr_shape's Introduction

expr_shape

expr_shape's People

Contributors

Stargazers

Watchers

Forkers

expr_shape's Issues

ERs for 2 subtissues missing

12.05.2017

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs