sheynkman-lab / biosurfer Goto Github PK
View Code? Open in Web Editor NEW"Surf" the biological network, from genome to transcriptome to proteome and back to gain insights into human disease biology.
License: MIT License
"Surf" the biological network, from genome to transcriptome to proteome and back to gain insights into human disease biology.
License: MIT License
Features to implement
Need a class that represents disease mutations (genetic variants, mendelian disease mutations, complex disease variants, cancer autosomal and somatic variants) that are mapped to Position
and Residue
objects.
Placeholder for module -> isomodules/isomut.py
Code that could be used as a start -> isomodules_in_progress/mutation*.py
Write code so that creation of each gene object can be done separately (use parallel python)
Same parallelization can be applied to the alignment of objects.
Need a word that represents new exon sequence that harbors a start codon...
pop up
spliced in
????
biosurfer/isomodules/isoimage.py
Lines 67 to 70 in aba9107
Currently - the isoimage code only takes in same-strand ORFs for plotting.
We need to think about what to with antisense transcripts, which are prevalent in our long-read data, as well as GENCODE.
Need to discuss how to describe the cut-out splice
proposed hierarchy:
Gene
owns Transcript
sTranscript
owns ORF
(s)ORF
owns UTR
s (5' and 3')ORF
and UTR
point to constituent Exon
s and chain of Position
sORF
also points to constituent CDS
s and chain of Residue
sExon
owns chain of Position
sCDS
owns chain of Residue
sJared idea - to rework the underlying code for representing features as ranges. He thinks this could allow for “lighter-weight” representation of objects. Potentially an optimization project after the “user interface” base code is working.
How was “requirements.txt” populated?
Thinking about language for the translational differences. If we say "frameshift", that is expected to mean a ribosomal "slip", stochastic or programmed, that causes a shift in the frame of translation for the same transcript.
I have not explicitly found a term that describes differences in relative frame of translation between different isoforms. I can potentially ask the directory of GENCODE about this.
We may need to say, “usage of a different translational frame”
Jared found a reference to an "exitron"
Should we name it that?
https://en.wikipedia.org/wiki/Exitron
We can consult with some experts that study the process.
Exitrons also can have frame-preserving or frame-shifted structures.
might try implementing this as sorting a list of tuples; if that doesn't work out, copy the optimized old code
biosurfer/isomodules/isoclass.py
Lines 15 to 19 in aba9107
mutations as lollipops (red=disease, green=benign, orange=variant of unknown significance)
biosurfer/isomodules/isoimage.py
Line 274 in 496be78
Essentially make Biomolecule and its subclasses "wrappers" for interval objects (see example)
What is the best way to structure many different groupings between features, of different modes and sizes, and at different scales?
At the time of this post, making queries from iso-objects in biosurfer requires chained attribute accesses. For example, align_groups[2].frmf.blocks[1].first.res.doms
might pull out the set of domains to which the first residue in a frameshifted region is mapped.
Would it be possible to let users make SQL-style queries? What are the pros and cons in terms of ease of code development, performance, user accessibility, etc?
Things that could make it easier for users and/or the annotation algorithm to identify splicing events that affect protein sequence:
Idea to think about - should we have an "annnotation" class which will be a flexible container to hold annotation information.
Annotation information examples:
Figure out how to add a table to the labels
To show data like this:
ETV2.pdf
show expression levels from short-read RNA-seq data
Possibility - class which represents the abundance of the isoform (or isoform sub-element, such as a junction or exon) in human tissues, cell lines, or disease-relevant samples.
Original code (now deleted) input the GTEx data and had transcript abundances across ~30 human tissues (see below).
This class may be helpful for comparing abundances of events (e.g., exon skipping) versus whole-isoforms. It may be helpful in comparing short-read-based versus long-read-based expression values.
It could also be used to plot expression visualizations in the isoform imager module. For example - Farilie's Swan program has an example.
def __init__(self, expr_dict):
self.expr_dict = expr_dict # tissue -> rpkm
self.avg_expr = self.compute_avg_expr()
def __getitem__(self, tiss):
# when expr_obj fetch by key (tissue), return value
return self.expr_dict[tiss]
def compute_avg_expr(self):
tot = 0
for k, v in self.expr_dict.items():
v = float(v)
tot += v
avg_expr = tot/len(self.expr_dict)
return avg_expr```
Think about how multiple features connected to the same ORF should be represented, so that one can readily know which features (type, and number) exist for each ORF/exon/residue
Are there naming conventions for split-codons associated with AAs that span two exons?
We could consult with Alain Laeder. or Jason Underwood about this.
A way to visualize isoforms interactively? Like a google-map scheme? Can zoom in and out, automatically squeeze introns for whatever group of isoforms are being displayed. Jared suggests Bokeh or Plotly.
Isocreatefeat.create_and_map_domain -> if run more than once, redundant domains created
Alignments affected:
Occurs at exon junctions.
Implement subclasses for features - by downloading data and coding them in
types of features:
this is the name of the attr linked to the iso_obj:
(for example, orf.dom retrieves current_dom from orf.doms)
biosurfer/isomodules/isoimage.py
Line 262 in 496be78
I was looking at the code in which the Y axis label is set. I was wondering how the physical Width of the Y axis label is determined? If you submit a longer string will the left-hand side of the canvas expand in response?
code to reproduce:
gene = <gene object>
aln_grps = isocreatealign.create_and_map_splice_based_align_obj([[gene.repr_orf, orf] for orf in gene.other_orfs])
for aln_grp in aln_grps:
isocreatefeat.create_and_map_frame_objects(aln_grp)
known examples:
Should we have a single module that holds utility functions or have several separate modules?
Currently - distance between tracks is 1. Need to find a way to dynamically determine the spacing of tracks (e.g., with domains and lollipop figures, one isoform may be taller than another)
How long does it take on average to instantiate a full gene object? Versus load from a pickle?
Will probably need to fix this while addressing #21.
Show the 5’UTR and 3’UTR as thin bars
E.g., UTR is light blue and thinner; CDS i darker blue and thicker
This can only happen after Transcript class is implemented (#21).
need to find examples
Legend for any shading/hatching/picture elements
Hatching - need a legend for the translational frame
Show a green bar for first methionine; and red bar for the stop codon
Need to indicate the Appris principles (e.g., asterisk next to transcript_name, and indicate what asterisk means in the legend)
Include a statement (near bottom left or right is usually good) saying:
*GENCODE APPRIS principle isoform
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.