sheynkman-lab / biosurfer Goto Github PK

"Surf" the biological network, from genome to transcriptome to proteome and back to gain insights into human disease biology.

License: MIT License

Python 100.00%

biosurfer's People

Contributors

Watchers

biosurfer's Issues

Generate table of peptide sequence differences between isoforms along with underlying transcript changes

example:

Features to implement for isoform plotting

Features to implement

skinny bars for UTRs, thick bars for CDSs
mutations as lollipops (red=disease, green=benign, orange=variant of unknown significance)
show expression levels from short-read RNA-seq data

Implementation of disease mutation class

Need a class that represents disease mutations (genetic variants, mendelian disease mutations, complex disease variants, cancer autosomal and somatic variants) that are mapped to Position and Residue objects.

Placeholder for module -> isomodules/isomut.py

Code that could be used as a start -> isomodules_in_progress/mutation*.py

IsoformPlot should try to automatically select the appris isoform as the reference

Python parallel - instantiation of gene objects, and alignment

Write code so that creation of each gene object can be done separately (use parallel python)

Same parallelization can be applied to the alignment of objects.

Need a word that represents new exon sequence that harbors a start codon...

pop up
spliced in
????

How to deal with antisense transcripts

biosurfer/isomodules/isoimage.py

Lines 67 to 70 in aba9107

 strand = {orf.strand for orf in self.orfs} 

 if len(strand) > 1: 

 raise ValueError("Can't plot isoforms from different strands") 

 self.strand: Strand = list(strand)[0]

Currently - the isoimage code only takes in same-strand ORFs for plotting.
We need to think about what to with antisense transcripts, which are prevalent in our long-read data, as well as GENCODE.

Record and display whether GENCODE ORFs have "start/end not found" tags

Nomenclature - "cut out splice"

Need to discuss how to describe the cut-out splice

Add transcripts and UTRs to isoclass hierarchy

proposed hierarchy:

Gene owns Transcripts
Transcript owns ORF(s)
ORF owns UTRs (5' and 3')
ORF and UTR point to constituent Exons and chain of Positions
ORF also points to constituent CDSs and chain of Residues
Exon owns chain of Positions
CDS owns chain of Residues

Idea for optimizing feature representation

Jared idea - to rework the underlying code for representing features as ranges. He thinks this could allow for “lighter-weight” representation of objects. Potentially an optimization project after the “user interface” base code is working.

How was “requirements.txt” populated?

Nomenclature - "frameshift"

Thinking about language for the translational differences. If we say "frameshift", that is expected to mean a ribosomal "slip", stochastic or programmed, that causes a shift in the frame of translation for the same transcript.

I have not explicitly found a term that describes differences in relative frame of translation between different isoforms. I can potentially ask the directory of GENCODE about this.
We may need to say, “usage of a different translational frame”

https://en.wikipedia.org/wiki/Ribosomal_frameshift

Naming of splicing that occurs internal to an exon

Jared found a reference to an "exitron"

Should we name it that?

https://en.wikipedia.org/wiki/Exitron

We can consult with some experts that study the process.

Exitrons also can have frame-preserving or frame-shifted structures.

Word for alternative 5' end with/without start codon

Need a word for this situation:

Reimplement splice-aware isoform alignment

might try implementing this as sorting a list of tuples; if that doesn't work out, copy the optimized old code

Add type checking to other modules

biosurfer/isomodules/isoclass.py

Lines 15 to 19 in aba9107

 from typing import TYPE_CHECKING, List, Set, Literal, Optional 

 if TYPE_CHECKING: 

 from .isoalign import Alignment 

 from .isogroup import Group 

 from .isofeature import Feature

Display disease mutations

mutations as lollipops (red=disease, green=benign, orange=variant of unknown significance)

Evaluate the necessity of outputting alignment group object during isoform plotting?

biosurfer/isomodules/isoimage.py

Line 274 in 496be78

  Returns list of PairwiseAlignmentGroups between anchor ORF and each of the other ORFs. 

Can BioCantor or a similar package be used to implement the iso-object hierarchy?

Essentially make Biomolecule and its subclasses "wrappers" for interval objects (see example)

Design of groupings between objects?

What is the best way to structure many different groupings between features, of different modes and sizes, and at different scales?

Ideas for encoding context and state-specific information

Starting a thread here for ideas on how to encode group-specific information efficiently:

How should users interact with and query from the "universe" of iso-objects?

At the time of this post, making queries from iso-objects in biosurfer requires chained attribute accesses. For example, align_groups[2].frmf.blocks[1].first.res.doms might pull out the set of domains to which the first residue in a frameshifted region is mapped.

Would it be possible to let users make SQL-style queries? What are the pros and cons in terms of ease of code development, performance, user accessibility, etc?

Enhance information in Alignment objects and their full string representations

Things that could make it easier for users and/or the annotation algorithm to identify splicing events that affect protein sequence:

have Residue objects that correspond to stop codons
replace EmptyResidue objects with positions (0-length ranges) within ORFs
explicitly show the presence of introns in full string representation of alignment

Annotation class to hold meta-data from databases (e.g., GO terms)

Idea to think about - should we have an "annnotation" class which will be a flexible container to hold annotation information.

Annotation information examples:

GO term for a gene
More information about a domain
A paper or series of studies that show functionality for a particular isoform

Table to the y-axis labels

Figure out how to add a table to the labels

To show data like this:
ETV2.pdf

Display isoform expression data

show expression levels from short-read RNA-seq data

Possibility - Expression class to describe the abundance of isoform and elements

Possibility - class which represents the abundance of the isoform (or isoform sub-element, such as a junction or exon) in human tissues, cell lines, or disease-relevant samples.

Original code (now deleted) input the GTEx data and had transcript abundances across ~30 human tissues (see below).

This class may be helpful for comparing abundances of events (e.g., exon skipping) versus whole-isoforms. It may be helpful in comparing short-read-based versus long-read-based expression values.

It could also be used to plot expression visualizations in the isoform imager module. For example - Farilie's Swan program has an example.

    def __init__(self, expr_dict):
        self.expr_dict = expr_dict # tissue -> rpkm
        self.avg_expr = self.compute_avg_expr()

    def __getitem__(self, tiss):
        # when expr_obj fetch by key (tissue), return value
        return self.expr_dict[tiss]

    def compute_avg_expr(self):
        tot = 0
        for k, v in self.expr_dict.items():
            v = float(v)
            tot += v
        avg_expr = tot/len(self.expr_dict)
        return avg_expr```

Representation of many, heterogeneous features linked to an ORF

Think about how multiple features connected to the same ORF should be represented, so that one can readily know which features (type, and number) exist for each ORF/exon/residue

Nomenclature: Frameshift in the middle of the protein

Come up with a “name” for this pattern, frameshift-then-shift-back-into-register

Naming of split-codons associated with AAs

Are there naming conventions for split-codons associated with AAs that span two exons?

We could consult with Alain Laeder. or Jason Underwood about this.

Future - interactive isoform visualization?

A way to visualize isoforms interactively? Like a google-map scheme? Can zoom in and out, automatically squeeze introns for whatever group of isoforms are being displayed. Jared suggests Bokeh or Plotly.

Redundant domain creation in create_and_map_domain function

Isocreatefeat.create_and_map_domain -> if run more than once, redundant domains created

v2 isocreatealign code omits residues from certain genes

Alignments affected:

ARHGEF1-203|204
ICAM4-202|201
OSCAR-204|206

Occurs at exon junctions.

replace magic constants/ints with Enums

Feature types to implement in biosurfer

Implement subclasses for features - by downloading data and coding them in
types of features:
this is the name of the attr linked to the iso_obj:
(for example, orf.dom retrieves current_dom from orf.doms)

dom (cat can be dbd, reg, act, repr, other) - domain
binding residue, activate site
lm - linear motif
idr - intrinsically disordered region
sstruc - secondary structure
ptm (cat can be phospho, acetyl, etc.) - post-translational modification
cons - conservation
frm - translational frame
isr (cat can be constitutive, subset, or isoform-specific) - isoform-specific region
Related to this is “Fractional splice” - Fractional splice code
For PTMs, look at Phosphosite+, Cell Signalling Technologies

Show UTRs and CDSs by thickness

skinny bars for UTRs, thick bars for CDSs

How is the width of the y axis labels set in Isoplot.draw?

biosurfer/isomodules/isoimage.py

Line 262 in 496be78

left_subaxes.set_yticklabels([orf.name for orf in self.orfs])

I was looking at the code in which the Y axis label is set. I was wondering how the physical Width of the Y axis label is determined? If you submit a longer string will the left-hand side of the canvas expand in response?

Creating Frame objects for alignment groups in some genes triggers IndexError

code to reproduce:

gene = <gene object>
aln_grps = isocreatealign.create_and_map_splice_based_align_obj([[gene.repr_orf, orf] for orf in gene.other_orfs])
for aln_grp in aln_grps:
    isocreatefeat.create_and_map_frame_objects(aln_grp)

known examples:

GCDH
TIMM50

	strand = {orf.strand for orf in self.orfs}
	if len(strand) > 1:
	raise ValueError("Can't plot isoforms from different strands")
	self.strand: Strand = list(strand)[0]

	from typing import TYPE_CHECKING, List, Set, Literal, Optional
	if TYPE_CHECKING:
	from .isoalign import Alignment
	from .isogroup import Group
	from .isofeature import Feature

sheynkman-lab / biosurfer Goto Github PK

biosurfer's People

Contributors

Watchers

biosurfer's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs