GithubHelp home page GithubHelp logo

876lkj / aparent Goto Github PK

View Code? Open in Web Editor NEW

This project forked from johli/aparent

0.0 0.0 0.0 196.51 MB

APA Regression Net - Predict and Engineer Alternative Polyadenylation

License: MIT License

Jupyter Notebook 98.98% Python 0.39% HTML 0.63%

aparent's Introduction

APARENT Logo

APARENT - APA Regression Net

This repository contains the code for training and running APARENT, a deep neural network that can predict human 3' UTR Alternative Polyadenylation (APA), annotate genetic variants based on the impact of APA regulation, and engineer new polyadenylation signals according to target isoform abundances or cleavage profiles.

APARENT was described in Bogard et al, Cell 2019 in press.

The model was trained on >3.5 million randomized 3' UTR poly-A signals expressed on mini gene reporters in HEK293.

Forward-engineering of new poly-A signals is done using the included SeqProp (Stochastic Sequence Backpropagation) software, which implements a gradient-based input optimization algorithm and uses APARENT as the predictor.

Further below on this page are links to IPython Notebooks containing all of the analyses performed in the paper. There is also a link to the repository containing all of the processed data used by the notebooks.

Contact jlinder2 (at) cs.washington.edu for any questions about the model or data.

Web Prediction Tool

We have hosted a publicly accessible web application where users can predict APA isoform abundance and variant effects with APARENT and visualize the results.

The web prediction tool is located at https://apa.cs.washington.edu.

Installation

APARENT can be installed by cloning or forking the github repository:

git clone https://github.com/johli/aparent.git
cd aparent
python setup.py install

APARENT requires the following packages to be installed

  • Tensorflow >= 1.13.1
  • Keras >= 2.2.4
  • Scipy >= 1.2.1
  • Numpy >= 1.16.2
  • Isolearn >= 0.2.0 (github)
  • [Optional] SeqProp >= 0.1 (github)

Usage

APARENT is built as a Keras Model, and as such can be easily executed using simple Keras function calls. See the example usage notebooks below for a tutorial on how to use the model for APA- and Variant Effect prediction.

This simple example illustrates how to predict the isoform abundance and cleavage profile of an input APA event:

import keras
from keras.models import Sequential, Model, load_model
from aparent.predictor import *

#Load APADB-tuned APARENT model and input encoder
apadb_model = load_model('../saved_models/aparent_apadb_fitted_large_lessdropout_no_sampleweights.h5')
apadb_encoder = get_apadb_encoder()

#Example APA sites (gene = PSMC6)

#Proximal and Distal PAS Sequences
seq_prox = 'AGATAGTGGTATAAGAAAGCATTTCTTATGACTTATTTTGTATCATTTGTTTTCCTCATCTAAAAAGTTGAATAAAATCTGTTTGATTCAGTTCTCCTACATATATATTCTTGTCTTTTCTGAGTATATTTACTGTGGTCCTTTAGGTTCTTTAGCAAGTAAACTATTTGATAACCCAGATGGATTGTGGATTTTTGAATATTAT'
seq_dist = 'TGGATTGTGGATTTTTGAATATTATTTTAAAATAGTACACATACTTAATGTTCATAAGATCATCTTCTTAAATAAAACATGGATGTGTGGGTATGTCTGTACTCCTCCTTTCAGAAAGTGTTTACATATTCTTCATCTACTGTGATTAAGCTCATTGTTGGTTAATTGAAAATATACATGCACATCCATAACTTTTTAAAGAGTA'

#Site Distance
site_distance = 180

#Proximal and Distal cut intervals within each sequence defining the isoforms
prox_cut_start, prox_cut_end = 80, 105
dist_cut_start, dist_cut_end = 80, 105

#Predict with APADB-tuned APARENT model
iso_pred, cut_prox, cut_dist = apadb_model.predict(x=apadb_encoder([seq_prox], [seq_dist], [prox_cut_start], [prox_cut_end], [dist_cut_start], [dist_cut_end], [site_distance]))

print("Predicted proximal vs. distal isoform % (APADB) = " + str(iso_pred[0, 0]))

APARENT Example Usage Notebooks

These two notebooks illustrate how to use the APARENT Keras models to predict APA given a proximal and distal site, and to predict APA Variant effects, respectively. These are the two model versions we recommend using:

saved_models/aparent_large_lessdropout_all_libs_no_sampleweights.h5

The base version of APARENT. Given an input sequence, predicts the (non-normalized) isoform abundance and cleavage distribution. It is non-normalized in the sense that predictions are not scaled w.r.t. a particular distal site, but rather the average distal bias of the training MPRA data. The main use of this model is to predict the effect of variants, by calculating the odds ratio between variant and wildtype isoform predictions.

saved_models/aparent_apadb_fitted_large_lessdropout_no_sampleweights.h5

A siamese APARENT network model, expecting both proximal and distal sequences as input. APARENT scores each site independently. The scores are weighted and combined with the log site distance, where the combination weights have been fitted on the Pooled-Tissue APADB data.

Notebook 1: APA Isoform & Cleavage Prediction
Notebook 2: APA Variant Effect Prediction

Note: This model version is not the one evaluated in the paper; this version has been trained on all MPRA libraries (no libraries have been held out) in order to make the best APA predictor possible.

Legacy Model & Code Availability

The Legacy Model is the version evaluated in the paper, which we provide here for reproducibility. The model architecture itself has not changed since the Legacy version, but the newest version has been trained on all MPRA libraries. The Legacy models (base version and APADB-fitted version) are located at saved_models/legacy_models/.

The Legacy model was originally built and trained using Theano. Theano has since stopped being developed, so we have lifted the original model into Keras. The original Theano training code can be found in the below repository:

Legacy Code Repository

Data Availability

The raw sequencing data for the 3' UTR MPRA libraries are found at GEO accession GSE113849.

The Legacy Data is the version of the processed data analyzed in the paper, which we provide here for reproducibility. The newest version of the data has been re-processed with the following additional improvements:

  1. Exact cleavage positions have been mapped for the Alien1 Random MPRA Sublibrary.
  2. A 20 nt random barcode upstream of the USE in the Alien1 Sublibrary has been included in the sequence.

Processed Data Repository
Processed Data Repository (legacy)

Note: The "Processed Data Repository" also includes the Legacy data, but the data has been re-formatted such that it is easier to work with in Keras.

Analysis

The following collection of IPython Notebooks contains all of the analyses performed in the paper. To aid reproducibility, we have used the Legacy APARENT model and Legacy Data in all of the notebooks.

Random MPRA Linear Model Notebooks

Log Odds Ratio Analysis of hexamers in the Random MPRA libraries and Linear Logistic Hexamer Regression.

Notebook 1a: Isoform Log Odds Ratio Analysis (Alien1 Library)
Notebook 1b: Isoform Log Odds Ratio Analysis (Alien2 Library)
Notebook 2: Cleavage Log Odds Ratio Analysis (Alien1 Library)
Notebook 3a: Hexamer Logistic Regression (Combined Library)
Notebook 3b: Hexamer Logistic Regression (TOMM5 Library only)
Notebook 3c: Hexamer Logistic Regression (Alien1 Library only)
Notebook 3d: Hexamer Logistic Regression (Alien2 Library only)

Random MPRA Neural Network Notebooks

Evaluation of APARENT on the Random MPRA libraries, and Convolutional Layer 1 & 2 visualizations.

Notebook 1: MPRA Prediction Evaluation
Notebook 2a: Conv Layer 1 and 2 Analysis (Alien1 Library)
Notebook 2b: Conv Layer 1 and 2 Analysis (Alien2 Library)
Notebook 3: CSE Hexamer Filter (Conv Layer 1)
Notebook 4: Cleavage Motifs (Conv Layer 1)

SeqProp APA Engineering Notebooks

Engineering of PAS sequences according to target isoform and cleavage objectives (and DeepDream).

Notebook 1: Target Isoform Sequence Optimization
Notebook 2: Target Cleavage Sequence Optimization
Notebook 3: Dense Layer Sequence Visualization (DeepDream-Style)

Designed MPRA Analysis Notebooks

Analysis of the Designed MPRA library, including Forward-engineering, Native PAS prediction, and Variant analysis.

Notebook 0a: Basic MPRA Library Statistics
Notebook 0b: MPRA LoFi vs. HiFi Replicates

Notebook 1a: SeqProp Target Isoforms (Summary)
Notebook 1b: SeqProp Target Isoforms (Detailed)

Notebook 2a: SeqProp Target Cut (Summary)
Notebook 2b: SeqProp Target Cut (Detailed)

Notebook 3: Human Wildtype APA Prediction

Notebook 4a: Human Variant Analysis (Summary)
Notebook 4b: Disease-Implicated Variants/UTRs (Detailed)
Notebook 4c: Cleavage-Altering Variants (Detailed)

Notebook 5a: Complex Functional Variants (Summary)
Notebook 5b: Complex Functional Variants (Canonical CSE)
Notebook 5c: Complex Functional Variants (Cryptic CSE)
Notebook 5d: Complex Functional Variants (CFIm25)
Notebook 5e: Complex Functional Variants (CstF)
Notebook 5f: Complex Functional Variants (Folding)

Notebook Bonus: TGTA Motif Saturation Mutagenesis

Native APA Analysis Notebooks

Analysis of native human APA (APADB and Leslie APA Atlas), including cell-type specific APA prediction evaluation.

Data sources: (APADB | Leslie)

Notebook 0: Basic Data Statistics
Notebook 1: Differential Usage Analysis
Notebook 2: Cleavage Site Prediction
Notebook 3: APA Isoform Prediction
Notebook 4: APA Isoform Prediction (Cross-Validation)

aparent's People

Contributors

jlinder2 avatar johli avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.