APARENT - APA Regression Net

This repository contains the code for training and running APARENT, a deep neural network that can predict human 3' UTR Alternative Polyadenylation (APA), annotate genetic variants based on the impact of APA regulation, and engineer new polyadenylation signals according to target isoform abundances or cleavage profiles.

APARENT was described in Bogard et al, Cell 2019 in press.

The model was trained on >3.5 million randomized 3' UTR poly-A signals expressed on mini gene reporters in HEK293.

Forward-engineering of new poly-A signals is done using the included SeqProp (Stochastic Sequence Backpropagation) software, which implements a gradient-based input optimization algorithm and uses APARENT as the predictor.

Further below on this page are links to IPython Notebooks containing all of the analyses performed in the paper. There is also a link to the repository containing all of the processed data used by the notebooks.

Contact jlinder2 (at) cs.washington.edu for any questions about the model or data.

Web Prediction Tool

We have hosted a publicly accessible web application where users can predict APA isoform abundance and variant effects with APARENT and visualize the results.

The web prediction tool is located at https://apa.cs.washington.edu.

Installation

APARENT can be installed by cloning or forking the github repository:

git clone https://github.com/johli/aparent.git
cd aparent
python setup.py install

APARENT requires the following packages to be installed

Tensorflow >= 1.13.1
Keras >= 2.2.4
Scipy >= 1.2.1
Numpy >= 1.16.2
Isolearn >= 0.2.0 (github)
[Optional] SeqProp >= 0.1 (github)

Usage

APARENT is built as a Keras Model, and as such can be easily executed using simple Keras function calls. See the example usage notebooks below for a tutorial on how to use the model for APA- and Variant Effect prediction.

This simple example illustrates how to predict the isoform abundance and cleavage profile of an input APA event:

import keras
from keras.models import Sequential, Model, load_model
from aparent.predictor import *

#Load APADB-tuned APARENT model and input encoder
apadb_model = load_model('../saved_models/aparent_apadb_fitted_large_lessdropout_no_sampleweights.h5')
apadb_encoder = get_apadb_encoder()

#Example APA sites (gene = PSMC6)

#Proximal and Distal PAS Sequences
seq_prox = 'AGATAGTGGTATAAGAAAGCATTTCTTATGACTTATTTTGTATCATTTGTTTTCCTCATCTAAAAAGTTGAATAAAATCTGTTTGATTCAGTTCTCCTACATATATATTCTTGTCTTTTCTGAGTATATTTACTGTGGTCCTTTAGGTTCTTTAGCAAGTAAACTATTTGATAACCCAGATGGATTGTGGATTTTTGAATATTAT'
seq_dist = 'TGGATTGTGGATTTTTGAATATTATTTTAAAATAGTACACATACTTAATGTTCATAAGATCATCTTCTTAAATAAAACATGGATGTGTGGGTATGTCTGTACTCCTCCTTTCAGAAAGTGTTTACATATTCTTCATCTACTGTGATTAAGCTCATTGTTGGTTAATTGAAAATATACATGCACATCCATAACTTTTTAAAGAGTA'

#Site Distance
site_distance = 180

#Proximal and Distal cut intervals within each sequence defining the isoforms
prox_cut_start, prox_cut_end = 80, 105
dist_cut_start, dist_cut_end = 80, 105

#Predict with APADB-tuned APARENT model
iso_pred, cut_prox, cut_dist = apadb_model.predict(x=apadb_encoder([seq_prox], [seq_dist], [prox_cut_start], [prox_cut_end], [dist_cut_start], [dist_cut_end], [site_distance]))

print("Predicted proximal vs. distal isoform % (APADB) = " + str(iso_pred[0, 0]))

APARENT Example Usage Notebooks

These two notebooks illustrate how to use the APARENT Keras models to predict APA given a proximal and distal site, and to predict APA Variant effects, respectively. These are the two model versions we recommend using:

saved_models/aparent_large_lessdropout_all_libs_no_sampleweights.h5

The base version of APARENT. Given an input sequence, predicts the (non-normalized) isoform abundance and cleavage distribution. It is non-normalized in the sense that predictions are not scaled w.r.t. a particular distal site, but rather the average distal bias of the training MPRA data. The main use of this model is to predict the effect of variants, by calculating the odds ratio between variant and wildtype isoform predictions.

saved_models/aparent_apadb_fitted_large_lessdropout_no_sampleweights.h5

A siamese APARENT network model, expecting both proximal and distal sequences as input. APARENT scores each site independently. The scores are weighted and combined with the log site distance, where the combination weights have been fitted on the Pooled-Tissue APADB data.

Notebook 1: APA Isoform & Cleavage Prediction
Notebook 2: APA Variant Effect Prediction

Note: This model version is not the one evaluated in the paper; this version has been trained on all MPRA libraries (no libraries have been held out) in order to make the best APA predictor possible.

Legacy Model & Code Availability

The Legacy Model is the version evaluated in the paper, which we provide here for reproducibility. The model architecture itself has not changed since the Legacy version, but the newest version has been trained on all MPRA libraries. The Legacy models (base version and APADB-fitted version) are located at saved_models/legacy_models/.

The Legacy model was originally built and trained using Theano. Theano has since stopped being developed, so we have lifted the original model into Keras. The original Theano training code can be found in the below repository:

Legacy Code Repository

Data Availability

The raw sequencing data for the 3' UTR MPRA libraries are found at GEO accession GSE113849.

The Legacy Data is the version of the processed data analyzed in the paper, which we provide here for reproducibility. The newest version of the data has been re-processed with the following additional improvements:

Exact cleavage positions have been mapped for the Alien1 Random MPRA Sublibrary.
A 20 nt random barcode upstream of the USE in the Alien1 Sublibrary has been included in the sequence.

Processed Data Repository
Processed Data Repository (legacy)

Note: The "Processed Data Repository" also includes the Legacy data, but the data has been re-formatted such that it is easier to work with in Keras.

Analysis

The following collection of IPython Notebooks contains all of the analyses performed in the paper. To aid reproducibility, we have used the Legacy APARENT model and Legacy Data in all of the notebooks.