Running DeepLoc on AWS

DeepLoc, a Deep Learning approach for Protein Subcellular Localization.

Introduction

This document provides step-by-step instructions on (a) establishing a connection to an AWS EC2 instance from your terminal; (b) transferring files from your local device to the VM; (c) setting up the environment to run DeepLoc. It will also cover the prerequisites to ensure that your code runs smoothly.

Prerequisites (One-time)

We assume that your operating system is iOS, and that you have activated your AWS account.

AWS Support for extending the vCPU Limit to 32
New AWS accounts are assigned zero virtual CPUs (vCPU) by default. However, computationally intensive tasks use use instances requiring more than 16 units. For DeepLoc, we would require instances with 32+ vCPU units.

Go to your AWS console and search for Support. Raise a token by creating a case and selecting Service limit increase. Fill in the following details to submit your request. The AWS team should approve your request and increase your vCPU limit within an hour.

You can enter a use-case based on your task. Here’s an example: “We're running computationally intensive enterprise-level DL models, for a subcellular localization task (business/research use-case). The startup that I'm working at is called NonExomics. We'd be grateful if ya'll could approve of the vCPU limit so that we could use P2/P3 instances at the earliest”.
Turn off Sleep Mode on your MAC
Go to Energy Saver->Power Adapter. Select the option Prevent computer from sleeping automatically when the display is off.
Changes to SSH-related Configuration Files
We do this to ensure that the connection pipeline doesn't break (while you SSH to an Amazon EC2 instance). Two of the most common errors that you might encounter while training your model are, “Write failed: Broken pipeline” and "packet_write_wait: Connection to xx.xx.xx.xx: Broken pipe" [Read More]. We have a workaround! Open your terminal and edit the three configuration files as follows:

Client-side Configuration

Step 1: Enter $sudo nano /etc/ssh/ssh_config on the terminal.
Add and save the following lines under Host *:

   ServerAliveInterval 120
   TCPKeepAlive yes
   ServerAliveCountMax 5

Step 2: Enter $sudo nano ~/.ssh/config on the terminal.
Add and save the following lines:

 Host *
   ServerAliveInterval 30
   ServerAliveCountMax 5

Server-side Configuration

Step 3: Enter $ sudo nano /etc/ssh/sshd_config on the terminal.
Add the following lines under Host * :

   ClientAliveInterval 600
   ClientAliveCountMax 0

Check if TCPKeepAlive yes is present (add it if it isn't). Restart your system for the changes to reflect.

Fork the DeepLoc Repository on Github
Set up your Github account and fork the DeepLoc repo. If you don’t fork the original repository, you will be denied access to the code while cloning the repository later.

Contact @sukritipaul05 if you encounter any other issues.

Downloading the Dataset

Download the .fasta file to your local /Downloads folder from [Link]. Alternatively, you can rename your own dataset to deeploc_data.fasta and move it to the local /Downloads folder.

Connecting to the Running AWS Instance

Select the Amazon Deep Learning AMI (Ubuntu 16.04) and launch a p2.8xlarge instance. Please note that you must not select any other instance (that might cause NonExomics to incur more costs than what the budget permits). You can refer to the AWS Basics Demo to get your instance running.

To securely connect to the AWS running instance, we SSH into the server (using the private key and public IPv4 on the AWS console). Open your terminal and navigate to the folder having the private aws_deeploc1_key.pem key.

cd /Users/your_username/Downloads/
chmod 0400 aws_deeploc1_key.pem
ssh -L localhost:8888:localhost:8888 -i aws_deeploc1_key.pem ubuntu@<your instance DNS >
#Example
#ssh -L localhost:8888:localhost:8888 -i aws_deeploc1_key.pem [email protected]

For this example, we’ve named the key as aws_deeploc1_key.pem. You can give it any name while launching the instance. Use the same name in the commands.
The SSH command requires the public iPV4 of the instance, which can be copied from the Public IPv4 DNS column in the View Instances page of the AWS Console.

Remember: From this step onwards, you’ve successfully established a connection from the local machine to the virtual server. Therefore, whatever you write/create/modify/delete/configure i.e., whichever command you perform on the current terminal gets reflected on the server and not your local device! Also, remember that NonExomics is being charged (per hour) when the instance is in the running state.

Getting the DeepLoc Folder Ready on the Virtual Server

Set up Git on the Virtual Server by running these commands on the terminal:

$git config --global user.email "<Enter your Github E-mail>"
$git config --global user.name "<Enter your Github Username>"

Check if you’ve forked the DeepLoc Github repository. On the Github webpage, it should be in the format < github username >/DeepLoc preceded by a fork symbol.
To make a copy of the repository on the virtual server:

$git clone < DeepLoc repository URL>
# Example: git clone https://github.com/nonexomics09/DeepLoc

Create a /data folder in the /DeepLoc Folder.

$cd DeepLoc
$mkdir data

Transfer the deeploc_data.fasta file from the local client system to the /DeepLoc/data folder on the virtual server using SCP. Open a new local (client) terminal window and type the command below.

$scp -i <path to .pem key> <copy file from path> user@server:<copy file to path>
# Example: scp -i /Users/nonexomics/Downloads/aws_deeploc1_key.pem /Users/nonexomics/Downloads/deeploc_data.fasta [email protected]:/home/ubuntu/DeepLoc/data

The Public IPv4 DNS of the instance (corresponding to the server bit in the scp command) can be found in the View Instances page of the AWS Console. Cross-check if the deeploc_data.fasta file is present in the /DeepLoc/data folder.

Setting up a Virtual Environment to Run the Python Files

Data Science best practices emphasise the creation of virtual environments for each project. Your current project may have a set of requirements that do not satisfy those of other projects. If you install packages and libraries directly on the virtual server, there may be incompatibilities in package/library/framework versions. We create a virtual environment and activate it, so that the project requirements are installed without friction, and are independent of the installations on the main virtual server.

Fortunately, the p2.8xlarge instance comes with a set of pre-installed virtual environments that can be activated from the terminal. Although you can create your own virtual environment on Ubuntu, and install the packages/libraries in the requirements.txt one by one, we recommend that you use one of the preexisting environments called ‘pytorch_p36’. More often than not, you may encounter clashes between the CUDA and PyTorch versions if you choose to create a new virtual environment from scratch.

The next step describe how you can activate the pytorch_p36 environment.

$cd ~
$source activate pytorch_p36
#You can view the list of other existing virtual environments via $conda info --envs

Running the Python Files in /DeepLoc

Model Architecture image taken from Thanh Tung Hoang's repository [1]

For this section, we follow the steps in this README.md documentation. Navigate to the /DeepLoc folder on the server, via your terminal. All the subsequent commands should be run from the /DeepLoc directory.

Steps

Run the requirements.txt file.

$pip install -r requirements.txt

Prepare the dataset and build the vocabularies/parameters for the dataset.

$python build_dataset.py
$python build_vocab.py --data_dir data/

Phase-1 Training involves a simple train.

$python train.py --data_dir data --model_dir experiments/base_model

Phase-2 Training: To select optimal hyper-parameters (learning rate) and display them.

$python search_hyperparams.py --data_dir data --parent_dir experiments/learning_rate
$python synthesize_results.py --parent_dir experiments/learning_rate

Test set evaluation.

#Existing DeepLoc test data 
$python evaluate.py --data_dir data --model_dir experiments/base_model
#Our test data
$python evaluate_nonexomics.py --data_dir data --model_dir experiments/base_model

Note: Once you’re done with the training and testing, you may transfer files to your local device or backup on Github. The server hard disk memory is temporary (ephemeral drive) and your instance will be wiped clean when you terminate it. Lastly, please terminate the instance on the AWS console as soon as you’ve completed the tasks to avoid unnecessary expenses!

Additional Section (Implementation Summary)

Instance Details

Item	Description
AMI (Virtual Machine)	Amazon Deep Learning AMI (Ubuntu 16.04)
Instance Type	p2.8xlarge : This is a cost-effective option, given that we need >16 GiB of GPU memory for this task, and an instance that enables accelerated computing (specs in the image below).
Training Runtime	~4.5 hours (on the entire fasta file), for 20 epochs. There are two phases for training, via two training files- so the total runtime can be estimated to be 9-10 hours if it is run during the day (Indian Time Zone).
Testing Runtime	<1 minute
Instance cost (per hour)	$7.20 per hour.
Backup and External Memory	No backup, at present. All the work gets wiped-off once the instance is turned off. To save our work, we would have to pay for additional EBS (external physical memory). This type of additional storage is billable even when the instance is switched off.
Region (instance)	East-US Ohio

AMI (Virtual Machine) & Instance Specs

The DeepLoc task requires NVIDIA CUDA, cuDNN, TensorFlow, and a compatible version of PyTorch. The p2.8xlarge instance is a viable instance type.

p2.8xlarge instance details

Snippets from the Training & Testing Phases

Fig. 1: Data Description (Vocabulary)

Fig. 2: Train, Test & Validation Data set Creation from the Fasta file

Fig. 3: Training acc=64%, loss=0.96 (without optimal hyperparams)

Fig. 4: Test acc=66%, loss=0.98 (trained without optimal hyperparams)

References

DeepLoc Repo by Thanh Tung Hoang. [Link]
CS230 Deep Learning
Almagro Armenteros, José Juan, et al. "DeepLoc: prediction of protein subcellular localization using deep learning." Bioinformatics 33.21 (2017): 3387-3395. [Link]

Any changes to the model architecture in this repository are attributed to NonExomics (©2021,NonExomics). Contact @sukritipaul05 while selecting your AWS EC2 instance or if you encounter any other issues.

sukritipaul05 / deeploc Goto Github PK

deeploc's Introduction