Corpus to Graph Genomics Processing Pipeline

This repository is an example for implementing a pipeline for processing medical documents. This repository is a code sample for implementing a pipeline for processing documents of any domain running on the Azure stack.

Processing steps:

Fetch documents from remote repository - pmc and pubmed databases on NCBI (www.ncbi.nlm.nih.gov).
Split documents to sentences and extract relevant entities (miRNA and genes) using a remote entity extraction service.
Find and score relations between entities in each sentence using a remote scoring API
Store relations and scores into graph database to be exposed by Graph API service

This repository is an example of using the Corpus to Graph Pipeline node module.

Solution Architecture

Components in the solution:

Component	Description
Public Repository	External repository that supplies new documents every day
Trigger Web Job	Scheduled to run daily and trigger a flow
Query Web Job	Queries for new document IDs (latest)
Parser Web Job	Divides documents into sentences and entities
Scoring Web Job	Scores sentences and relations
External API	API (url) that enables entity extraction and scoring
Graph Data	Database to store documents, sentences and relations

Architecture Diagram

Components
- Web Jobs
- Logging
- Console
- Graph API
Testing
Deployment
Code Challenges
- Deploying Node.js as a Web Job
- Seamless Execution Across Execution Modes
License

Components

Web Jobs

There are 3 web jobs in the bundle

Web Job	Description
Trigger	A schedules web job that triggers a daily check for new document Ids
Query	Query documents according to date range provided through Trigger Queue and insert all unprocessed documents to New IDs Queue
Parser	Processes each document in New IDs Queue into sentences and entities and pushes them into Scoring Queue
Scoring	Scores each sentence in Scoring Queue via the Scoring Service

To get more information on the message api between the web jobs and the queues see Corpus to Graph Pipeline - Message API

Logging

The web jobs output their logs into two mediums:

nodejs console - Using the nodejs common console.log\console.info etc...
console web app - see Console

Console - Managing and Monitoring

The Console Application web app is deployed as part of the solution. For more information see Console - Managing and Monitoring the Pipeline

Graph API

Used to expose the output of the pipeline. Mainly designed to be used by the loom tool to get the entities and the relations.

Testing

Prerequisites

To run the tests locally, create a setenv.test.cmd file at the root of your repository. You can copy it from env.template.cmd as a template.

For local run parameters see Local Run Parameters

Running Tests Locally

Initiate tests by running:

npm install
npm test

Running Tests from Mac\Linux

To run unit testing from mac or linux, you need nothing special.

Integration tests have a setup part that runs the schema.sql on the configured sql database. To run integration tests from a mac\linux machine you either need to have sqlcmd installed locally or run :

(in that order) on your sql database manually and then run the integration tests.

Deployment

The deployment files are available under azure-deployment folder. It uses ARM template to deploy the environment and connect it to git with continuous deployment.

Running Locally

Create a setenv.private.cmd file at the root of your repository. You can copy it from env.template.cmd as a template.

For local run parameters see Local Run Parameters

Local Environment Prerequisites

Sql Server, Database, login name and password (schema is available in Schema.sql)
Azure storage account name and key for queues
Azure storage account name and key for logging (can use the same one as for queues)
Service URLs for: Document processing, Scoring
Enable Google Authentication

Azure Deployment

Create a azure-deployment/Templates/azuredeploy.parameters.private.json file with your configuration and passwords. You can use azure-deployment/Templates/azuredeploy.parameters.json file as a reference.

Prerequisites

An active azure subscription
Service URLs for: Document processing, Scoring
Enable Google Authentication on the console web app url
Set Up Git

To edit the deployment parameters see Azure Deployment Parameters TODO - There should be no prerequisites to the project 1) Remove project dependencies 2) Enable user/password authentication to prevent dependency on google authentication

Deploy with ARM

Install azure-cli and change mode to ARM

npm install -g azure-cli
azure config mode arm

List all subscriptions and see the currently set subscription.
In case you need to change the subscription, use azure account set.

azure account list
azure account show
azure account set c37fee37-d7f6-45bc-a4f2-852780bda058

To deploy the template to azure, use the following (You can also use azure-deployment\Templates\scalable\deploy.cmd):

cd azure-deployment\Templates\scalable\
azure group create -n resource-group-name -l "West US"
azure group deployment create -f azuredeploy.json -e parameters.private.json resource-group-name deployment-name

To deploy continuous integration run

azure group deployment create -f azuredeploy.sourcecontrol.json -e parameters.private.json resource-group-name deployment-sourcecontrol-name

Notice 1: The deployment templates have been divided into two since currently, deploying node continuous deployment with ARM can seem to fail*

Notice 2: Even though continuous deployment may seem to fail, this might be the result of network errors with npm and might actually work*

Deployment parameters

resource-group-name - You can use an existing resource group or run azure group create to create a new resource group.
deployment-name - The name for the deployment which you can later monitor through and azure cli or the azure portal.

Deployment Types

This repository contains two kinds of deployments

Scalable Deployment

Slim Deployment

Deployment Name	Description
Scalable	Deploys each web job to a separate web app, enabling the creation of an app service plan that scales according to CPU % or queue message count.
Slim	All web jobs will be deployed to the same web app, which will cost fewer resources.

The Slim Deployment (aka "All in one") exists under azure-deployment\Templates\all-in-one. That folder contains separate files for ARM template (base + source control) and parameters.

When deploying using this template, one web job will be created and marked (with environment variable) as all_in_one. This will cause the npm postinstall action to create 4 web jobs on the 1 web app.

Code Challenges

Deploying Node.js as a Web Job

In order to automatically deploy our application via Continuous Deployment we need to take the following into consideration:

Continuous Deployment makes a copy of your repository in D:\home\site\wwwroot
Web Jobs are automatically created by placement under D:\home\site\wwwroot\app_data\jobs\<type>\<name>
<type> is either continuous for always running jobs, or triggered for manual/scheduled Web Job
<name> is the name you will see when viewing the Web Job in the App Service
For scheduled web jobs, you simply need to add a settings.job file (CRON format) in that folder as well.

As such, we use the npm postinstall script to determine the Web Job type and the specific service to run in order to copy/override the app.js in the relevant web job folder.

Currently [Sep 18th 2016] there is an issue deploying source control (continuous deployment) with application settings to App Services. The deployment works, requires to tun the deployment command for a second time to ensure application settings. It is handled by an issue on git-hub: azure-xplat-cli issue #2618

Seamless Execution Across Execution Modes

The first step in solving the Continuous Deployment problem - All services should be connected to the same repository, but each running different code.

For that app.js is used as the entry point for all Web Jobs and environment variables are used to indicate which service should currently run. This looks something like that:

var webJobName = process.env.PIPELINE_ROLE;
...
var runner = new continuousRunner(webJobName, config);
runner.start();

Environment variables are used to provide service specific settings, such as Azure storage settings, Sql server configuration, external endpoints, etc. They are also used to define which service should run on which machine (ie. PIPELINE_ROLE).

In each of the following scenarios we set the environment variable in different methods:

Development - Process Execution

Each "web job" has a dedicated run.<service>.cmd file, which sets the relevant environment variables. The run.cmd file executes all cmd files in parallel.

Testing - Background Processes

The testing framework starts all web jobs before executing the tests. It uses log history to check whether specific conditions are met to validate tests' results.

The test runs each web job as a separate process as follows:

var worker = exec('set PIPELINE_ROLE=' + webJobName + '&& node ' + runAppJSPath);

License

Document Processing Pipeline is licensed under the MIT License.

isabella232 / corpus-to-graph-genomics Goto Github PK

corpus-to-graph-genomics's Introduction

Corpus to Graph Genomics Processing Pipeline

Solution Architecture

Architecture Diagram

Table of contents

Components

Web Jobs

Logging

Console - Managing and Monitoring

Graph API

Testing

Prerequisites

Running Tests Locally

Running Tests from Mac\Linux

Deployment

Running Locally

Local Environment Prerequisites

Azure Deployment

Prerequisites

Deploy with ARM

Deployment parameters

Deployment Types

Scalable Deployment

Slim Deployment

Code Challenges

Deploying Node.js as a Web Job

Seamless Execution Across Execution Modes

Development - Process Execution

Testing - Background Processes

License

corpus-to-graph-genomics's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org

Jobs