GithubHelp home page GithubHelp logo

dtlehrer / repurp Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 21.53 MB

An automated drug repurposing project.

Java 1.55% Shell 0.13% Perl 4.30% Go 1.87% Perl 6 0.42% Roff 0.24% HTML 88.65% JavaScript 0.01% CSS 0.12% TeX 2.71%

repurp's Introduction

repurp: Automated Drug Repurposing Exploration

Overview

This project is the result of summer research and coursework at the College of Saint Benedict/Saint John's University (CSB/SJU). By taking disease names as input and outputting top potential pre-approved drugs for treatment, our goal is to create an automated drug repurposing program that does the following:

  1. more quickly extracts disease-related data from growing biomedical databases
  2. uses interconnected biomedical data to suggest top drugs for repurposing
  3. produces comparable results to manual drug repurposing studies
  4. reduces drug discovery costs

Biomedical Data

Downloaded Biomedical Data

In the current prototype version, Therapeutic Target Database (TTD) and Human Symptoms-Disease Network (HSDN) datasets have been manually downloaded and pre-processed before program execution for simplicity (avoiding web scraping and reducing program runtime). However, pre-processing this data could become cumbersome with database updates, so a modified program implementing total automation may require live data extraction from these sources.

Therapeutic Target Database (TTD) Datasets

Drug project and protein target information was downloaded from the TTD data download page, as seen in the image below:

TTD Downloaded Files

Highlighted links indicate downloaded text files. These files were opened within Microsoft Excel, preprocessed, and joined based on shared attributes to create a CSV file accessed within the repurposing prototype.

The Merged Dataset (TTDData7.csv)

The resulting cumulation of TTD data is found in one of the main, internally-saved datasets accessed within the repurposing prototype. This joined dataset consists of 3,389 records, each containing 10 attributes/fields, which are outlined below:

  • Uniprot ID: a universal protein identifier
  • TTDTargetID: a drug target identifier specific to the TTD
  • Target_Name: the name of the protein target (protein targeted/bound by a drug project)
  • Target Indication: the disease a protein target has been acted upon to treat
  • ICD9: International Statistical Classification of Diseases and Related Health Problems, 9th revision. This is an international disease identification code.
  • ICD10: more international disease identification codes (10th revision)
  • Target Type: a protein target's development stage (successful, clinical trial, research, etc.)
  • TTDDRUGIDs: TTTD-specific IDs for one or more drugs that act on the corresponding protein target
  • LNMs: one or more drug names (corresponding to TTDDRUGIDs order)
  • Indications: a list with the specific disease each drug project attempts to treat (corresponding to TTDDRUGIDs order)

Record fields may be identical in several locations, but the combination of TTDTargetID and Target Indication should create a composite primary key for the dataset.

Human Symptoms-Disease Network (HSDN) Symptom Similarity Scores

133,106 symptom similarity scores between 1,596 distinct diseases were downloaded from a 2014 study dataset (Supplementary Data 4). This dataset was converted to a CSV file, and it is used to generate one of the drug weights implemented in ranking drug suggestions within the repurposing prototype.

Other Biomedical Data Sources (National Center for Biotechnology Information (NCBI) & Entrez Databases)

NCBI's Entrez Databases (PubMed, Protein, Gene, etc.) provide a quickly growing amount of biomedical data. These databases are much larger—and change more rapidly—than sources like TTD and HSDN, so it is more vital to avoid saving extract live information from them and avoid saving local database copies.

Entrez Programming Utilities (E-utilities) and Entrez Direct (EDirect)

E-utilities, the NCBI Entrez system's public API, provide access to all Entrez Databases and a stuctured data retrieval mechanism employed by EDirect, a NCBI-provided downloadable package of executables that allow the E-utilities to be called directly from a UNIX command line. We can use EDirect command line queries in our program to extract a wide range of highly-customized information from databases like PubMed, Protein, and Gene. See repurp/KeyResources/Entrez_Direct/ or our paper for more information related to the E-utilities and EDirect.

Getting Started

This project was built and tested on a Linux machine. For optimized performance, it is currently recommended that you execute its scripts on a comparable setup. TTD and HSDN datasets are already downloaded and pre-processed, and EDirect tools have been installed to a local package, so you should be able to download and run this repository's collection of files on your own machine. The following steps provide one possible way to work towards customized repurposing on your own machine:

  1. download this repository's ZIP file Download ZIP
  2. extract all files to your own machine's home directory
  3. open a terminal window
  4. navigate to ~/repurp-master/ (cd ~/repurp-master)
  5. execute one of the scripts by entering its file name in the command line (ex. enter "repurp" or "./repurp" in the command line to run the generalized script, "repurpDiabetes" for the diabetes-specific program, or "repurpAD" for Alzheimer's disease)
  6. when prompted by the terminal, provide a disease name as input

After completion, program output should be visible in your repurp-master/output/ directory. Disease-related genes, proteins, and weighted drug suggestions can be obtained for most input diseases provided to the repurp shell script.

Related Work

Paper

Presentation

Original Prototype Demonstration Video (slightly outdated...file structure differs from this repo)

repurp's People

Contributors

dtlehrer avatar

Watchers

 avatar

repurp's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.