GithubHelp home page GithubHelp logo

mentatpsi / kegg-crawler Goto Github PK

View Code? Open in Web Editor NEW
19.0 3.0 4.0 25.78 MB

A parallel API crawler for the retrieval of Kyoto Encyclopedia of Genes and Genomes metabolic and genomics data.

License: GNU General Public License v3.0

Python 100.00%
kegg kegg-reaction keggrest python metabolites pathway reactions genetics

kegg-crawler's Introduction

KEGG Crawler

Author: Shay Maor

KEGG Crawler is a collection of Python scripts that are designed to download huge portions of the KEGG database for local use into a format that helps with Data Processing such as Python. There are various tools that download portions of the database, but this tool was designed to gather a lot more data. A sample of the last crawl is featured in the repo. Primarily, there are pathways, reactions, and metabolites. These come in the form of CSVs and Pickle files (to be imported as dictionaries).

It uses KEGGs REST API to first attain a list of pathways, as well as their respective chemical reactions and metabolites. It utilizes Python's threading module to make the crawler process parallel, minimizing bandwidth & latency issues. It utilizes 8 threads with stacks on each thread of the target crawls. It also utilizes the urllib2 module to initiate the crawls. The last module it uses comes from the Beautiful Soup library. This requires a pip install of beautiful soup (after pip is installed, this can be done through the cmd "pip install beautifulsoup").

The main crawler has a progress indicator and presents a message when each thread makes 50% progress and when it reaches completion. The script itself runs for approximately 20-30 minutes on a cable connection.

It is divided into 3 different files:

crawler.py is responsible for the crawls. It utilizes the rest of the scripts to perform the crawl.

dbReader.py utilizes pattern recognition to create an object inheriting a dictionary type containing deeper information. This is useful for later applications as some information might prove helpful for future endeavors. In the case of metabolites, this includes entries such as formula, molecular weight, reactions involved in, metabolic pathways, enzymes involved, as well as some information on external database fields (such as ChEBI and PubChem identifiers). The instance can then be called such as dbInstance['DBLINKS'] or dbInstance['ENZYME']

mapArea.py utilizes html parsing of the maparea section of the pathway maps. It was used for producing the secondary csv called pathway_connection.csv, explained later. It is the only script which uses HTMLParser, since it examines the tags within the html source found in pathways to look for metabolite nodes.

It generates 5 csv's as well as python pickle files (dictionaries that can be imported). A later release will contain the python scripts that show proper usage of the pickle files. One of the Pickle files (compoundsD, a more complete collection of data on each metabolite) usually contains approximately 70 MB.

A prompt will take place after maparea threads have completed. This is to avoid the heavier, more lengthy crawl from taking place in case testing functions were provided prior.

The 4 dominant CSVs created are as follows.

pathways.csv "Pathway ID","Pathway Name","KEGG ID"

metabolites.csv "KEGG ID","Names","chEBI","Occurence"

reactions.csv "Reaction KEGG ID","Reactants","Products","Reversability","EC list","Occurences","Names"

pathways_reactions.csv "Pathway ID","Pathway KEGG","Reaction KEGG ID"

The secondary CSV it creates: pathway_connection.csv "Pathway1 KEGG","Pathway2 KEGG","Connecting reactions","Connection metabolites"

this was for trying to hone down some possible connectivity from one pathway to another.

kegg-crawler's People

Contributors

mentatpsi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

kegg-crawler's Issues

Forbidden by KEGG?

Hi there,

When using this crawler, do you experience access forbidden by KEGG? Since there will be to many data retrieval requests sent from the same IP.

Thanks,
G

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.