GithubHelp home page GithubHelp logo

proteobooster's Introduction

ProteoBOOSTER

ProteoBOOSTER is a tool to automatically infer protein-protein interactions (PPI) networks for an entire proteome, and characterize this PPI networks by inferring protein complexes. If functional annotations are avilable for the proteome, complexes are also analyzed for function, providing a functional profile for as many putative complexes as possible.

This repository contains the code to download source databases to do this for any target proteome, as well as the scripts to automatically transfer the network and perform the subsequent analyses. A sister project, ProteoBOOSTER-web, provides a web interface that can display information generated by ProteoBOOSTER on a website.

Usage

This repository is composed by a series of scripts that should be run in a specific sequence. In between, however, you should run BLAST and ClusterONE. Below is an example sequence of commands that runs the pipeline to the Homo sapiens reference proteome:

Install the required packages

pip install requirements.txt

Download the required databases

python download_sapshot.py <project-path> -s <data-dir> 

Here <project-path> refers to the directory on your computer where you want to store all the databases, they will be stored in <project-path>/<data-dir>.

Next, we need to pre-process the interaction databases and combine it into a single database.

python create_interaction_file.py <project-path>/<data-dir> <project-path>/interactions

This creates a collection of files on the <project-path>/interactions/<data-dir> path (although we're changing this behavior soon).

Now, let's assume you've downloaded a proteome fasta file from UniprotKB, such as the homo sapiens reference proteome: UP000005640_9606.fasta.

To process this, we need to first create a BLAST database from the fasta that containts all the interactors we've identified before:

makeblastdb -in <project-path>/interactions/<data-dir>/sequences.fasta -out <project-path>/interactions/<data-dir>/sequences.fasta -dbtype prot

Then, we need to align our target fasta file against that database:

blastp -outfmt 6 -query UP000005640_9606.fasta -out UP000005640_9606.blast -db <project-path>/interactions/<data-dir>/sequences.fasta -num_threads <num-cores>

Here <num-cores is an optional value to speed up the alignment with the number of cores.

We carry on by applying our homology criterion to build a homolog database for the target proteome:

python create_homologs.py UP000005640_9606.blast UP000005640_9606.homologs

Using the calculated homolog database, we can now transfer interactions from the combined database for our target proteome:

python transfer_interactions.py UP000005640_9606.homologs <project-path>/interactions/<data-dir>/interaction_file.tab UP000005640_9606.interologs

If you know that no experimental interactions are available for your target organism, the next step is optional. For homo sapiens (or any model organism) however, it's very likely that some of the proteins in the reference proteome have annotated interactors, so we extract them to include those in the inferred interaction:

python extract_experimental_interactions.py UP000005640_9606.fasta <project-path>/interactions/<data-dir>/interaction_file.tab UP000005640_9606.interactions

We now have enough data to create a graph and use it to infer protein complexes. Let's get the graph in a format that ClusterONE can use:

python prepare_data_for_clustering.py UP000005640_9606.interactions UP000005640_9606.interologs UP000005640_9606.graph

And then cluster the graph:

java -jar cluster_one-1.2.jar -F csv UP000005640_9606.graph > UP000005640_9606.complexes

Finally, to get the functional overrepresentation of all these complexes, you may run:

python overrepresentation.py UP000005640_9606.fasta UP000005640_9606.complexes UP000005640_9606.goa <project-path>/<data-dir>/go-basic.obo UP000005640_9606.overrep 

Preparing files for visualization with ProteoBOOSTER-web

The section above already provides all the files required to get the information you need, but it may be preferable to visualize this data using the web based interactive explorer.

It works by loading this information in a relational database and creating graphical user interfaces to communicate with it. As a convenience, we created a script that helps you to transform the set of files you created above into files than can be ingested by the data loader that ships with ProteoBOOSTER-web.

This does require a small amount of extra pre-processing and downloading some more files, however.

Extra downloads:

You will need to:

  • download taxonomy informations from UniProt and clicking the Download button. This tutorial will continue assuming you downloaded it and named it taxonomy-info.tsv.
  • create a file containing protein names, this is can be done running the following commands:
    • grep ">" <project-path>/<data-dir>/trembl.fasta > trembl.proteins (this will extract all the lines with protein metadata from the fasta file)
    • grep ">" <project-path>/<data-dir>/swissprot.fasta > swissprot.proteins
    • cat trembl.proteins swissprot.proteins > proteobooster.proteins
    • python collect_proteins.py UP000005640_9606.homologs UP000005640_9606.interologs UP000005640_9606.exp_interactions UP000005640_9606.proteins (this will extract all the proteins from the files we generated above, it is not a required step, but it speeds up the generation).
    • python build_protein_info_dict_singlethread.py proteobooster.proteins UP000005640_9606.proteins protein-buffer.proteins

Preparing the database files

Now that we have all the files we need, we simply need to create a directory where the database files will be written. Let's name that directory db-load, and then we can run:

python prepare_database_files.py UP000005640_9606 <project-path>/<data-dir>/go-basic.obo UP000005640_9606.goa UP000005640_9606.interologs UP000005640_9606.exp_interactions UP000005640_9606.homologs protein-buffer.proteins UP000005640_9606.complexes UP000005640_9606.overrep <project-path>/<data-dir>/mi.obo taxonomy-info.tsv db-load

proteobooster's People

Contributors

torresmateo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.