GithubHelp home page GithubHelp logo

yascoma / ppintegrator Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 208.52 MB

Python pipelines to prepare PPI (Protein-Protein Interactions) data from reference databases and describe them semantically using ontologies

License: MIT License

Python 100.00%

ppintegrator's Introduction

PPIntegrator - PPI Triplification Process

Python pipelines to prepare PPI (Protein-Protein Interactions) data from reference databases and describe them semantically using ontologies

Summary

This pipeline has as major goal provide a tool for protein interactions (PPI) prediction data formalization and standardization using the OntoPPI ontology. This pipeline is splitted in two parts: (i) a part to prepare data from three main sources of PPI data (HINT, STRING and PredPrin) and create the standard files to be processed by the next part; (ii) the second part uses the data prepared before to semantically describe using ontologies related to the concepts of this domain. It describes the provenance information of PPI prediction experiments, datasets characteristics, functional annotations of proteins involved in the PPIs, description of the PPI detection methods (also named as evidence) used in the experiment, and the prediction score obtained by each PPI detection method for the PPIs. This pipeline also execute data fusion to map the same protein pairs from different data sources and, finally, it creates a database of all these information in the alegro graph triplestore. The figure below illustrates the two parts of this pipeline.

pipeline

Requirements:

  • Python packages needed:
    • pip3 install numpy
    • pip3 install rdflib
    • pip3 install uuid
    • pip3 install SPARQLWrapper
    • alegro graph tools (pip3 install agraph-python)
      Go to this site for the installation tutorial

Usage Instructions

Preparation:

  1. git clone https://github.com/YasCoMa/ppintegrator.git
  2. cd ppintegrator
  3. pip3 install -r requirements.txt Allegrograph is a triple store, which is a database to maintain semantic descriptions. This database's server provides a web application with a user interface to run, edit and manage queries, visualize results and manipulate the data without writing codes other than SPARQL query language. The use of the Allegregraph option is not mandatory, but if you want to export and use it, you have to install the server and the client.
  4. if you want to use the Allegrograph server option (this triple store has free license up to 5,000,000 triples), install allegrograph server in your machine (configure a user and password): Server - https://franz.com/agraph/support/documentation/current/server-installation.html; Client - https://franz.com/agraph/support/documentation/current/python/install.html
  5. Export the following environment variables to configure Allegrograph server
export AGRAPH_HOST=127.0.0.1
export AGRAPH_PORT=10035
export AGRAPH_USER=chosen_user
export AGRAPH_PASSWORD=chosen_password
  1. Start allegrograph: path/to/allegrograph/bin/agraph-control --config path/to/allegrograph/lib/agraph.cfg start
  2. Read the file data_requirements.txt to understand which files are needed for the process

Data preparation (first part) - File prepare_data_triplification.py :

  • Pipeline parameters:

    • -rt or --running_type
      Use to indicate from which source you want to prepare PPI data, as follows:
      1 - Prepare data for PredPrin
      2 - Prepare data for String
      3 - Prepare data for HINT

    • -fec or --file_experiment_config
      File with the experiment configuration in json format

      Examples are in these files (all the metadata are required): params_hint.json, params_predrep_5k.json e params_string.json

    • -org or --organism
      Prepare data only for one organism of interest (example: homo_sapiens)

      This parameter is optional. If you do not specify, it will automatically use the organisms described in the experiment configuration file above

  • Running modes examples:

    1. Running for PPI data generated by PredPrin:
      python3 prepare_data_triplification.py -rt 1 -fec params_predrep_5k.json

    2. Running for HINT database:
      python3 prepare_data_triplification.py -rt 3 -fec params_hint.json

    3. Running for STRING database:
      python3 prepare_data_triplification.py -rt 2 -fec params_string.json

    In the file auxiliar_data_preparation.py you can run it for all the examples provided automatically, as follows:
    python3 auxiliar_data_preparation.py

PPI data triplification (second part) - File triplification_ppi_data.py:

  • Pipeline parameters:

    • -rt or --running_type
      Use to indicate which execution step you want to run (it is desirable following the order showed):
      0 - Generate the descriptions for all the protein interaction steps of an experiment (run steps 1, 2 and 3)
      1 - Generate triples just about data provenance
      2 - Generate triples just for protein functional annotations
      3 - Generate triples just for the score results of each evidence
      4 - Execute data fusion
      5 - Generate descriptions and execute data fusion (run steps 1, 2, 3 and 4)
      6 - Export to allegrograph server

    • -fec or --file_experiment_config
      File with the experiment configuration in json format

      Examples are in these files (all the metadata are required): params_hint.json, params_predrep_5k.json e params_string.json

    • -fev or --file_evidence_info
      File with the PPI detection methods information in json format

      Examples are in these files (all the metadata are required): evidences_information.json, evidences_information_hint.json e evidences_information_string.json

    • -fcv or --file_config_evidence
      File with the experiment and evidence methods files addresses in tsv format

      Example of this file: config_evidence_file.tsv

  • Running modes examples:

    1. Running to generate all semantic descriptions for PredPrin:
      python3 triplification_ppi_data.py -rt 0 -fec params_predrep_5k.json -fev evidences_information.json

    2. Running to generate only triples of data provenance:
      python3 triplification_ppi_data.py -rt 1 -fec params_hint.json -fev evidences_information_hint.json

    3. Running to generate only triples of PPI scores for each evidence:
      python3 triplification_ppi_data.py -rt 3 -fec params_hint.json -fev evidences_information_hint.json

    4. Running to generate only triples of protein functional annotations (only PredPrin exports these annotations):
      python3 triplification_ppi_data.py -rt 2 -fec params_predrep_5k.json -fev evidences_information.json

    5. Running to generate all semantic descrptions for STRING:
      python3 triplification_ppi_data.py -rt 0 -fec params_string.json -fev evidences_information_string.json

    For the next options (4, 5 and 6), it is mandatory running at least mode 1 and 3 for HINT, STRING and PredPrin

    1. Running to execute data fusion of different sources:
      python3 triplification_ppi_data.py -rt 4 -fcv config_evidence_file.tsv

    2. Running to generate all semantic descriptions and execute data fusion of different sources (combines mode 0 and 4):
      python3 triplification_ppi_data.py -rt 5 -fcv config_evidence_file.tsv

    3. Export semantic data to allegrograph server:
      python3 triplification_ppi_data.py -rt 6 -fcv config_evidence_file.tsv

Query Scenarios for analysis

Supposing you ran all the steps showed in the section above, you can run the following options to analyse the data stored alegro graph triple store.
File to use for this section: query_analysis_ppitriplificator.py

  • Parameter:

    • -q or --query_option
      Use to indicate which query you want to perform:
      1 - Get all the different organisms whose interactions are stored in the database
      2 - Get the interactions that have scientific papers associated and the list of these papers
      3 - Get a list of the most frequent biological processes annotated for the interactions of Escherichia coli bacteria
      4 - Get only the interactions belonging to a specific biological process (regulation of transcription, DNA-templated) in Escherichia coli bacteria
      5 - Get the scores of interactions belonging to a specific biological process (regulation of transcription, DNA-templated) in Escherichia coli bacteria
      6 - Get a list of the most frequent biological processes annotated for the interactions of human organism
      7 - Get only the interactions belonging to a specific biological process (positive regulation of transcription by RNA polymerase II) in human organism
      8 - Get the scores of interactions belonging to a specific biological process (positive regulation of transcription by RNA polymerase II) in human organism
  • Running modes examples:

    1. Running queries:
      python3 query_analysis_ppitriplificator.py -q 1
      Change number 1 to the respective number of the query you want to perform

Reference

Martins, Y. C., Ziviani, A., Cerqueira e Costa, M. D. O., Cavalcanti, M. C. R., Nicolás, M. F., & de Vasconcelos, A. T. R. (2023). PPIntegrator: semantic integrative system for protein–protein interaction and application for host–pathogen datasets. Bioinformatics Advances, 3(1), vbad067.

Bug Report

Please, use the Issues tab to report any bug.

ppintegrator's People

Contributors

yascoma avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.