GithubHelp home page GithubHelp logo

rnaimehaom / queryup Goto Github PK

View Code? Open in Web Editor NEW

This project forked from voisinneg/queryup

0.0 0.0 0.0 462 KB

R client for the UniProt REST API

License: GNU General Public License v3.0

R 16.39% HTML 83.61%

queryup's Introduction

R package: queryup

Guillaume Voisinne 2022 - 09 - 09

R-CMD-check Codecov test coverage

The queryup R package aims to facilitate retrieving information from the UniProt database using R. Programmatic access to the UniProt database is performed by submitting queries to the UniProt website REST API.

Install

Install the package from github using devtools:

devtools::install_github("VoisinneG/queryup")
library(queryup)

Queries

Queries combine different fields to identify matching database entries. Here, queries are submitted using the function query_uniprot(). In the queryup R package, a query must be formatted as a list containing character vectors named after existing UniProt fields (available query fields can be found in the API documentation or in the package data query_fields$field). Different query fields must be matched simultaneously. For instance, the following query uses the fields gene_exact to return the UniProt entries of all proteins encoded by gene Pik3r1 :

query <- list("gene_exact" = "Pik3r1")
df <- query_uniprot(query, show_progress = FALSE)
head(df)
##        Entry       Entry Name Gene Names Organism (ID)   Reviewed
## 2 A0A096MNU6 A0A096MNU6_PAPAN     PIK3R1          9555 unreviewed
## 3 A0A0D9RTM6 A0A0D9RTM6_CHLSB     PIK3R1         60711 unreviewed
## 4 A0A1S3F3Z7 A0A1S3F3Z7_DIPOR     Pik3r1         10020 unreviewed
## 5 A0A1U7Q814 A0A1U7Q814_MESAU     Pik3r1         10036 unreviewed
## 6 A0A287DCB8 A0A287DCB8_ICTTR     PIK3R1         43179 unreviewed
## 7 A0A2I2ZTD7 A0A2I2ZTD7_GORGO     PIK3R1          9595 unreviewed

Available query fields can be listed using the package data query_fields:

query_fields$field
##  [1] "accession"                                                
##  [2] "active"                                                   
##  [3] "Refer to the page: Sequence Annotations"                  
##  [4] "lit_author"                                               
##  [5] "protein_name"                                             
##  [6] "chebi"                                                    
##  [7] "uniprot_id (/uniref), then uniref_cluster_90 (/uniprotkb)"
##  [8] "xrefcount_pdb (or xref_count)"                            
##  [9] "date_created"                                             
## [10] "database, xref"                                           
## [11] "ec"                                                       
## [12] "Refer to the pages: Comments or Sequence Annotations"     
## [13] "existence"                                                
## [14] "family"                                                   
## [15] "fragment"                                                 
## [16] "gene"                                                     
## [17] "gene_exact"                                               
## [18] "go"                                                       
## [19] "virus_host_name, virus_host_id"                           
## [20] "accession_id"                                             
## [21] "inchikey"                                                 
## [22] "protein_name"                                             
## [23] "interactor"                                               
## [24] "keyword"                                                  
## [25] "length"                                                   
## [26] "mass"                                                     
## [27] "cc_mass_spectrometry"                                     
## [28] "date_modified"                                            
## [29] "protein_name"                                             
## [30] "organelle"                                                
## [31] "organism_name, organism_id"                               
## [32] "plasmid"                                                  
## [33] "proteome"                                                 
## [34] "proteomecomponent"                                        
## [35] "sec_acc"                                                  
## [36] "reviewed"                                                 
## [37] "scope"                                                    
## [38] "sec_acc"                                                  
## [39] "sequence"                                                 
## [40] "date_sequence_modified"                                   
## [41] "strain"                                                   
## [42] "taxonomy_name, taxonomy_id"                               
## [43] "tissue"                                                   
## [44] "cc_webresource"

Columns

By default, query_uniprot() returns a data.frame with UniProt accession IDs, gene names, organism and Swiss-Prot review status. You can choose which data columns to retrieve using the columns parameter.

df <- query_uniprot(query, 
                    columns = c("id", "sequence", "keyword", "gene_primary"),
                    show_progress = FALSE)
## Warning in (function (..., deparse.level = 1) : number of columns of result is
## not a multiple of vector length (arg 338)

See the API documentation or the package data return_fields for all available columns. Available returned fields can be listed using the package data return_fields:

head(return_fields)
##          field                      label
## 1    accession                      Entry
## 2           id                 Entry name
## 3   gene_names                 Gene names
## 4 gene_primary       Gene names (primary)
## 5 gene_synonym       Gene names (synonym)
## 6     gene_oln Gene names (ordered locus)

Note that the parameter columns and the name of the corresponding column in the output data frame do not necessarily match (they correspond to columns “field” and “label” respectively in the package data return_fields).

names(df)
## [1] "Entry"                "Entry Name"           "Sequence"            
## [4] "Keywords"             "Gene Names (primary)"

Let’s check the sequence and the UniProt keywords corresponding to the first entry :

as.character(df$Sequence[1])
## [1] "MSAEGYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFSDGQEARPEEIGWLNGYNETTGERGDFPGTYVEYIGRKKISPPTPKPRPPRPLPVAPGSSKTEADVEQQALTLPDLAEQFAPPDVAPPLLIKLVEAIEKKGLECSTLYRTQSSGNLAELRQLLDCDTASVDLEMIDVHILADAFKRYLLDLPNPVIPAAVYSEMISLAQEVQSSEEYIQLLKKLIRSPSIPHQYWLTLQYLLKHFFKLSQTSSKNLLNARVLSEIFSPMLFRFSAASSDNTENLIKVIEILISTEWNERQPAPALPPKPPKPTTVANNGMNNNMSLQDAEWYWGDISREEVNEKLRDTADGTFLVRDASTKMHGDYTLTLRKGGNNKLIKIFHRDGKYGFSDPLTFNSVVELINHYRNESLAQYNPKLDVKLLYPVSKYQQDQVVKEDNIEAVGKKLHEYNTQFQEKSREYDRLYEEYTRTSQEIQMKRTAIEAFNETIKIFEEQCQTQERYSKEYIEKFKREGNEKEIQRIMHNYDKLKSRISEIIDSRRRLEEDLKKQAAEYREIDKRMNSIKPDLIQLRKTRDQYLMWLTQKGVRQKKLNEWLGNENTEDQYSLVEDDEDLPHHDEKTWNVGSSNRNKAENLLRGKRDGTFLVRESSKQGCYACSVVVDGEVKHCVINKTATGYGFAEPYNLYSSLKELVLHYQHTSLVQHNDSLNVTLAYPVYAQDSYFIFQGNMGRMHGNGHSM"
as.character(df$Keywords[1])
## [1] "Coiled coil;Reference proteome;Repeat;SH2 domain;SH3 domain;Stress response"

Combining query fields

Our first query returned many matches. We can build more specific queries by using more than one query field. By default, matching entries must satisfy all query fields simultaneously. Let’s retrieve the only Swiss-Prot reviewed protein entry encoded by gene Pik3r1 in Homo sapiens (taxon: 9606):

query <- list("gene_exact" = "Pik3r1", 
              "reviewed" = "true", 
              "organism_id" = "9606")
df <- query_uniprot(query, show_progress = FALSE)
print(df)
##    Entry Entry Name  Gene Names Organism (ID) Reviewed
## 2 P27986 P85A_HUMAN PIK3R1 GRB1          9606 reviewed

Multiple items per query field

It is also possible to look for entries that match different items within a single query field. Items from a given query field are looked for independently. Hence, the following query will return all Swiss-Prot reviewed proteins encoded by either Pik3r1 or Pik3r2 in either Mus musculus (taxon: 10090) or Homo sapiens (taxon: 9606):

query <- list("gene_exact" = c("Pik3r1", "Pik3r2"), 
              "reviewed" = "true", 
              "organism_id" = c("9606", "10090"))
df <- query_uniprot(query, show_progress = FALSE)
print(df)
##    Entry Entry Name  Gene Names Organism (ID) Reviewed
## 2 O00459 P85B_HUMAN      PIK3R2          9606 reviewed
## 3 O08908 P85B_MOUSE      Pik3r2         10090 reviewed
## 4 P26450 P85A_MOUSE      Pik3r1         10090 reviewed
## 5 P27986 P85A_HUMAN PIK3R1 GRB1          9606 reviewed

Queries with invalid entries

If a query containing invalid entries is sent to the UniProt REST API, an error message is returned and no information about the other potentially valid entries can be retrieved. To overcome this limitation, queryup parses the error messages and remove invalid entries from the query. Hence, query_uniprot() will return information for valid entries only :

invalid_ids <- c("P226", "CON_P22682", "REV_P47941")
valid_ids <- c("A0A0U1ZFN5", "P22682")
ids <- c(invalid_ids, valid_ids)
query <- list("accession_id" = ids)
query_uniprot(query)
## 3 invalid values were found (REV_P47941, P226, CON_P22682) and removed from the query.

##        Entry     Entry Name Gene Names Organism (ID)   Reviewed
## 2 A0A0U1ZFN5 A0A0U1ZFN5_RAT  Cbl c-Cbl         10116 unreviewed
## 3     P22682      CBL_MOUSE        Cbl         10090   reviewed

Long queries

Because UniProt REST API limits the size of queries, long queries containing more than a few hundreds entries cannot be passed in a single request. To overcome this limitation, the queryup package splits long queries into smaller ones. For instance, the dataset uniprot_entries that is bundled with the queryup package contains information for 1000 UniProt entries. We could retrieve the ENSEMBL ids corresponding to these entries using :

ids <- uniprot_entries$Entry
query <- list("accession_id" = ids)
columns <- c("gene_names", "xref_ensembl")
df <- query_uniprot(query, columns = columns, show_progress = FALSE)
head(df)
##        Entry                 Gene Names
## 2 A0A087WPF7             Auts2 Kiaa0442
## 3 A0A088MLT8 Iqcj-Schip1 Iqschfp Schip1
## 4 A0A0B4J1F4                     Arrdc4
## 5 A0A0B4J1G0               Fcgr4 Fcgr3a
## 6 A0A0G2JDV3                 Gbp6 Mpa2l
## 7 A0A0U1RPR8                     Gucy2d
##                                                                Ensembl
## 2 ENSMUST00000161226 [A0A087WPF7-1];ENSMUST00000161374 [A0A087WPF7-3];
## 3                                   ENSMUST00000182006 [A0A088MLT8-1];
## 4 ENSMUST00000048068 [A0A0B4J1F4-1];ENSMUST00000118110 [A0A0B4J1F4-2];
## 5                                                  ENSMUST00000078825;
## 6                                                           A0A0G2JDV3
## 7                                                  ENSMUST00000206435;

Protein-protein interactions

Another usage could be to retrieve protein-protein interactions among a set of UniProt entries:

ids <- sample(uniprot_entries$Entry, 400)
query <- list("accession_id" = ids, 
              "interactor" = ids)
columns <- "cc_interaction"
df <- query_uniprot(query = query, columns = columns, show_progress = FALSE)
head(df)
##      Entry                         Interacts with
## 2   E9Q401 Q6PHZ2; Q9Z2I2; Q8K4S1; E9Q401; P23327
## 3   O08808         Q8BKX1; O08808; P46940; P61586
## 4   O35681         Q9R0N4; O35681; Q9R0N8; Q9R0N9
## 21  O08547                                 O35526
## 22  A2A259                         Q2EG98; A2A259
## 211 O35526 O08547; P60879; P46097; P21707; Q62747

queryup's People

Contributors

voisinneg avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.