GithubHelp home page GithubHelp logo

wikimedia / wikidataqueryservicer Goto Github PK

View Code? Open in Web Editor NEW
29.0 6.0 2.0 53 KB

An R package for the Wikidata Query Service API

Home Page: https://cran.r-project.org/package=WikidataQueryServiceR

License: Other

R 100.00%
wikidata sparql wdqs rstats r-package r api-wrapper

wikidataqueryservicer's Introduction

WikidataQueryServiceR

Project Status: Active – The project has reached a stable, usable state and is being actively developed. CRAN_Status_Badge CRAN Total Downloads License: MIT

This is an R wrapper for the Wikidata Query Service (WDQS) which provides a way for tools to query Wikidata via SPARQL (see the beta at https://query.wikidata.org/). It is written in and for R, and was inspired by Os Keyes’ WikipediR and WikidataR packages.

Author: Mikhail Popov (Wikimedia Foundation)
License: MIT
Status: Active

Installation

install.packages("WikidataQueryServiceR")

To install the development version:

# install.packages("remotes")
remotes::install_github("wikimedia/WikidataQueryServiceR@main")

Usage

library(WikidataQueryServiceR)
## See ?WDQS for resources on Wikidata Query Service and SPARQL

You submit SPARQL queries using the query_wikidata() function.

Example: fetching genres of a particular movie

In this example, we find an “instance of” (P31) “film” (Q11424) that has the label “The Cabin in the Woods” (Q45394), get its genres (P136), and then use WDQS label service to return the genre labels.

query_wikidata('SELECT DISTINCT
  ?genre ?genreLabel
WHERE {
  ?film wdt:P31 wd:Q11424.
  ?film rdfs:label "The Cabin in the Woods"@en.
  ?film wdt:P136 ?genre.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}')
genre genreLabel
http://www.wikidata.org/entity/Q3072049 zombie film
http://www.wikidata.org/entity/Q471839 science fiction film
http://www.wikidata.org/entity/Q859369 comedy-drama
http://www.wikidata.org/entity/Q1342372 monster film
http://www.wikidata.org/entity/Q853630 slasher film
http://www.wikidata.org/entity/Q224700 comedy horror

For more example SPARQL queries, see this page on Wikidata.

query_wikidata() can accept multiple queries, returning a (potentially named) list of data frames. If the vector of SPARQL queries is named, the results will inherit those names.

Fetching queries from Wikidata’s examples page

The package provides a WikipediR-based function for getting SPARQL queries from the WDQS examples page.

sparql_query <- get_example(c("Cats", "How many states this US state borders"))
sparql_query[["How many states this US state borders"]]
 
SELECT ?state ?stateLabel ?borders
WHERE
{
  {
    SELECT ?state (COUNT(?otherState) as ?borders)
    WHERE
    {
    ?state wdt:P31 wd:Q35657 .
    ?otherState wdt:P47 ?state .
    ?otherState wdt:P31 wd:Q35657 .
    }
    GROUP BY ?state
  }
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" .
  }
}        
ORDER BY DESC(?borders) 

Now we can run all extracted SPARQL queries:

results <- query_wikidata(sparql_query)
lapply(results, dim)
## $Cats
## [1] 147   2
## 
## $`How many states this US state borders`
## [1] 48  3
head(results$`How many states this US state borders`)
state stateLabel borders
http://www.wikidata.org/entity/Q1509 Tennessee 8
http://www.wikidata.org/entity/Q1581 Missouri 8
http://www.wikidata.org/entity/Q1261 Colorado 7
http://www.wikidata.org/entity/Q1603 Kentucky 7
http://www.wikidata.org/entity/Q1400 Pennsylvania 6
http://www.wikidata.org/entity/Q1211 South Dakota 6

Links for learning SPARQL

Additional Information

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

wikidataqueryservicer's People

Contributors

bearloga avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

ts404

wikidataqueryservicer's Issues

Switch to WikipediR for extracting query examples

This package does not rely on the rvest and urltools R packages for core functionality, but if the user has them installed then there is a bonus function for scraping the examples page and extracting SPARQL queries.

It would be faster, simpler, and overall better to use WikipediR in scrape_example() to fetch the original wiki markup through Wikidata's MediaWiki API.

SPARQL package?

First, your package looks brilliant.

Second, question: I have never used the SPARQL package, but is there any reason why not to depend on it in packages like yours?

Notable deaths

I would like to reproduce this example through a SPARQL query to Wikidata, instead of relying on scraping Wikipedia like the author of the example does.

May I ask if you know how to do this? I am only starting to get a grip at using Wikidata through SPARQL.

Queryng a variables instead of strings to Wikidata using WikidataQueryServiceR

Provided a vector of movies' names, I would like to know their genres querying Wikidata.

Since I am a R user, I have recently discovered WikidataQueryServiceR which has exactly the same example I was looking for:

library(WikidataQueryServiceR)

query_wikidata('SELECT DISTINCT
  ?genre ?genreLabel
 WHERE {
  ?film wdt:P31 wd:Q11424.
  ?film rdfs:label "The Cabin in the Woods"@en.
  ?film wdt:P136 ?genre.
 SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
 }')
    
## 5 rows were returned by WDQS

Unfortunately, this query uses a static text, so I would like to replace The Cabin in the Woods by a vector. In order to do, I tried with the following code:

library(WikidataQueryServiceR)

example <- "The Cabin in the Woods" # Single string for testing purposes.

query_wikidata(paste('SELECT DISTINCT ?human ?humanLabel ?sex_or_gender  ?sex_or_genderLabel WHERE {
  ?human wdt:P31 wd:Q5.
  ?human rdfs:label', example, '@en.
  ?human wdt:P21 ?sex_or_gender.
 SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
  OPTIONAL { ?human wdt:P2561 ?name. }
}', sep = ""))

But that does not work as expected, as I get the following result:

Error in FUN(X[[i]], ...) : Bad Request (HTTP 400).

Is there any way to pass variables instead of fixed strings?

"smart" formatting yields a "cannot open the connection" error

Running in format = "simple" mode, we get the following:

sparql_query <- "SELECT DISTINCT
  ?softwareVersion ?publicationDate
WHERE {
  BIND(wd:Q206904 AS ?R)
  ?R p:P348 [
  ps:P348 ?softwareVersion;
  pq:P577 ?publicationDate
  ] .
}"

str(results <- WikidataQueryServiceR::query_wikidata(sparql_query))
'data.frame':	21 obs. of  2 variables:
 $ softwareVersion: chr  "3.3.3" "3.1.0" "3.1.2" "3.2.5" ...
 $ publicationDate: chr  "2017-03-06T00:00:00Z" "2014-04-10T00:00:00Z" "2014-10-31T00:00:00Z" "2016-04-14T00:00:00Z" ...

Using the format = "smart" mode, we'd be able to get a data.frame with publicationDate formatted to POSIXlt:

results <- WikidataQueryServiceR::query_wikidata(sparql_query, format = "smart")

But we get an error instead:

Error in open.connection(con, "rb") : cannot open the connection
In addition: Warning message:
In open.connection(con, "rb") : cannot open file '{
  "head" : {
    "vars" : [ "softwareVersion", "publicationDate" ]
  },
  "results" : {
    "bindings" : [ {
      "softwareVersion" : {
        "type" : "literal",
        "value" : "3.3.3"
      },
      "publicationDate" : {
        "datatype" : "http://www.w3.org/2001/XMLSchema#dateTime",
        "type" : "literal",
        "value" : "2017-03-06T00:00:00Z"
      }

query_wikidata may drop results

While using this package, one of my queries kept returning a different and way too low number of results. The query in question:

SELECT ?item ?itemLabel ?bhl_id ?orcid ?viaf ?isni WHERE {
  ?item wdt:P4081 ?bhl_id. #BHL creator
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" } 
  OPTIONAL { ?item wdt:P496 ?orcid .}
  OPTIONAL { ?item wdt:P214 ?viaf .}
  OPTIONAL { ?item wdt:P213 ?isni .}
}

As of now, this should return 27.225 results, but I only got some number between 7.000 and 10.000 each time. Upon closer inspection, it seems to cut off every time it reaches an item with a ÿ character in its label. So I think it's an encoding problem during the content extraction. I am running this on a Windows 10 machine, don't know if that may play a role.

When using the following function:

querki <- function(query,h="text/csv") {
  require(httr)
  response <- httr::GET(url = "https://query.wikidata.org/sparql", 
                        query = list(query = query),
                        httr::add_headers(Accept = h))
  return(httr::content(response))
}

I do get all results.

Switch from GET to POST

GET has length restrictions, POST doesn't. WDQS supports POST now so the queries should be submitted via POST to avoid potential cropping of long queries.

Allow wdqs_requester() and query_wikidata() to request other instances

https://query.wikidata.org/sparql is the main instance which is used to query Wikidata. However, in some context, other instances may be useful. For instance, https://qlever.cs.uni-freiburg.de/wikidata/ is sometimes used to run big queries on a dump of Wikidata. Another instance is https://wikidata.demo.openlinksw.com/sparql.

I think it would be useful to add an instance parameter to wdqs_requester. Do you think it's a good idea ?

Submit new version to CRAN

In a few months from now (because v0.1.1 was published on 28 April 2017), the latest version should be submitted to CRAN.

The changes (so far) include:

  • Switch to WikipediR for extracting query examples (#4)
  • Switch from POSIXlt to POSIXct (#5)
  • Switch from GET to POST (#6)
  • Rate limiting (#11)
  • Using tidyverse family of packages (readr::read_csv is especially helpful for automatic conversion of date-time columns)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.