GithubHelp home page GithubHelp logo

kbnlwikimedia / wdreconcile Goto Github PK

View Code? Open in Web Editor NEW

This project forked from hay/wdreconcile

0.0 1.0 0.0 25 KB

Map strings to Wikidata QID's using various methods

License: MIT License

Python 95.98% Shell 4.02%

wdreconcile's Introduction

wdreconcile.py

Map strings to Wikidata QID's using various methods

This is a work-in-progress Python command-line tool to align strings to Wikidata items (QID's).

Install

Clone this repo and use poetry to install dependencies:

poetry install

Then run poetry run wdreconcile.

Usage

Using the search reconciler

Create a text file with strings you want to reconcile, separated by newline. E.g.

museums.txt

Metropolitan Museum of Art
Centraal Museum
Jewish Historical Museum

By default wbsearch.py uses the wdsearch reconciler. This gives you back the very first result from the the wbsearchentities Wikidata API. This is the same as what you get when using the autocomplete field on the website. You need to specify a language in ISO-code form (e.g. en)

poetry run wdreconcile -i museums.txt -o museums.csv -l en

This will give you back a filed called museums.csv that looks like this:

query id label description status
Metropolitan Museum of Art Q160236 Metropolitan Museum of Art major art museum in New York City, United States ok
Centraal Museum Q260913 Centraal Museum museum in Utrecht, Netherlands ok
Jewish Historical Museum Q702726 Jewish Historical Museum Jewish history, culture, and religion museum in Amsterdam, Netherlands ok

Note that the output format (-o) can have any extension that dataknead supports, so to use json, just run the command like this:

poetry run wdreconcile -i museums.txt -o museums.json -l en

If you want more than the first result you can use the -li (limit) parameter to change the number of results.

poetry run wdreconcile -i museums.txt -o museums-3.csv -l en -li 3

You can also use the Wikidata fulltext search, which will give you the same results as the Special:Search page. Specify wdfullsearch using the -rt argument. The wdfullsearch reconciler is about half as slow as the default wdsearch reconciler.

poetry run wdreconcile -i museums.txt -o museums.csv -l en -rt wdfullsearch

And you can also use the Wikidata reconciler as used by OpenRefine, using the -rt (reconciler type) parameter.

poetry run wdreconcile -i museums.txt -o museum-openrefine.csv -rt openrefine -l en

Lookup labels/descriptions by qid

Another use of wdreconcile is to map back QID's to labels and descriptions using the wdentity reconciler. This will also check if the item exists and might be handy for batch checking of existing QID's.

poetry run wdreconcile -i museum-qids.csv -o museum-matched.csv -rt wdentity -l en

Troubleshooting

If you add the -v (verbose) flag wdreconcile will give much more debug information.

All options

usage: wdreconcile [-h] -i INPUT -o OUTPUT
                   [-rt {openrefine,wdentity,wdsearch,wdfullsearch}] -l
                   LANGUAGE [-li LIMIT] [-v]

Reconcile a list of strings to Wikidata items

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input file (text, line based)
  -o OUTPUT, --output OUTPUT
                        Output file
  -rt {openrefine,wdentity,wdsearch,wdfullsearch}, --reconciler_type {openrefine,wdentity,wdsearch,wdfullsearch}
                        Reconciler type
  -l LANGUAGE, --language LANGUAGE
                        ISO code of the language you're using to reconcile
  -li LIMIT, --limit LIMIT
                        How many results to return
  -v, --verbose         Display debug information

License

MIT © Hay Kranen

wdreconcile's People

Contributors

hay avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.