GithubHelp home page GithubHelp logo

jvfe / reconciler Goto Github PK

View Code? Open in Web Editor NEW
21.0 1.0 6.0 1.54 MB

Python package to reconcile DataFrames

Home Page: https://jvfe.github.io/reconciler/

License: BSD 2-Clause "Simplified" License

Python 91.95% Makefile 8.05%
wikidata linked-data open-data reconciliation-service pandas python dataframe

reconciler's Introduction

reconciler

license pytest status documentation status DOI

reconciler is a python package to reconcile tabular data with various reconciliation services, such as Wikidata, working similarly to what OpenRefine does, but entirely within Python, using Pandas.

Quickstart

You can install the latest version of reconciler from PyPI with:

pip install reconciler

Then to use it:

from reconciler import reconcile
import pandas as pd

# A DataFrame with a column you want to reconcile.
test_df = pd.DataFrame(
    {
        "City": ["Rio de Janeiro", "São Paulo", "São Paulo", "Natal"],
        "Country": ["Q155", "Q155", "Q155", "Q155"]
    }
)

# Reconcile against type city (Q515), getting the best match for each item.
reconciled = reconcile(test_df["City"], type_id="Q515")

The resulting dataframe would look like this:

id match name score type type_id input_value
Q8678 True Rio de Janeiro 100 city Q515 Rio de Janeiro
Q174 True São Paulo 100 city Q515 São Paulo
Q131620 True Natal 100 municipality of Brazil Q3184121 Natal

In case you want to ensure the results are cities from Brazil, you can specify the property_mapping argument with a specific property-value pair:

# Reconcile against type city (Q515) and items have the country (P17) property equals to Brazil (Q155)
reconciled = reconcile(test_df["City"], type_id="Q515", property_mapping={"P17": test_df["Country"]})

Options

The reconcile() function accepts several options.

  • type_id - The type of items to reconcile against per the API specification.
  • top_res - Either the number of results to return per entry or the string 'all' to return all results.
  • property_mapping - A list of properties to filter results on per the API specification.
  • reconciliation_endpoint - The reconciliation service to connect to. Defaults to https://wikidata.reconci.link/en/api.

Other very useful packages

Although my opinion may be biased, I think reconciler is a pretty nice package. But the thing is, it probably won't fulfill all your Wikidata-related needs. Here are other packages that could help with that:

  • WikidataIntegrator has a lot of very nice, low-level, functions for dealing with various wikidata-related activities, such as item acquisition and programmatic editing.

  • wikidata2df is a very simple utility package for quickly and easily turning wikidata SPARQL queries into Pandas DataFrames.

reconciler's People

Contributors

edsu avatar jvfe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

reconciler's Issues

Add option to reconcile against specific triples

Currently users can only reconcile against a specific type, not taking into account the properties of the items, this is specially important with items containing gene names, so you could, for example, reconcile against type gene (Q7187) with found in taxon (P703) Homo Sapiens (Q15978631).

The API already supports these triples, it would just be a question of implementing it in the code.

Possibility to specify multiple type_id (instance of) for an entity to reconcile

I wonder if it is possible to specify multiple type_id for an entity.

For example, a case study would be: I would like to search entities which are either city (Q6256) or country (Q).
Then, an autosuggest service can only shows (or prioritize) the entities of these types over other types.

If it is not possible, is there any plan to implement it?
I think this functionality would be nice.

Many thanks in advance!

Avoid name conflict with module reconcile and function reconcile

Hi, I tried monkey patching my fix in #12 to a pip installed version of reconciler but it was harder than I expected. I was able workaround the trouble but it might be worth considering renaming the reconcile module.

My first issue was that I was trying to modify webutils but because reconcile, which uses webutils, is imported from __init__.py, it’s too late to change once the module is loaded. My fix was to replace this with a blank __init__.py.

The next issue is that this line replaces the reconcile module with the reconcile function, my monkey patch which sets reconciler.reconcile = revonciler.reconcile.reconcile has trouble when called twice (when running a Jupyter notebook for instance).

Might it be a good idea to name the reconcile module something like reconcileImpl so that the line doesn’t clobber the module with the function? Then this line could be called more than once. Also the module is visible for debugging.

Invalid query error when using simplified version of your code

Hi @jvfe. Thanks so much for posting this. I had been searching all over the web for some examples and didn't find the OpenRefine or Reconcilliation Service API W3C Community Report particularly helpful for getting the wikidata.reconci.link API to work in Python. Using your code here and the Reconcilliation Service example for a reconciliation query, I assembled this minimal Python script:

import requests
import json

http = requests.Session()

reconciliation_endpoint = 'https://wikidata.reconci.link/en/api'

query_string = '''{
    "queries": [
      {
        "query": "Christel Hanewinckel",
        "type": "Q5",
        "limit": 5,
        "type_strict": "should"
      }
    ]
}'''

response = http.post(reconciliation_endpoint, data=json.loads(query_string))
print(json.dumps(response.json(), indent=2))

I had been unable to get anything but a 404 from the API when I tried making the call using Postman. Using this script I did get an error message from the server:

{
  "arguments": {
    "lang": "en",
    "queries": "query"
  },
  "details": "Expecting value: line 1 column 1 (char 0)",
  "message": "invalid query",
  "status": "error"
}

This script and query is so incredibly simple and stripped down, I cannot see what the problem is. The actual construction of the query in your code is spread around enough that I can't really compare my query_string with one your code generates. But based on the error message, the problem doesn't seem to be a malformed query. Rather, the API seems to not be seeing the query at all. I didn't see anything in the documentation or your code about any particular required Content-Type headers or authorization being required. So I am mystified as to why it doesn't work.

I don't know how closely you monitor this issue tracker, but if you have any advice on what I'm doing wrong, I would greatly appreciate it. Thanks!

Steve Baskauf
Vanderbilt University Libraries
Nashville, Tennessee, USA

Bug report: function crash and burn with "['features'] not found in axis"

Hey @jvfe,

I was trying to split and reconcile this SPARQL query result.

77-item dataframe with names. I run:

reconciled = reconcile(new_df["itemLabel"], type_id="Q16521")

It splits in 7 chunks and runs. I get this when one of the frames fails:
image
KeyError: "['features'] not found in axis"

A quick fix would be to add a try-except block in:

        input_keys, response = return_reconciled_raw(
            column,
            type_id,
            has_property,
            reconciliation_endpoint,
        )
        parsed = parse_raw_results(input_keys, response)

        dfs.append(parsed.drop(["features"], axis=1))

More details, see this notebook

Idea: Replicate error with small example and add as test

Support properties by mapping values from additional columns in dataframe

The reconciliation API supports optional properties which can be used to supply contextual information: https://reconciliation-api.github.io/specs/latest/#structure-of-a-reconciliation-query

Could the reconciler package support supply of a mapping between property names and additional columns in the dataframe, which is then used to build properties in the query object? This is particularly an issue for non-wikidata reconciliation services (as implemented in issue #4).

Example using reconciliation service: http://data1.kew.org/reconciliation/reconcile/IpniName

Dataframe:

id scientific_name genus species author
1 Quercus robur Quercus robur L

Calling reconcile using just the value in the scientific_name column would return 3 potential matches. If we were able to add contextual information about the author via properties, we would return a single match. So when reconciling the column scientific_name, we'd like to add extra data from the columns genus, species and author as properties epithet_1, epithet_2 and publishing_author:

Query object:

{"query":"Quercus robur"
,"properties":[{"p":"epithet_1","pid":"epithet_1","v":"Quercus"}
,{"p":"epithet_2","pid":"epithet_2","v":"robur"}
,{"p":"publishing_author","pid":"publishing_author","v":"L."}]
}

Could be implemented by mapping dataframe column name (source of property value) to property id and name:

col_prop_mapper={"genus":{"id":"epithet_1","name":"epithet_1"}
"species":{"id":"epithet_2","name":"epithet_2"}
"author":{"id":"publishing_author","name":"publishing_author"}
}

... and supplying this mapping as an optional parameter to the reconcile function.

Make testing suite faster and more self-contained

Currently, the tests are slow to run and could use some improvements, such as being more self-contained and less reliant on external data services, to account for situations such as the reconciliation services being loaded.

For this, csv-reconcile seems like a promising solution, since it would allow us to spin up our own reconciliation service inside a pytest fixture.

Idea by @gitonthescene on #12

Add progress bar for reconciliation

Currently the reconciliation occurs without any indication of progress, different from OpenRefine. It would be nice to give the user some input as to how the process is being handled.

This would probably require some major refactoring of the post request I use to reconcile, perhaps uploading in chunks and updating a progress bar with tqdm, something that is beyond my skills at the present moment.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.