jvfe / reconciler Goto Github PK

View Code? Open in Web Editor NEW

21.0 1.0 6.0 1.54 MB

Python package to reconcile DataFrames

Home Page: https://jvfe.github.io/reconciler/

License: BSD 2-Clause "Simplified" License

Python 91.95% Makefile 8.05%

wikidata linked-data open-data reconciliation-service pandas python dataframe

reconciler's Introduction

reconciler

reconciler is a python package to reconcile tabular data with various reconciliation services, such as Wikidata, working similarly to what OpenRefine does, but entirely within Python, using Pandas.

Quickstart

You can install the latest version of reconciler from PyPI with:

pip install reconciler

Then to use it:

from reconciler import reconcile
import pandas as pd

# A DataFrame with a column you want to reconcile.
test_df = pd.DataFrame(
    {
        "City": ["Rio de Janeiro", "São Paulo", "São Paulo", "Natal"],
        "Country": ["Q155", "Q155", "Q155", "Q155"]
    }
)

# Reconcile against type city (Q515), getting the best match for each item.
reconciled = reconcile(test_df["City"], type_id="Q515")

The resulting dataframe would look like this:

id	match	name	score	type	type_id	input_value
Q8678	True	Rio de Janeiro	100	city	Q515	Rio de Janeiro
Q174	True	São Paulo	100	city	Q515	São Paulo
Q131620	True	Natal	100	municipality of Brazil	Q3184121	Natal

In case you want to ensure the results are cities from Brazil, you can specify the property_mapping argument with a specific property-value pair:

# Reconcile against type city (Q515) and items have the country (P17) property equals to Brazil (Q155)
reconciled = reconcile(test_df["City"], type_id="Q515", property_mapping={"P17": test_df["Country"]})

Options

The reconcile() function accepts several options.

type_id - The type of items to reconcile against per the API specification.
top_res - Either the number of results to return per entry or the string 'all' to return all results.
property_mapping - A list of properties to filter results on per the API specification.
reconciliation_endpoint - The reconciliation service to connect to. Defaults to https://wikidata.reconci.link/en/api.

Other very useful packages

Although my opinion may be biased, I think reconciler is a pretty nice package. But the thing is, it probably won't fulfill all your Wikidata-related needs. Here are other packages that could help with that:

WikidataIntegrator has a lot of very nice, low-level, functions for dealing with various wikidata-related activities, such as item acquisition and programmatic editing.
wikidata2df is a very simple utility package for quickly and easily turning wikidata SPARQL queries into Pandas DataFrames.

reconciler's People

Contributors

Stargazers

Watchers

Forkers

henrieglesorotos lubianat gitonthescene admariner edsu renedorsch

reconciler's Issues

Add option to reconcile against specific triples

Currently users can only reconcile against a specific type, not taking into account the properties of the items, this is specially important with items containing gene names, so you could, for example, reconcile against type gene (Q7187) with found in taxon (P703) Homo Sapiens (Q15978631).

The API already supports these triples, it would just be a question of implementing it in the code.

Build documentation using mkdocs and mkdocstrings

The tools are already here, they're really easy to use, I just have to restructure the markdown files to something more sensible.

Quer trabalhar com dados abertos?

Opa João,

estava dando uma olhada nos seus pacotes e eles são beem interessantes.

Eu tenho uma outra iniciativa, o Base dos Dados, https://github.com/basedosdados/mais, que pretende organizar dados públicos num grande banco de dados.

Estamos procurando alguém para estagiar como engenheiro de dados e acho que pode ser interessante pra você:

https://www.linkedin.com/feed/update/urn:li:activity:6736712117942009856/

Expose the data extension service

Expose the data extension service through reconciler.

Spec

Idea by @gitonthescene on #11

Possibility to specify multiple type_id (instance of) for an entity to reconcile

I wonder if it is possible to specify multiple type_id for an entity.

For example, a case study would be: I would like to search entities which are either city (Q6256) or country (Q).
Then, an autosuggest service can only shows (or prioritize) the entities of these types over other types.

If it is not possible, is there any plan to implement it?
I think this functionality would be nice.

Many thanks in advance!

Is it possible to reconcile without a type id?

I wanted to run a reconciliation regardless of type ID (i.e. any term on Wikidata).
Do you know if that is possible?

Avoid name conflict with module reconcile and function reconcile

Hi, I tried monkey patching my fix in #12 to a pip installed version of reconciler but it was harder than I expected. I was able workaround the trouble but it might be worth considering renaming the reconcile module.

My first issue was that I was trying to modify webutils but because reconcile, which uses webutils, is imported from __init__.py, it’s too late to change once the module is loaded. My fix was to replace this with a blank __init__.py.

The next issue is that this line replaces the reconcile module with the reconcile function, my monkey patch which sets reconciler.reconcile = revonciler.reconcile.reconcile has trouble when called twice (when running a Jupyter notebook for instance).

Might it be a good idea to name the reconcile module something like reconcileImpl so that the line doesn’t clobber the module with the function? Then this line could be called more than once. Also the module is visible for debugging.

Make reconciliation endpoint configurable

Thanks a lot for this wonderful project!

The reconciliation API is supported by many more data sources than Wikidata. Here is a list of public reconciliation endpoints:
https://reconciliation-api.github.io/testbench/

It would be great to make the endpoint configurable, so that the library can be used for other data sources.

Invalid query error when using simplified version of your code

Hi @jvfe. Thanks so much for posting this. I had been searching all over the web for some examples and didn't find the OpenRefine or Reconcilliation Service API W3C Community Report particularly helpful for getting the wikidata.reconci.link API to work in Python. Using your code here and the Reconcilliation Service example for a reconciliation query, I assembled this minimal Python script:

import requests
import json

http = requests.Session()

reconciliation_endpoint = 'https://wikidata.reconci.link/en/api'

query_string = '''{
    "queries": [
      {
        "query": "Christel Hanewinckel",
        "type": "Q5",
        "limit": 5,
        "type_strict": "should"
      }
    ]
}'''

response = http.post(reconciliation_endpoint, data=json.loads(query_string))
print(json.dumps(response.json(), indent=2))

I had been unable to get anything but a 404 from the API when I tried making the call using Postman. Using this script I did get an error message from the server:

{
  "arguments": {
    "lang": "en",
    "queries": "query"
  },
  "details": "Expecting value: line 1 column 1 (char 0)",
  "message": "invalid query",
  "status": "error"
}

This script and query is so incredibly simple and stripped down, I cannot see what the problem is. The actual construction of the query in your code is spread around enough that I can't really compare my query_string with one your code generates. But based on the error message, the problem doesn't seem to be a malformed query. Rather, the API seems to not be seeing the query at all. I didn't see anything in the documentation or your code about any particular required Content-Type headers or authorization being required. So I am mystified as to why it doesn't work.

I don't know how closely you monitor this issue tracker, but if you have any advice on what I'm doing wrong, I would greatly appreciate it. Thanks!

Steve Baskauf
Vanderbilt University Libraries
Nashville, Tennessee, USA

Examples not using default reconciliation service

Hi there, thanks for this project! Might it be useful for the README to have at least one example using a reconciliation service other than the default one?

Also, I think exposing the data extension service through this API might also be useful.

Bug report: function crash and burn with "['features'] not found in axis"

Hey @jvfe,

I was trying to split and reconcile this SPARQL query result.

77-item dataframe with names. I run:

reconciled = reconcile(new_df["itemLabel"], type_id="Q16521")

It splits in 7 chunks and runs. I get this when one of the frames fails:

KeyError: "['features'] not found in axis"

A quick fix would be to add a try-except block in:

        input_keys, response = return_reconciled_raw(
            column,
            type_id,
            has_property,
            reconciliation_endpoint,
        )
        parsed = parse_raw_results(input_keys, response)

        dfs.append(parsed.drop(["features"], axis=1))

More details, see this notebook

Idea: Replicate error with small example and add as test

Support properties by mapping values from additional columns in dataframe

The reconciliation API supports optional properties which can be used to supply contextual information: https://reconciliation-api.github.io/specs/latest/#structure-of-a-reconciliation-query

Could the reconciler package support supply of a mapping between property names and additional columns in the dataframe, which is then used to build properties in the query object? This is particularly an issue for non-wikidata reconciliation services (as implemented in issue #4).

Example using reconciliation service: http://data1.kew.org/reconciliation/reconcile/IpniName

Dataframe:

id	scientific_name	genus	species	author
1	Quercus robur	Quercus	robur	L

Calling reconcile using just the value in the scientific_name column would return 3 potential matches. If we were able to add contextual information about the author via properties, we would return a single match. So when reconciling the column scientific_name, we'd like to add extra data from the columns genus, species and author as properties epithet_1, epithet_2 and publishing_author:

Query object:

{"query":"Quercus robur"
,"properties":[{"p":"epithet_1","pid":"epithet_1","v":"Quercus"}
,{"p":"epithet_2","pid":"epithet_2","v":"robur"}
,{"p":"publishing_author","pid":"publishing_author","v":"L."}]
}

Could be implemented by mapping dataframe column name (source of property value) to property id and name:

col_prop_mapper={"genus":{"id":"epithet_1","name":"epithet_1"}
"species":{"id":"epithet_2","name":"epithet_2"}
"author":{"id":"publishing_author","name":"publishing_author"}
}

... and supplying this mapping as an optional parameter to the reconcile function.

Make testing suite faster and more self-contained

Currently, the tests are slow to run and could use some improvements, such as being more self-contained and less reliant on external data services, to account for situations such as the reconciliation services being loaded.

For this, csv-reconcile seems like a promising solution, since it would allow us to spin up our own reconciliation service inside a pytest fixture.

Idea by @gitonthescene on #12

Add progress bar for reconciliation

Currently the reconciliation occurs without any indication of progress, different from OpenRefine. It would be nice to give the user some input as to how the process is being handled.

This would probably require some major refactoring of the post request I use to reconcile, perhaps uploading in chunks and updating a progress bar with tqdm, something that is beyond my skills at the present moment.