opensanctions / nomenklatura Goto Github PK

Framework and command-line tools for integrating FollowTheMoney data streams from multiple sources

License: MIT License

Python 99.76% Makefile 0.24%

data-integration deduplication record-link

nomenklatura's Introduction

nomenklatura

Nomenklatura de-duplicates and integrates different Follow the Money entities. It serves to clean up messy data and to find links between different datasets.

Usage

You can install nomenklatura via PyPI:

$ pip install nomenklatura

Command-line usage

Much of the functionality of nomenklatura can be used as a command-line tool. In the following example, we'll assume that you have a file containing Follow the Money entities in your local directory, named entities.ijson. If you just want try it out, you can use the file tests/fixtures/donations.ijson in this repository for testing (it contains German campaign finance data).

With the file in place, you will cross-reference the entities to generate de-duplication candidates, then run the interactive de-duplication UI in your console, and eventually apply the judgements to generate a new file with merged entities:

# generate merge candidates using an in-memory index:
$ nomenklatura xref -r resolver.json entities.ijson
# note there is now a new file, `resolver.json` that contains de-duplication info.
$ nomenklatura dedupe -r resolver.json entites.ijson
# will pop up a user interface.
$ nomenklatura apply entities.ijson -o merged.ijson -r resolver.json
# de-duplicated data goes into `merged.ijson`:
$ cat entities.ijson | wc -l 
474
$ cat merged.ijson | wc -l 
468

Programmatic usage

The command-line use of nomenklatura is targeted at small datasets which need to be de-duplicated. For more involved scenarios, the package also offers a Python API which can be used to control the semantics of de-duplication.

nomenklatura.Dataset - implements a basic dataset for describing a set of entities.
nomenklatura.Store - a general purpose access mechanism for entities. By default, a store is used to access entity data stored in files as an in-memory cache, but the store can be subclassed to work with entities from a database system.
nomenklatura.Index - a full-text in-memory search index for FtM entities. In the application, this is used to block de-duplication candidates, but the index can also be used to drive an API etc.
nomenklatura.Resolver - the core of the de-duplication process, the resolver is essentially a graph with edges made out of entity judgements. The resolver can be used to store judgements or get the canonical ID for a given entity.

All of the API classes have extensive type annotations, which should make their integration in any modern Python API simpler.

Design

This package offers an implementation of an in-memory data deduplication framework centered around the FtM data model. The idea is the following workflow:

Accept FtM-shaped entities from a given loader (e.g. a JSON file, or a database)
Build an in-memory inverted index of the entities for dedupe blocking
Generate merge candidates using the blocking index and FtM compare
Provide a file-based storage format for merge challenges and decisions
Provide a text-based user interface to let users make merge decisions

Later on, the following might be added:

A web application to let users make merge decisions on the web

Resolver graph

The key implementation detail of nomenklatura is the Resolver, a graph structure that manages user decisions regarding entity identity. Edges are Judgements of whether two entity IDs are the same, not the same, or undecided. The resolver implements an algorithm for computing connected components, which can the be used to find the best available ID for a cluster of entities. It can also be used to evaluate transitive judgements, e.g. if A <> B, and B = C, then we don't need to ask if A = C.

Reading

Contact, contributions etc.

This codebase is licensed under the terms of an MIT license (see LICENSE).

We're keen for any contributions, bug fixes and feature suggestions, please use the GitHub issue tracker for this repository.

Nomenklatura is currently developed thanks to a Prototypefund grant for OpenSanctions. Previous iterations of the package were developed with support from Knight-Mozilla OpenNews and the Open Knowledge Foundation Labs.

nomenklatura's People

Contributors

Stargazers

Watchers

nomenklatura's Issues

Know when the dataset's data has changed

It would be nice to ask the Nomen API "has your dataset changed since last time I asked?"

When doing an ETL pipeline, you optimise by only running steps when inputs to them change. When the step is a reconciliation with Nomen, that means either the data you are reconciling, or the Nomen data. Quite often the Nomen data doesn't change very often, so you don't need to redo your step.

Currently I think the only way to tell if the Nomen data has changed is to call the API for the aliases:

curl http://nomenklatura.okfnlabs.org/uk25k-column-names/aliases -H 'Accept: application/json'

and then compute a hash on it. This seems a waste of server processing and bandwidth. Could you supply the hash of the Nomen data directly and return it with API call?

Restore upload functionality

The NK2 version has no working upload support, but restoring it should yield a much simpler and more user-friendly workflow with the new data model.

Data enrichment pipeline

We want to pull in extra details and graph links for existing targets in the OpenSanctions database from related-topic databases like OpenCorporates, Wikidata (cf. opensanctions/opensanctions#1), ICIJ OffshoreLeaks, GLEIF and the UK PSC registry. Preference here should be given to public databases that mint stable identifiers. We also want to explore how we can use BODS data and the datasets listed on the OpenOwnership registry.

In order to do this, we will introduce a new type of dataset, the Enricher. At the moment, we have a) data sources that entities are imported from, b) collections that bundle them up into sets that a data user might want to use. Enrichers would be different in that they don't originate target entities, but only provide additional details for the existing targets.

The process in general would be:

All the targets (or entities) from an existing dataset are sent through a matching mechanism provided by the enricher. That matching mechanism will return a list of possible matching entities with a score. These matches must be kept in some sort of short-term storage.
A user is shown the dedupe interface for each possible match in descending order of their score. If they decline the match, the entity is block-listed. For matches and undecided, the information is recorded into a new database table.
When the enricher is next run, the list of decided matches is used to pull in the full data for each matching entity and their immediate graph environment. That data is then pushed to the main OpenSanctions statement database and will be included in future data exports.

While some of the Enrichers should be API-based (e.g. OpenCorporates), in other cases it might make most sense to just convert the whole source dataset into FtM format and then run the enrichment process against a local ElasticSearch index that contains the full dataset (e.g. Wikidata, PSC, ICIJ). For this purpose, we can move the ES indexing code from the API to the main codebase.

Use webassets for dependencies.

Helps build custom JS/CSS. Part of #13.

Replace messytables with CSVKit

Messytables has become quite chaotic and we don't really need a lot of guessing, so we could use csvkit, which has a clear API and support for a variety of formats.

Replace use of memcache with SQL denormalization and an efficient query

The normalized string of each entity should be stored in the table, it can then be retrieved using a single SQL query. Part of #13.

'next' button causes ServerError

This occurs when matching. e.g. on http://nomenklatura.okfnlabs.org/uk25k-column-names/aliases/122562/match?random=True&query=supplier and you press 'next' to see more Entities that match this Alias. The link is: http://nomenklatura.okfnlabs.org/uk25k-column-names/aliases/122562/match?random=True&page=2&query=supplier Every time it gives Internal Server Error

Re-design entity view page.

This should contain:

List of aliases
List of attributes
Boxes for invalid/alias etc.
Link to review function
Delete button
Edit function for name

Delete a dataset

It's likely that when trialling nomenklatura, new users (such as myself) will generate one or more test datasets to get a feel for the service and how it works.

It could be useful to allow users with an easy way of deleting their own datasets, and also freeing up the namespace that they used to inhabit so that it can be reused?

There is a danger that someone might delete a dataset that someone else is reusing, which could be mitigated by allowing users to clone other peoples' datasets? The datasets may then diverge of course, as each user updates their own copy.

Couldn't write to canonical property

I want to make SCT and alias of SPCT using the python client and it doesn't seem to be writing the canonical property. I'm not clear whether this is the client or the API.

>>> import nomenklatura
>>> opennames = nomenklatura.Dataset('public-bodies-uk')
>>> entity = opennames.entity_by_name('Solihull Care Trust')
>>> sct = opennames.entity_by_name('Solihull Care Trust')
>>> sct.canonical
>>> spct = opennames.entity_by_name('Solihull Primary Care Trust')
>>> spct.canonical
>>> sct.__data__['canonical'] = spct.__data__
>>> sct
<Entity(public-bodies-uk:28555:Solihull Care Trust:Solihull Primary Care Trust)>
>>> print sct.__data__
{u'name': u'Solihull Care Trust', u'num_aliases': 1, u'creator': {u'updated_at': u'2014-06-02T10:14:59.913031', u'created_at': u'2014-06-02T10:14:59.913020', u'login': u'davidread', u'github_id': 307612, u'id': 37}, u'created_at': u'2014-06-02T10:33:13.424085', u'updated_at': u'2014-07-22T15:14:41.926716', u'invalid': False, u'dataset': u'public-bodies-uk', u'attributes': {u'dgu-name': u'solihull-nhs-care-trust', u'dgu-uri': u'http://data.gov.uk/publisher/solihull-nhs-care-trust'}, u'reviewed': True, u'id': 28555, u'canonical': {u'name': u'Solihull Primary Care Trust', u'num_aliases': 0, u'creator': {u'updated_at': u'2014-06-02T10:14:59.913031', u'created_at': u'2014-06-02T10:14:59.913020', u'login': u'davidread', u'github_id': 307612, u'id': 37}, u'created_at': u'2014-06-02T11:02:47.969079', u'updated_at': u'2014-07-22T16:59:04.201602', u'invalid': False, u'dataset': u'public-bodies-uk', u'attributes': {}, u'reviewed': True, u'id': 29838, u'canonical': None}}
>>> sct.update()
>>> sct.canonical
<Entity(public-bodies-uk:29838:Solihull Primary Care Trust:None)>
>>> sct2 = opennames.entity_by_name('Solihull Care Trust')
>>> sct2
<Entity(public-bodies-uk:28555:Solihull Care Trust:None)>
>>> sct2.canonical

The POST data has canonical set to the full dict of the other entity:

{"name": "Solihull Care Trust",
"num_aliases": 1,
"creator": {"updated_at": "2014-06-02T10:14:59.913031", "created_at": "2014-06-02T10:14:59.913020", "login": "davidread", "github_id": 307612, "id": 37}, 
"created_at": "2014-06-02T10:33:13.424085", 
"updated_at": "2014-07-22T15:14:41.926716", 
"invalid": false, 
"dataset": "public-bodies-uk", 
"attributes": {"dgu-name": "solihull-nhs-care-trust", "dgu-uri": "http://data.gov.uk/publisher/solihull-nhs-care-trust"}, 
"reviewed": true, 
"id": 28555, 
"canonical": {"name": "Solihull Primary Care Trust", "num_aliases": 0, "creator": {"updated_at": "2014-06-02T10:14:59.913031", "created_at": "2014-06-02T10:14:59.913020", "login": "davidread", "github_id": 307612, "id": 37}, "created_at": "2014-06-02T11:02:47.969079", "updated_at": "2014-07-22T16:59:04.201602", "invalid": false, "dataset": "public-bodies-uk", "attributes": {}, "reviewed": true, "id": 29838, "canonical": null}
}

Any idea what's wrong?

Make sure all parts of the API speak JSON by default

There is still quite a lot of the application that like to speak in www-formencoded, but all of it should be in JSON by default. Part of #13.

nomenklatura installation

Hello,
It seems that the hosted version of nomeklatura (http://nomenklatura.okfnlabs.org/) is not working any more. I also tried to upload a CSV file with 238 lines and two columns and everything I get is a 502 Bad Gateway error... Well, well.

So I tried to install nomenklatura on my computer: I downloaded the ZIP file https://github.com/pudo/nomenklatura/archive/master.zip, created a python vitual environement with miniconda and run the setup.py file (python setup.py install). But I don't really know what to do next to launch nomenklatura. Could you help ?

I also figure out if the files you provide on GitHub are for installing a replica of http://nomenklatura.okfnlabs.org/ ?

Thanks,

Create and edit entities

@arc64 has started a branch at feature-edit-entity to edit and create entities. Here's a micro-todo:

Make it a directive or sub-view (directive for simplicity)
Make the "edit value" aspect a directive
Allow creating entities (select type first, then fill in fields)
Inline editing of link types (e.g. Post on a Person and Company)

Saving project settings causes exception

e.g. http://nomenklatura.okfnlabs.org/uk-public-bodies/edit and pressing submit causes error

Vagrantfile out of date

raring is EOL, so the vagrant-install.sh script craps out.

I switched to using precise, and ran into snags because the default locale wasn't UTF-8 — so the PSQL template databases were created in latin-1, leading the createdb -E utf-8 -O vagrant nomenklatura line to fail

No way to view a single entity over the API

I can view an entity by ID, but not by name.

This works: curl http://nomenklatura.okfnlabs.org/uk25k-entities/entities/17793
This fails: curl http://nomenklatura.okfnlabs.org/uk25k-entities/entities?name=Airedale%20NHS%20Foundation%20Trust

It gives me a list of ALL the entities, since the path //entities is overloaded.

Internal Server Error on Import

I've not been able to upload my names. I've got a csv with 2 columns (UniqueID & Name) and ~110,000 rows. I've tried with scv, tab, and excel. Is it just too many entities?

Switch to Bootstrap 3

Make this thing more responsive to mobile devices and form factors, part of #13.

502 Bad Gateway error after uploading CSV file

I tried several time to create a dataset, uploading a csv or tsv, but it always gives me a 502. A friend of mine gets the same error.

Login with GitHub is broken

As title might indicate, if I click on Login with GitHub button I can't see any request going to GH and I'm returned to the same page. Console has some errors, not sure how related but might help identifying a potential issue (attached).

I can open a new issue on the console errors, as they might as well be unrelated. I haven't looked at the sources to follow the failing request so I can't assume anything just yet.

Edit: This issues is based on the version deployed here.

Deprecate /lookup API in favour of Refine endpoint

The two are essentially equivalent, we should drop the non-standard one and talk to @tfmorris about adding write options to the Refine protocol specification.

Names vs id, eg in OpenRefine

Reconciliation in OpenRefine is often used to match name elements to an abstract id. In nomenklatura, the intention seems to be to reconcile one name against another using fuzzy matching, but there doesn't seem to be explicit support for matching to an id?

I guess an id could also be a name, and then aliases represent possible variants of that id, but then how would an OpenRefine user know what the canonical name was if the matched name was actually a UID?

UIDs are presumably created in nomenklatura datasets to allow for mapping between names and aliasses, so I guess the question is: should users be allowed to submit their own UIDs or UID/canonical name pairs? (that was the assumption I'd made when I originally heard about nomenklatura).

Button to re-review an entity

I make mistakes when reconciling. It would be great when viewing an entity to have a button to allow re-reviewing it.

e.g. http://opennames.org/entities/27095 I'd like to make it an alias for "London Borough of Newham".

i.e. I need a link to:
http://opennames.org/datasets/public-bodies-uk/review/27095

(Of course I can also do this via the API, but that is inconvenient...)

vagrant up doesn't run

I was going to have a go at making a docker version of nomenklatura, but wanted to run it locally at first from the current build just to check everything worked.

Aside from the missing box (I ended up using http://cloud-images.ubuntu.com/vagrant/raring/current/raring-server-cloudimg-i386-vagrant-disk1.box ) I get a load of errors. Is there a new Vagrantfile that does run cleanly anywhere?

Switch entity.data property to HSTORE type

This binds us to postgres but the gain is that the data can be queried directly. Part of #13.

Internal Server Error nomenklatura.okfnlabs.org

Go to:
http://nomenklatura.okfnlabs.org

Error:
Internal Server Error

[Super] Bring ODIS changes into Nomenklatura

After working on the ODIS/nomenklatura2 code base for a bit, I've decided that there is little reason for a non-progressive development and hope to backport changes into this code base.

Issue with Importing CSV

I am trying to import a csv, I have it saved on my end as a csv(comma delimited) encoded as UTF-8. I enter it in and hit import, I see it count up the "%" , but nothing happens in the end, and I can never access my files to begin processing. Any Help?

failing to create new entity

Trying to post from rapidpro SMS platform to this. - i.e when a user reaches a certain point in a flow their response is added as a new entity to a specific dataset on nomenklatura (so we can then review) following: http://nomenklatura.uniceflabs.org/api/2/[email protected]&dataset=somsettlementsbaidoa&api_key=

getting a 500 error in response - any ideas?

Implement clustering function

A clustering function would attempt to guess a set of clusters (i.e. multiple identical entities) and then propose merging them to the user.

This can be based on the dedupe Python library.

Write API documentation

The current copy is completely outdated

Handle unicode single quote better

This is not essential, but it would be nice to allow a unicode single quote to be treated the same as a non-unicode single quote.

e.g. http://nomenklatura.okfnlabs.org/uk-public-bodies/entities/54827

The unicode single quote is u'\u2019' and comes from the WDTK data.

Put API under a prefix

The API should live at a prefix, e.g. /api/2/. Part of #13.

What is it good for?

Pudo, I don't (yet) get the sense behind this service. So users can store canonical names of entities and their aliases in the cloud. What advantage does this provide over storage in a local database, for example?

I suppose (but I am not sure because it's not documented) that if I want to re-conciliate an entity (e.g. I have many aliases but their not yet mapped to an entity) I can use the already existing datasets of ALL users to map my own aliases (because somebody has already done that?). If yes, maybe you should add some documentation. But maybe I'm just too stupid :-)

Write a simple tutorial for a developer user

Explain in one/two sentence what this does for me
Walk me through what i do to get started (do this as a screencast for bonus points)
Link/feature on front page

Better instructions for new arrivals (?)

Login with GitHub is broken

I can open a new issue on the console errors, as they might as well be unrelated. I haven't looked at the sources to follow the failing request so I can't assume anything just yet.

Base don the version deployed here

Write install instructions.

Can copy most of this from datawire, as they are technology twins.

Write a blog post on OKFN Labs blog

Jekyll style post in _posts in https://github.com/okfn/okfn.github.com :-)

Re-introduce ``/lookup`` API call

Find an entity; and if it doesn't exist create it

bower dependency missing

git://github.com/kkruit/angular-ui-bootstrap3-bower.git is no longer available

Internal Server Error

Not sure whether to report problems like this here or some place site-specific for the OKFN hosted version, but when I attempted to import a TSV spreadsheet, I got an internal server error mesasge with no traceback or other error information.

The upload URL after this was http://nomenklatura.okfnlabs.org/test-tom/upload/7e0a7fb56f56cfddbed0d9a68e6b0907929553cf

I can make the file available, but it was basically just a deduped version of the Nomenklaturo openinterests-entities dataset that I downloaded and cleaned with OpenRefine.

Document reconciliation API

Link to the current project

I quite often want to click on a link from whatever I'm doing back to the main page for my project (e.g. http://nomenklatura.okfnlabs.org/uk-public-bodies ). Would be useful to put one in under the Nomenklatura header bar, say? No need for proper breadcrumbs though.