GithubHelp home page GithubHelp logo

opensanctions / nomenklatura Goto Github PK

View Code? Open in Web Editor NEW
184.0 20.0 37.0 1.75 MB

Framework and command-line tools for integrating FollowTheMoney data streams from multiple sources

License: MIT License

Python 99.76% Makefile 0.24%
data-integration deduplication record-link

nomenklatura's Introduction

nomenklatura

Nomenklatura de-duplicates and integrates different Follow the Money entities. It serves to clean up messy data and to find links between different datasets.

screenshot

Usage

You can install nomenklatura via PyPI:

$ pip install nomenklatura

Command-line usage

Much of the functionality of nomenklatura can be used as a command-line tool. In the following example, we'll assume that you have a file containing Follow the Money entities in your local directory, named entities.ijson. If you just want try it out, you can use the file tests/fixtures/donations.ijson in this repository for testing (it contains German campaign finance data).

With the file in place, you will cross-reference the entities to generate de-duplication candidates, then run the interactive de-duplication UI in your console, and eventually apply the judgements to generate a new file with merged entities:

# generate merge candidates using an in-memory index:
$ nomenklatura xref -r resolver.json entities.ijson
# note there is now a new file, `resolver.json` that contains de-duplication info.
$ nomenklatura dedupe -r resolver.json entites.ijson
# will pop up a user interface.
$ nomenklatura apply entities.ijson -o merged.ijson -r resolver.json
# de-duplicated data goes into `merged.ijson`:
$ cat entities.ijson | wc -l 
474
$ cat merged.ijson | wc -l 
468 

Programmatic usage

The command-line use of nomenklatura is targeted at small datasets which need to be de-duplicated. For more involved scenarios, the package also offers a Python API which can be used to control the semantics of de-duplication.

  • nomenklatura.Dataset - implements a basic dataset for describing a set of entities.
  • nomenklatura.Store - a general purpose access mechanism for entities. By default, a store is used to access entity data stored in files as an in-memory cache, but the store can be subclassed to work with entities from a database system.
  • nomenklatura.Index - a full-text in-memory search index for FtM entities. In the application, this is used to block de-duplication candidates, but the index can also be used to drive an API etc.
  • nomenklatura.Resolver - the core of the de-duplication process, the resolver is essentially a graph with edges made out of entity judgements. The resolver can be used to store judgements or get the canonical ID for a given entity.

All of the API classes have extensive type annotations, which should make their integration in any modern Python API simpler.

Design

This package offers an implementation of an in-memory data deduplication framework centered around the FtM data model. The idea is the following workflow:

  • Accept FtM-shaped entities from a given loader (e.g. a JSON file, or a database)
  • Build an in-memory inverted index of the entities for dedupe blocking
  • Generate merge candidates using the blocking index and FtM compare
  • Provide a file-based storage format for merge challenges and decisions
  • Provide a text-based user interface to let users make merge decisions

Later on, the following might be added:

  • A web application to let users make merge decisions on the web

Resolver graph

The key implementation detail of nomenklatura is the Resolver, a graph structure that manages user decisions regarding entity identity. Edges are Judgements of whether two entity IDs are the same, not the same, or undecided. The resolver implements an algorithm for computing connected components, which can the be used to find the best available ID for a cluster of entities. It can also be used to evaluate transitive judgements, e.g. if A <> B, and B = C, then we don't need to ask if A = C.

Reading

Contact, contributions etc.

This codebase is licensed under the terms of an MIT license (see LICENSE).

We're keen for any contributions, bug fixes and feature suggestions, please use the GitHub issue tracker for this repository.

Nomenklatura is currently developed thanks to a Prototypefund grant for OpenSanctions. Previous iterations of the package were developed with support from Knight-Mozilla OpenNews and the Open Knowledge Foundation Labs.

nomenklatura's People

Contributors

crccheck avatar dependabot[bot] avatar jbothma avatar mihi-tr avatar pudo avatar robbi5 avatar rufuspollock avatar simonwoerpel avatar vikt0rs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nomenklatura's Issues

Know when the dataset's data has changed

It would be nice to ask the Nomen API "has your dataset changed since last time I asked?"

When doing an ETL pipeline, you optimise by only running steps when inputs to them change. When the step is a reconciliation with Nomen, that means either the data you are reconciling, or the Nomen data. Quite often the Nomen data doesn't change very often, so you don't need to redo your step.

Currently I think the only way to tell if the Nomen data has changed is to call the API for the aliases:

curl http://nomenklatura.okfnlabs.org/uk25k-column-names/aliases -H 'Accept: application/json'

and then compute a hash on it. This seems a waste of server processing and bandwidth. Could you supply the hash of the Nomen data directly and return it with API call?

Restore upload functionality

The NK2 version has no working upload support, but restoring it should yield a much simpler and more user-friendly workflow with the new data model.

Data enrichment pipeline

We want to pull in extra details and graph links for existing targets in the OpenSanctions database from related-topic databases like OpenCorporates, Wikidata (cf. opensanctions/opensanctions#1), ICIJ OffshoreLeaks, GLEIF and the UK PSC registry. Preference here should be given to public databases that mint stable identifiers. We also want to explore how we can use BODS data and the datasets listed on the OpenOwnership registry.

In order to do this, we will introduce a new type of dataset, the Enricher. At the moment, we have a) data sources that entities are imported from, b) collections that bundle them up into sets that a data user might want to use. Enrichers would be different in that they don't originate target entities, but only provide additional details for the existing targets.

The process in general would be:

  1. All the targets (or entities) from an existing dataset are sent through a matching mechanism provided by the enricher. That matching mechanism will return a list of possible matching entities with a score. These matches must be kept in some sort of short-term storage.
  2. A user is shown the dedupe interface for each possible match in descending order of their score. If they decline the match, the entity is block-listed. For matches and undecided, the information is recorded into a new database table.
  3. When the enricher is next run, the list of decided matches is used to pull in the full data for each matching entity and their immediate graph environment. That data is then pushed to the main OpenSanctions statement database and will be included in future data exports.

While some of the Enrichers should be API-based (e.g. OpenCorporates), in other cases it might make most sense to just convert the whole source dataset into FtM format and then run the enrichment process against a local ElasticSearch index that contains the full dataset (e.g. Wikidata, PSC, ICIJ). For this purpose, we can move the ES indexing code from the API to the main codebase.

Replace messytables with CSVKit

Messytables has become quite chaotic and we don't really need a lot of guessing, so we could use csvkit, which has a clear API and support for a variety of formats.

Re-design entity view page.

This should contain:

  • List of aliases
  • List of attributes
  • Boxes for invalid/alias etc.
  • Link to review function
  • Delete button
  • Edit function for name

Delete a dataset

It's likely that when trialling nomenklatura, new users (such as myself) will generate one or more test datasets to get a feel for the service and how it works.

It could be useful to allow users with an easy way of deleting their own datasets, and also freeing up the namespace that they used to inhabit so that it can be reused?

There is a danger that someone might delete a dataset that someone else is reusing, which could be mitigated by allowing users to clone other peoples' datasets? The datasets may then diverge of course, as each user updates their own copy.

Couldn't write to canonical property

I want to make SCT and alias of SPCT using the python client and it doesn't seem to be writing the canonical property. I'm not clear whether this is the client or the API.

>>> import nomenklatura
>>> opennames = nomenklatura.Dataset('public-bodies-uk')
>>> entity = opennames.entity_by_name('Solihull Care Trust')
>>> sct = opennames.entity_by_name('Solihull Care Trust')
>>> sct.canonical
>>> spct = opennames.entity_by_name('Solihull Primary Care Trust')
>>> spct.canonical
>>> sct.__data__['canonical'] = spct.__data__
>>> sct
<Entity(public-bodies-uk:28555:Solihull Care Trust:Solihull Primary Care Trust)>
>>> print sct.__data__
{u'name': u'Solihull Care Trust', u'num_aliases': 1, u'creator': {u'updated_at': u'2014-06-02T10:14:59.913031', u'created_at': u'2014-06-02T10:14:59.913020', u'login': u'davidread', u'github_id': 307612, u'id': 37}, u'created_at': u'2014-06-02T10:33:13.424085', u'updated_at': u'2014-07-22T15:14:41.926716', u'invalid': False, u'dataset': u'public-bodies-uk', u'attributes': {u'dgu-name': u'solihull-nhs-care-trust', u'dgu-uri': u'http://data.gov.uk/publisher/solihull-nhs-care-trust'}, u'reviewed': True, u'id': 28555, u'canonical': {u'name': u'Solihull Primary Care Trust', u'num_aliases': 0, u'creator': {u'updated_at': u'2014-06-02T10:14:59.913031', u'created_at': u'2014-06-02T10:14:59.913020', u'login': u'davidread', u'github_id': 307612, u'id': 37}, u'created_at': u'2014-06-02T11:02:47.969079', u'updated_at': u'2014-07-22T16:59:04.201602', u'invalid': False, u'dataset': u'public-bodies-uk', u'attributes': {}, u'reviewed': True, u'id': 29838, u'canonical': None}}
>>> sct.update()
>>> sct.canonical
<Entity(public-bodies-uk:29838:Solihull Primary Care Trust:None)>
>>> sct2 = opennames.entity_by_name('Solihull Care Trust')
>>> sct2
<Entity(public-bodies-uk:28555:Solihull Care Trust:None)>
>>> sct2.canonical

The POST data has canonical set to the full dict of the other entity:

{"name": "Solihull Care Trust",
"num_aliases": 1,
"creator": {"updated_at": "2014-06-02T10:14:59.913031", "created_at": "2014-06-02T10:14:59.913020", "login": "davidread", "github_id": 307612, "id": 37}, 
"created_at": "2014-06-02T10:33:13.424085", 
"updated_at": "2014-07-22T15:14:41.926716", 
"invalid": false, 
"dataset": "public-bodies-uk", 
"attributes": {"dgu-name": "solihull-nhs-care-trust", "dgu-uri": "http://data.gov.uk/publisher/solihull-nhs-care-trust"}, 
"reviewed": true, 
"id": 28555, 
"canonical": {"name": "Solihull Primary Care Trust", "num_aliases": 0, "creator": {"updated_at": "2014-06-02T10:14:59.913031", "created_at": "2014-06-02T10:14:59.913020", "login": "davidread", "github_id": 307612, "id": 37}, "created_at": "2014-06-02T11:02:47.969079", "updated_at": "2014-07-22T16:59:04.201602", "invalid": false, "dataset": "public-bodies-uk", "attributes": {}, "reviewed": true, "id": 29838, "canonical": null}
}

Any idea what's wrong?

nomenklatura installation

Hello,
It seems that the hosted version of nomeklatura (http://nomenklatura.okfnlabs.org/) is not working any more. I also tried to upload a CSV file with 238 lines and two columns and everything I get is a 502 Bad Gateway error... Well, well.

So I tried to install nomenklatura on my computer: I downloaded the ZIP file https://github.com/pudo/nomenklatura/archive/master.zip, created a python vitual environement with miniconda and run the setup.py file (python setup.py install). But I don't really know what to do next to launch nomenklatura. Could you help ?

I also figure out if the files you provide on GitHub are for installing a replica of http://nomenklatura.okfnlabs.org/ ?

Thanks,

Create and edit entities

@arc64 has started a branch at feature-edit-entity to edit and create entities. Here's a micro-todo:

  • Make it a directive or sub-view (directive for simplicity)
  • Make the "edit value" aspect a directive
  • Allow creating entities (select type first, then fill in fields)
  • Inline editing of link types (e.g. Post on a Person and Company)

Internal Server Error on Import

I've not been able to upload my names. I've got a csv with 2 columns (UniqueID & Name) and ~110,000 rows. I've tried with scv, tab, and excel. Is it just too many entities?

Login with GitHub is broken

As title might indicate, if I click on Login with GitHub button I can't see any request going to GH and I'm returned to the same page. Console has some errors, not sure how related but might help identifying a potential issue (attached).
hc5

I can open a new issue on the console errors, as they might as well be unrelated. I haven't looked at the sources to follow the failing request so I can't assume anything just yet.

Edit: This issues is based on the version deployed here.

Names vs id, eg in OpenRefine

Reconciliation in OpenRefine is often used to match name elements to an abstract id. In nomenklatura, the intention seems to be to reconcile one name against another using fuzzy matching, but there doesn't seem to be explicit support for matching to an id?

I guess an id could also be a name, and then aliases represent possible variants of that id, but then how would an OpenRefine user know what the canonical name was if the matched name was actually a UID?

UIDs are presumably created in nomenklatura datasets to allow for mapping between names and aliasses, so I guess the question is: should users be allowed to submit their own UIDs or UID/canonical name pairs? (that was the assumption I'd made when I originally heard about nomenklatura).

[Super] Bring ODIS changes into Nomenklatura

After working on the ODIS/nomenklatura2 code base for a bit, I've decided that there is little reason for a non-progressive development and hope to backport changes into this code base.

Issue with Importing CSV

I am trying to import a csv, I have it saved on my end as a csv(comma delimited) encoded as UTF-8. I enter it in and hit import, I see it count up the "%" , but nothing happens in the end, and I can never access my files to begin processing. Any Help?

Implement clustering function

A clustering function would attempt to guess a set of clusters (i.e. multiple identical entities) and then propose merging them to the user.

This can be based on the dedupe Python library.

What is it good for?

Pudo, I don't (yet) get the sense behind this service. So users can store canonical names of entities and their aliases in the cloud. What advantage does this provide over storage in a local database, for example?

I suppose (but I am not sure because it's not documented) that if I want to re-conciliate an entity (e.g. I have many aliases but their not yet mapped to an entity) I can use the already existing datasets of ALL users to map my own aliases (because somebody has already done that?). If yes, maybe you should add some documentation. But maybe I'm just too stupid :-)

Login with GitHub is broken

As title might indicate, if I click on Login with GitHub button I can't see any request going to GH and I'm returned to the same page. Console has some errors, not sure how related but might help identifying a potential issue (attached).
hc5

I can open a new issue on the console errors, as they might as well be unrelated. I haven't looked at the sources to follow the failing request so I can't assume anything just yet.

Base don the version deployed here

Internal Server Error

Not sure whether to report problems like this here or some place site-specific for the OKFN hosted version, but when I attempted to import a TSV spreadsheet, I got an internal server error mesasge with no traceback or other error information.

The upload URL after this was http://nomenklatura.okfnlabs.org/test-tom/upload/7e0a7fb56f56cfddbed0d9a68e6b0907929553cf

I can make the file available, but it was basically just a deduped version of the Nomenklaturo openinterests-entities dataset that I downloaded and cleaned with OpenRefine.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.