datasets / publicbodies Goto Github PK

View Code? Open in Web Editor NEW

61.0 27.0 28.0 13.8 MB

A database of public bodies such as government departments, ministries etc.

Home Page: http://publicbodies.org

License: MIT License

CSS 36.47% Python 7.17% HTML 17.40% Less 38.96%

open-government open-data open-knowledge-international ministries department police fire

publicbodies's Introduction

A database of public bodies (or organizations):

Government-run or controlled organizations or entities which may or may not have distinct corporate existence

Examples are:

Government Ministries or Departments
State-run Health organizations
Police and fire departments

Visit the site: https://publicbodies.org/

Data

Data is stored in CSVs partitioned by country or region (e.g. EU) in the data folder. Files are named by two-letter ISO code.

Contribute data

Please just add a CSV file and submit a pull request or open an issue.

The set of fields required in the CSV file can be seen in the field list on: public-body-schema.json. You can also check out the existing data in data/ for hints. To learn more about Data Packages, visit https://specs.frictionlessdata.io/.

If you can, developing a bot to automatically and periodically collect the data is even better.

For developers of the website

The website is a Jekyll site. To get it running locally:

Install Docker.

Get the code

git clone https://github.com/okfn/publicbodies
cd publicbodies

Run Jekyll

cd website
export JEKYLL_VERSION=4.2.0
docker run --rm --volume="$PWD:/srv/jekyll" -it jekyll/minimal:$JEKYLL_VERSION jekyll build --baseurl $PWD/_site/ --watch

The built website will appear on the website/_site folder.

The list of outstanding issues is at: https://github.com/okfn/publicbodies/issues

For developers of data collector bots

Data is kept automatically up-to-date by bots that collect and update data once a week. The scripts are kept on the scripts/import directory, followed by the international place code (e.g. br for Brazil, it for Italy).

The script MUST be runnable from a command line interface. It should display the available options if run with the --help parameter, and output data to the file chosen by the --output parameter. For example:

python3 scripts/import/br/import_br.py --help

usage: import_br.py [-h] [--output file_name]

Imports Brazilian public body data from the official source and complements it
with data from several auxiliary sources. Official source: [SIORG's open data
API](https://dados.gov.br/dataset/siorg)

optional arguments:
  -h, --help          show this help message and exit
  --output file_name  filename for the data output as CSV

When making requests, bots MUST use the Public Bodies Bot user agent string to identify themselves to servers:

PublicBodiesBot (https://github.com/okfn/publicbodies)

If using Python, use the same libraries already defined in scrips/requirements.txt, in order to keep the project dependencies tidy, and only add new ones if strictly necessary.

After creating a new bot, make sure to add it to the update data workflow so that it runs regularly and keeps the data up-to-date.

Original preparation

Details of the automated data extraction to build the original database.

Data sources:

Brazil
- Brazilian Government's SIORG – https://dados.gov.br/dataset/siorg
European Union
- AskTheEU.org
Italy
- Opendata IPA (amministrazioni / enti)
Germany
- FragDenStaat.de – (private GoogleDoc)
- Bund.de – https://www.bund.de/Content/DE/Behoerden/Suche/Formular.html
United Kingdom
- WhatDoTheyKnow.com – https://www.whatdotheyknow.com/body/all-authorities.csv
United States of America
- A-Z Index of U.S. Government Departments and Agencies – https://www.usa.gov/federal-agencies/a

publicbodies's People

Contributors

Stargazers

Watchers

publicbodies's Issues

Re-Add google analytics

Lost them in node upgrade ...

Specify source code license

Is the license for the source code of this project (not the data, as that is a separate issue) specified somewhere? I couln't find it. Please include a (preferrably) open source license or, if there is already one, make it more evident (e.g. mention on the README and/or include a COPYING.txt file).

Note: it may be necessary to:

Suggest a propositional license here; and
Obtain consent from each contributor of source code in this project to license his/her work under said proposed license.

Add the Italian (IT) public bodies from the Public Administration Index

add the list of the italian public administrations using the data available on the CSV maintained by the Italian Public Administration Index

Lower case country leads to dead page

The search leads to links like publicbodies.org/gb which is an empty page but the front page leads to publicbodies.org/GB. These need to be harmonised.

Data for China

Data from Shen: http://ubercheckout.com/cn.csv

Push data to CKAN DataStore for querying

Decide whether or not organizational units are in scope

Are in scope of the data for this project:

a) only organizations (as in org:Organization ); or
b) organization and their respective hierarchy of organizational units (as in org:OrganizationalUnit )?

Basic tests

See https://github.com/okfn/opendatacensus/tree/master/tests for our preferred approach (using mocha, superagent etc)

Search support

Options

JS solr (lunrjs etc)
Separate solr
Google custom search (require us to build a site-map or list everything on the front page)
No search

Integrate Swiss Federal Data

Wrote a quick scraper for the directory of Swiss federal entities, see https://scraperwiki.com/scrapers/public_bodies_of_the_swiss_federation/

Names or extracted in German only, but are available in French and Italian as well
Parsing of addresses/phone numbers could be improved
Not sure if everything needed is covered and present in the right form, just tried to guess from the CSV files available - feedback very welcome!

Integrate EU WhoIsWho data

http://europa.eu/whoiswho/public/

Implement hierarchy browser

For countries for which we have a good tree structure being able to browse that tree in the UI would be very helpful.

Requirements:

Go from a public body to its parent body (done)
See a list of child bodies per public body
Present overview per jurisdiction in tree / forrest form

Use Info from OpenTED

List Bodies On Per-Country Pages

The index page is quite long, and atm ~75% is probably not relevant to a given user. I spent ~15 minutes working on splitting them out. Should I continue? Thoughts?

I'm slightly confused by the website tagline

The Public Bodies tagline is "A URL for every part of government"

yet very non-government entities pop-up on the UK list e.g. ASDA

It would be less catchy a tagline but perhaps, "A URL for every FoI-able public sector organisation" might be more accurate, less confusing?

Document contributor workflow

Add i18n support

The web application should support internationalization.

I also suggest we create a project to localize it in Transifex.

That should help users of other languages to browse for public bodies in their native language.

datapackage.json

Add to repository all scripts that load publicly avaliable data

We should create "scripts/import/XX" directories as needed in the repository to hold scripts to update the data, where avaliable from public sources. That way it would be much easier to keep the data up-to-date.

United States csv

I'm on it!

Licence for whatdotheyknow data

Has the licence for the whatdotheyknow list of public bodies been established? We asked a few months ago and they didn't have one, although no doubt with a good nudge they would be happy to.

Consider adding related-bodies/related-agencies to schema

To make the data set more useful, I think adding a field to the schema for related bodies/agencies would be very useful. Perhaps the field is populated by the values key field.

Thoughts?

Display country name not just code

Load country code info from e.g. http://data.okfn.org/data/country-codes and use them ...

Get descriptions for all EU items

No descriptions at the moment. This could be perfect for http://crowdcrafting.org/ or we could just put in a google spreadsheet and ask people to jump in.

Just grab the CSV from https://github.com/okfn/publicbodies/blob/master/data/raw/eu.csv and start updating the description field ...

Connect with relevant FOI sites

Would be nice to link out from a given public body to all requests related to it on relevant FoI sites

/cc @wombleton NZ could be a test case for this ...

Organisation identifiers (for discussion)

This is an idea that I've been thinking about for a while. I discussed it with @rgrp a couple of weeks ago and wanted to share it with the list to see what everyone thinks.

The short version: could public bodies be used to generate usable organisation identifiers?

Background

The IATI Standard is an XML based format for sharing detailed information about aid projects. Fundamentally, the model shows resource flows from one organisation to another, with various classifications in between and many financial transactions as part of each project. So like this:

activity (DFID -> World Health Organisation)
  - transaction (GBP 500 disbursed on 2013-05-01)
  - transaction (GBP 500 disbursed on 2013-07-05)

For the private sector and NGOs, the methodology for uniquely identifying organisations is:

Jurisdiction-National registration body-Number
e.g. for Oxfam GB, registered at the Charity Commission, with reg number 202918:
GB-CHC-202918

For governments, the following methodology is used:
Jurisdiction-OECD/DAC Agency code
e.g. for the UK's Department for International Development:
GB-1

For multilaterals, we use the following methodology:
OECD/DAC Channel code
e.g. for the World Bank's International Development Association (IDA):
44002

Problems

Agency codes

Agency codes only include donor agencies. So the Ministry of Finance in Botswana, for example, does not have a code.
Agency codes don't even include all donor agencies: for example, parts of the European Commission or the United States, even though they give aid, don't have their own identifier - they're categorised under Miscellaneous.
The process for adding new agency codes is slow (even if it took a day, that might be too long)

Channel codes

Channel codes only contain a subset of all of the multilateral / international / intergovernmental organisations in the world, and many of them are not listed in a very usable way. For example, the World Health Organisation has two codes:
a) World Health Organisation - core voluntary contributions account
b) World Health Organisation - assessed contributions
--> but there isn't one for just "World Health Organisation", for example if you're contracting them to deliver a project.

Many organisations publishing IATI data will therefore struggle to provide unique organisation identifiers for many of the public sector / international organisations that they are working with.

Rationale

Official lists of organisations should be used if possible.
Official lists of organisations don't exist in most cases.
The exact identifier assigned to an organisation is not fundamentally important (whether it's BW-1 or BW-21, the Botswana Ministry of Finance just needs a code).
Organisation identifiers should be cross-mapped to other codes / identifiers for those organisations so that the data is easily interoperable.

Proposal

Fuzzy reconciliation / text matching of organisations, with an API that assigns an existing identifier where available, and creates a new one where it's not available

Organisations (initially, preferably those with a large amount of data) throw four key pieces of data at the API:

organisation name (text) - e.g. MINISTRY OF FINANCE
organisation country (code) - e.g. BW (for Botswana)
language (code) - e.g. en
last recorded transaction with this organisation (date) - e.g. 2013-07-05

the API responds with one of the following (possibly using HTTP status codes?):
a) Organisation found => use code BW-1
b) Organisation not found => created code BW-21

it also stores the data about the last recorded transaction, so that other people know that that organisation may have existed on that date.

Another source could be Charts of Accounts, existing lists (like those that exist on PB already), budget documents, and structured spending data, e.g. from OpenSpending.

Dealing with duplicates

This will probably lead to some duplicates being created. There could be some manual reconciliation for this. Organisations could have a primary identifier and several secondary identifiers that were used by duplicate organisations..

Dealing with changing organisations

Organisations can be created / deleted / merged in the real world. This should probably lead to:
a) created - a new identifier gets created;
b) merged - a new identifier gets created for the new organisation; and (manually) the old organisations are linked / related to the new organisation;
c) deleted - the identifier continues to exist, because old (and possibly future) data will still refer to it. However, it should be (manually) marked as no longer existing, pointing to a successor organisation of one exists (with some flag to explain whether it's a wholly .

Questions

Does this sound sensible? Is it a good idea? Is there a better alternative?
Will the fuzzy matching be accurate enough to be useful? Is it likely to assign organisations an incorrect code?
How should the identifiers be identified as being created by Public Bodies - just a prefix like PB-?

OECD-DAC codelists:

http://www.oecd.org/dac/stats/dacandcrscodelists.htm
IATI Standard:
http://iatistandard.org

Broken links

All the CSV downloads on the homepage link to "undefined.csv": http://publicbodies.org/

Lifecycle issues

Public bodies change frequently and it would be good to agree how to deal with this. I think having a sense of permanence for URLs is useful, so I suggest:

Suggest:

URLs for a body must never change
Title should not change. If a body changes its name then it should be handled as if it died and a new one was created.
When a body dies it should be marked as inactive.
If a body takes over the main role of a previous body, then the old body should have a 'redirect' to the new body stored with it.
If a body's abbreviation or other property changes then that is ok (e.g. DBIS -> BIS)

Home page issue with Firefox

it looks like with Firefox the two main DIVs, the one with the jurisdictions and the sidebar on the right overlap a bit. On both Chrome and Safari are instead well positioned

Normalize dates to ISO 8601

Instructions for data contributors

This should probably go on the wiki once finished.

Fields

key names:
- should be url suitable: alphanumeric + '-' only
- use - rather than _
- use abbreviations where appropriate
use iso formatted date / times

To discuss

Do we need last modified and created?
Do we want both parent and parent_key?

What Public Bodies

National or local departments or agencies
(Probably) Not every school of fire station in existence.

Asides

Write up a description of the columns

New United States data source

https://github.com/GSA-OCSIT/govt-urls

Source: http://www.infodocket.com/2014/01/29/reference-list-of-government-urls-that-do-not-end-in-gov-or-mil-crawled-by-usa-gov/

Will ingest soon.

JSON output from frontend conforms to Popolo schema

Now that #29 is done and we are in line with Popolo in the CSV this should be pretty easy

http://popoloproject.com/specs/organization.html

Slovenian government account holders

http://www.ujp.gov.si/dokumenti/dokument.asp?id=127 -- first excel links :)

Change key to use slug

Let's get rid of random generated uuid parts for keys and use slug instead.

Check that slugs are unique per jurisdiction
implement the change

Also:

What about rename key => id?

Ensure id present for all Greek public bodies

@okfngr just noticed that in #43 PR a lot of public bodies were missing a key field (now called id).

Would it be possible to generate and add an id field to all records - an id field is required and is necessary for the frontend to work.

We also seem to be missing jurisdiction codes (which are required) for

gr/dpa
gr/adae
gr/asep
gr/esr
gr/synigoros
gr/minedu
gr/neagenia
gr/gsae
gr/culture
gr/gss
gr/gsrt
gr/minedu
gr/minedu
gr/minedu
gr/gak
gr/gak
gr/iky

Support for sending corrections / additions

Several options:

Fork and pull (good for bulk corrections and submissions)
We could load the CSVs into google docs and have people edit then remerge
- perhaps we can / should have them permanently there
Submission of individual corrections (feedback form style) - Suggest the google forms hack approach (we'll just submit stuff into gforms via js ...) - cf http://github.com/okfn/opendatacensus which uses this technique for city submissions

Link to CSV files broken

#51 made a change uppercasing jurisdiction codes, but links on the front page are lowercase.

Check we have everything from https://www.gov.uk/government/organisations

https://www.gov.uk/government/organisations

German Public Bodies from FragDenStaat.de

The ever growing list of German public bodies on FragDenStaat.de can be accessed via the FragDenStaat.de API:

https://fragdenstaat.de/api/v1/publicbody/?format=json

It's a bit verbose. If CSV is a better fit, I can also provide a dump.

`npm run-script make` throws an error

It looks like npm install is enough to install the site now. npm run-script make fails since there’s no longer a site directory.

Switch to simple web app with templating

e.g. nodejs + nunjucks + deploy on heroku

Note we would still just load raw csv when app loads - heroku 512 MB limit should be fine give amount of data we have so far ...

Add keys for US data

US data is missing key field in many cases - cf #39

Use info from http://datahub.io/dataset/uk-public-bodies

Data for Quebec

I have a scraper for Quebec's public bodies (my boss authored it, and wants to contribute). It's written in ruby, and can be seen here. How do we go about integrating this?https://gist.github.com/jpmckinney/5022490

Build to flat files and deploy to s3

Build
Deploy

Build

Let's use nunjucks

var env = new nunjucks.Environment();
var tmpl = env.getTemplate('test.html');
console.log(tmpl.render({ username: "james" }));

Seems a great idea!

Current fields

Current fields and suggested changes (e.g. to be in line with popolo as much as possible). Note the list of changes is in progress and incomplete.

title => name (in org name)
abbr => abbreviation
key => id (?)
category => classification
parent => DELETE (just have parent_id)
parent_key => parent_id
description
url
jurisdiction => DELETE (just have jurisdiction code)
jurisdiction_code = ISO 2 digit code where that exists. Otherwise we coin.
source => DELETE in favour of source URL (??)
source_url => keep
- make clear there is no point pointing at exactly the same API endpoint - much more useful to point at a specific location
- (??) DELETE entirely and just credit in contributor notes (we already have a bunch of different sources for data and as people add the problem will get worse)
- Could have multiple sources per entry (??)
address
contact => What's the difference from address
email
tags => keep
- at the moment several of the files use tags (though not necessarily consistently)
created_at => DELETE (little value ...)
updated_at => DELETE (ditto)

Add:

other_names: semi-colon separated list of alternate names
founding_date: ISO 8601
dissolution_date: ISO 8601
image

Consider switch to JSON from CSV

Pros / Cons

(+) Greater flexibility, ability to directly match org spec
- In particular can handle multiple values, multiple identifiers
(-) Much bigger and less compact. Harder for people to work with (e.g. CSV usable in spreadsheets etc)
(-) More complexity (but perhaps necessary)