GithubHelp home page GithubHelp logo

rcdatasets's Introduction

RCDatasets

This repo provides the datasets.json file, used as "ground truth" for the knowledge graph work in ADRF and Rich Context.

For a diagram of how this dataset list fits within the overall ETL workflow used to update the knowledge graph, see the OmniGraffle source at docs/kg_etl_workflow.graffle in this repo.

Managing Updates

Having a separate repo helps us manage changes carefully. This is metadata not data, so serves it as the basis for linking. That requires auditing of any changes, to avoid breaking links in the graph downstream from any update.

Consequently, each update must be handled through a pull request and audited in a code review.

  1. work in a separate branch and update from master
  2. look for other PRs (work in progress) and note the IDs used
  3. request a range of up to 5 IDs on the rich_context channel on Slack
  4. make edits in your branch
  5. confirm through unit tests: python test.py

At that point, create a PR and have someone else on the team review it.

Also, don't commit code here except for consistency checks used on the dataset list itself.

Required Fields

At a minimum, each record in the datasets.json file must have these required fields:

  • provider -- name of the data provider in providers.json
  • title -- name of the dataset
  • id -- a unique sequential identifier

For the names, use what the data provider shows on their web page and try to be as consise as possible.

When adding records:

  • first, make sure the providers.json entry is correct
  • add to the bottom of the file
  • increment the id number manually

Other fields that may be included:

  • alt_title -- list of alternative titles or abbreviations, aka "mentions"
  • url -- URL for the main page describing the dataset
  • doi -- a unique persistent identifier assigned by the data provider
  • alt_ids -- stored as a list, other unique identifiers (alternative DOIs, etc.)
  • description -- a brief (tweet sized) text description of the dataset
  • date -- date of publication, which may help resolve conflicting identifiers

To Do

quality checks on dataset entries

  • spot checks on urls, titles, etc
  • unify naming conventioins
  • is 'program data' a dataset? revisit after november workshop

Additions to test.py

  • add check for commas within entries

Enrich datasets.json with additional metadata

The datasets enumerated in datasets.json may have additional metadata, which would be given to us by the data provider or client using the dataset.

These fields might include (but not limited to):

  • keywords and categories - list of terms associated with the dataset
  • geographical coverage - geography that the dataset covers, e.g New York State, Germany
  • temporal coverage - time period of the dataset. If the dataset is regularly released, e.g. the U.S. Census, the value could be 'decennial'
  • data steward - person responsible for protecting and sharing the dataset - id should come from data_stewards.json (not yet in existence)
  • customer - client or partner who requested that the dataset be entered into our knowledge graph - id should come from customers.json (not yet in existence)
  • long_description - longer form description of dataset
  • in_adrf - boolean value indicating whether or not the dataset is in the ADRF
  • funder - organization (could be the agency) that funded creation or dissemination of the dataset

rcdatasets's People

Contributors

andrewhnorris avatar benjamin-feder avatar ceteri avatar claytonrsh avatar jasonzhangzy1757 avatar menoah avatar srand525 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.