GithubHelp home page GithubHelp logo

catalyst-cooperative / pudl-zenodo-storage Goto Github PK

View Code? Open in Web Editor NEW
2.0 4.0 2.0 242 KB

Tools for creating versioned archives of raw data on Zenodo using Frictionless data packages.

License: MIT License

Python 100.00%
zenodo api archiver census-data dataset doi eia electricity energy epa

pudl-zenodo-storage's Introduction

PUDL Utils for Zenodo storage and packaging

Deprecated

This repo has been replaced by the new pudl-archiver repo, which combines both the scraping andd archiving process.

Background on Zenodo

Zenodo is an open repository maintained by CERN that allows users to archive research-related digital artifacts for free. Catalyst uses Zenodo to archive raw datasets scraped from the likes of FERC, EIA, and the EPA to ensure reliable, versioned access to the data PUDL depends on. Take a look at our archives here. In the event that any of the publishers change the format or contents of their data, remove old years, or simply cease to exist, we will have a permanent record of the data. All data uploaded to Zenodo is assigned a DOI for streamlined access and citing.

Whenever the historical data changes substantially or new years are added, we make new Zenodo archives and build out new versions of PUDL that are compatible. Paring specific Zenodo archives with PUDL releases ensures a functioning ETL for users and developers.

Once created, Zenodo archives cannot be deleted. This is, in fact, their purpose! It also means that one ought to be sparing with the information uploaded. We don't want wade through tons of test uploads when looking for the most recent version of data. Luckily Zenodo has created a sandbox environment for testing API integration. Unlike the regular environment, the sandbox can be wiped clean at any time. When testing uploads, you'll want to upload to the sandbox first. Because we want to keep our Zenodo as clean as possible, we keep the upload tokens internal to Catalyst. If there's data you want to see integrated, and you're not part of the team, send us an email at [email protected].

One last thing-- Zenodo archives for particular datasets are referred to as "depositions". Each dataset is it's own deposition that gets created when the dataset is first uploaded to Zenodo and versioned as the source releases new data that gets uploaded to Zenodo.

Installation

We recommend using mamba to create and manage your environment.

In your terminal, run:

$ mamba env create -f environment.yml
$ mamba activate pudl-zenodo-storage

Adding a New Data Source

When you're adding an entirely new dataset to the PUDL, your first course of action is building a scrapy script in the [pudl-scrapers] (https://github.com/catalyst-cooperative/pudl-scrapers) repo. Once you've done that, you're ready to archive.

First, you'll need to fill in some metadata in the pudl repo. Start by adding a new key value pair in the SOURCE dict in the pudl/metadata/source.py module. It's best to keep the key (the source name) you choose simple and consistent across all repos that reference the data. Once you've done this, you'll need to install your local version of pudl (rather than the default version from GitHub). Doing this will allow the Zenodo archiver script to process changes you made to the pudl repo.

While in the pudl-zenodo-storage environment, navigate to the pudl repo and run:

$ pip install -e ./

You don't need to worry about the fields.py module until you're ready to transform the data in pudl.

Now, come back to this repo and create a module for the dataset in the frictionless directory. Give it the same name as the key you made for the data in the SOURCE dict. Use the existing modules as a model for your new one. The main function is called datapackager() and it serves to produce a json for the Zenodo archival collection.

Lastly, you need to:

  • Add archive metadata for the new dataset in the zs/metadata.py module. This includes creating a UUID (universally unique identifier) for the data. UUIDs are used to uniquely distinguish the archive prior to the creation of a DOI. You can do this using the uuid.uuid4() function that is part of the Python standard library.
  • Add the chosen deposition name to this list of acceptable names output with the zenodo_store --help flag. See parse_main() in zs.cli.py.
  • Add specifications for your new deposition in the archive_selection() function also in zs.cli.py.

Updating an Existing Data Source

If updating an existing data source -- say, one that as released a new year's worth of data -- you don't need to add any new metadata to the pudl repo. Simply run the scraper for the data and then run the Zenodo script as described below. The code was built to detect any changes in the data and automatically create a new version of the same deposition when uploaded.

Running the Zenodo Archiver Script

Before you can archive data, you'll need to run the scrapy script you just created in the pudl-scrapers repo. Once you've scraped the data, then you can come back and run the archiver. This script, zenodo_store gets defined as an entry point in setup.py.

Next, you'll need to define ZENODO_SANDBOX_TOKEN_UPLOAD and ZENODO_TOKEN_UPLOAD environment variables on your local machine. As mentioned above, we keep these values internal to Catalyst so as to maintain a clean and reliable archive.

The zenodo_store script requires you to include the name of the Zenodo deposition as an argument. This is a string value that indicates which dataset you're going to upload. Use the --help flag to see a list of supported strings. You can also find a list of the deposition names in the archive_selection() function in the cli.py module.

When you're testing an archive, you'll want to make sure you use the Zenodo sandbox rather than the official Zenodo archive (see above for more info about the sandbox). Adding the --verbose flag will print out logging messages that are helpful for debugging. Adding the --noop flag will show you whether your the data you scraped is any different from the data you already have uploaded to Zenodo without uploading anything (so long as there is an existing upload to compare it to).

If the dataset is brand new, you'll also need to add the --initialize flag so that it knows to create a new deposition for the data.

Make sure a new deposition knows where to grab scraped data:

$ zenodo_store newdata --noop --verbose
Archive would contain: path/to/scraped/data

Compare a newly scraped deposition to the currently archived deposition of the same dataset. If you get the output depicted below then the archive data is the same as the scraped data, and you don't need to make a new version!

$ zenodo_store newdata --noop
{
    "create": {},
    "delete": {},
    "update": {}
}

Test run a new deposition in the sandbox (the output link is fake!):

$ zenodo_store newdata --sandbox --verbose --initialize
Uploaded path/to/scraped/data
Your new deposition archive is ready for review at https://sandbox.zenodo.org/deposit/number

Once you're confident with your upload, you can go ahead and run the script without any flags.

$ zenodo_store newdata

Repo Contents

zs

The zs.ZenodoStorage class provides an interface to create archives and upload files to Zenodo.

frictionless

Package metadata in dict formats, as necessary to support the frictionless datapackage specification.

pudl-zenodo-storage's People

Contributors

aesharpe avatar bendnorman avatar cmgosnell avatar dependabot[bot] avatar pre-commit-ci[bot] avatar ptvirgo avatar zaneselvans avatar zschira avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

karldw kevpgoff

pudl-zenodo-storage's Issues

Create archiver for FERC Form 2

We have a real scraper for FERC Form 2, now we need a way to convert that into an archive on Zenodo.

This one is a little bit complicated, since there are now 3 different data formats, they overlap in time, and the earlier data is split into two files per year.

  • 1991-1999: Data is available in DAT format, split into 2 files per year (A-M, and N-Z)
  • 1996-2021Q2: Data is available in DBF format, one file per year.
  • 2021-present: Data is available in XBRL format via an RSS feed that has one filing per feed item.

Organize Zenodo storage repo into a python package

Right now there are multiple Python packages being defined in this repo at the top level, which doesn't conform to common assumptions about how packages work... and also leads to the local (rather than installed) package getting tested. Stuff to do:

  • Create an isolated src/pudl-zenodo-storage directory to combine the zs and frictionless packages into a single package.
  • Turn the bin/zenodo_store.py script into an entrypoint script that lives inside the single package.
  • Update setup.py to look for its package contents in src/pudl-zenodo-storage
  • Fix all the local / relative imports to work with the new package structure

Update FERC Form 1 archiver to work with DBF + XBRL

The FERC Form 1 is now published as a combination of historical VisualFoxPro DBF databases, and XBRL data, and we need a way to archive them together as a single dataset.

  • 1994-2021Q2: DBF zipfiles, one per year.
  • 2021-present: XBRL data published via RSS feed, one filing per feed item.

Set up black autoformatter as a test

We're interested in moving to using the black deterministic auto formatter in the PUDL repo, but it might be nice to experiment with setting it up here in a smaller, less busy repo first.

  • Replace autopep8 pre-commit hook with black.
  • Blacken all the existing code by running pre-commit.
  • Check in one big formatting commit.

Create Archiver for FERC Form 6

We now have a scraper for the FERC Form 6, but need to be able to archive the data on Zenodo. There are two separate sources with non-overlapping coverage:

  • 2000-2020: DBF files, one per year.
  • 2021-present: XBRL published in an RSS feed, with one filing per feed item.

Create archiver for FERC Form 60

We have a scraper for the FERC Form 60, but we need an archiver to go with it:

  • 2006-2020: DBF zipfiles, one per year
  • 2021-present: XBRL data published via an RSS feed, one filing per feed item.

Update FERC Form 714 archiver for use with CSV + XBRL

Old FERC Form 714 data is published as a collection of CSVs. New FERC Form 714 data is published as XBRL. We need an archiver that can store both formats.

  • 2006-2020: CSV files (exported from VisualFoxPro DBF apparently)
  • 2021-present: XBRL data published via RSS, one filing per feed item.

Refactor archiving process to simplify the workflow, and make it easier to add new datasets

The archiver is currently fairly confusing, which makes it difficult to add new datasets. Refactoring this workflow so there is only one place where everything related to a single dataset is defined could make this process easier. In the refactoring process, we can also simplify some of the general logic flow and improve logging functionality to make debugging the archiver much easier.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.