GithubHelp home page GithubHelp logo

catalyst-cooperative / pudl-catalog Goto Github PK

View Code? Open in Web Editor NEW
9.0 6.0 2.0 616 KB

An Intake catalog for distributing open energy system data liberated by Catalyst Cooperative.

Home Page: https://catalyst.coop/pudl/

License: MIT License

Jupyter Notebook 41.20% Python 58.80%
data-catalog database eia electricity energy epa ferc intake natural-gas open-data

pudl-catalog's Introduction

REPO IN STASIS

After some experimentation we have decided not to use Intake data catalogs, so this repository is not currently being updated or maintained. We may repurpose it for use with another data catalog system, but for the time being, please see the PUDL Data Access Documentation for instructions on how to access the data we publish.

The PUDL Data Catalog

Tox-PyTest Status Codecov Test Coverage Read the Docs Build Status PyPI Latest Version conda-forge Version Supported Python Versions Any color you want, so long as it's black.

This repository houses a data catalog distributing open energy system data liberated by Catalyst Cooperative as part of our Public Utility Data Liberation Project (PUDL). It uses the Intake library developed by Anaconda to provide a uniform interface to versioned data releases hosted on publicly accessible cloud resources.

Catalog Contents

Currently available datasets

  • Raw FERC Form 1 DB (SQLite) -- browse DB online
  • PUDL DB (SQLite) -- browse DB online
  • Census Demographic Profile 1 (SQLite)
  • Hourly Emissions from the EPA CEMS (Apache Parquet)

Ongoing Development

To track ongoing development of the PUDL Catalog you can follow these issues in the main PUDL repository:

See also:

PUDL Catalog Usage

Installation

You can install the PUDL Catalog using conda:

conda install -c conda-forge catalystcoop.pudl

or pip:

pip install catalystcoop.pudl-catalog

Import the Intake Catalogs

The pudl_catalog registers itself as an available data source within Intake when it's installed, so you can grab it from the top level Intake catalog. To see what data sources are available within the catalog you turn it into a list (yes this is weird).

import intake
import pandas as pd
from pudl_catalog.helpers import year_state_filter

pudl_cat = intake.cat.pudl_cat
list(pudl_cat)
[
  'hourly_emissions_epacems',
  'hourly_emissions_epacems_partitioned',
  'pudl',
  'ferc1',
  'censusdp1tract'
]

Inspect the catalog data source

Printing the data source will show you the YAML that defines the source, but with all the Jinja template fields interpolated and filled in:

pudl_cat.hourly_emissions_epacems
hourly_emissions_epacems:
  args:
    engine: pyarrow
    storage_options:
      simplecache:
        cache_storage: /home/zane/.cache/intake
    urlpath: simplecache::gs://intake.catalyst.coop/dev/hourly_emissions_epacems.parquet
  description: Hourly pollution emissions and plant operational data reported via
    Continuous Emissions Monitoring Systems (CEMS) as required by 40 CFR Part 75.
    Includes CO2, NOx, and SO2, as well as the heat content of fuel consumed and gross
    power output. Hourly values reported by US EIA ORISPL code and emissions unit
    (smokestack) ID.
  driver: intake_parquet.source.ParquetSource
  metadata:
    catalog_dir: /home/zane/code/catalyst/pudl-catalog/src/pudl_catalog/
    license:
      name: CC-BY-4.0
      path: https://creativecommons.org/licenses/by/4.0
      title: Creative Commons Attribution 4.0
    path: https://ampd.epa.gov/ampd
    provider: US Environmental Protection Agency Air Markets Program
    title: Continuous Emissions Monitoring System (CEMS) Hourly Data
    type: application/parquet

Data source specific metadata

The source.discover() method will show you some internal details of the data source, including what columns are available and their data types:

pudl_cat.hourly_emissions_epacems.discover()
{'dtype': {'plant_id_eia': 'int32',
  'unitid': 'object',
  'operating_datetime_utc': 'datetime64[ns, UTC]',
  'year': 'int32',
  'state': 'int64',
  'facility_id': 'int32',
  'unit_id_epa': 'object',
  'operating_time_hours': 'float32',
  'gross_load_mw': 'float32',
  'heat_content_mmbtu': 'float32',
  'steam_load_1000_lbs': 'float32',
  'so2_mass_lbs': 'float32',
  'so2_mass_measurement_code': 'int64',
  'nox_rate_lbs_mmbtu': 'float32',
  'nox_rate_measurement_code': 'int64',
  'nox_mass_lbs': 'float32',
  'nox_mass_measurement_code': 'int64',
  'co2_mass_tons': 'float32',
  'co2_mass_measurement_code': 'int64'},
 'shape': (None, 19),
 'npartitions': 1,
 'metadata': {'title': 'Continuous Emissions Monitoring System (CEMS) Hourly Data',
  'type': 'application/parquet',
  'provider': 'US Environmental Protection Agency Air Markets Program',
  'path': 'https://ampd.epa.gov/ampd',
  'license': {'name': 'CC-BY-4.0',
   'title': 'Creative Commons Attribution 4.0',
   'path': 'https://creativecommons.org/licenses/by/4.0'},
  'catalog_dir': '/home/zane/code/catalyst/pudl-catalog/src/pudl_catalog/'}}

Note

If the data has not been cached this method might take a while to finish depending on your internet speed. The EPA CEMS parquet data is almost 5 GB.

Read some data from the catalog

To read data from the source you call it with some arguments. Here we’re supplying filters (in “disjunctive normal form”) that select only a subset of the available years and states. This limits the set of Parquet files that need to be scanned to find the requested data (since the files are partitioned by year and state) and also ensures that you don’t get back a 100GB dataframe that crashes your laptop. These arguments are passed through to dask.dataframe.read_parquet() since Dask dataframes are the default container for Parquet data. Given those arguments, you convert the source to a Dask dataframe and the use .compute() on that dataframe to actually read the data and return a pandas dataframe:

filters = year_state_filter(
    years=[2019, 2020],
    states=["ID", "CO", "TX"],
)
epacems_df = (
    pudl_cat.hourly_emissions_epacems(filters=filters)
    .to_dask()
    .compute()
)
cols = [
    plant_id_eia,
    operating_datetime_utc,
    year,
    state,
    operating_time_hours,
    gross_load_mw,
    heat_content_mmbtu,
    co2_mass_tons,
]
epacems_df[cols].head()
plant_id_eia operating_datetime_utc year state operating_time_hours gross_load_mw heat_content_mmbtu co2_mass_tons
469 2019-01-01 07:00:00+00:00 2019 CO 1.0 203.0 2146.2 127.2
469 2019-01-01 08:00:00+00:00 2019 CO 1.0 203.0 2152.7 127.6
469 2019-01-01 09:00:00+00:00 2019 CO 1.0 204.0 2142.2 127.0
469 2019-01-01 10:00:00+00:00 2019 CO 1.0 204.0 2129.2 126.2
469 2019-01-01 11:00:00+00:00 2019 CO 1.0 204.0 2160.6 128.1

For more usage examples see the Jupyter notebook at notebooks/pudl-catalog.ipynb

Planned data distribution system

We’re in the process of implementing automated nightly builds of all of our data products for each development branch with new commits in the main PUDL repository. This will allow us to do exhaustive integration testing and data validation on a daily basis. If all of the tests and data validation pass, then a new version of the data products (SQLite databases and Parquet files) will be produced, and placed into cloud storage.

These outputs will be made available via a data catalog on a corresponding branch in this pudl-catalog repository. Ingeneral only the catalogs and data resources corresponding to the HEAD of development and feature branches will be available. Releases that are tagged on the main branch will be retained long term.

The idea is that for any released version of PUDL, you should also be able to install a corresponding data catalog, and know that the software and the data are compatible. You can also install just the data catalog with minimal dependencies, and not need to worry about the PUDL software that produced it at all, if you simply want to access the DBs or Parquet files directly.

In development, this arrangement will mean that every morning you should have access to a fully processed set of data products that reflect the branch of code that you’re working on, rather than the data and code getting progressively further out of sync as you do development, until you take the time to re-run the full ETL locally yourself.

Benefits of Intake Catalogs

The Intake docs list a bunch of potential use cases. Here are some features that we’re excited to take advantage of:

Rich Metadata

The Intake catalog provides a human and machine readable container for metadata describing the underlying data, so that you can understand what the data contains before downloading all of it. We intend to automate the production of the catalog using PUDL’s metadata models so it’s always up to date.

Local data caching

Rather than downloading the same data repeatedly, in many cases it’s possible to transparently cache the data locally for faster access later. This is especially useful when you’ve got plenty of disk space and a slower network connection, or typically only work with a small subset of a much larger dataset.

Manage data like software

Intake data catalogs can be packaged and versioned just like Python software packages, allowing us to manage depedencies between different versions of software and the data it operates on to ensure they are compatible. It also allows you to have multiple versions of the same data installed locally, and to switch between them seamlessly when you change software environments. This is especially useful when doing a mix of development and analysis, where we need to work with the newest data (which may not yet be fully integrated) as well as previously released data and software that’s more stable.

A Uniform API

All the data sources of a given type (parquet, SQL) would have the same interface, reducing the number of things a user needs to remember to access the data.

Decoupling Data Location and Format

Having users access the data through the catalog rather than directly means that the underlying storage location and file formats can change over time as needed without requiring the user to change how they are accessing the data.

Additional Intake Resources

Other Related Energy & Climate Data Catalogs

CarbonPlan is a non-profit research organization focused on climate and energy system data analysis. They manage their data inputs and products using Intake, and the catalogs are public.

Pangeo Forge is a collaborate project providing analysis read cloud optimzed (ARCO) scientific datasets, primarily related to the earth sciences, including climate data. The motiviation and benefits of this approach are described in this paper: Pangeo Forge: Crowdsourcing Analysis-Ready, Cloud Optimized Data Production

Licensing

Our code, data, and other work are permissively licensed for use by anybody, for any purpose, so long as you give us credit for the work we've done.

  • For software we use the MIT License.
  • For data, documentation, and other non-software works we use the CC-BY-4.0 license.

Contact Us

  • For general support, questions, or other conversations around the project that might be of interest to others, check out the GitHub Discussions
  • If you'd like to get occasional updates about our projects sign up for our email list.
  • Want to schedule a time to chat with us one-on-one? Join us for Office Hours
  • Follow us on Twitter: @CatalystCoop
  • More info on our website: https://catalyst.coop
  • For private communication about the project or to hire us to provide customized data extraction and analysis, you can email the maintainers: [email protected]

About Catalyst Cooperative

Catalyst Cooperative is a small group of data wranglers and policy wonks organized as a worker-owned cooperative consultancy. Our goal is a more just, livable, and sustainable world. We integrate public data and perform custom analyses to inform public policy (Hire us!). Our focus is primarily on mitigating climate change and improving electric utility regulation in the United States.

Funding

This work is supported by a generous grant from the Alfred P. Sloan Foundation and their Energy & Environment Program

Storage and egress fees for this data are covered by Amazon Web Services's Open Data Sponsorship Program.

pudl-catalog's People

Contributors

bendnorman avatar dependabot[bot] avatar jdangerx avatar pre-commit-ci[bot] avatar zaneselvans avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pudl-catalog's Issues

Replace storage_option with AWS S3 bucket

PUDL has been accepted to the Open Data Sponsorship Program on AWS, which covers storage and egress fees of S3 buckets that contain our data. This is great news because our users won't have to set up a GCP account to deal with requester pays.

Tasks:

  • Follow the onboard steps for the AWS program
  • Update the base URLs in the intake catalog.
  • Update the nightly build script to load the outputs to S3.
  • Add AWS credential secrets to github.
  • Share the program with OpenAddresses.
  • Update CATALOG_VERSION to v2022.11.30.
  • Copy v2022.11.30 outputs to AWS bucket.
  • Add install instruction to pudl-catalog.
  • Add link to readthedocs site on purl-catalog github.
  • Try interacting with the intake catalog using the s3 bucket.
  • Update documentation explaining AWS bucket. Remove the requester pays documentation.
  • Figure out how to download objects from s3 bucket for internal. Add to PUDL documentation.
  • Double check the logs are working as expected (interaction from intake is logged, IP, size, file). Log fields
  • Get MFA recovery code and make sure there is a second MFA method.
  • Give Zane awsopendata credentials.
  • Create a tutorial on how to use the intake catalogs in AWS. Can we just say: run this code in a jupyter notebook running in EMR Studio? Maybe include Athena queries from parquet files? See example.
  • Add yaml file to the AWS open data github.
  • Add s3 integration tests to pudl-catalog.

AWS Notes

AWS MFA Notes

Speed up querying of paritioned Parquet data on GCS

Using pd.read_parquet()

When using pd.read_parquet() reading data from a collection of remote parquet files using the gcs:// protocol takes twice as long as reading from a single parquet file, but no similar slowdown occurs locally:

# Select ~1% of the 800M rows in the dataset, from 6 of 1274 row groups: 
filters = [
    [('year', '=', 2019), ('state', '=', 'ID')],
    [('year', '=', 2019), ('state', '=', 'CO')],
    [('year', '=', 2019), ('state', '=', 'TX')],
    [('year', '=', 2020), ('state', '=', 'ID')],
    [('year', '=', 2020), ('state', '=', 'CO')],
    [('year', '=', 2020), ('state', '=', 'TX')]
]

single_file_local = pd.read_parquet("../data/hourly_emissions_epacems.parquet", filters=filters)
# CPU times: user 2.58 s, sys: 778 ms, total: 3.35 s
# Wall time: 2.23 s

multi_file_local = pd.read_parquet("../data/hourly_emissions_epacems", filters=filters)
# CPU times: user 4.57 s, sys: 1.01 s, total: 5.58 s
# Wall time: 2.67 s

single_file_remote = pd.read_parquet("gcs://catalyst.coop/intake/test/hourly_emissions_epacems.parquet", filters=filters)
# CPU times: user 5.33 s, sys: 1.22 s, total: 6.56 s
# Wall time: 25 s

multi_file_remote = pd.read_parquet("gcs://catalyst.coop/intake/test/hourly_emissions_epacems", filters=filters)
# CPU times: user 16.2 s, sys: 2.61 s, total: 18.8 s
# Wall time: 51.7 s
  • Is it not able to use the pushdown filtering remotely to only scan the files / block groups that have the requested data?
  • Looking at the reports from %%time the user time does double locally for the partitioned data, but the elapsed time doesn't. Is it working with multiple threads locally, but only a single thread remotely?

Using intake_parquet

Even ignoring the close to 12 minutes of apparent network transfer time, the same query only took 25 seconds with pd.read_parquet() and here it took 3 minutes. Really need to be able to toggle caching on and off before I can experiment here.

# Not sure giving it empty storage options had the effect of disabling caching.
# It seems to have re-downloaded the whole dataset and put it... where?
single_file_intake = pudl_cat.hourly_emissions_epacems(
    storage_options={}, filters=filters
).to_dask().compute()
# CPU times: user 2min 17s, sys: 44.2 s, total: 3min 1s
# Wall time: 14min 49s

Improve requester-pays documentation

Working with requester-pays data requires a little bit of cloud setup, which may be a challenge for some users. Update instructions based on suggestions from @cmgosnell in #23

  • Create a separate page in the docs explaining how to set up requester pays.
  • Grab some screenshots from the GCP console for setting up the project, setting IAM permissions.
  • Make it clear which options to choose when doing gcloud init
    • Re-initialize this configuration with new settings
    • Pick the cloud project that you just created in GCP
  • Update README to point at requester pays documentation for setup.

Add FERC XBRL derived SQLite DBs to PUDL Catalog

We now have a host of additional XBRL derived FERC databases, which should are output by the nightly builds, and should be included in the catalog, along with their metadata. JSON files can also be made into intake catalog sources, so we can provide direct access to the datapackage and XBRL taxonomy derived metadata.

In addition we should export intake catalog metadata in YAML for use with each of these datasets.

  • FERC 1 XBRL
    • SQLite DB
    • Datapackage descriptor (JSON)
    • Taxonomy metadata (JSON)
    • Intake catalog metadata (YAML)
  • FERC 2 XBRL
    • SQLite DB
    • Datapackage descriptor (JSON)
    • Taxonomy metadata (JSON)
    • Intake catalog metadata (YAML)
  • FERC 6 XBRL
    • SQLite DB
    • Datapackage descriptor (JSON)
    • Taxonomy metadata (JSON)
    • Intake catalog metadata (YAML)
  • FERC 60 XBRL
    • SQLite DB
    • Datapackage descriptor (JSON)
    • Taxonomy metadata (JSON)
    • Intake catalog metadata (YAML)
  • FERC 714 XBRL
    • SQLite DB
    • Datapackage descriptor (JSON)
    • Taxonomy metadata (JSON)
    • Intake catalog metadata (YAML)

Add catalystcoop.pudl_catalog to our JupyterHub

Once we have an initial release of the data catalog on PyPI / conda-forge (see #10):

  • Add catalystcoop.pudl_catalog to the environment specified in the pudl-examples repo which builds the Docker container for our JupyterHub.
  • Test accessing the cloud data on the JupyterHub both using pd.read_parquet() and through the Intake catalog, both for the monolithic and partitioned datasets, with and without caching turned on to see what the user experience is like.
  • Make this the default way of accessing the EPA CEMS data on the JupyterHub, so we can reduce our disk usage and avoid the hassle of uploading new data to the hub whenever we do a data release.

Constrain allowable years and states for filtering

The EPA CEMS dataset is composed of ~1300 row groups, each containing a unique combination of year and state to allow efficient pushdown filtering by time and location. Only a certain range of years (1995-2020) and set of state abbreviations (continental US plus DC) are valid for filtering. It would be nice if we could at least suggest, and preferably require that users only attempt to filter with valid values, so that if they ask for something outside of the allowable values they get an error, rather than waiting a long time for a query that won't give them anything useful.

Is this easy to set up with the intake catalog? Can we designate an allowable set of values for years and states to be used as filters? How are user parameters meant to be used? I've seen that you can enumerate allowable values there, but they seem only to be for use in Jinja templating of the filenames, and not for things like the filters.

Automate generation of EPA CEMS metadata for data catalog export

We want to integrate column and table metadata (e.g. text descriptions) into the source definition in pudl_catalog.yaml so that users can understand what data is available when browsing the catalog. This information is currently being written into the column and table metadata within the Parquet files during ETL, so it could be read from there. It could be exported from our Pydantic metadata models when we generate pudl_catalog.yaml.

  • Identify or create an appropriate structure / format for table & column level metadata in the pudl_catalog.yaml. This should include at least:
    • Text description of the table (Resource.description)
    • Primary key of the table (Resource.schema.primary_key)
    • Text descriptions for each column (Field.description)
    • Licensing terms for the data (Resource.license)
    • The original source of the data (Resource.sources)
    • Creator(s) / Maintainer(s) of the dataset (Resource.contributors)
  • Add a Resource.to_intake_data_source() method that can generate the Intake data source level metadata entry.

Fix missing / incomplete Parquet & Intake metadata

The source.discover() method shows some details about the internals of a data source within an intake catalog. E.g.

import intake
pudl_cat = intake.cat.pudl_cat
pudl_cat.hourly_emissions_epacems.discover()
{
    'dtype': {
        'plant_id_eia': 'int32',
        'unitid': 'object',
        'operating_datetime_utc': 'datetime64[ns, UTC]',
        'year': 'int32',
        'state': 'int64',
        'facility_id': 'int32',
        'unit_id_epa': 'object',
        'operating_time_hours': 'float32',
        'gross_load_mw': 'float32',
        'heat_content_mmbtu': 'float32',
        'steam_load_1000_lbs': 'float32',
        'so2_mass_lbs': 'float32',
        'so2_mass_measurement_code': 'int64',
        'nox_rate_lbs_mmbtu': 'float32',
        'nox_rate_measurement_code': 'int64',
        'nox_mass_lbs': 'float32',
        'nox_mass_measurement_code': 'int64',
        'co2_mass_tons': 'float32',
        'co2_mass_measurement_code': 'int64'
    },
    'shape': (None, 19),
    'npartitions': 1,
    'metadata': {
        'title': 'Continuous Emissions Monitoring System (CEMS) Hourly Data',
        'type': 'application/parquet',
        'provider': 'US Environmental Protection Agency Air Markets Program',
        'path': 'https://ampd.epa.gov/ampd',
        'license': {
            'name': 'CC-BY-4.0',
            'title': 'Creative Commons Attribution 4.0',
            'path': 'https://creativecommons.org/licenses/by/4.0'
        },
        'catalog_dir': '/home/zane/code/catalyst/pudl-data-catalog/src/pudl_catalog/'
    }
}

However, some of this information doesn't reflect what's in the parquet files as well as it could. We should make sure:

  • unitid and unit_id_epa show up as string not object
  • The category columns state, so2_mass_measurement_code, nox_rate_measurement_code, nox_mass_measurement_code, and co2_mass_measurement_code show up as category instead of int64 (presumably they're appearing as integers because integers are keys in a dictionary of categorical values?)
  • The shape tuple should indicate the number of rows in the dataset, rather than None since that information is stored in the Parquet file metadata.
  • The nullability of columns dtypes should be preserved.

Some of these issues seem to be arising from Intake, and some of them seem to arise from the metadata that's getting written to the Parquet files in ETL. Looking at the type information for a sample of the data after it's been read back into a pandas dataframe:

filters = year_state_filter(
    years=[2019, 2020],
    states=["CO", "ID", "TX"],
)
epacems_df = (
    pudl_cat.hourly_emissions_epacems(filters=filters)
    .to_dask().compute()
)
epacems_df.info(show_counts=True, memory_usage="deep")
epacems_df.info(show_counts=True, memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8006424 entries, 0 to 8006423
Data columns (total 19 columns):
 #   Column                     Non-Null Count    Dtype              
---  ------                     --------------    -----              
 0   plant_id_eia               8006424 non-null  int32              
 1   unitid                     8006424 non-null  object             
 2   operating_datetime_utc     8006424 non-null  datetime64[ns, UTC]
 3   year                       8006424 non-null  int32              
 4   state                      8006424 non-null  category           
 5   facility_id                8006424 non-null  int32              
 6   unit_id_epa                8006424 non-null  object             
 7   operating_time_hours       8003928 non-null  float32            
 8   gross_load_mw              8006424 non-null  float32            
 9   heat_content_mmbtu         8006424 non-null  float32            
 10  steam_load_1000_lbs        33252 non-null    float32            
 11  so2_mass_lbs               3586052 non-null  float32            
 12  so2_mass_measurement_code  3586052 non-null  category           
 13  nox_rate_lbs_mmbtu         3716001 non-null  float32            
 14  nox_rate_measurement_code  3716001 non-null  category           
 15  nox_mass_lbs               3716549 non-null  float32            
 16  nox_mass_measurement_code  3716549 non-null  category           
 17  co2_mass_tons              3688397 non-null  float32            
 18  co2_mass_measurement_code  3688397 non-null  category           
dtypes: category(5), datetime64[ns, UTC](1), float32(8), int32(3), object(2)
memory usage: 1.3 GB

The categorical values show up correctly as categories, but the other type issues (nullability, string vs/ object) remain. In my experimentation with different ways of writing out the files I think I did see strings, nullable types, and category types coming through fine in this information in the past, so I think there's something wrong with the Parquet metadata. Reading in one file and looking at the metadata directly, they all appear to be correct:

import pyarrow.parquet as pq
epacems_pq = pq.read_table("../data/hourly_emissions_epacems/epacems-2020-ID.parquet")
{name: dtype for name, dtype in zip(epacems_pq.schema.names, epacems_pq.schema.types)}
{
    'plant_id_eia': DataType(int32),
    'unitid': DataType(string),
    'operating_datetime_utc': TimestampType(timestamp[ms, tz=UTC]),
    'year': DataType(int32),
    'state': DictionaryType(dictionary<values=string, indices=int32, ordered=0>),
    'facility_id': DataType(int32),
    'unit_id_epa': DataType(string),
    'operating_time_hours': DataType(float),
    'gross_load_mw': DataType(float),
    'heat_content_mmbtu': DataType(float),
    'steam_load_1000_lbs': DataType(float),
    'so2_mass_lbs': DataType(float),
    'so2_mass_measurement_code': DictionaryType(dictionary<values=string, indices=int32, ordered=0>),
    'nox_rate_lbs_mmbtu': DataType(float),
    'nox_rate_measurement_code': DictionaryType(dictionary<values=string, indices=int32, ordered=0>),
    'nox_mass_lbs': DataType(float),
    'nox_mass_measurement_code': DictionaryType(dictionary<values=string, indices=int32, ordered=0>),
    'co2_mass_tons': DataType(float),
    'co2_mass_measurement_code': DictionaryType(dictionary<values=string, indices=int32, ordered=0>)
}
epacems_pq.schema
plant_id_eia: int32 not null
  -- field metadata --
  description: 'The unique six-digit facility identification number, also' + 69
unitid: string not null
  -- field metadata --
  description: 'Facility-specific unit id (e.g. Unit 4)'
operating_datetime_utc: timestamp[ms, tz=UTC] not null
  -- field metadata --
  description: 'Date and time measurement began (UTC).'
year: int32 not null
  -- field metadata --
  description: 'Year the data was reported in, used for partitioning EPA ' + 5
state: dictionary<values=string, indices=int32, ordered=0>
  -- field metadata --
  description: 'Two letter US state abbreviation.'
facility_id: int32
  -- field metadata --
  description: 'New EPA plant ID.'
unit_id_epa: string
  -- field metadata --
  description: 'Emissions (smokestake) unit monitored by EPA CEMS.'
operating_time_hours: float
  -- field metadata --
  description: 'Length of time interval measured.'
gross_load_mw: float not null
  -- field metadata --
  description: 'Average power in megawatts delivered during time interval' + 10
heat_content_mmbtu: float not null
  -- field metadata --
  description: 'The energy contained in fuel burned, measured in million ' + 4
steam_load_1000_lbs: float
  -- field metadata --
  description: 'Total steam pressure produced by a unit during the report' + 8
so2_mass_lbs: float
  -- field metadata --
  description: 'Sulfur dioxide emissions in pounds.'
so2_mass_measurement_code: dictionary<values=string, indices=int32, ordered=0>
  -- field metadata --
  description: 'Identifies whether the reported value of emissions was me' + 47
nox_rate_lbs_mmbtu: float
  -- field metadata --
  description: 'The average rate at which NOx was emitted during a given ' + 12
nox_rate_measurement_code: dictionary<values=string, indices=int32, ordered=0>
  -- field metadata --
  description: 'Identifies whether the reported value of emissions was me' + 47
nox_mass_lbs: float
  -- field metadata --
  description: 'NOx emissions in pounds.'
nox_mass_measurement_code: dictionary<values=string, indices=int32, ordered=0>
  -- field metadata --
  description: 'Identifies whether the reported value of emissions was me' + 47
co2_mass_tons: float
  -- field metadata --
  description: 'Carbon dioxide emissions in short tons.'
co2_mass_measurement_code: dictionary<values=string, indices=int32, ordered=0>
  -- field metadata --
  description: 'Identifies whether the reported value of emissions was me' + 47
-- schema metadata --
description: 'Hourly emissions and plant operational data reported via Co' + 68
primary_key: 'plant_id_eia,unitid,operating_datetime_utc'

However... the epacems.schema.pandas_metadata is None so it's relying on the default mapping of PyArrow types to Pandas types, which isn't what we want it to do.

Why isn't the pandas metadata being embedded in the Parquet file? Is it possible to explicitly insert it? The function that's writing the Parquet files is: pudl.etl._etl_one_year_epacems() and it's using pa.Table.from_pandas() so.... wtf?

def _etl_one_year_epacems(
    year: int,
    states: List[str],
    pudl_db: str,
    out_dir: str,
    ds_kwargs: Dict[str, Any],
) -> None:
    """Process one year of EPA CEMS and output year-state paritioned Parquet files."""
    pudl_engine = sa.create_engine(pudl_db)
    ds = Datastore(**ds_kwargs)
    schema = Resource.from_id("hourly_emissions_epacems").to_pyarrow()

    for state in states:
        with pq.ParquetWriter(
            where=Path(out_dir) / f"epacems-{year}-{state}.parquet",
            schema=schema,
            compression="snappy",
            version="2.6",
        ) as pqwriter:
            logger.info(f"Processing EPA CEMS hourly data for {year}-{state}")
            df = pudl.extract.epacems.extract(year=year, state=state, ds=ds)
            df = pudl.transform.epacems.transform(df, pudl_engine=pudl_engine)
            pqwriter.write_table(
                pa.Table.from_pandas(df, schema=schema, preserve_index=False)
            )

Add tests for `pudl.sqlite` and `ferc1.sqlite`

Right now CI only tests whether the EPA CEMS parquet data is working, but we've included the pudl.sqlite and ferc1.sqlite databases in the manifest as well, so they also need to be tested.

Messing around with the v2022.11.30 data I found that there were a variety of issues with some tables in the PUDL DB, and none of the data in the ferc1 DB was accessible so... there's work to be done here. I've implemented just the most basic tests as an example of some of these problems in #75 and marked the ones that aren't working with xfail.

Some potential tests to implement

  • Check that urlpath to pudl.sqlite looks reasonable
  • Check that urlpath to ferc1.sqlite looks reasonable
  • Check that a few expected tables exist in pudl.sqlite
  • Check that a few expected tables exist in ferc1.sqlite
  • Check that the number of tables in pudl.sqlite is at least some minimum.
  • Check that the number of tables in ferc1.sqlite is at least some minimum.
  • Read a table from pudl.sqlite and check that it has a reasonable shape and contents.
  • Read a table from ferc1.sqlite and check that it has a reasonable shape and contents.

Improve catalog-level CI tests

The CarbonPlan data catalog repo provides some examples of tests that could apply to our catalog as well.

These include:

  • Ensure all catalog entries have a set of required metadata attributes.
  • Verify that the expected files are actually available in cloud storage
  • Use to_dask() to avoid the need to download data while verifying that EPA CEMS dataframes have the right columns.

Use external metadata to improve filter/cache performance

Maybe this shouldn't be surprising, but when you query the whole collection of Parquet files with caching on, they all get downloaded, even if you're only reading data from a few of them, because as it is now you still need to access metadata inside the Parquet files to figure out which ones contain the data you're looking for.

This defeats some of the purpose of caching, since the first time you do a query/filter, you have to wait 10+ minutes for it all to download. Probably this wouldn't be an issue on cloud resources with 1-10Gb of network bandwidth, but it's a pain on our home network connections.

It looks like pyarrow supports _metadata sidecar files, which would hopefully speed up scanning the whole dataset considerably. But it also looks like it's tied to writing out a PyArrow dataset, rather than just a collection of files with the same schema in the same directory (which means all the columns are in all the files, and the schema applies simply to all of them without needing to munge around in the partitioning columns)

So far as I can tell, writing pandas_metadata into the parquet files (see #7) also requires using df.to_parquet() rather than a ParquetWriter directly or other methods for writing the dataframes out to parquet files, which is frustrating.

  • Can I write the data out using df.to_parquet() using the same schema for all of them, and then generate the metadata sidecar file after the fact?
  • Is there any way to append to an existing Parquet file using df.to_parquet()?
  • What defines whether a collection of Parquet files is considered a "dataset"? If they all use the same schema, how is a directory full of Parquet files different from a Parquet dataset?

Using pd.read_parquet()

When using pd.read_parquet() reading data from a collection of remote parquet files using the gcs:// protocol takes twice as long as reading from a single parquet file, but no similar slowdown occurs locally:

# Select ~1% of the 800M rows in the dataset, from 6 of 1274 row groups: 
filters = [
    [('year', '=', 2019), ('state', '=', 'ID')],
    [('year', '=', 2019), ('state', '=', 'CO')],
    [('year', '=', 2019), ('state', '=', 'TX')],
    [('year', '=', 2020), ('state', '=', 'ID')],
    [('year', '=', 2020), ('state', '=', 'CO')],
    [('year', '=', 2020), ('state', '=', 'TX')]
]

single_file_local = pd.read_parquet("../data/hourly_emissions_epacems.parquet", filters=filters)
# CPU times: user 2.58 s, sys: 778 ms, total: 3.35 s
# Wall time: 2.23 s

multi_file_local = pd.read_parquet("../data/hourly_emissions_epacems", filters=filters)
# CPU times: user 4.57 s, sys: 1.01 s, total: 5.58 s
# Wall time: 2.67 s

single_file_remote = pd.read_parquet("gcs://catalyst.coop/intake/test/hourly_emissions_epacems.parquet", filters=filters)
# CPU times: user 5.33 s, sys: 1.22 s, total: 6.56 s
# Wall time: 25 s

multi_file_remote = pd.read_parquet("gcs://catalyst.coop/intake/test/hourly_emissions_epacems", filters=filters)
# CPU times: user 16.2 s, sys: 2.61 s, total: 18.8 s
# Wall time: 51.7 s
  • Is it not able to use the pushdown filtering remotely to only scan the files / block groups that have the requested data?
  • Looking at the reports from %%time the user time does double locally for the partitioned data, but the elapsed time doesn't. Is it working with multiple threads locally, but only a single thread remotely?

Using intake_parquet

Even ignoring the close to 12 minutes of apparent network transfer time, the same query only took 25 seconds with pd.read_parquet() and here it took 3 minutes. Really need to be able to toggle caching on and off before I can experiment here.

# Not sure giving it empty storage options had the effect of disabling caching.
# It seems to have re-downloaded the whole dataset and put it... where?
single_file_intake = pudl_cat.hourly_emissions_epacems(
    storage_options={}, filters=filters
).to_dask().compute()
# CPU times: user 2min 17s, sys: 44.2 s, total: 3min 1s
# Wall time: 14min 49s

Allow local file caching to be disabled when appropriate

Local file caching is via simplecache:: is hugely valuable when you have a lot of cheap disk and a slower net connection (WFH),but it's not necessarily appropriate in a cloud computing context (e.g. our JupyterHub or CI/CD) where the network is extremely fast, there are no data egress fees, and fast disk is more likely to be constrained.

If we are going to use our Intake data catalog as a primary means of accessing versioned, processed data, the user should be able to turn off caching when appropriate. Is this as easy as not setting PUDL_INTAKE_CACHE so there's no designated location for the cache? Or can it / should it be set explicitly in the arguments to the data source?

Set up requester pays

After having a few $20 days, we've decided to limit our exposure to unexpected data egress fees by turning on Requester Pays for the storage buckets containing the pudl-catalog.

  • Enable requester pays on gs://intake.catalyst.coop
  • Make the tests supply a billing project so they can access the storage buckets.
  • Check that we're only downloading a minimal amount of data (a couple of state-years) in the tests.
  • Update the example notebook to use requester_pays and new dd.read_parquet() args.
  • Provide documentation / links for setting up a user billing project in the README.

Set up CI for the PUDL data catalog repo

We have a nominally working PUDL Intake Catalog! And now it needs to be tested.

  • Flesh out the README
    • link to Intake examples, talks
    • link to CarbonPlan repo for additional examples
    • link to Pangeo Forge as another example
    • Link to any Intake issues in the main PUDL repository
    • Outline the flow of data from PUDL to the data catalog through CI/CD processes.
    • Add CI status, CodeCov, PyPi, conda-forge, python version compatibility, and black badges to README
    • Add Catalyst contact info from template repo
    • Convert README to RST
    • Add code of conduct from template repo
  • Set up Tox to run linters, pytest, and generate test coverage
  • Create a GitHub action to run Tox automatically
  • Set up dependabot to update dependencies
  • Define a Python (conda?) environment for the data catalog repo
  • Test that pudl_catalog.yaml is valid
  • Create a dev branch
  • Make sure that pre-commit.ci is set up and working on the repository
  • Set up tests to work without storing data in the repository
  • Test that a small amount of data (e.g. 2020 Idaho):
    • can be read from monolithic & partitioned versions of the remote data.
    • is available from gcs:// and https:// URLs
    • is identical whether we get it from Intake or via pd.read_parquet()
  • Set up CodeCov to receive coverage reports and track test coverage over time

Error with Pandas >= 2.0

I installed catalystcoop.pudl-catalog via mamba as part of a larger environment.yml file and had 2 issues:

  1. sqlalchemy 2.0.15 was installed. I see that setup.py has been modified to limit it below 2.0.
  2. Pandas 2.0.2 was installed. This causes an issue with (at least) the pudl.generators_eia860 table.

To reproduce:

import intake

pudl_cat = intake.cat.pudl_cat

gens = pudl_cat.pudl.generators_eia860.read()

The error is:

---------------------------------------------------------------------------
IntCastingNaNError                        Traceback (most recent call last)
Cell In[6], line 1
----> 1 gens = pudl_cat.pudl.generators_eia860.read()
      2 gens.head()

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/intake_sql/intake_sql.py:173](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/intake_sql/intake_sql.py:173), in SQLSourceAutoPartition.read(self)
    171 def read(self):
    172     self._get_schema()
--> 173     return self._dataframe.compute()

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/base.py:310](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/base.py:310), in DaskMethodsMixin.compute(self, **kwargs)
    286 def compute(self, **kwargs):
    287     """Compute this dask collection
    288 
    289     This turns a lazy Dask collection into its in-memory equivalent.
   (...)
    308     dask.compute
    309     """
--> 310     (result,) = compute(self, traverse=False, **kwargs)
    311     return result

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/base.py:595](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/base.py:595), in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    592     keys.append(x.__dask_keys__())
    593     postcomputes.append(x.__dask_postcompute__())
--> 595 results = schedule(dsk, keys, **kwargs)
    596 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/threaded.py:89](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/threaded.py:89), in get(dsk, keys, cache, num_workers, pool, **kwargs)
     86     elif isinstance(pool, multiprocessing.pool.Pool):
     87         pool = MultiprocessingPoolExecutor(pool)
---> 89 results = get_async(
     90     pool.submit,
     91     pool._max_workers,
     92     dsk,
     93     keys,
     94     cache=cache,
     95     get_id=_thread_get_id,
     96     pack_exception=pack_exception,
     97     **kwargs,
     98 )
    100 # Cleanup pools associated to dead threads
    101 with pools_lock:

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/local.py:511](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/local.py:511), in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs)
    509         _execute_task(task, data)  # Re-execute locally
    510     else:
--> 511         raise_exception(exc, tb)
    512 res, worker_id = loads(res_info)
    513 state["cache"][key] = res

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/local.py:319](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/local.py:319), in reraise(exc, tb)
    317 if exc.__traceback__ is not tb:
    318     raise exc.with_traceback(tb)
--> 319 raise exc

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/local.py:224](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/local.py:224), in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    222 try:
    223     task, data = loads(task_info)
--> 224     result = _execute_task(task, data)
    225     id = get_id()
    226     result = dumps((result, id))

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/core.py:121](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/core.py:121), in _execute_task(arg, cache, dsk)
    117     func, args = arg[0], arg[1:]
    118     # Note: Don't assign the subtask results to a variable. numpy detects
    119     # temporaries by their reference count and can execute certain
    120     # operations in-place.
--> 121     return func(*(_execute_task(a, cache) for a in args))
    122 elif not ishashable(arg):
    123     return arg

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/core.py:121](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/core.py:121), in (.0)
    117     func, args = arg[0], arg[1:]
    118     # Note: Don't assign the subtask results to a variable. numpy detects
    119     # temporaries by their reference count and can execute certain
    120     # operations in-place.
--> 121     return func(*(_execute_task(a, cache) for a in args))
    122 elif not ishashable(arg):
    123     return arg

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/core.py:121](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/core.py:121), in _execute_task(arg, cache, dsk)
    117     func, args = arg[0], arg[1:]
    118     # Note: Don't assign the subtask results to a variable. numpy detects
    119     # temporaries by their reference count and can execute certain
    120     # operations in-place.
--> 121     return func(*(_execute_task(a, cache) for a in args))
    122 elif not ishashable(arg):
    123     return arg

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/utils.py:73](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/utils.py:73), in apply(func, args, kwargs)
     42 """Apply a function given its positional and keyword arguments.
     43 
     44 Equivalent to ``func(*args, **kwargs)``
   (...)
     70 >>> dsk = {'task-name': task}  # adds the task to a low level Dask task graph
     71 """
     72 if kwargs:
---> 73     return func(*args, **kwargs)
     74 else:
     75     return func(*args)

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/dataframe/io/sql.py:412](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/dask/dataframe/io/sql.py:412), in _read_sql_chunk(q, uri, meta, engine_kwargs, **kwargs)
    410     return df
    411 else:
--> 412     return df.astype(meta.dtypes.to_dict(), copy=False)

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/pandas/core/generic.py:6305](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/pandas/core/generic.py:6305), in NDFrame.astype(self, dtype, copy, errors)
   6303 else:
   6304     try:
-> 6305         res_col = col.astype(dtype=cdt, copy=copy, errors=errors)
   6306     except ValueError as ex:
   6307         ex.args = (
   6308             f"{ex}: Error while type casting for column '{col_name}'",
   6309         )

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/pandas/core/generic.py:6324](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/pandas/core/generic.py:6324), in NDFrame.astype(self, dtype, copy, errors)
   6317     results = [
   6318         self.iloc[:, i].astype(dtype, copy=copy)
   6319         for i in range(len(self.columns))
   6320     ]
   6322 else:
   6323     # else, only a single dtype is given
-> 6324     new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   6325     return self._constructor(new_data).__finalize__(self, method="astype")
   6327 # GH 33113: handle empty frame or series

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/pandas/core/internals/managers.py:451](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/pandas/core/internals/managers.py:451), in BaseBlockManager.astype(self, dtype, copy, errors)
    448 elif using_copy_on_write():
    449     copy = False
--> 451 return self.apply(
    452     "astype",
    453     dtype=dtype,
    454     copy=copy,
    455     errors=errors,
    456     using_cow=using_copy_on_write(),
    457 )

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/pandas/core/internals/managers.py:352](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/pandas/core/internals/managers.py:352), in BaseBlockManager.apply(self, f, align_keys, **kwargs)
    350         applied = b.apply(f, **kwargs)
    351     else:
--> 352         applied = getattr(b, f)(**kwargs)
    353     result_blocks = extend_blocks(applied, result_blocks)
    355 out = type(self).from_blocks(result_blocks, self.axes)

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/pandas/core/internals/blocks.py:511](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/pandas/core/internals/blocks.py:511), in Block.astype(self, dtype, copy, errors, using_cow)
    491 """
    492 Coerce to the new dtype.
    493 
   (...)
    507 Block
    508 """
    509 values = self.values
--> 511 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    513 new_values = maybe_coerce_values(new_values)
    515 refs = None

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/pandas/core/dtypes/astype.py:242](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/pandas/core/dtypes/astype.py:242), in astype_array_safe(values, dtype, copy, errors)
    239     dtype = dtype.numpy_dtype
    241 try:
--> 242     new_values = astype_array(values, dtype, copy=copy)
    243 except (ValueError, TypeError):
    244     # e.g. _astype_nansafe can fail on object-dtype of strings
    245     #  trying to convert to float
    246     if errors == "ignore":

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/pandas/core/dtypes/astype.py:187](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/pandas/core/dtypes/astype.py:187), in astype_array(values, dtype, copy)
    184     values = values.astype(dtype, copy=copy)
    186 else:
--> 187     values = _astype_nansafe(values, dtype, copy=copy)
    189 # in pandas we don't store numpy str dtypes, so convert to object
    190 if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str):

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/pandas/core/dtypes/astype.py:105](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/pandas/core/dtypes/astype.py:105), in _astype_nansafe(arr, dtype, copy, skipna)
    100     return lib.ensure_string_array(
    101         arr, skipna=skipna, convert_na_value=False
    102     ).reshape(shape)
    104 elif np.issubdtype(arr.dtype, np.floating) and is_integer_dtype(dtype):
--> 105     return _astype_float_to_int_nansafe(arr, dtype, copy)
    107 elif is_object_dtype(arr.dtype):
    108     # if we have a datetime[/timedelta](https://file+.vscode-resource.vscode-cdn.net/timedelta) array of objects
    109     # then coerce to datetime64[ns] and use DatetimeArray.astype
    111     if is_datetime64_dtype(dtype):

File [/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/pandas/core/dtypes/astype.py:150](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/envs/powergenome_catalog/lib/python3.10/site-packages/pandas/core/dtypes/astype.py:150), in _astype_float_to_int_nansafe(values, dtype, copy)
    146 """
    147 astype with a check preventing converting NaN to an meaningless integer value.
    148 """
    149 if not np.isfinite(values).all():
--> 150     raise IntCastingNaNError(
    151         "Cannot convert non-finite values (NA or inf) to integer"
    152     )
    153 if dtype.kind == "u":
    154     # GH#45151
    155     if not (values >= 0).all():

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer: Error while type casting for column 'utility_id_eia'

Add SQLite entries to the catalog

Once intake-sqlite has been released (catalyst-cooperative/pudl#1156), it's time to add our SQLite outputs to the pudl-catalog. Using the SQLiteCatalog / sqlite_cat driver, define basic sub-catalog entries for the following SQLite DB outputs:

  • ferc1
  • pudl
  • censusdp1tract

With these three SQLite DBs alongside the EPA CEMS parquet dataset, we'll have a nominally complete PUDL catalog that lets people install our data like software.

Release catalystcoop.pudl_catalog on PyPI and conda-forge

Once we have a version of the pudl_catalog that is functional for public users, package it up and release for installation with pip and conda so we can get other people using it and receive feedback.

  • Initial v0.1.0 release of catalystcoop.pudl_catalog on PyPI
  • Initial v0.1.0 release of catalystcoop.pudl_catalog on conda-forge

Automatically disable caching of local data catalog sources

Reading parquet files which are stored on the local filesystem through the current PUDL catalog still results in caching. This slows things down dramatically, and quickly uses an enormous amount of disk space. Especially in development when we've got data that we've just generated locally it could be nice to be working with it using the same mechanism as remote data (the data catalog), but not if we end up with a bunch of unnecessary caching happening continuously in the background.

Identify a way to disable caching when we're working with local data. Ideally this would be done automatically without the user having to think about it. Maybe it's as simple as making the simplecache:: prefix to urlpath conditional based on the value of PUDL_INTAKE_PATH using Jinja templating features?

If that's not possible then maybe caching can be turned off with an argument that's passed to the data source by the user.

Make anonymous public data access work

Right now none of the truly public data access methods seems to be functional, which defeats most of the purpose of publishing the data catalog. Several issues have come up:

  • Reading from https://storage.googleapis.com is catastrophically slow and tries to read the entire monolithic parquet file into memory rather than filtering it first, for both read_parquet() and intake
  • Reading partitioned data from https://storage.googleapis.com with read_parquet() doesn't work easily, since you need to actually list all of the files specifically. No wildcards or directories allowed.
  • Reading partitioned data from https://storage.googleapis.com with intake using a *.parquet wildcard fails with a 403 Forbidden error because the public user doesn't have permission to list all of the objects in the bucket.
  • Unauthenticated users cannot access data via gcs://catalyst.coop/intake/test even though everything in the (pseudo) directory is publicly readable. Do we need to create a separate gcs://intake.catalyst.coop bucket that is entirely public, rather than using object-level ACLs? Or is it simply not possible to provide public access over gcs://?
  • intake_parquet complains when you give it a directory (filled with parquet files) as its urlpath even though pandas and dask are happy to read from a directory. E.g. {{ env(INTAKE_PATH) }}/epacems/ does not work. The error says that all paths have to end with .parq or .parquet.

Ideally we would be able to provide public access both via gcs:// (which seems to provide much more "filesystem" like access) and over https:// (which has much better support in generic download tools for the less cloud-literate).

Need to understand the intended patterns of usage with public cloud accessible data, and how to make the public resource as functional / convenient as it can be.

May also need to understand better how to limit the risk of a bajillion downloads costing us on data egress fees, which might mean going requester-pays.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.