unsw-ceem / nemosis Goto Github PK

NEMOSIS: NEM Open-source information service. A Python package for downloading historical data published by the Australian Energy Market Operator (AEMO)

License: Other

Python 100.00%

aemo national-electricity-market python energy nem australia data nemweb

nemosis's Introduction

NEMOSIS

A Python package for downloading historical data published by the Australian Energy Market Operator (AEMO)

Download Windows Application (GUI)
Documentation
Support NEMOSIS
Get Updates, Ask Questions
Using the Python Interface (API)

Download Windows Application (GUI)

Choose the exe from the latest release

Documentation

Check out the wiki
View worked examples:
- GUI
- Python interface:
What data is available and data column definitions
Watch a video
- Download generator dispatch data
- Download dispatch data by fuel type.
Read our paper introducting NEMOSIS

Support NEMOSIS

Cite our paper in your publications that use data from NEMOSIS.

Get Updates, Ask Questions

Join the NEMOSIS forum group.

Using the Python Interface (API)

Installing NEMOSIS

pip install nemosis

Data from dynamic tables

Dynamic tables contain a datetime column that allows NEMOSIS to filter their content by a start and end time.

To learn more about each dynamic table visit the wiki.

You can view the dynamic tables available by printing the NEMOSIS default settings.

from nemosis import defaults

print(defaults.dynamic_tables)

#['DISPATCHLOAD', 'DUDETAILSUMMARY', 'DUDETAIL', 'DISPATCHCONSTRAINT', 'GENCONDATA', 'DISPATCH_UNIT_SCADA', 'DISPATCHPRICE', . . .

Workflows

Your workflow may determine how you use NEMOSIS. Because the GUI relies on data being stored as strings (rather than numeric types such as integers or floats), we suggest the following:

If you are using NEMOSIS' API in your code, or using the same cache for the GUI and API, use dynamic_data_compiler. This will allow your data to be handled by both the GUI and the API. Data read in via the API will be typed, i.e. datetime columns will be a datetime type, numeric columns will be integer/float, etc. See this section.
If you are using NEMOSIS to cache data in feather or parquet format for use with another application, use cache_compiler. This will ensure that cached feather/parquet files are appropriately typed to make further external processing easier. It will also cache faster as it doesn't prepare a DataFrame for further analysis. See this section.

Dynamic data compiler

dynamic_data_compiler can be used to download and compile data from dynamic tables.

from nemosis import dynamic_data_compiler

start_time = '2017/01/01 00:00:00'
end_time = '2017/01/01 00:05:00'
table = 'DISPATCHPRICE'
raw_data_cache = 'C:/Users/your_data_storage'

price_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache)

Using the default settings of dynamic_data_compiler will download CSV data from AEMO's NEMWeb portal and save it to the raw_data_cache directory. It will also create a feather file version of each CSV (feather files have a faster read time). Subsequent dynamic_data_compiler calls will check if any data in raw_data_cache matches the query and loads it. This means that subsequent dynamic_data_compiler will be faster so long as the cached data is available.

A number of options are available to configure filtering (i.e. what data NEMOSIS returns as a pandas DataFrame) and caching.

Filter options

dynamic_data_compiler can be used to filter data before returning results.

To return only a subset of a particular table's columns, use the select_columns argument.

from nemosis import dynamic_data_compiler

price_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache,
                                   select_columns=['REGIONID', 'SETTLEMENTDATE', 'RRP'])

To see what columns a table has, you can inspect NEMOSIS' defaults.

from nemosis import defaults

print(defaults.table_columns['DISPATCHPRICE'])
# ['SETTLEMENTDATE', 'REGIONID', 'INTERVENTION', 'RRP', 'RAISE6SECRRP', 'RAISE60SECRRP', 'RAISE5MINRRP', . . .

Columns can also be filtered by value. To do this, you need provide a column to be filtered (filter_cols) and a value or values to filter (filter_values) a corresponding column by. to filter by a column the column must be included as a filter column.

In the example below, the table will be filtered to only return rows where REGIONID == 'SA1'.

from nemosis import dynamic_data_compiler

price_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache, filter_cols=['REGIONID'], filter_values=(['SA1'],))

Several filters can be applied simultaneously. A common filter is to extract pricing data excluding any physical intervention dispatch runs (INTERVENTION == 0 is the appropriate filter, see here). Below is an example of filtering to get data for Gladstone Unit 1 and Hornsdale Wind Farm 2 excluding any physical dispatch runs:

from nemosis import dynamic_data_compiler

unit_dispatch_data = dynamic_data_compiler(start_time, end_time, 'DISPATCHLOAD', raw_data_cache, filter_cols=['DUID', 'INTERVENTION'], filter_values=(['GSTONE1', 'HDWF2'], [0]))

Caching options

By default the options fformat='feather' and keep_csv=True are used.

If the option fformat='csv' is used then no feather files will be created, and all caching will be done using CSVs.

price_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache, fformat='csv')

If you supply fformat='feather', the original AEMO CSVs will still be cached by default. To save disk space but still ensure your data will work with the API & GUI, use keep_csv=False in combination with fformat='feather' (which is the default option). This will delete the AEMO CSVs after the feather file is created.

price_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache, keep_csv=False)

If the option fformat='parquet' is provided then no feather files will be created, and a parquet file will be used instead. While feather might have faster read/write, parquet has excellent compression characteristics and good compatability with packages for handling large on-memory/cluster datasets (e.g. Dask). This helps with local storage (especially for Causer Pays data) and file size for version control.

Cache compiler

This may be useful if you're using NEMOSIS to build a data cache, but then process the cache using other packages or applications. It is particularly useful because cache_compiler will infer the data types of the columns before saving to parquet or feather, thereby eliminating the need to type convert data that is obtained using dynamic_data_compiler.

cache_compiler can be used to compile a cache of parquet or feather files. Parquet will likely be smaller, but feather can be read faster. cache_compiler will not run if it detects the appropriate files in the raw_data_cache directory. Otherwise, it will download CSVs, covert to the requested format and then delete the CSVs. It does not return any data, unlike dynamic_data_compiler.

The example below downloads parquet data into the cache.

from nemosis import cache_compiler

cache_compiler(start_time, end_time, table, raw_data_cache, fformat='parquet')

Accessing additional table columns

By default NEMOSIS only includes a subset of an AEMO table's columns, the full set of columns are listed in the MMS Data Model Reports, or can be seen by inspecting the CSVs in the raw data cache. Users of the python interface can add additional columns as shown below. If you using a feather or parquet based cache the rebuild option should be set to true so the additional columns are added to the cache files when they are rebuilt. This method of adding additional columns should also work with the cache_compiler function.

from nemosis import defaults, dynamic_data_compiler

defaults.table_columns['BIDPEROFFER_D'] += ['PASAAVAILABILITY']

start_time = '2017/01/01 00:00:00'
end_time = '2017/01/01 00:05:00'
table = 'BIDPEROFFER_D'
raw_data_cache = 'C:/Users/your_data_storage'

volume_bid_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache, rebuild=True)

Data from static tables

Static tables do not include a time column and cannot be filtered by start and end time.

To learn more about each static table visit the wiki.

You can view the static tables available by printing the tables in NEMOSIS' defaults:

from nemosis import defaults

print(defaults.static_tables)
# ['ELEMENTS_FCAS_4_SECOND', 'VARIABLES_FCAS_4_SECOND', 'Generators and Scheduled Loads', 'FCAS Providers']

static_table

The static_table function can be used to access these tables

from nemosis import static_table

fcas_variables = static_table('VARIABLES_FCAS_4_SECOND', raw_data_cache)

Disable logging

NEMOSIS uses the python logging module to print messages to the console. If desired, this can be disabled after imports, as shown below. This will disable log messages unless they are at least warnings.

import logging

from nemosis import dynamic_data_compiler

logging.getLogger("nemosis").setLevel(logging.WARNING)

nemosis's People

Contributors

Stargazers

Watchers

nemosis's Issues

FCAS variables URL has changed

Now located at:
'https://www.aemo.com.au/-/media/files/electricity/nem/settlements_and_payments/settlements/auction-reports/archive/820-0079-csv.csv'

was

'https://www.aemo.com.au/-/media/Files/CSV/820-0079-csv.csv'

Dictionary in defaults.py needs to be amended

No request or HTTP error handling

When requesting FCAS variables, the error code returned was related to table handling (i.e. code handling the response after the request). The URL was incorrect and the table handling code was not the issue.

For debugging purposes (especially if AEMO website arrangement changes), the code should handle request or HTTP errors. This could be implemented in the download functions within downloader.py

Make missing values consistent in Generators and Scheduled Loads

In the 'Generators and Scheduled Loads' table, columns such as Fuel Source - Primary have '', '-' and nan values. Do they mean something different?

Note that these different values appear to come from the raw excel file:

If these values mean the same thing, I think this library should coerce them into the same nan value. (Even if it's AEMO's 'fault')

Error in running the worked example

Hi Nick, someone here is getting error when running the worked example:

DISPATCHINTERVAL extra column in DISPATCHREGIONSUM

DISPATCHREGIONSUM seems to return an extra undocumented column "DISPATCHINTERVAL" .

Example values: 20191231241, 20200131240

Missing required dependency: xlrd

Hi Nick,

Running nemosis in a clean environment with pandas installed.

When I try to query the Generator and Scheduled Loads file, the read_excel functions errors out due to xlrd not being installed. While the setup.py contains pandas (see below), seems as though xlrd is an optional install with pandas. Should xlrd be added to setup.py?

install_requires=['requests', 'joblib', 'pyarrow', 'feather-format', 'pandas']

Abi

Add "Unit Size (MW)" to Generators and Scheduled Loads (Generators and Scheduled Loads) table

Hi team
Would it be difficult/appropriate to add Unit Size (MW) to the Generators and Scheduled Loads table.
It is available in the raw NEM Registration and Exemption List.

Can't download PREDISPATCHPRICE and P5MIN_REGIONSOLUTION

I'm unable to download data from either of these tables: PREDISPATCHPRICE or P5MIN_REGIONSOLUTION

For example, this code

from nemosis import defaults
print(defaults.table_columns['PREDISPATCHPRICE'])

gives me a key error. It seems like these both should be dynamic tables, like, for example, DISPATCHPRICE.

But is that not the case?

Add ROOFTOP_PV_ACTUAL table, or all tables?

Hi,

This looks like a great library. How can I use it to download ROOFTOP_PV_ACTUAL? It seems like only some tables are supported. But the table schemas don't appear to be hard-coded within this library. What does it take to add another table, as a developer?

Option 1: Please just add `ROOFTOP_PV_ACTUAL`

Or tell me which files to modify, and I'll write a PR. Then as a user I can load it.

Option 2: add a way for end-users to load obscure tables

Maybe a function where the user says what the table name is, whether it's static, which columns to deduplicate by, etc.

Option 3: add all tables to the library

This would certainly be best for the user experience. Although it is hard to do.

I have previously web-scraped the schemas from the data model report. So I could give you a big json of all the column types. (And the script that produced them.) That might help.

Some tricky things to note for this approach:

there is actually a table that's in the dataset, but not the documentation. (A CO2 one)
there are two columns for plant capacity that are documented as whole numbers, but are in fact decimals
the market notices sometimes contain newline files, which breaks some CSV parsing

Enhance logging + user feedback

Move print user feedback to logging module
- severity of logging - e.g. warning, error
- can be extended to debugging for development
More precise error logging
- currently limited by which variables are available within each private function
More logging for other user input issues (e.g. invalid table names)

Causer Pays Elements fetching needs to be fixed

Causer Pays elements mapping file was previously a static url link to a table updated on ad-hoc basis.

Now, "versioned" through NEMWeb.

Need to implement fetching similar to DUDETAILSUMMARY (search_type = end) to fetch the latest .csv when user requests it.

Static table download error

I'm trying to download the static Generator table using the name Generators and Scheduled Loads with the code:

gen_table = "Generators and Scheduled Loads"
gen = data_fetch_methods.static_table(start_time, end_time, 
                                gen_table, raw_data_cache)

but it fails with:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

It works to fetch the pricing data:

# download pricing data
start_time = '2017/01/01 00:00:00' # start time, exclusive
end_time = '2017/01/31 23:59:00' # end time, inclusive
table = 'DISPATCHPRICE' # the table the data is requested from
raw_data_cache = 'data/raw_data_cache' # raw data is cached

df = data_fetch_methods.dynamic_data_compiler(start_time, end_time, table, raw_data_cache)

So not sure what I'm doing wrong for the static table. I'm using python 3 on Google colab.

Is full data range necessary for tests?

Three tests in test_processing_info_maps.py draw data from the start of the data model (see code below) to 2018. In some cases this is 9 years worth of data and quite large. This slows testing down. Is this range necesary for these tests?

Based on tests run for #15, testing time incuding downloads is ~20 minutes on a high end computer.

start_test_window = defaults.nem_data_model_start_time

where

nem_data_model_start_time = '2009/07/01 00:00:00

Bid data unavailable after March 2021

Bidding data in the tables BIDPEROFFER_D and BIDDAYOFFER_D isn't available after March 2021, because AEMO has stopped uploading these tables.

These tables can probably be recreated by using the tables BIDPEROFFER and BIDDAYOFFER, which contain all bids submitted by participants not just those used in dispatch, by applying the appropriate filtering.

This fix is currently being worked on.

Pandas issue for read-only workbook

I tried running the example code for generator bidding data (https://github.com/UNSW-CEEM/NEMOSIS/blob/master/examples/generator_bidding_data.ipynb)

On running cell 4 (downloading the static table) I get a large error which I beleive traces back to pandas use of pyxl and handling of read only workbooks (which I assume the generator and load static table is).

If I change line 569 in pandas _openpyxl.py file (vevn/lib/pandas/io/excel/_openpyxl.py) to be read_only=False, the error disappears and I can downlaod the data

I assume Nemosis should pass command to change default for this setting within pandas?

table `TRADINGPRICE` does not have an `INTERVENTION` column

When retrieving data for TRADINGPRICE no data will return for the INTERVENTION column and thus the cached csv file will be saved without it.

However, a subsequent retrieval from feather causes the following error:
ArrowInvalid: Field named INTERVENTION is not found
because the schema as defined in defaults.py includes this column.

See PR #12 for a fix. I have tested locally and there are no other column inconsistencies for that table that I can see.

Redundancy of start and end times in static table fetch methods

H Nick,

Static table fetch methods currently require a start and end time to be provided. These are redundant for static tables

STPASA tables

Is it possible to download the Stpasa tables? I am specifically looking for STPASA_REGIONSOLUTION.

AEMO must produce load forecasts for each region for the following timeframes:

Each day for the day ahead – pre-dispatch forecast

Each day for the period two to seven days ahead – short term projected assessment of system adequacy (STPASA) forecast

In this AEMO data model report - pdf, its the STPASA_REGIONSOLUTION which is updated each STPASA run (i.e every 2 hours).

STPASA_REGIONSOLUTION shows the results of the regional capacity,
maximum surplus reserve and maximum spare capacity evaluations for
each period of the study

Unable to retrieve enablement data for contingency FCAS 1 second (raise and lower)

The code hasn't been updated to include this data. Looking into it, I think it can be enabled by adding RAISE1SEC and LOWER1SEC columns in defaults.py in the table_columns dictionary.

Migrate user warnings to `logging`, instead of printing

I am currently writing examples for NEMSEER and using NEMOSIS' compilers.

In a notebook example, I should be able to disable user info/warning. This can be achieved with logging, but right now all info/errors/warning are printed.

Can we change printing to logging? Looks like a fairly straightforward change

NEMOSIS dynamic_data_compiler

I as using NEMOSIS in conjunction with NEMSEER. When I pull data from the dynamic data compiler under a parquet format, it seems like all the data is saved as a datetime file.

    nemosis_data = nemosis.dynamic_data_compiler(
        nemosis_start,
        time,
        "TRADINGPRICE",
        nemosis_cache,
        filter_cols=["REGIONID"],
        filter_values=(["SA1"],),
        fformat="parquet",
    )
    
    actual_price = nemosis_data.groupby("SETTLEMENTDATE")["RRP"].sum()[time]

As a result it is unable to undertake the sum( ) operation as that is only restricted to float/int objects. This seems to be a recent issue, as I have run the same function a few days ago and did not have any issues. A printout of the "nemosis_data" variable is set out below.


INFO: Query raw data already downloaded to nemseer_cache
INFO: Converting PRICE data to xarray.
Compiling data for table TRADINGPRICE.
Returning TRADINGPRICE.
          SETTLEMENTDATE REGIONID                           RRP  \
2729 2021-01-01 00:30:00      SA1 1970-01-01 00:00:00.000000035   

                      RAISE6SECRRP                 RAISE60SECRRP RAISE5MINRRP  \
2729 1970-01-01 00:00:00.000000001 1970-01-01 00:00:00.000000001   1970-01-01   

                       RAISEREGRRP                  LOWER6SECRRP  \
2729 1970-01-01 00:00:00.000000010 1970-01-01 00:00:00.000000001   

                     LOWER60SECRRP LOWER5MINRRP                   LOWERREGRRP  \
2729 1970-01-01 00:00:00.000000003   1970-01-01 1970-01-01 00:00:00.000000012   

     PRICE_STATUS  
2729         FIRM