buds-lab / building-data-genome-project-2 Goto Github PK

Whole building non-residential hourly energy meter data from the Great Energy Predictor III competition

License: Other

Jupyter Notebook 100.00%

open-source open-data open-data-science energy-efficiency energy-consumption building-energy building-automation smart-city smart-meter electricity-meter

building-data-genome-project-2's People

Contributors

Stargazers

Watchers

building-data-genome-project-2's Issues

Create 'raw' and 'cleaned' version of data

It would be good to add to the documentation the differences between the raw, cleaned and processed folders in the wiki

Switch the UID for Eagle and Peacock and redo the code names

I had the UID/Code names switched for Eagle and Peacock! Therefore we need to redo the unique id code names. I think we should take the opportunity to reorder the code name to: "AnimalName_SimplifiedSpaceUse_HumanName" -- I think this is a good idea so when people sort the UID, then it groups the buildings in more logical groups. Also, I think some of the Space Uses can be simplified so there is no "/" in the names

Why did Swan not get added to the Kaggle competition?

@anjukan -- you have any idea why the Swan data set was never added to the Kaggle competition data set?

Was the BDG2 data taken from the data after the filtering for the Kaggle competition?

@anjukan put this table in the BDG2 paper:

@ponybiam -- did you use the data before or after this filtering step?

Its ok to have this table, but we need to be clear about what the filtering process pertains to?

Site Moose converted from MJ, not KJ

Just discovered that I messed up one more conversion!

Chilled Water and Elec for Moose should be converted from Megajoules (MJ) and not kilojoules (KJ)

Change all water meters from gallons to liters

Change all non-energy meters to liters from gallons

Add available data start-date in metadata

Should we add a feature with the date in which data starts being available for each building? a lot of them have missing values at the begginig of the period and it was suggested by an user that may be useful.

Building IDs?

create a deanonymzed version of the meta file for the UC berkeley

Calculate the number of meters per site

In the meter data analysis portion, calculate the number of meters from each site for Table 1 of the BDG2 paper. Also calculate the total number of meters in the dataset

Mapping between BDG1 and BDG2

As discussed on Kaggle, it would be very helpful to find out which buildings are present in both BDG1 and in BDG2. For those buildings that are present in both sites, is there also an actual overlap in data or are the timeframes of measurement different?

Wishing you all the best.

Predictive models

Predictive models with cleaned data:

Long-term: 1 year train, 1 year prediction
Short-term: 30 days train, 3 days prediction

meter reading units

Hello! You mentioned that all meter units were converted to kwh in the cleaned dataset. However, I cannot find where you did this. Can you please confirm this and point me to where you did the conversion? Thank you.

Measurement Units Recheck

@anjukan -- is there a reason that Bear and Hog are left blank -- is it because you didn't know what those units were?

Synchronize the sqm and sqft on the meta file

Convert all sqft to sqm where there are gaps in sqm and vice versa

Change all energy meters to KWh

(This issue will be updated and worked on once the units are checked)

Site ID 'Wolf" has incorrect coordinates

According to Miller et al. (2020), site id "Wolf" should correspond to the University College in Dublin. The coordinates for "Wolf" available in the metadata file correspond to a point near Lauwersoog in the Netherlands.

Accordingly, the latitude and longitude coordinates for Wolf should be updated from (53.3498, 6.2603) to a point near (53.3667, 6.2583), which is the Google Maps coordinates for the University College in Dublin.

I was concerned that the weather data would also be mismatched. Miller et al. (2020) use the weather data from NOAA ISD Station 039690-99999, which I verified is near or in Dublin, so there is no need to verify that this is the correct weather data set.

Remove the `lat` and `lng` for the anonymous sites

In the metadata.csv file, remove the lat and lng values

Update figures for paper

Now that raw data has changed, figures for the publication must be updated.

Created some streamlit apps for BDG2

Hello friends,

I'm starting to create some Streamlit apps for BDG2 for interactive data exploration and model building. These apps can be used to better understand the data-set and used for my future teachings. My plan is to:

For context Streamlit is a super easy-to-use tool to create data apps in Python. Compared to traditional dashboards like plotly dash or superset, you can utilize the full potential of Python like creating ML models and perform in-depth analysis. They will also host the apps for free (at least for now).

I noticed there is already a notebook folder in this repository, I wonder how does the app files fit in? They are not really markdown files as they need to rendered by Streamlit or run locally. Would it be possible to include the app links in the readme.md once they are complete?

Oh and for those interested I will write a tutorial on how to create these apps. Also please feel free to comment on what kind of analysis you would like to see. Cheers.

Add the kaggle build ID to the meta data file

Add a new column in the metadata file for the mapping of BDG2 ids and the kaggle building id: https://github.com/buds-lab/building-data-genome-project-2/wiki/BDG-Kaggle-mapping

zeros in electricity_cleaned.csv

The 10_Cleaned-dataset.ipynb contains code to convert electricity.csv -> electricity_cleaned.csv by replacing the zeros with NaN. But when I checked out electricity_cleaned.csv it contains the original zeros?

Plus at first glace many sites don't have data from the earlier date in the file, ideally the earliest date where data is available by site would be included in the metadata?

Also maybe it's because I cloned the repo or didn't have git-lfs installed first, but I had to do a bit of googling to work out how to fetch the csv files. It might pay to add some details on LFS in the readme?

misaligned timestamps?

I loaded the weather.csv and electricity.csv files into Pandas. The timestamp fields seem to be inconsistent.

In weather.csv for site_id="Hog" there are two entries for 2017-11-05 (non-duplicates) due I assume to DLS end. There is also a missing entry for 2017-03-12 02:00 on DLS start.

But the electricity.csv contains 24 records per day, there is no duplicates record on 2017-11-05 and no missing on 2017-03-12

So it appears weather.csv is wall clock time in the local timezone, and electricity is perhaps standard time?

I noticed in particular in 11_Models.ipynb you join the data and weather datasets on timestamp but you haven't added an index on these columns. If you add an index the join will fail. But probably of more concern, I think the temperature and electricity will be shifted by 1 hour for half the year?

"Mandatory" meta data?

@anjukan -- You mentioned in your part that "For the metadata of the buildings, it was deemed that \texttt{square$_$feet}, \texttt{primary$_$use}, \texttt{year$_$built} and \texttt{floor$_$count} would be the only mandatory attributes."

Does this mean these were the filters for the Kaggle competition? I'm surprised that we required floor count and year built as mandatory.

@ponybiam -- there are tons of buildings in the BDG2 that don't have these two features, so I'm guessing we didn't make those same filtering steps?

Accurate statement about the issues fixed between Kaggle and BDG2

I have the following passage in the paper -- is it correct?

The data contained in this repository has several differences from that found on the Kaggle competition website. These were issues that were detected in the midst of the competition and were fixed in this updated data set. The first difference is that the BDG2.0 data set only has timestamps that are in the local time zone, including the weather data. The weather data released in the Kaggle competition had a timestamp that was set to UTC and the contestants had to come up with ways to find the right alignment for the weather data in order to use it properly. The other issue fixed was a conversion mistake in which Site X was not properly converted to \emph{kWh$_{sum}$} and was instead left in its raw form. This conversion has fixed in this data set.

Add 'kaggle' data set

This is the 2017 data used for the public leaderboard in the competition. A copy of this data set will be part of this repository.

Where did the weather data come from for Shrew and Swan?

@anjukan @ponybiam --- this is from the paper -- I guess someone got the weather data for those last two sites?

buds-lab / building-data-genome-project-2 Goto Github PK

building-data-genome-project-2's People

Contributors

Stargazers

Watchers

Forkers

building-data-genome-project-2's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs