GithubHelp home page GithubHelp logo

chem150's Introduction

CHEM150 Atmospheric Chemistry ML Modelling

Author: Thomas Cintra

This is my lab github repo :)

Outline

Below we enumerate what is currently in this repository.

  • Defines a Python API to fetch AQS data (```datafetcher.py```).
  • Provides a sample notebook to determine the best site to sample from in Los Angeles County (```lab_notebook.ipynb```).
  • Provides a sample notebook to fetch relevant AQS data (```lab_notebook.ipynb```).
  • Provides a sample notebook to explore relevant CEDS emissions data (```hemco_data_exploration.ipynb```).
  • Provides a python script to generate a 2018 dataset for LA North Main St (```generate.py```), including AQS and CEDS data.
  • Provides a sample notebook to run a Random Forest model on the above dataset (```lab_notebook_2.ipynb```).
  • Setup

    Please note that I use a virtual environment to manage any modules used for the models and data processing. To begin using this repo use the following commands on the root of this directory:

    $ python3 -m venv venv # Creates a virtual environment
    $ source venv/bin/activate # Activates virtual environment
    $ python3 -m pip install --upgrade pip
    $ python3 -m pip install -r requirements.txt # Download requirements
    

    You must also create a .env file and populate it with your email and password to the AQS api. Your .env file should look like:

    EMAIL="[email protected]"
    KEY="example"
    

    Using DataFetcher

    Please refer to lab_notebook.ipynb for examples on how to use the DataFetcher class. It has 3 primary purposes: (1) Finding the site with the most data in a particular county/state; (2) fetching AQS data; and (3) fetching CEDS data. Important functions have detailed docstrings.

    Finding the best site:

    We can't realistically get data for an entire year, for every site, for every code, for multiple years. If we want to have an idea of what critical data was available over the span of decades, we need to do some sampling. As a first pass we sample 1 random day every 5 years starting in 2000. Then we get the top 5 sites with the most critical data for a given 5 year period, which happens to be Los Angeles-North Main Street no matter what data range we pick.

    From the top 5 we then rank them based on the availability of PAMS_VOC data. We only need to check the date that resulted in the most available data from the previous queury.

    However, we want to eventually find more relevant sites to train on so we could replicate the above logic to find the best sites accross multiple states, etc. Even better, we could find a clever way to aggregate data accross neighbouring sites.

    Building a dataset

    Running the generate.py script will create 3 datasets in the data/clean/Los_Angeles-North_Main_Street/2018 directory (the structure is data/clean/\<site\>/\<year\>). The core dataset contains CRITERIA and MET data, the vocs dataset contains VOCs data, and the emissions dataset contains CEDS data. Use the following command:

    $ python3 generate.py
    

chem150's People

Contributors

tcintra avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

osrussell

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.