GithubHelp home page GithubHelp logo

tsdataclinic / nwf-process-geodata Goto Github PK

View Code? Open in Web Editor NEW
0.0 7.0 0.0 2.25 MB

Accessing, processing, documenting wildlife and environmental datasets for the National Wildlife Federation

License: Other

Python 6.98% Jupyter Notebook 93.02%

nwf-process-geodata's Introduction

National Wildlife Federation

Accessing, processing, documenting wildlife and environmental datasets for the National Wildlife Federation

Project Description

The National Wildlife Federation (NWF) is a large non-profit dedicated to conservation and wildlife advocacy. NWF is working with another organization to create an interactive mapping tool that shows the intersection of potential carbon management project with wildlife and envirornmental considerations in the state of Wyoming. Data Clinic is tasked with finding and processing these wildlife and envirornmental datasets and providing them to NWF. NWF has given us a spreadsheet outlining the desired datasets, which we have augmented with additional metadata.

The data pipeline we build will contain a few distinct steps, with each step depending on the previous. Roughly, these steps are:

  1. Access data and upload to s3
    • The metadata spreadsheet contains links to APIs and hosted files matching the requested datasets. The code in download.py should iterate through the datasets with links and download each locally before uploading to the nwf-dataclinic s3 bucket.
  2. Simple data processing
    • The raw data on s3 will have different file formats, projects, and extents. We want to provide NWF with data that has been minimally processed to ensure compatibility. The code in process.py should traverse the raw datasets and apply these basic processing steps and save the results to s3.
  3. Documenting processed data
    • The final step is to create simple documentation for each dataset. These should be pdf files generated for each processed dataset. These documents will contain information from the metadata, such as dataset description, licence, years covered, etc. as well as some additional information dervived from the data itself such as column names/types and number of rows. The code in document.py will iterate through the processed datasets and create the documentation for each.

These steps are composed in run.py - which also exports the full contents of the repository to a specified local folder. You can run the full pipeline by executing poetry run python3 src/run.py --s3bucket <your_bucket> (or by running the script from the envirornment of your choice). Flags exist to export the data locally, skip the s3 upload, or overwrite data which has already been processed. To see the full list, run poetry run python3 src/run.py --help.

Poetry Environment Set-up

This project uses Poetry to provide an easy way to manage dependencies. You can set it up by following these steps:

  1. Ensure you have a python 3.9 or higher installation on your external machine
  2. Install poetry following the instructions here
  3. From the root project directory, install the depencies with poetry install
  4. Ensure the envirornment has been installed by running poetry shell. You should see something like (nwf-process-geodata-py3.9) in your terminal.

Git stuff

We encourage people to follow the git feature branch workflow which you can read more about here: How to use git as a Data Scientist

For each feature you are adding to the code

  1. Switch to the main branch and pull the most recent changes
git checkout main 
git pull
  1. Make a new branch for your addition
git checkout -b cleaning_script
  1. Write your awesome code.
  2. Once it's done add it to git
git status
git add {files that have changed}
git commit -m {some descriptive commit message}
  1. Push the branch to gitlab
git push -u origin cleaning_script
  1. Go to GitHub and create a merge request.
  2. Either merge the branch yourself if your confident it's good or request that someone else reviews the changes and merges it in.
  3. Repeat
  4. ...
  5. Profit.

Project based on the cookiecutter data science project template. #cookiecutterdatascience

nwf-process-geodata's People

Contributors

canyonfoot avatar jonas-scholz123 avatar swchang749 avatar sarah-angelini avatar

Watchers

Indraneel Purohit avatar Mark Roth avatar  avatar  avatar Kaushik Mohan avatar  avatar Rachael W Riley avatar

nwf-process-geodata's Issues

Path changes

Currently the download paths look like data/raw/New Category/Subtype/Individual.xxx (e.g. data/raw/crucial_critical/wgfd_aquatic/aquatic_crucial_habitat_priorities.geojson). This structure is preserved for the processed data and associated documentation. I think it would be better if the data files and documentation each had their own folder, so you'd have aquatic_crucial_habitat_priorities/aquatic_crucial_habitat_priorities.geojson and aquatic_crucial_habitat_priorities/aquatic_crucial_habitat_priorities.pdf. This should be just a small change to the path creation function in download.py but will require deleting everything on the bucket first (which I wanted to do anyway).

Function to delete all files on the bucket

It's going to be useful to be able to clear out the bucket and rerun everything from scratch. We should have a function like 'delete_bucket_contents` that does this. It can go in common.py or in its own script.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.