National Wildlife Federation

Accessing, processing, documenting wildlife and environmental datasets for the National Wildlife Federation

Project Description

The National Wildlife Federation (NWF) is a large non-profit dedicated to conservation and wildlife advocacy. NWF is working with another organization to create an interactive mapping tool that shows the intersection of potential carbon management project with wildlife and envirornmental considerations in the state of Wyoming. Data Clinic is tasked with finding and processing these wildlife and envirornmental datasets and providing them to NWF. NWF has given us a spreadsheet outlining the desired datasets, which we have augmented with additional metadata.

The data pipeline we build will contain a few distinct steps, with each step depending on the previous. Roughly, these steps are:

Access data and upload to s3
- The metadata spreadsheet contains links to APIs and hosted files matching the requested datasets. The code in download.py should iterate through the datasets with links and download each locally before uploading to the nwf-dataclinic s3 bucket.
Simple data processing
- The raw data on s3 will have different file formats, projects, and extents. We want to provide NWF with data that has been minimally processed to ensure compatibility. The code in process.py should traverse the raw datasets and apply these basic processing steps and save the results to s3.
Documenting processed data
- The final step is to create simple documentation for each dataset. These should be pdf files generated for each processed dataset. These documents will contain information from the metadata, such as dataset description, licence, years covered, etc. as well as some additional information dervived from the data itself such as column names/types and number of rows. The code in document.py will iterate through the processed datasets and create the documentation for each.

These steps are composed in run.py - which also exports the full contents of the repository to a specified local folder. You can run the full pipeline by executing poetry run python3 src/run.py --s3bucket <your_bucket> (or by running the script from the envirornment of your choice). Flags exist to export the data locally, skip the s3 upload, or overwrite data which has already been processed. To see the full list, run poetry run python3 src/run.py --help.

Poetry Environment Set-up

This project uses Poetry to provide an easy way to manage dependencies. You can set it up by following these steps:

Ensure you have a python 3.9 or higher installation on your external machine
Install poetry following the instructions here
From the root project directory, install the depencies with poetry install
Ensure the envirornment has been installed by running poetry shell. You should see something like (nwf-process-geodata-py3.9) in your terminal.

Git stuff

We encourage people to follow the git feature branch workflow which you can read more about here: How to use git as a Data Scientist

For each feature you are adding to the code

Switch to the main branch and pull the most recent changes

git checkout main 
git pull

Make a new branch for your addition

git checkout -b cleaning_script

Write your awesome code.
Once it's done add it to git

git status
git add {files that have changed}
git commit -m {some descriptive commit message}

Push the branch to gitlab

git push -u origin cleaning_script

Go to GitHub and create a merge request.
Either merge the branch yourself if your confident it's good or request that someone else reviews the changes and merges it in.
Repeat
...
Profit.

Project based on the cookiecutter data science project template. #cookiecutterdatascience

tsdataclinic / nwf-process-geodata Goto Github PK

nwf-process-geodata's Introduction

National Wildlife Federation

Project Description

Poetry Environment Set-up

Git stuff

nwf-process-geodata's People

Contributors

Watchers

nwf-process-geodata's Issues

Path changes

Function to delete all files on the bucket

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs