GithubHelp home page GithubHelp logo

partrita / discovery_gtr Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nestauk/discovery_gtr

0.0 1.0 0.0 82 KB

Automating data transfer from the GtR API, processing, and storing on Amazon S3

Shell 1.64% Python 71.74% Makefile 26.62%

discovery_gtr's Introduction

discovery_gtr: GtR to S3 Pipeline

๐Ÿ‘‹ About

The GtR to S3 Pipeline is a workflow designed to fetch paginated bulk data resources from the GtR (Gateway to Research) API and save them to an AWS S3 bucket. This script is designed to be run as a GitHub Action, but it can also be executed locally.

This is an experimental script that should be used with caution, as it can use up Nesta's GitHub Actions Minutes.

Workflow Overview

The pipeline performs the following steps:

  1. Calls the GtR API to get the total number of pages for a specified endpoint.
  2. Calls the GtR API to fetch data for each page.
  3. Saves the fetched data to an AWS S3 bucket.

Usage

GitHub Action

When using the pipeline as a GitHub Action, each instance of the workflow fetches data from a specific endpoint. The ENDPOINT environment variable is set to specify the endpoint to fetch. Additionally, the following environment variables should be set as GitHub Secrets:

  • AWS_ACCESS_KEY: AWS access key for S3.
  • AWS_SECRET_KEY: AWS secret key for S3.
  • MY_BUCKET_NAME: Name of the S3 bucket to store the data.
  • DESTINATION_S3_PATH: Path to the S3 destination folder.
  • ENDPOINT: The endpoint to fetch.

To trigger the GitHub Action, you can include it in your GitHub Actions workflow configuration file.

To activate the GitHub Action, the path of the .github/main.yaml file needs to be changed to .github/workflows/main.yaml. It is currently the former to disable it for now.

Local Execution

To run the pipeline locally, you need to set the following environment variables in a .env file:

  • AWS_ACCESS_KEY: AWS access key for S3.
  • AWS_SECRET_KEY: AWS secret key for S3.
  • MY_BUCKET_NAME: Name of the S3 bucket to store the data.
  • DESTINATION_S3_PATH: Path to the S3 destination folder.
  • ENDPOINTS: A list of endpoints to fetch data from.

You can execute the pipeline locally using the command:

bash python gtr_to_s3.py

Data Extraction

The pipeline extracts relevant data from the GtR API response using an identifier (key_to_extract). The identifier is generated based on the endpoint being fetched. Here's how the identifier is determined for commonly used endpoints:

  • Funds Endpoint: The identifier is set to "fund."
  • Projects Endpoint: The identifier is set to "project."
  • Organisations Endpoint: The identifier is set to "organisation."
  • Persons Endpoint: The identifier is set to "person."

For each endpoint, the pipeline extracts data by iterating through the API response, mapping specific headers to the extracted data, and creating a list of dictionaries, where each dictionary represents a single data entry.

The pipeline is designed to accommodate variations in data structure for different endpoints, allowing for flexible data extraction.

Customization

If you need to customize data extraction for a specific endpoint, you can modify the ENDPOINT_HEADERS dictionary within the script. This dictionary is designed to map endpoint names to lists of headers to extract, but currently does not do this, and produces the data in the default GtR mapping.

Setup

  • Meet the data science cookiecutter requirements, in brief:
    • Install: direnv and conda
  • Run make install to configure the development environment:
    • Setup the conda environment
    • Configure pre-commit

Contributor guidelines

Technical and working style guidelines


Project based on Nesta's data science project template (Read the docs here).

discovery_gtr's People

Contributors

tomwillcocks1 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.