GithubHelp home page GithubHelp logo

openaddresses / batch Goto Github PK

View Code? Open in Web Editor NEW
5.0 6.0 5.0 10.24 MB

OpenAddresses/Machine based AWS Batch based ETL Processing

Home Page: https://batch.openaddresses.io/

License: MIT License

JavaScript 67.47% Shell 0.14% Dockerfile 0.20% HTML 0.11% Vue 32.08%
openaddresses geocoding addresses gis geospatial geocoder

batch's Introduction

OpenAddresses Batch

Deploy

Before you are able to deploy infrastructure you must first setup the OpenAddresses Deploy tools

Once these are installed, you can create the production stack via: (Note: it should already exist!)

deploy create prod

Or update to the latest GitSha or CloudFormation template via

deploy update prod

Parameters

Whenever you deploy, you will be prompted for the following parameters

GitSha

On every commit, GitHub actions will build the latest Docker image and push it to the batch ECR. This parameter will be populated automatically by the deploy cli and simply points the stack to use the correspondingly Docker image from ECR.

MapboxToken

A read-only Mapbox API token for displaying base maps underneath our address data. (Token should start with pk.)

Bucket

The bucket in which assets should be saved to. See the S3 Assets section of this document for more information

Branch

The branch with which weekly sources should be built from. When deployed into production this is generally master. When testing new features this can be any openaddresses/openaddresses branch.

DatabaseType

The AWS RDS database class that powers the backend.

DatabasePassword

The password to set on the backend database. Passed to the API via docker env vars

SharedSecret

API functions that are public currently do not require any auth at all. Internal functions however are protected by a stack-wide shared secret. This secret is an alpha-numeric string that is included in a secret header, to authenticate internal API calls.

This value can be any secure alpha-numeric combination of characters and is safe to change at any time.

GithubSecret

This is the secret that Github uses to sign API events that are sent to this API. This shared signature allows us to verify that events are from github. Only the production stack should use this parameter.

Components

The project is divided into several componenets

Component Purpose
cloudformation Deploy Configuration
api Dockerized server for handling all API interactions
api/web Subfolder for UI specific components
cli CLI for manually queueing work to batch
lambda Lambda responsible for instantiating a batch job environement and submitting it
task Docker container for running a batch job

S3 Assets

By default, processed job assets are uploaded to the bucket v2.openaddresses.io in the following format

s3://v2.openaddresses.io/<stack>/job/<job_id>/source.png
s3://v2.openaddresses.io/<stack>/job/<job_id>/source.geojson
s3://v2.openaddresses.io/<stack>/job/<job_id>/cache.zip

Manual sources (sources that are cached to s3 via the upload tool), are in the following format

s3://v2.openaddresses.io/<stack>/upload/<user_id>/<file_name>

API

API documentation is availiable here

Development

In order to set up an effective dev environment, first obtain a copy of the metastore.

Create a local

./clone prod

Then from the /api directory, run

npm run dev

Now, to build the latest UI, navigate to the /api/web directory in a seperate tab, and run:

npm run build --watch

Note: changes to the website will now to automatically rebuilt, just refresh the page to see them.

Finally, to access the api, navigate to http://localhost:5000 in your web browser.

Database

All data is persisted in an AWS RDS managed postgres database.

dbdiagram.io

batch's People

Contributors

dependabot[bot] avatar iandees avatar ingalls avatar missinglink avatar rzmk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

batch's Issues

DotMap

Context

machine currently performs the update on the openaddresses.io dotmap. We should create a new scheduled event that downloads the global collection and performs the dotmap update

Actions

  • Create dotmap event
  • Download global collection
  • Use tippecanoe to create vector layer
  • Upload layer tile-by-tile via new Mapbox API to avoid size restrictions

Collection Size

Context

Track the collection size in the database and display it via the UI

Actions

  • Add size col in collection table
  • Self report the collection size upon collection update
  • Display the collection size via the UI

Source Removal

Context

On each weekly run, the currently stored data results should be compared against the source JSONs. If there are source JSONs that no longer exist, the admin should be prompted to remove the data sources from the batch platform

Warn On Non: Website 200

Context

In the check_sources portion of the task, attempt to curl the website and check for a 200 status code.

Failed Data Jobs

Context

To make the process of fixing broken sources easier, there should be a page where the user is able to view recently failing runs

Actions

  • Add job endpoint to only retrieve live runs
  • Add job endpoint to filter by status
  • Update UI to use these filters on a recent job failures page

Job Errors Loading

Context

The JobErrors page does not show a loading bar and instead flashes "No Errors found" every time a user navigates to the page.

Actions

  • Add better loading behavior
  • Update the error count in the upper tab when the user refreshes the list
  • Don't reload page on suppress, just splice out of results

Don't use "optimal" in batch compute environments

The compute environments that run the fetch tasks use the "optimal" instance type in the AWS Batch configuration. That defaults to using some pretty big and old instance types. It'd be better if we could be more specific with instance types ("t3.small") or even just an instance type family ("t3"). We should continue using Spot instances, though. This would save significant amounts of money.

User Paging

Context

The admin page is starting to overflow due to the number of users

Actions

  • Add default limit to returned usernames
  • Add paging system for users

Unchanged Data

Context

If data hasn't changed in over a year, add it to the warn list

Job Error

Show job error on Job Page if one exists

Error Handling

Context

At the moment the Login component is the only component with solid error handling. Every api call should inform the user if it cannot fallback to a safe backup option

Actions

  • Ensure every fetch has a catch that potentially triggers the Error component

Upload Papercuts

Context

The upload function now works as expected but could use a couple improvements.

Actions

  • Add a close button in the upper right hand corner while uploading that will call xhr.cancel
  • Add a warning of the user tries to navigate away from the page while uploading, notifying that this will kill the current upload
  • Add an API for listing past uploads

Filter by Map Click

Context

The interactive map on the /data page should allow click events to filter the list of sources

Actions

  • Add bounds column to job table
  • Populate bounds column with stats data from batch task
  • Make job => map relational instead of transnational updates
  • Add point filter to data endpoint
  • Update UI to show click event & perform filter
  • Update UI to be able to cancel filter

Set the correct Content-Type on S3 uploads

e.g. for https://v2.openaddresses.io/batch-prod/job/1004/source.png the content-type is not set, so it defaults to an octet stream and the browser downloads instead of displays the image.

Register Component

Context

The register component has neither error nor success reporting

Actions

  • Add error handling
  • Add red color to input fields that are empty/invalid
  • Add success handling

Login Component

Context

The login component has neither error nor success reporting

Actions

  • Add error handling
  • Add red color to input fields that are empty/invalid
  • Add success handling
  • When a page is initially loaded - determine if a user is authenticated and set auth object

Fully Doc & Host API Endpoints

Context

Fully document and host in-code generated API documentation.

Actions

  • Investigate APIDoc
  • Document existing routes
  • Document POST/Patch Bodies
  • Document general return JSON
  • Document important non-generic Error states

Summary Statistics

Context

Although the results.openaddresses.io stats have been broken for some time, we should create a chart that is similiar

Screenshot from 2020-08-22 07-55-33

Invalid Coordinates

Context

As I was trying to add new whyoming sources, they succeeded but had invalid coordinates. The stats module should also track the number of valid vs invalid lat/lngs

Context

  • Track number of valid coords
  • Issue warning/fail if about certain %

Bin#Match

Context

The Bin#match function is called when a job from a live run is finished successfully. It takes the source and updates the map with coverage information as necessary.

Actions

Ensure the Bin#Match function is able to successfully match all of the following

  • Country Level Sources
  • Region Level Sources (Province/State/Territory)
  • District Level Sources
  • Custom GeoJSON
  • Add point support to map backend
  • Add point support to map UI
  • Add full test coverage
  • Add zoom based scaling of points

preview first 10 features in job output

Is your feature request related to a problem? Please describe.
When uploading a new source, or making changes to an existing one, ideally I would verify that it's being processed as expected before the PR is merged. The job preview like at https://batch.openaddresses.io/job/21189/ shows a map which is really helpful for a quick scan to make sure the projection is roughly correct and that most data is being loaded, but it doesn't tell that the attributes are parsed correctly.

Describe the solution you'd like
I can download the processed data, however for large sources that means I need to download the whole dataset when really I just want to see the first 5 or 10 features as that is usually enough for a first pass scan to make sure the parsing is correct.

Reduce costs

After switching to batch, our costs have increased dramatically. At first it was because we were using "optimal" in the AWS Batch configuration, which started rather large instances and kept them running unnecessarily, then after #80 we switched to c5 instances, but AWS Batch is still starting up larger instances and keeping them around for longer than needed. This makes our AWS bill ~twice what it was before we switched to batch.

Can we try to use c5.large instances and reduce the maximum number of vCPUs in the batch compute environment? Maybe reduce the requested memory or CPU for each task?

Data download links point to wrong files

As reported by an external data user:

The files downloaded do not match the file labels. For instance, attempting to download the Canadian province of Alberta (ca/ab/province) instead opens file for a city in Brazil. (See attached screenshot.)

image

Skip Sources

Context

If a source has the skip: true property, don't even fire a batch task

Generate Weekly Data Dumps

Context

Most users download our large data dumps, generating these dumps is not currently supported

Actions

  • Add global dump
  • Add config definition for data dumps
  • Add data dump batch task
  • Add data dump API -> fire batch task
  • Add lambda schedule to hit data dump API
  • Add UI panel for displaying data dumps

Stats Check

Context

There is currently no protection on an address source dropping significantly. We should have at least basic protection from a source significantly degrading if it's total address count drops significantly between runs

Actions

  • Add stats UI to job page
  • Add stats diff UI to job page
  • Add BBOX UI to job page
  • Add BBOX UI diff to job page
  • Add Warn type to job status
  • Add session management
  • Add basic Admin Component
  • Add tab for marking Warn jobs as Success or Failure (Authenticated)
  • Task runner must perform stats comparison and potentially create a job error
  • If Job is part of live run and fails, create a job error

5xx Error Master Ticket

Context

Screenshot from 2021-02-27 06-14-57

I'm seeing a small but consistent number of 5xx errors from the API that all appear to be from the Job Error API.

Need to track this down so I stop getting emails.

Rerun Restrictions

Context

Allow github reruns indefinitely, but only allow Job reruns if they are ~ 1 week to prevent a super old job from overwriting a new job

Year Tag

Context

Many sources have a year tag for ensuring static data is updated. If a year tag is older than 1 year, start to consistently WARN the source.

Verify Email

Context

Not unexpectedly, we immediately got a large number of spam email accounts. At the very least we should send an email verification

US-TX-Mclennan Will Run for Days

Context

US-TX-McLennan will run for days if not manually terminated

Job: https://console.aws.amazon.com/batch/v2/home?region=us-east-1#jobs/detail/cc97737d-a9c7-479d-9423-d3d11d48b2e3
CWL: https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fbatch$252Fjob/log-events/batch-prod-job$252Fdefault$252Fddd79241891746339c201b6dd93052d2

Actions

  • Add default timeout value
  • Add ability to increase timeout value in schema
  • Fix McLennan to not consume infinite resources

Data Backfill

Context

We should backfill the v2 service with all of the last good runs from the results service.

Actions

  • Write script to get list of s3 locations of latest runs
  • Write script to download and convert to GeoJSONLD
  • Directly override these files into the database - skipping the runner

cc/ @iandees

Don't run CI again on merge

The data please bot seems to run on merge to master, resulting in another scrape + image going on the PR after it was merged. It shouldn't run on merge.

GH Issue Bot

Context

The Github CI integration should post images & stats to the PR once a job completes

Actions

  • Create an issue on ci job success
  • Include picture in issue
  • Include stats table in issue

Find another way to build dotmap

The dotmap on openaddresses.io is sourced from this code that uploads the complete listing of all addresses in OA to Mapbox. This costs a bunch of money, so I disabled the cronjob that runs the dotmap + upload process in AWS.

We need to find another way of generating this layer and keeping it up to date.

Run Auto Live

Context

The current results page does not show CI data that is sucessfully merged into master, meaning that users are forced to scrape the CI runs page to get the latest data if they can't wait for the scheduled runs

Actions

  • Monitor GH Actions for merge event
  • If a PR is merged into master, mark the run as live

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.