openaddresses / batch Goto Github PK

View Code? Open in Web Editor NEW

5.0 6.0 5.0 10.24 MB

OpenAddresses/Machine based AWS Batch based ETL Processing

Home Page: https://batch.openaddresses.io/

License: MIT License

JavaScript 67.47% Shell 0.14% Dockerfile 0.20% HTML 0.11% Vue 32.08%

openaddresses geocoding addresses gis geospatial geocoder

batch's Introduction

OpenAddresses Batch

Deploy

Before you are able to deploy infrastructure you must first setup the OpenAddresses Deploy tools

Once these are installed, you can create the production stack via: (Note: it should already exist!)

deploy create prod

Or update to the latest GitSha or CloudFormation template via

deploy update prod

Parameters

Whenever you deploy, you will be prompted for the following parameters

GitSha

On every commit, GitHub actions will build the latest Docker image and push it to the batch ECR. This parameter will be populated automatically by the deploy cli and simply points the stack to use the correspondingly Docker image from ECR.

MapboxToken

A read-only Mapbox API token for displaying base maps underneath our address data. (Token should start with pk.)

Bucket

The bucket in which assets should be saved to. See the S3 Assets section of this document for more information

Branch

The branch with which weekly sources should be built from. When deployed into production this is generally master. When testing new features this can be any openaddresses/openaddresses branch.

DatabaseType

The AWS RDS database class that powers the backend.

DatabasePassword

The password to set on the backend database. Passed to the API via docker env vars

SharedSecret

API functions that are public currently do not require any auth at all. Internal functions however are protected by a stack-wide shared secret. This secret is an alpha-numeric string that is included in a secret header, to authenticate internal API calls.

This value can be any secure alpha-numeric combination of characters and is safe to change at any time.

GithubSecret

This is the secret that Github uses to sign API events that are sent to this API. This shared signature allows us to verify that events are from github. Only the production stack should use this parameter.

Components

The project is divided into several componenets

Component	Purpose
cloudformation	Deploy Configuration
api	Dockerized server for handling all API interactions
api/web	Subfolder for UI specific components
cli	CLI for manually queueing work to batch
lambda	Lambda responsible for instantiating a batch job environement and submitting it
task	Docker container for running a batch job

S3 Assets

By default, processed job assets are uploaded to the bucket v2.openaddresses.io in the following format

s3://v2.openaddresses.io/<stack>/job/<job_id>/source.png
s3://v2.openaddresses.io/<stack>/job/<job_id>/source.geojson
s3://v2.openaddresses.io/<stack>/job/<job_id>/cache.zip

Manual sources (sources that are cached to s3 via the upload tool), are in the following format

s3://v2.openaddresses.io/<stack>/upload/<user_id>/<file_name>

API

API documentation is availiable here

Development

In order to set up an effective dev environment, first obtain a copy of the metastore.

Create a local

./clone prod

Then from the /api directory, run

npm run dev

Now, to build the latest UI, navigate to the /api/web directory in a seperate tab, and run:

npm run build --watch

Note: changes to the website will now to automatically rebuilt, just refresh the page to see them.

Finally, to access the api, navigate to http://localhost:5000 in your web browser.

Database

All data is persisted in an AWS RDS managed postgres database.

batch's People

Contributors

Stargazers

Watchers

Forkers

kimballjohnson macieg missinglink arch0345 andrewharvey

batch's Issues

DotMap

Context

machine currently performs the update on the openaddresses.io dotmap. We should create a new scheduled event that downloads the global collection and performs the dotmap update

Actions

Create dotmap event
Download global collection
Use tippecanoe to create vector layer
Upload layer tile-by-tile via new Mapbox API to avoid size restrictions

Collection Size

Context

Track the collection size in the database and display it via the UI

Actions

Add size col in collection table
Self report the collection size upon collection update
Display the collection size via the UI

Source Removal

Context

On each weekly run, the currently stored data results should be compared against the source JSONs. If there are source JSONs that no longer exist, the admin should be prompted to remove the data sources from the batch platform

GDAL Inverted Coordintaes

Context

The python gdal package seems to have inverted their coordinates resulting in ogr sources having mismatched coordinates

Ref: OSGeo/gdal#1546

Warn On Non: Website 200

Context

In the check_sources portion of the task, attempt to curl the website and check for a 200 status code.

Failed Data Jobs

Context

To make the process of fixing broken sources easier, there should be a page where the user is able to view recently failing runs

Actions

Add job endpoint to only retrieve live runs
Add job endpoint to filter by status
Update UI to use these filters on a recent job failures page

Job Errors Loading

Context

The JobErrors page does not show a loading bar and instead flashes "No Errors found" every time a user navigates to the page.

Actions

Add better loading behavior
Update the error count in the upper tab when the user refreshes the list
Don't reload page on suppress, just splice out of results

Don't use "optimal" in batch compute environments

The compute environments that run the fetch tasks use the "optimal" instance type in the AWS Batch configuration. That defaults to using some pretty big and old instance types. It'd be better if we could be more specific with instance types ("t3.small") or even just an instance type family ("t3"). We should continue using Spot instances, though. This would save significant amounts of money.

User Paging

Context

The admin page is starting to overflow due to the number of users

Actions

Add default limit to returned usernames
Add paging system for users

run vs run/<id>

Context

The single run API should return as much data as the run list API

http://staging.openaddresses.io/api/run/204

http://staging.openaddresses.io/api/run/

Unchanged Data

Context

If data hasn't changed in over a year, add it to the warn list

Job Error

Show job error on Job Page if one exists

File Upload

Context

Users can currently upload data products to https://results.openaddresses.io/upload-cache

Actions

API for file upload
UI for file upload

Error Handling

Context

At the moment the Login component is the only component with solid error handling. Every api call should inform the user if it cannot fallback to a safe backup option

Actions

Ensure every fetch has a catch that potentially triggers the Error component

Incorrect encoding on data downloads

Users are reporting that data from some sources are incorrectly encoded.

Upload Papercuts

Context

The upload function now works as expected but could use a couple improvements.

Actions

Add a close button in the upper right hand corner while uploading that will call xhr.cancel
Add a warning of the user tries to navigate away from the page while uploading, notifying that this will kill the current upload
Add an API for listing past uploads

Filter by Map Click

Context

The interactive map on the /data page should allow click events to filter the list of sources

Actions

Add bounds column to job table
Populate bounds column with stats data from batch task
Make job => map relational instead of transnational updates
Add point filter to data endpoint
Update UI to show click event & perform filter
Update UI to be able to cancel filter

Set the correct Content-Type on S3 uploads

e.g. for https://v2.openaddresses.io/batch-prod/job/1004/source.png the content-type is not set, so it defaults to an octet stream and the browser downloads instead of displays the image.

Link directly to logs from errors page

On https://batch.openaddresses.io/errors, we should include a link to the logs for that source so that it's easier to find out what went wrong.

GH Duplicate Jobs

Context

The GH bot appears to be making duplicate jobs on the v2-test branch

Ref: http://staging.openaddresses.io/run/56

Register Component

Context

The register component has neither error nor success reporting

Actions

Add error handling
Add red color to input fields that are empty/invalid
Add success handling

Login Component

Context

The login component has neither error nor success reporting

Actions

Add error handling
Add red color to input fields that are empty/invalid
Add success handling
When a page is initially loaded - determine if a user is authenticated and set auth object

Fully Doc & Host API Endpoints

Context

Fully document and host in-code generated API documentation.

Actions

Investigate APIDoc
Document existing routes
Document POST/Patch Bodies
Document general return JSON
Document important non-generic Error states

Summary Statistics

Context

Although the results.openaddresses.io stats have been broken for some time, we should create a chart that is similiar

Invalid Coordinates

Context

As I was trying to add new whyoming sources, they succeeded but had invalid coordinates. The stats module should also track the number of valid vs invalid lat/lngs

Context

Track number of valid coords
Issue warning/fail if about certain %

Bin#Match

Context

The Bin#match function is called when a job from a live run is finished successfully. It takes the source and updates the map with coverage information as necessary.

Actions

Ensure the Bin#Match function is able to successfully match all of the following

preview first 10 features in job output

Is your feature request related to a problem? Please describe.
When uploading a new source, or making changes to an existing one, ideally I would verify that it's being processed as expected before the PR is merged. The job preview like at https://batch.openaddresses.io/job/21189/ shows a map which is really helpful for a quick scan to make sure the projection is roughly correct and that most data is being loaded, but it doesn't tell that the attributes are parsed correctly.

Describe the solution you'd like
I can download the processed data, however for large sources that means I need to download the whole dataset when really I just want to see the first 5 or 10 features as that is usually enough for a first pass scan to make sure the parsing is correct.

Missing CSV downloads?

In openaddresses/openaddresses#5436, a user points out that they can't find CSV downloads. OpenAddresses data has always distributed CSV data, so switching to line-delimited GeoJSON is a breaking change. Can we add back CSV please?

Reduce costs

After switching to batch, our costs have increased dramatically. At first it was because we were using "optimal" in the AWS Batch configuration, which started rather large instances and kept them running unnecessarily, then after #80 we switched to c5 instances, but AWS Batch is still starting up larger instances and keeping them around for longer than needed. This makes our AWS bill ~twice what it was before we switched to batch.

Can we try to use c5.large instances and reduce the maximum number of vCPUs in the batch compute environment? Maybe reduce the requested memory or CPU for each task?

Data download links point to wrong files

As reported by an external data user:

The files downloaded do not match the file labels. For instance, attempting to download the Canadian province of Alberta (ca/ab/province) instead opens file for a city in Brazil. (See attached screenshot.)

Skip Sources

Context

If a source has the skip: true property, don't even fire a batch task

Generate Weekly Data Dumps

Context

Most users download our large data dumps, generating these dumps is not currently supported

Actions

Add global dump
Add config definition for data dumps
Add data dump batch task
Add data dump API -> fire batch task
Add lambda schedule to hit data dump API
Add UI panel for displaying data dumps

Stats Check

Context

There is currently no protection on an address source dropping significantly. We should have at least basic protection from a source significantly degrading if it's total address count drops significantly between runs

Actions

5xx Error Master Ticket

Context

I'm seeing a small but consistent number of 5xx errors from the API that all appear to be from the Job Error API.

Need to track this down so I stop getting emails.

Add error check for points outside the 'coverage' area

This source claims to cover Vernon county, Missouri, when it actually covers Vernon county, Wisconsin: https://batch.openaddresses.io/job/4001/

It'd be great to get an alert or otherwise flag the run if some of the points are outside or far away from the claimed coverage area.

Track User Downloads

Context

Track the amount of data downloaded per user

Rerun Restrictions

Context

Allow github reruns indefinitely, but only allow Job reruns if they are ~ 1 week to prevent a super old job from overwriting a new job

CSV sources have lat and lon flipped

As reported in openaddresses/openaddresses#5465

It looks like the source is specified correctly because the existing machine generates the correct output, but batch seems to be generating output with lat/lon flipped.

Download links all point to the same file

From a user report:

On https://batch.openaddresses.io/data/, the Download links for all data collections point to the same global data collection download. i.e. clicking any of the collection downloads starts downloading a 14GB file.

Year Tag

Context

Many sources have a year tag for ensuring static data is updated. If a year tag is older than 1 year, start to consistently WARN the source.

Verify Email

Context

Not unexpectedly, we immediately got a large number of spam email accounts. At the very least we should send an email verification

US-TX-Mclennan Will Run for Days

Context

US-TX-McLennan will run for days if not manually terminated

Job: https://console.aws.amazon.com/batch/v2/home?region=us-east-1#jobs/detail/cc97737d-a9c7-479d-9423-d3d11d48b2e3
CWL: https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fbatch$252Fjob/log-events/batch-prod-job$252Fdefault$252Fddd79241891746339c201b6dd93052d2

Actions

Add default timeout value
Add ability to increase timeout value in schema
Fix McLennan to not consume infinite resources

Data Backfill

Context

We should backfill the v2 service with all of the last good runs from the results service.

Actions

Write script to get list of s3 locations of latest runs
Write script to download and convert to GeoJSONLD
Directly override these files into the database - skipping the runner

cc/ @iandees

Don't run CI again on merge

The data please bot seems to run on merge to master, resulting in another scrape + image going on the PR after it was merged. It shouldn't run on merge.

Submit Service

Context

Investigate integrating the Submit-Service: https://github.com/openaddresses/submit-service into the batch API/UI

Include a link to the batch website run on GitHub comment

When the data please bot posts a comment with the screenshot, it should include a link to the run. Bonus points to include links directly to the logs, too.

The contents of the logs page should be monospaced

e.g. on https://batch.openaddresses.io/job/7655/log, the log content should be monospaced and smaller so it's a bit easier to read.

GH Issue Bot

Context

The Github CI integration should post images & stats to the PR once a job completes

Actions

Create an issue on ci job success
Include picture in issue
Include stats table in issue

Find another way to build dotmap

The dotmap on openaddresses.io is sourced from this code that uploads the complete listing of all addresses in OA to Mapbox. This costs a bunch of money, so I disabled the cronjob that runs the dotmap + upload process in AWS.

We need to find another way of generating this layer and keeping it up to date.

Run Auto Live

Context

The current results page does not show CI data that is sucessfully merged into master, meaning that users are forced to scrape the CI runs page to get the latest data if they can't wait for the scheduled runs

Actions

Monitor GH Actions for merge event
If a PR is merged into master, mark the run as live

openaddresses / batch Goto Github PK

batch's Introduction

OpenAddresses Batch

Deploy

Parameters

GitSha

MapboxToken

Bucket

Branch

DatabaseType

DatabasePassword

SharedSecret

GithubSecret

Components

S3 Assets

API

Development

Database

batch's People

Contributors

Stargazers

Watchers

Forkers

batch's Issues

Context

Actions

Context

Actions

Context

Context

Context

Context

Actions

Context

Actions

Context

Actions

Context

Context

Context

Actions

Context

Actions

Context

Actions

Context

Actions

Context

Context

Actions

Context

Actions

Context

Actions

Context

Context

Context

Context

Actions

Context

Context

Actions

Context

Actions

Context

Context

Context

Context

Context

Context

Actions

Context

Actions

Context

Context

Actions

Context

Actions

Recommend Projects

Recommend Topics