cityxr-data-engine's Introduction

City XR Data Project

We’ll develop a data engine package to handle ingestion and querying of data. This will consist of two programs and supporting scripts:

The cxd-ingest program will run and continuously sample data according to its source plugins.
The cxd-server program will run and serve queries over a some kind of GraphQL or RESTy interface for batch queries and also route real-time messages to data set subscribers via websockets.

Assumptions

The engine will use Postgres 11 as a backing data store.
The engine will target deployment on Linux servers.
Operators will have basic business analyst knowledge--e.g., ability to modify example stub scripts.

Data sources supported will be:

SQL queries via ODBC
Document fetching via HTTP
JSON payloads received via HTTP POST
Local filesystem access

Datasets will be one of:

Point-in-time data (“These businesses are at these locations 12 days ago.”)
Time-series data (“These locations had this temperature between May 5 and Nov 5.”)
Instantaneous data (“This is the current value of water in this flood plain.”)

Features

The overall engine features will:

Thorough documentation of engine architecture.
Tutorials for common engine use-cases.
Helper scripts to generate stub data source plugins.
Helper scripts to manage the engine.
Instructions for deployment.

The cxd-ingest program will:

Have a shell script for generating stub plugins to add data sources.
Have a plugin architecture for supporting data sources as described above.
Plugins will be written in Python 3.
Plugins will register themselves with the ingest program and run their setup (creation of database tables, etc.).
Plugins may either stay resident (if they need to receive webhooks) or schedule themselves for periodic servicing.
When serviced, plugins will run in two parts:
- A fetch phase, where they receive or gather their data.
- An ingest phase, where they process and clean the data.
Log errors and data quality conditions.

The cxd-server program will:

Provide at least one of a RESTy or GraphQL interface for querying ingested data.
Provide ability to retrieve batched data updates since some previous time (“Give me all new data since yesterday”).
Provide periodic updates of data updates to clients subscribed to various datasets.
Provide a maintenance/debugging interface to see what data sources are available, what data they have, what their run status is, and to visualize that data.

cityxr-data-engine's People

Contributors

Watchers

cityxr-data-engine's Issues

Server -- Need to have a way of querying point-in-time data for a particular data source

The REST api needs to allow users to query a data source's values at a particular time.

Users should specify:

What data source they want
What fields they care about

Ingest -- Need to be able to listen for HTTP POSTS to a route and send them to the appropriate plugin for processing if it's a plugin that listens for that stuff (whole pipeline here)

Ingest -- Need to have a way of notifying subscribers in the database (via NOTIFY, probably) that new data is available or that a source has updated or failed.

Server -- Need to have a way of querying the available data sources

Server -- Need a maintenance page for reviewing data.

We need a page to view the different run status and data values. This can live in the ingest server, which already has a lot of this.

This needs to:

show what data sources are available
show what data schema the data sources have
show what their run status is and history is
visualize the data

Ingest -- Need to have a way of connecting to other databases as data sources

Ingest -- Need to have a way of pausing datasources

(check the config table is_disabled flag for where we might be able to store this)

Ingest -- Need to accept addresses and attempt to geocode them

Ingest -- Need to accept geojson for shapes

Server -- Need to have a way of forwarding ingest notifcations to subscribed clients

The data server needs to be able to send updates to clients that are subscribed to a data source over websockets.

Subscribers are notified once a run completes. Historical data must be fetched using the API.

Process of subscription for websockets is:

Server maintains a list of active data sources and their subscribers
Client subscribes to a data source
On update from the database (NOTIFY perhaps? see example gist somebody used here), notify subscribers through websockets that new data is available.

Clients subscribe using this method:

{
  "ds": "<datasource id>",
  "a": "sub"
}

To unsubscribe:

{
  "ds": "<datasource id>",
  "a": "unsub"
}

Once subscribed, a client should expect messages of the form:

{
  "t": "<RFC3339 timestamp of sending time",
  "rid"
  "ds": "<datasource id>",
  "d" : [
   <data frames as objects whose keys are the column names and whose values are the data points> 
  ]
}