Above is the system design for this application. For simplicity, the 'Worker' logic and 'Visualizer' logic is handled in this same Go program.
-
Pull-based subscription so that multiple workers can consume from the same subscription, allowing us to scale up writes. Read more
-
Using a time-series database (InfluxDB) for efficiency of queries since we are interested in a metric over time. Read more
-
Batch writes so that we minimize IO overhead to the database. Read more
-
Separate data processing, storage, and visualizing. We may want to visualize the same data points in various ways (rides / hr, avg meter reading) or scale up these individual concerns.
-
At larger scale / more complex requirements, we may want to ingest data into a pipeline of Apache Spark functions.
-
At larger scale, a single node of InfluxDB may not be sufficient for High Availability / Resilience. For that, an enterprise installation of InfluxDB will grant features for distributed mode
- Ensure you have minikube setup. See here.
- Add helm and tiller for easier install of InfluxDB. See here
-
helm install --name v1 stable/influxdb
-
You should be able to get the hostname of influxdb. Eg. http://v1-influxdb.default:8086
-
Create a database called
taxianalytics
-
Setup your google cloud project and topic for PubSub. We will need the project id and topic name later. See here
-
Get your your google cloud service account key and save in
<project_root>/key.json
. See here -
Set the key.
export GCLOUD_KEY=$(cat key.json)
-
Set your google cloud project.
export TAXI_PROJECT=<gcloud_project_id>
-
Set the PubSub subscription name.
export TAXI_SUB_NAME=<gcloud_pubsub_topic>
-
Set the Database host name.
export DB_HOST=<influx_db_host>
-
Run the app.
go run main.go
-
All the above environment variables must be setup
-
Run the deploy script using
sh deploy.sh