Playground around vector and webserver logs.
vector
and influxdb2
with a docker-compose to run it quickly.
The influxdb bucket, org, token... is created automatically at first run. The docker-compose should honor dependencies between services through healthchecks.
When launching the stack, it also launches the pipeline computation (ie vector
runs). It can take a while when processing big files, use vector tap
(cf below) to see what happens!
- Upload yesterday's logs to Minio (should be done with a cron)
- Run
python script/download_logfile_and_catalogs.py
to download data.gouv.fr's catalog and store them in the./tables/
directory and download yesterday's log file to./logs
.
TLDR; count resources downloads, aggregate them over 1 minute and send them to influxdb. Lives in vector.toml.
- Open haproxy logs file(s) designated by
sources.haproxy.include
- Parses them as syslog, then via a custom regex as haproxy log components (HTTPLog)
- Filter based on haproxy backend (data.gouv.fr) and status code (no errors)
- Enrich with business info in
detect_type
: this is where you detect if it's a dataset, a resource... an api call or not... - Route based on
detect_type
: this dispatches the lines based on their type and allows custom logic for each type map_to_resource
is specific to, well, resources. It uses a predefined enrichment table which is use the resources catalog. This is where you can find a resource and dataset id based on the requestmetric_count_resources
transforms the log line to a metric โ basically it keeps only the fields we need and define a pivot field for metrics computation (what to count)aggregate_resources
: aggregate (sum) over 1 minutes- Push the results to influxdb
This is the current data model sent from vector to influxdb.
{
"name":"resource_id",
"namespace":"resource",
"tags": {
"dataset_id":"5f733777722fc12a413290eb",
"method":"GET",
"request_api":"false",
"resource_id":"01466800-c1cb-48f4-b7f6-bf1615c34e7f",
"status_code":"200"
},
"timestamp":"2022-06-04T00:00:21Z",
"kind":"incremental",
"counter": {
"value":2.0
}
}
influxdb indexes tags
as fields, in a measurement named by namespace
.name
, with the associated timestamp
and counter.value
as value.
influxdb exposes a dashboard on http://localhost:8086/ (influxdb/influxdb) where it's possible to query the timeseries. For example, this query shows all entries in your influxdb:
from(bucket: "vector-bucket")
|> range(start: 0, stop: v.timeRangeStop)
You can filter the query to get the download count over time for a given resource.
from(bucket: "vector-bucket")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "resource.resource_id")
|> filter(fn: (r) => r["resource_id"] == "e5f40fbc-7a84-4c4a-94e4-55ac4299b222")
|> aggregateWindow(every: v.windowPeriod, fn: sum, createEmpty: false)
|> yield(name: "sum")
clickhouse
sink: big heavy stuff, couldn't manage to insert proper data into it (but the transform pipeline was not great at that point)- logs vs metrics: logs would be one unfiltered line, metrics associate a value to a line or an aggregate of line. Seems more appropriate now that I've understood how it works, not easy at first
- metric type: counter vs set โ with aggregation, counter works pretty well
vector tap
in docker-compose: pretty neat to log stuff when streaming, but does not work for a test with a small file (file is processed before vector opens its tap)
- handle resource duplication when going through a permalink then static.data.gouv.fr: could be done by querying the resources table and ignoring the hit when it belongs to a resource with a static.data.gouv.fr. VRL seems able to handle that
- handle datasets, reuses, organizations... hits
- Is influx the right sink? FluxQL needs some getting used to... Needs to be battle tested. Time-series pattern looks promising for our use case though.
- Aggregation: should we aggregate and on what window?
- This stack might seem overkill for a single metric computation. Still, I believe it's very flexible for future use cases (ops logs, other business computations...). Vector can be deployed as a distributed agent on multiple servers communicating with a central aggregator easily (it's just a source and sink!). We could deploy it everywhere we need to monitor or query something, whatever it is. We can even plug it natively into our kafka stream. Also, the pattern is coherent with what we're building with Kafka and data analysis services, it's kind of an event based logging thingy, with consumers and stuff.
- FYI, vector is written in Rust and influxdb in Golang