fastly / fastly-exporter Goto Github PK

A Prometheus exporter for the Fastly Real-time Analytics API

License: Apache License 2.0

Go 99.71% Dockerfile 0.08% Makefile 0.21%

fastly-exporter's Introduction

fastly-exporter

This program consumes from the Fastly Real-time Analytics API and makes the data available to Prometheus. It should behave like you expect: dynamically adding new services, removing old services, and reflecting changes to service metadata like name and version.

Getting

Binary

Go to the releases page.

Docker

Available on the packages page as fastly/fastly-exporter.

docker pull ghcr.io/fastly/fastly-exporter:latest

Note that version latest will track RCs, alphas, etc. -- always use an explicit version in production.

Helm chart

Helm must be installed to use the prometheus-community/fastly-exporter chart. Please refer to Helm's documentation to get started.

Once Helm is set up properly, add the repo as follows:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

And install:

helm upgrade --install fastly-exporter prometheus-fastly-exporter --namespace monitoring --set token="fastly_api_token"

Source

If you have a working Go installation, you can clone the repo and install the binary from any revision, including HEAD.

git clone [email protected]:fastly/fastly-exporter
cd fastly-exporter
go build ./cmd/fastly-exporter
./fastly-exporter -h

Using

Basic

For simple use cases, all you need is a Fastly API token. See this link for information on creating API tokens. The token can be provided via the -token flag or the FASTLY_API_TOKEN environment variable.

fastly-exporter -token XXX

This will collect real-time stats for all Fastly services visible to your token, and make them available as Prometheus metrics on 127.0.0.1:8080/metrics.

Filtering services

By default, all services available to your token will be exported. You can specify an explicit set of service IDs to export by using the -service xxx flag. (Service IDs are available at the top of your Fastly dashboard.) You can also include only those services whose name matches a regex by using the -service-allowlist '^Production' flag, or exclude any service whose name matches a regex by using the -service-blocklist '.*TEST.*' flag.

For tokens with access to a lot of services, it's possible to "shard" the services among different fastly-exporter instances by using the -service-shard flag. For example, to shard all services between 3 exporters, you would start each exporter as

fastly-exporter [common flags] -service-shard 1/3
fastly-exporter [common flags] -service-shard 2/3
fastly-exporter [common flags] -service-shard 3/3

Filtering metrics

By default, all metrics provided by the Fastly real-time stats API are exported as Prometheus metrics. You can export only those metrics whose name matches a regex by using the -metric-allowlist 'bytes_total$' flag, or exclude any metric whose name matches a regex by using the -metric-blocklist imgopto flag.

Filter semantics

All flags that filter services or metrics are repeatable. Repeating the same flag causes its condition to be combined with OR semantics. For example, -service A -service B would include both services A and B (but not service C). Or, -service-blocklist Test -service-blocklist Staging would skip any service whose name contained Test or Staging.

Different flags (for the same filter target) combine with AND semantics. For example, -metric-allowlist 'bytes_total$' -metric-blocklist imgopto would only export metrics whose names ended in bytes_total, but didn't include imgopto.

Metrics Grouping: by datacenter or aggregate

The Fastly real-time stats API returns measurements grouped by datacenter as well as aggregated measurements for all datacenters. By default, exported metrics are grouped by datacenter. The response body size of the metrics endpoint can potentially be very large. This will be exacerbated when using the exporter with many services, many origins with Origin Inspector, and many domains with Domain Inspector. One way to reduce the output size of the metrics endpoint is by using the -aggregate-only flag. When this flag is used only the aggregated metrics from the real-time stats API will be exported. Metrics will still include the datacenter label but it will always be set to "aggregate".

Service discovery

Per-service metrics are available via /metrics?target=<service ID>. Available services are enumerated as targets on the /sd endpoint, which is compatible with the generic HTTP service discovery feature of Prometheus. An example Prometheus scrape config for the Fastly exporter follows.

scrape_configs:
  - job_name: fastly-exporter
    http_sd_configs:
      - url: http://127.0.0.1:8080/sd
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: service
      - target_label: __address__
        replacement: 127.0.0.1:8080

Dashboards and Alerting

Data from the the Fastly exporter can be used to build dashboards and alerts with Grafana and Alertmanager. For a fully working example see fastly-dashboards created by @mrnetops. Fastly-dashboards contains a Docker Compose setup, which boots up a full fastly-exporter + Prometheus + Alertmanager + Grafana + Fastly dashboard stack with Slack alerting integration.

fastly-exporter's People

Contributors

Stargazers

Watchers

fastly-exporter's Issues

Continue starting up if fastly is down

Right now if Fastly is down or a temporary network glitch is occurring during startup the exporter crashes:

fastly # [    7.238528] prometheus-fastly-exporter-start[935]: level=error component=api.fastly.com during="initial API calls" err="error executing API services request: Get \"https://api.fastly.com/service\": dial tcp: lookup api.fastly.com: Temporary failure in name resolution"
fastly # [    7.242486] systemd[1]: prometheus-fastly-exporter.service: Main process exited, code=exited, status=1/FAILURE

For me it would be preferable for the exporter to continue starting up but emit a metric saying Fastly seems to be down.

re https://github.com/NixOS/nixpkgs/pull/151427/files#diff-e669f3682eb07a05197060a93f278ff57f160465c8168906e12f1f2c472026d8R262-R273

Exporter should set it's own custom user-agent

It would be nice if the exporter can set a custom user-agent string so that it can be identified by downstream APIs. Such as User-Agent: fastly-exporter v1.2.3.

I'm happy to take this work on next week, but detailing here before I forget.

Add version info metric

It would be nice to have a fastly_exporter_build_info metric in order to help monitor deployments of the exporter.

This can be achieved with the Prometheus client_golang version package.

Race condition between processing and scraping

There is a race condition that exists when the scrape is happening while all the per datacenter metrics are being incremented. When the results are processed from the real-time stats API it’s iterating and incrementing the metrics per-datacenter. If the scrape happens during that processing loop, the metrics that are reported won’t include all metrics for all datacenters since the response from the realtime API hasn’t finished processing yet. Therefore that scrape is reporting all the data from the last second of realtime data. I was able to easily reproduce by adding an artificial delay in the processing loop to force the scrape to happen in the middle of the loop. This can cause interesting graphs when running queries like:

(sum(rate(fastly_rt_requests_total[1m])) by(service_id)- (
sum(rate(fastly_rt_tls_total[1m]))by(service_id) ))

This line should be flat:

A potential solution is to add some locking so that every scrape is guaranteed to have a full set of data from any given response from the API. This has some performance implications especially when running against many services.

Thanks to @mrnetops for reporting.

remove tls_version="any" from fastly_rt_tls_total

This is essentially including a total field in the metric itself which is violates

Per https://prometheus.io/docs/practices/naming/

As a rule of thumb, either the sum() or the avg() over all dimensions of a given metric should be meaningful (though not necessarily useful).

As this essentially means that a simple sum(rate(fastly_rt_tls_total[1m])) by (service_name) ends up double counting the values (v10 + v11 + v12 + v13 + any) as seen in this normalized snippet

fastly_rt_tls_total{datacenter="YYZ",service_id="XXX",service_name="XXX",tls_version="any"} 11686
fastly_rt_tls_total{datacenter="YYZ",service_id="XXX",service_name="XXX",tls_version="v10"} 0
fastly_rt_tls_total{datacenter="YYZ",service_id="XXX",service_name="XXX",tls_version="v11"} 0
fastly_rt_tls_total{datacenter="YYZ",service_id="XXX",service_name="XXX",tls_version="v12"} 11686
fastly_rt_tls_total{datacenter="YYZ",service_id="XXX",service_name="XXX",tls_version="v13"} 0

We shouldn't be adding any total or sum subvalue that will impact totaling or summing the metric itself.

origin-insights

We should probably start thinking about https://developer.fastly.com/reference/api/metrics-stats/origin-insights/

I think this would probably be reasonable to just roll into the exporter directly, but I'm open to suggestions.

Idle services also trigger net/http "Client.Timeout exceeded while awaiting headers"

I just noticed that one of my idle developer services is regularly reporting timeouts

level=error component=monitors service_id=xxxxxxxxxxxxxxxxxx service_name=tmol.com err="Get https://rt.fastly.com/v1/channel/xxxxxxxxxxxxxxxxxx/ts/1541720127: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
level=error component=monitors service_id=xxxxxxxxxxxxxxxxxx service_name=tmol.com err="Get https://rt.fastly.com/v1/channel/xxxxxxxxxxxxxxxxxx/ts/1541720219: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
level=error component=monitors service_id=xxxxxxxxxxxxxxxxxx service_name=tmol.com err="Get https://rt.fastly.com/v1/channel/xxxxxxxxxxxxxxxxxx/ts/1541720441: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
level=error component=monitors service_id=xxxxxxxxxxxxxxxxxx service_name=tmol.com err="Get https://rt.fastly.com/v1/channel/xxxxxxxxxxxxxxxxxx/ts/1541720828: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
level=error component=monitors service_id=xxxxxxxxxxxxxxxxxx service_name=tmol.com err="Get https://rt.fastly.com/v1/channel/xxxxxxxxxxxxxxxxxx/ts/1541721003: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

Looking at the corresponding Fastly dashboard, there is a note about There is currently no new data. Graphs will resume when data is received.

Should we update the error messaging This may also be an idle service, or is there a better way to differentiate no data vs timeout/connectivity problems?

Pagination is not working properly (not fetching every pages)

Hi !
After updating to latest version we noticed that we were missing some services,
On version 7:
level=debug component=api.fastly.com refresh_took=749.283548ms total_service_count=127 accepted_service_count=0
On version 6 or before:
level=debug component=api.fastly.com refresh_took=2.052766648s total_service_count=7528

We got 76 pages of services and the 127 services found by version 7 comes from only the first page and the last one, so the pagination in exporter is not working properly and not fetching each pages but only the first one and the last one.

Thank you !

Support HTTP3 metrics

The real-time analytics API now returns http3 as an integer for the number of requests received over HTTP/3

It would be good to support this in the exporter

Feature request: export service/version information

One quite helpful bit of information is at which time new versions of services were activated. The graphs in the Fastly management UI display this; it would be cool if one could set this up e.g. as Grafana annotations.

AFAICT the information is in principle available from the Fastly API https://docs.fastly.com/api/config#version.

Add partial docker tags

name: Docker meta
        id: meta
        uses: docker/metadata-action@v3
        with:
          images: ghcr.io/peterbourgon/fastly-exporter
          tags: |
            type=ref,event=branch
            type=ref,event=pr
            type=semver,pattern={{version}}
            type=semver,pattern={{major}}.{{minor}}
            type=semver,pattern={{major}}

for 7, 7.0, 7.0.0 tags once we move out of alpha/beta/etc

fastly-exporter-2.1.0-linux-amd64 doesn't work in alpine

fastly-exporter-2.1.0-linux-amd64 doesn't work in alpine container.

fastly-exporter-2.0.0-linux-amd64:

➜  fastly-exporter cat Dockerfile-2.0.0
FROM alpine:3.8

RUN apk add --no-cache ca-certificates

RUN wget https://github.com/peterbourgon/fastly-exporter/releases/download/v2.0.0/fastly-exporter-2.0.0-linux-amd64 -O /fastly-exporter && chmod a+x fastly-exporter

ENTRYPOINT ["/fastly-exporter", "-endpoint", "http://0.0.0.0:8080/metrics"]
➜  fastly-exporter docker build --pull -t fastly-exporter-2.0.0 -f Dockerfile-2.0.0 .
[…]
➜  fastly-exporter docker run --rm fastly-exporter-2.0.0:latest
level=error err="-token is required"

fastly-exporter-2.1.0-linux-amd64:

➜  fastly-exporter cat Dockerfile-2.1.0
FROM alpine:3.8

RUN apk add --no-cache ca-certificates

RUN wget https://github.com/peterbourgon/fastly-exporter/releases/download/v2.1.0/fastly-exporter-2.1.0-linux-amd64 -O /fastly-exporter && chmod a+x fastly-exporter

ENTRYPOINT ["/fastly-exporter", "-endpoint", "http://0.0.0.0:8080/metrics"]
➜  fastly-exporter docker build --pull -t fastly-exporter-2.1.0 -f Dockerfile-2.1.0 .
[…]
➜  fastly-exporter docker run --rm fastly-exporter-2.1.0:latest
standard_init_linux.go:190: exec user process caused "no such file or directory"

It looks like 2.1.0 binary is not linked statically.

/ # ldd fastly-exporter-2.1.0-linux-amd64
	/lib64/ld-linux-x86-64.so.2 (0x7fa4e59f2000)
	libpthread.so.0 => /lib64/ld-linux-x86-64.so.2 (0x7fa4e59f2000)
	libc.so.6 => /lib64/ld-linux-x86-64.so.2 (0x7fa4e59f2000)
/ # ldd fastly-exporter-2.0.0-linux-amd64
ldd: fastly-exporter-2.0.0-linux-amd64: Not a valid dynamic program

missing metrics

hello,

correct me if i missed something big here, but i look at the fastly historical stats api documentation [here][https://docs.fastly.com/api/stats], i see many metrics which are not exported to my prom server such as status_5xx. did i miss something or is this currently not supported?

Add all those wild new Compute@Edge metrics

https://developer.fastly.com/reference/api/metrics-stats/realtime/ Cmd+F "compute_"

Support scraping multiple services

It looks like we can only specify a single Fastly service ID in the exporter right now.

Some shops have multiple services/service ID's set up, so it would be useful to be able to pass a comma-separated string or something and scrape multiple services.

Expose fastly rt api 'recorded' timestamp in prometheus metrics

The fastly real-time api includes a recorded field in the response, which is a timestamp of when a metric was generated. This is currently represented in the APIResponse struct.

Prometheus can be asked to honour the metric timestamps seen when scraping a target.

And the prometheus golang client allows for the creation of metrics with timestamps.

This issue is to update the fastly-exporter to use the recorded field as the metric timestamp when converting fastly metrics to prometheus ones. Doing this could help keep fastly metrics from being offset from other scrape targets.

fastly_rt_service_info for group_left/group_right metadata

https://www.robustperception.io/exposing-the-software-version-to-prometheus
https://www.robustperception.io/how-to-have-labels-for-machine-roles

Technically, we could do all kinds of fun stuff like

fastly_rt_service_info
fastly_rt_datacenter_info
fastly_rt_domain_info
fastly_rt_customer_info

and stuff all sorts of other one off tidbits in without binding them to the individual metrics.

We could even do service_name that way

something like

fastly_rt_service_info{service_id="$SID", service_name="$NAME", customer_id="$CID"} 1
fastly_rt_datacenter_info{datacenter_code="$CODE", datacenter_name="$NAME", datacenter_group="$GROUP", datacenter_shield="$SHIELD"} 1

Question: Strategy for large number of services

Hi There,
Thanks again for your time developing this software, it's helped us out immensely thus far. We've been running an old version (version 0.x) for quite a while and it's been good to us. Right now we're running two fastly_exporter instances with approximately 150 services each on two VMs to share the load. I'm interested in the auto-discovery feature that you've implemented in the new versions of this exporter but I have concerns about how I can manage a large number of Fastly services with it.

In total, we have approximately 900 Fastly services deployed to one Fastly account. As one can imagine, if I were to even attempt to boot the fastly_exporter with autodiscovery enabled, it would only be a bad time. Up until this point, I'd been manually curating our list of 'important services' to monitor with the exporter by manually filtering out our Staging environments, etc.

I was wondering if there were any existing strategies out there for dealing with an excessive number of Fastly properties with the exporter, and how one might go about architecting the exporter and the prometheus ingestion to deal with these volumes.

A couple of key things come to mind:

Might be nice/necessary to distribute & co-ordinate chunks of services to different instances of the exporter
Perhaps a feature to be able to exclude/include services based on a regular expression command line flag? This, in combination with the already-existing autodiscovery feature could be a viable method of dynamically and predictably consuming a sub-section of work. (Lots of our services are convention-based names, and this would make it easy to filter out in bulk)

Default endpoint value

I ran into an issue deploying this exporter on k8s where I hadn't set the endpoint argument, which ended up defaulting to http://127.0.0.1:8080/metrics. So Prometheus wasn't able to scrape it, the pod's metrics endpoint wasn't reachable from other pods even though they all had access... While my team and I were troubleshooting, we realized that according to the Dockerfile, the entrypoint is in fact http://0.0.0.0:8080/metrics! So shouldn't the default value of this argument also be http://0.0.0.0:8080/metrics?

RT API meta-metric

It would be useful to have a meta-metric like fastly_rt_up that indicates if the exporter has a valid connection to the RT API for each service.

How can we get a Dockerfile for this app?

I would love to work with this exporter in a container environment like aws or gcp. Have you stopped working on a Dockerfile for this project?

Thanks for this project so far!

Docker tag for latest binary release

It looks like the Docker image tags aren't up-to-date/in sync with the binary releases on Github. While I'm successfully using the latest tag on Docker for now, in general it's preferable to tag the docker image to a specific release for compatibility reasons etc. Just wondering if that's something you'd be able to do, at least for major versions such as v3.

Also, thanks for writing this tool! It's been working out great so far. Cheers!

Add datcenters API enricment

There's a /datacenters API endpoint that has a bunch of additional info that could be useful as labels.

For example, the group might be a nice label to aggregate over.

Token expiration metric not available in a version v8.1.0

Hello,

I'm running exporter version v8.1.0.
The token_expiration metric should be available from version v8.0.0.
However I don't see this metric exposed by exporter:

# grepping metric with not result
$ grep token_expiration /tmp/fastly_exporter_metrics
$ wc -l /tmp/fastly_exporter_metrics
70851 /tmp/fastly_exporter_metrics

I use just 1 argument for service filtering -service-allowlist.

Could you double-check if metric is exposed for you pls?

metric consolidation for http2/http3

Similar to how we consolidated

tls_v10
tls_v11
tls_v12
tls_v13
into fastly_rt_tls_total{tls_version=$version}

We should consolidate

http2
http3
and the silent but implicit http1
into fastly_rt_http_total {version=$version}.

Right now we need to do hinky things like

sum(rate(fastly_rt_requests_total{}[1m])) 
- sum(rate(fastly_rt_http2_total{}[1m]))
- sum(rate(fastly_rt_http3_total{}[1m]))

to derive http1 requests and it requires us to explicitly know every potential metric name in play to essentially derive it by exclusion.

i.e. if http4 is added, the above exclusion calculation breaks unless it is explicitly adjusted.

fastly_rt_http_total {version=1} would be soooo much nicer and way less fragile.

docker build for latest version

possible to get a docker image at mrnetops/fastly-exporter from the new 4.0.0 release?

Missing Docker images

It looks like mrnetops/fastly-exporter is missing the v6.1.0 release on Docker Hub..

carnality explosion with fastly_rt_datacenter_info

fastly_rt_datacenter_info looks to have an unintended cardinality explosion.

Instead of the expected ~100 pop time series, it's getting multiplied by service, so we're getting 10s of thousands of time series instead.

i.e.
fastly_rt_datacenter_info{datacenter="ACC", group="Africa", latitude="5.573", longitude="-0.203", name="Ghana", service="XXX" }

Metric LastSuccessfulResponse is updated even if API returns error

The Fastly exporter updates the metric lastSuccessfulResponse even if the Fastly API returns an error. In our case, I noticed that the Fastly token was revoked from the log messages:

level=error component=rt.fastly.com service_id=_reducted_ status_code=403 response_ts=1712674340 err="invalid authentication" msg="token may be invalid"

The metric description suggests that the metric shouldn't be updated:

"last_successful_response", Help: "Unix timestamp of the last successful response received from the real-time stats API."

I tracked it to a specific branch of the code related to retrieving origin metrics as an example:

The function NewMetrics (mentioned above) is called here.
The function MetricsFor is called here from within the spawn function.
Inside the spawn function, the RunOrigins function is called.
Inside the RunOrigins function queryOrigins is called, which returns the API result/error as the second return value, but it is ignored. A few lines down, the metric lastSuccessfulResponse is updated.

Support rt.fastly.com "demo" channel

 ./fastly-exporter-3.0.1-linux-amd64 -token xyz -service demo
level=info prometheus_addr=127.0.0.1:8080 path=/metrics namespace=fastly subsystem=rt
level=info component=api.fastly.com filtering_on="explicit service IDs" count=1
level=error component=api.fastly.com during="initial service refresh" err="error decoding API services response: json: cannot unmarshal object into Go value of type []api.Service"

There is a special channel ID demo which is used for the analytics on the fastly.com
(https://www.fastly.com) homepage.

Source: https://docs.fastly.com/api/analytics#channels

Useful for populating test data

`-metric-blocklist` doesn't work with `fastly_rt_datacenter_info`

The -metric-blocklist can't filter out the metric fastly_rt_datacenter_info.

Here is how to reproduce the issue with the latest stable version:

docker run \
  --env FASTLY_API_TOKEN="<your token>" \
  --interactive \
  --publish="0.0.0.0:8080:8080" \
  --rm \
  --tty \
  ghcr.io/fastly/fastly-exporter:v7.6.1 \
  -metric-blocklist='^fastly_rt_datacenter_info$'

curl -s http://127.0.0.1:8080/metrics | grep fastly_rt_datacenter_info

As you can see, the fastly_rt_datacenter_info continues to be exported, even if explicitly filtered out.

PS: Kudos for maintaining this exporter. It is really handy! 🙏🏼

Miss duration histogram has fewer buckets in v6.0.0 alpha.1 release

Hello! Thanks for your work on this project. It's super useful getting this data into Prometheus.

I've started using the v6.0.0-alpha.1 release since it makes it easier to get 429 response rates. Overall the release is working great.

The problem I'm having is that the MissDurationSeconds histogram only has three buckets for durations greater than 1 second (2.5, 5, and 10). In v5.0.0, there were double the number of buckets for durations greater than 1 second (2, 4, 8, 16, 32, 60).

In practice, I think this means I'm getting less accurate data on p99 miss latency. I'm seeing about a 300-400ms difference compared to before. Obviously, this issue will be experienced differently by users based on their specific response time patterns.

If it's desirable, I'm happy to submit a PR to either switch the bucket values back to their previous configuration, or to add a command-line option (e.g. -miss-duration-buckets 0.005,0.01,0.025,0.05,0.1,0.25,0.5,1,2,4,8,10) to allow the bucket configuration to be specified at runtime.

Fastly's API returns a significant amount of buckets:

The miss_histogram object is a histogram. Each key is the upper bound of a span of 10 milliseconds, and the values are the number of requests to origin during that 10ms period. Any origin request that takes more than 60 seconds to return will be in the 60000 bucket.

From my limited querying of the API, I seem to see the following pattern for buckets from Fastly:

1ms buckets from 0-10ms
10ms buckets from 10-250ms
50ms buckets from 250-1000ms
100ms buckets from 1000-3000ms
500ms buckets from 3000-60000ms

Consider adding a way to (optionally) track tokens

Tracking tokens in order to monitor when they get old and should be renewed, or when they stop to be used and could be destroyed, is a must on security departments.

So, it would be great to be able to get such stats by using Prometheus, but requires a service to expose that data.

This applies for user and automation tokens.

Minimum data required:

type (automation/user)
creation timestamp
last used timestamp

Better logging for rt.fastly.com (Client.Timeout exceeded while awaiting headers)

Because of how fastly-exporter will wait for new stats to be published for services, we tend to get a ton of logging like this for services that are simply not handling requests, and so not generating stats.

level=error component=rt.fastly.com service_id=xxx during="execute request" err="Get "https://rt.fastly.com/v1/channel/xxx/ts/1666656765\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

This can make it hard to suss out of there are in fact errors with or connecting to rt.fastly.com, vs simply having a number of idle services. This is a problem that is going to scale with the number of services in play the the account in question. (assuming more services overall is going to increase the incident and volume of idle services)

Possibly these errors should be reclassed as info as they are byproducts of the intended use case of connecting and listening for stat updates. and/or we should have better logging for when there are issues (connection refused, non-2xx responses, etc)

Short term, I have attempted to minimize the spurious errors with -rt-timeout 120s to increase the likelyhood of a service request -> stat response.

Interestingly, that seems to have tentatively addressed all of the errors, which makes me wonder if there is an interaction with a maximum time to stat response from rt.fastly.com, even if stats are zero. So possibly, raise that default to > the maximum stat response time from rt.fastly.com (if that is in fact what is happening)?

Add discovery/scrape-time service selection

With Prometheus 2.28, there is now a generic http service discovery.

The exporter can now produce an API output that lists all of the available services so that they can be scraped independently. This improves the performance of ingestion by spreading it out over time and allowing Prometheus to ingest the data over multiple target threads.

On the Prometheus side, you would configure the job like this:

scrape_configs:
- job_name: fastly
  metrics_path: /fastly
  http_sd_configs:
  - url: http://fastly-exporter:8080/sd
  relabel_configs:
  - source_labels: [__address__]
    target_label: __param_target
  - source_labels: [__param_target]
    target_label: instance
  - target_label: __address__
    replacement: fastly-exporer:8080

The /service-discovery endpoint would output json like this:

[
  { 
    "targets": [
      "<Service ID 1>",
      "<Service ID 2>",
      "<Service ID 3>",
      "<Service ID ...>"
    ]
  }
]

The relabel_config would then produce exporter URLs like /fastly?service=<Service ID 1>.

Reduce output size of metrics endpoint

Problem

Currently, when collecting stats for 201 services after running the exporter for 13 days with 12 shards the metrics endpoint output size is as follows:

Shard	Services	Payload (KB)
1	18	30,792
2	25	55,378
3	16	40,243
4	15	29,123
5	21	34,345
6	22	40,100
7	10	19,948
8	19	47,790
9	11	20,234
10	15	40,499
11	19	37,366
12	19	29,092
Total	210	424,910

With a scrape interval of 60 seconds the bandwidth requirement becomes 7,082 KB/s. In terms of storage requirements, this is 424,910 KB * 60 mins * 24 hours = 584 GB of raw data per day.

This can cause considerable impact on Prometheus scraping performance as this is a very large payload.

Proposal

Currently, each datacenter is a label which multiplies the number of each metric. When combined with a metric that has a status_code label this can explode the number of metrics returned.

A possible solution to reduce the output size of the metrics endpoint would be to aggregate the datacenter.

Analysis of how this might impact the output size for the earlier example is as follows:

Shard	Services	Payload (KB)
1	17	645
2	25	934
3	16	607
4	14	531
5	21	796
6	21	786
7	10	394
8	18	686
9	10	398
10	15	582
11	19	718
12	15	569
Total	201	7,646

With a scrape interval of 60 seconds the bandwidth requirement becomes 127 KB/s. In terms of storage requirements, this is 7,646 KB * 60 mins * 24 hours = 11 GB of raw data per day.

A comparison to the results with having individual datacenter metrics shows the following improvements:

Datacenter	Payload (KB)	Rate (KB/s)	Storage (Daily in GB)	Reduction
Individual	424,910	7082	584
Aggregated	7,646	127	11	98%

A side effect of having aggregated datacenter metrics would be the memory consumption should be reduced. It it hard to determine the exact impact but there should certainly be some improvements.

Conclusion

Aggregated data center metrics would provide an option for users that wish to reduce the metrics endpoint output size. By providing this as an option (not the default) this would allow users to decide if the benefits of reducing the output size outweigh the loss of inidivual datacenter metrics.

Metric Timestamps

While it's normal not best practice to expose timestamps. I think this exporter may need them.

I have been seeing issues with update vs scrape time alignment.

This may also be an artifact of the latency for ingesting updates from the real-time API. If the update comes in just slightly after the scrape, the update will not include some data. But the next scrape will catch up with the current value.

This causes artifacts in the graphs.

iterating http_sd support towards generic discovery

I believe we can make http_sd work generically out of the box without relabel_configs.

by explicitly setting the __params_target label per target (and ideally using the host header and port from the request to set the target:port) we can get service discovery to work without any jiggery pokery.

I used https://github.com/pagarme/static-response-server to host the following

[
  {
    "targets": [
      "fastly-exporter:8080"
    ],
    "labels": {
      "__param_target": "0AizkuJPvMmqhulU7fXXXX"
    }
  },
  {
    "targets": [
      "fastly-exporter:8080"
    ],
    "labels": {
      "__param_target": "0KO5PPKDAMlzAQ22fsXXXX"
    }
  }
]

along with the following minimal http_sd_configs

  - job_name: 'fastly-exporter'

    http_sd_configs:
            - url: http://static-content:7070

and everything worked out of the box.

Cannot pull new Docker image