GithubHelp home page GithubHelp logo

dabz / ccloudexporter Goto Github PK

View Code? Open in Web Editor NEW
87.0 15.0 53.0 2.28 MB

Prometheus exporter for Confluent Cloud API metric

Home Page: https://docs.confluent.io/current/cloud/metrics-api.html

Go 96.46% Dockerfile 0.60% Makefile 2.48% Shell 0.46%

ccloudexporter's Introduction

DEPRECATED - Use export endpoint instead

As of December 2021, Confluent recommends using the export endpoint of the Confluent Cloud Metrics API to extract metrics instead of running a separate service such as the ccloudexporter. This endpoint can be scraped directly with a Prometheus server or other Open Metrics compatible scrapers.

Prometheus exporter for Confluent Cloud Metrics API

A simple Prometheus exporter that can be used to extract metrics from Confluent Cloud Metric API. By default, the exporter will be exposing the metrics on port 2112. When launching with Docker Compose, metrics are displayed via a Grafana dashboard on http://localhost:3000 (admin/admin).

To use the exporter, the following environment variables need to be specified:

  • CCLOUD_API_KEY: The API Key created with ccloud api-key create --resource cloud
  • CCLOUD_API_SECRET: The API Key Secret created with ccloud api-key create --resource cloud

CCLOUD_API_KEY and CCLOUD_API_SECRET environment variables will be used to invoke the https://api.telemetry.confluent.cloud endpoint.

Usage

./ccloudexporter [-cluster <cluster_id>] [-connector <connector_id>] [-ksqlDB <app_id>] [-schemaRegistry <sr_id>]

Options

Usage of ./ccloudexporter:
  -cached-second int
    	Number of second that data will be cached in-memory and returned to Prometheus. This is a mechanism to protect the MetricsAPI from being flooded. (default 30)
  -cluster string
    	Comma separated list of cluster ID to fetch metric for. If not specified, the environment variable CCLOUD_CLUSTER will be used
  -config string
    	Path to configuration file used to override default behavior of ccloudexporter
  -connector string
    	Comma separated list of connector ID to fetch metric for. If not specified, the environment variable CCLOUD_CONNECTOR will be used
  -delay int
    	Delay, in seconds, to fetch the metrics. By default set to 120, this, in order to avoid temporary data points. (default 120)
  -endpoint string
    	Base URL for the Metric API (default "https://api.telemetry.confluent.cloud/")
  -granularity string
    	Granularity for the metrics query, by default set to 1 minutes (default "PT1M")
  -ksqlDB string
    	Comma separated list of ksqlDB application to fetch metric for. If not specified, the environment variable CCLOUD_KSQL will be used
  -schemaRegistry string
    	Comma separated list of Schema Registry ID to fetch metric for. If not specified, the environment variable CCLOUD_SCHEMA_REGISTRY will be used
  -listener string
    	Listener for the HTTP interface (default ":2112")
  -log-pretty-print
    	Pretty print the JSON log output (default true)
  -no-timestamp
    	Do not propagate the timestamp from the the metrics API to prometheus
  -timeout int
    	Timeout, in second, to use for all REST call with the Metric API (default 60)
  -verbose
    	Print trace level logs to stdout
  -version
    	Print the current version and exit

Examples

Building and executing

go get github.com/Dabz/ccloudexporter/cmd/ccloudexporter
go install github.com/Dabz/ccloudexporter/cmd/ccloudexporter
export CCLOUD_API_KEY=ABCDEFGHIKLMNOP
export CCLOUD_API_SECRET=XXXXXXXXXXXXXXXX
./ccloudexporter -cluster lkc-abc123

Using Docker

docker run \
  -e CCLOUD_API_KEY=$CCLOUD_API_KEY \
  -e CCLOUD_API_SECRET=$CCLOUD_API_SECRET \
  -e CCLOUD_CLUSTER=lkc-abc123 \
  -p 2112:2112 \
  dabz/ccloudexporter:latest

Using Docker Compose

export CCLOUD_API_KEY=ABCDEFGHIKLMNOP
export CCLOUD_API_SECRET=XXXXXXXXXXXXXXXX
export CCLOUD_CLUSTER=lkc-abc123
docker-compose up -d

In addition to the metrics exporter and Prometheus containers, the Docker Compose launch starts a Grafana on http://localhost:3000 (admin/admin). The launch pre-provisions a Prometheus datasource for the Confluent Cloud metrics and a default dashboard.

The Docker Compose service definitions include data volumes for both Prometheus and Grafana, so metrics data will be retained following docker-compose down and restored when containers are started again. To remove these volumes and start with empty Prometheus and Grafana databases, run docker-compose down --volumes.

Using Kubernetes

Kubernetes deployment with Prometheus Operator. These following lines assume there is Prometheus Operator already running in the cluster with label: release=monitoring. Add the list of cluster ids separated by spaces in ./kubernetes/ccloud_exporter.env, for example: CCLOUD_CLUSTERS=<cluster_id1> <cluster_id2> ....

cp ./ccloud_exporter.env-template ./kubernetes/ccloud_exporter.env
cd ./kubernetes
vim ./ccloud_exporter.env
make install

A Deployment and a Service object are deployed to a unique namespace. A ServiceMonitor CRD is deployed to the Prometheus Operator namespace.

To delete deployment: cd ./kubernetes && make remove

Configuration file

For more advanced deployment, you could specify a YAML configuration file with the -config flag. If you do not provide a configuration file, the exporter creates one from the provided flags.

Configuration

Global configuration

Key Description Default value
config.http.baseurl Base URL for the Metric API https://api.telemetry.confluent.cloud/
config.http.timeout Timeout, in second, to use for all REST call with the Metric API 60
config.listener Listener for the HTTP interface :2112
config.noTimestamp Do not propagate the timestamp from the metrics API to prometheus false
config.delay Delay, in seconds, to fetch the metrics. By default set to 120, this, in order to avoid temporary data points 120
config.granularity Granularity for the metrics query, by default set to 1 minute PT1M
config.cachedSecond Number of second that data will be cached in-memory and returned to Prometheus. 30
rules List of rules that need to be executed to fetch metrics

Rule configuration

Key Description
rules.clusters List of Kafka clusters to fetch metrics for
rules.connectors List of connectors to fetch metrics for
rules.ksqls List of ksqlDB applications to fetch metrics for
rules.schemaRegistries List of Schema Registries id to fetch metrics for
rules.labels Labels to exposed to Prometheus and group by in the query
rules.topics Optional list of topics to filter the metrics
rules.metrics List of metrics to gather

Examples of configuration files

  • A simple configuration to fetch metrics for a cluster: simple.yaml
  • A configuration to fetch metrics at the partition granularity for a few topics: partition.yaml

Default configuration

config:
  http:
    baseurl: https://api.telemetry.confluent.cloud/
    timeout: 60
  listener: 0.0.0.0:2112
  noTimestamp: false
  delay: 60
  granularity: PT1M
  cachedSecond: 30
rules:
  - clusters:
      - $CCLOUD_CLUSTER
    connectors:
      - $CCLOUD_CONNECTOR
    ksqls:
      - $CCLOUD_KSQL
    schemaRegistries:
      - $CCLOUD_SCHEMA_REGISTRY
    metrics:
      - io.confluent.kafka.server/received_bytes
      - io.confluent.kafka.server/sent_bytes
      - io.confluent.kafka.server/received_records
      - io.confluent.kafka.server/sent_records
      - io.confluent.kafka.server/retained_bytes
      - io.confluent.kafka.server/active_connection_count
      - io.confluent.kafka.server/request_count
      - io.confluent.kafka.server/partition_count
      - io.confluent.kafka.server/successful_authentication_count
      - io.confluent.kafka.connect/sent_bytes
      - io.confluent.kafka.connect/received_bytes
      - io.confluent.kafka.connect/received_records
      - io.confluent.kafka.connect/sent_records
      - io.confluent.kafka.connect/dead_letter_queue_records
      - io.confluent.kafka.ksql/streaming_unit_count
      - io.confluent.kafka.schema_registry/schema_count
    labels:
      - kafka_id
      - topic
      - type

Limits

In order to avoid reaching the limit of 1,000 points set by the Confluent Cloud Metrics API, the following soft limits has been established in the exporter:

  • In order to group by partition, you need to specify one or multiple topics
  • You cannot specify more than 100 topics in a single rule
  • clusters, labels and metrics are required in each rule

How to build

go get github.com/Dabz/ccloudexporter/cmd/ccloudexporter

Grafana

A Grafana dashboard is provided in ./grafana/ folder.

Grafana Screenshot

Deprecated configuration

cluster_id is deprecated

Historically, the exporter and the Metrics API exposed the ID of the cluster with the label cluster_id. In the Metrics API V2, this label has been renamed to resource.kafka.id. It is now exposed by the exporter as kafka_id instead.

To avoid breaking previous dashboard, the exporter is exposing, for the moment, the ID of the cluster as cluster_id and kafka_id.

Username/Password authentication is deprecated

In previous versions, it was possible to rely on username/password to authenticate to Confluent Cloud. Nowadays, only the API key/secret is officially supported to connect to the Metrics API.

To ensure backward compatibility, previous environment variables are still available. Nonetheless, username/password is now deprecated and you must rely on API key/secret.

Integrations

In real world, customers would want to integrate with their existing logging, monitoring and alerting solutions. Here, we're trying to accommodate as many tooling examples as possible to showcase how the Metrics API can be integrated with.

Splunk

Let's take a look at how to see the cloud metrics on Splunk dashboard. A docker-compose yaml is created to include:

  • cclouder exporter image to pull metrics from the CCloud MetricsAPI
  • kafka lag exporter image to pull consumer-group metrics from the Admin client
  • Splunk's Open Telemetry Collector image that'd receive metrics from Prometheus' /metrics endpoint
  • Splunk's standalone container to view the Analytics dashboard
  • This set up is done by tweaking the needed details here

How to run

See Also

For a tutorial that showcases how to use ccloudexporter, and steps through various failure scenarios to see how they are reflected in the provided metrics, see the Observability for Apache Kafka® Clients to Confluent Cloud tutorial.

ccloudexporter's People

Contributors

aaron-trout avatar angoothachap avatar awalther28 avatar carlessanagustin avatar dabz avatar ganeshs avatar javabrett avatar kutysam avatar oorobfuoo avatar raytung avatar sebco59 avatar sirianni avatar subkanthi avatar vdesabou avatar zohimi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ccloudexporter's Issues

Remove grouping per partition

It seems that grouping per partition is generating a lot of data and make it harder to use. On top of that, it seems that not all metrics will be able to be grouped per partition in the future.

We should remove the grouping per partition, maybe we could include it later on, but with more restrictions (e.g. when it's specified for a single topic)

docker.errors.DockerException: Error while fetching server API version

while running 'docker-compose up -d' getting below .
I installed docker and docker compose , adding user to docker group .

Traceback (most recent call last):
File "urllib3/connectionpool.py", line 677, in urlopen
File "urllib3/connectionpool.py", line 392, in _make_request
File "http/client.py", line 1277, in request
File "http/client.py", line 1323, in _send_request
File "http/client.py", line 1272, in endheaders
File "http/client.py", line 1032, in _send_output
File "http/client.py", line 972, in send
File "docker/transport/unixconn.py", line 43, in connect
PermissionError: [Errno 13] Permission denied

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "requests/adapters.py", line 449, in send
File "urllib3/connectionpool.py", line 727, in urlopen
File "urllib3/util/retry.py", line 410, in increment
File "urllib3/packages/six.py", line 734, in reraise
File "urllib3/connectionpool.py", line 677, in urlopen
File "urllib3/connectionpool.py", line 392, in _make_request
File "http/client.py", line 1277, in request
File "http/client.py", line 1323, in _send_request
File "http/client.py", line 1272, in endheaders
File "http/client.py", line 1032, in _send_output
File "http/client.py", line 972, in send
File "docker/transport/unixconn.py", line 43, in connect
urllib3.exceptions.ProtocolError: ('Connection aborted.', PermissionError(13, 'Permission denied'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "docker/api/client.py", line 214, in _retrieve_server_version
File "docker/api/daemon.py", line 181, in version
File "docker/utils/decorators.py", line 46, in inner
File "docker/api/client.py", line 237, in _get
File "requests/sessions.py", line 543, in get
File "requests/sessions.py", line 530, in request
File "requests/sessions.py", line 643, in send
File "requests/adapters.py", line 498, in send
requests.exceptions.ConnectionError: ('Connection aborted.', PermissionError(13, 'Permission denied'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "docker-compose", line 3, in
File "compose/cli/main.py", line 80, in main
File "compose/cli/main.py", line 189, in perform_command
File "compose/cli/command.py", line 70, in project_from_options
File "compose/cli/command.py", line 153, in get_project
File "compose/cli/docker_client.py", line 43, in get_client
File "compose/cli/docker_client.py", line 170, in docker_client
File "docker/api/client.py", line 197, in init
File "docker/api/client.py", line 222, in _retrieve_server_version
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', PermissionError(13, 'Permission denied'))
[3257] Failed to execute script docker-compose

Do not expose all available metrics by default

The number of metrics available in Confluent Cloud Metrics API is increasing every day. As we have more and more data, exposing all of them does not make sense. Instead, we should:

  • Have a default configuration with the most important metrics
  • Have a way to expose new metrics (either through the command line or through a configuration file).

Error on active_connection_count

Hi,

this kind of error happened for active_connection_count metrics.

Received status code 400 instead of 200 for POST on https://api.telemetry.confluent.cloud/v1/metrics/cloud/query with {"aggregations":[{"agg":"SUM","metric":"io.confluent.kafka.server/active_connection_count"}],"filter":{"op":"AND","filters":[{"field":"metric.label.cluster_id","op":"EQ","value":"lkc-aaaaa"}]},"granularity":"PT1M","group_by":["metric.label.topic"],"intervals":["2020-03-17T09:38:25+01:00/2020-03-17T09:39:25+01:00"],"limit":1000}

According to Confluent Cloud support, you can remove "group_by":["metric.label.topic"]

Gather consumer lag

Currently, the Metric API does not expose the consumer lag. But we could retrieve it in multiple ways, e.g. the exporter could rely on the Admin API to expose it.

Request to add swagger.yaml

Even though this has only 2 endpoints it would be nice to add swagger.yaml to the repo and as a path. This would formalise the API and also make automations easier

Metrics using Docker / Kubernetes

Hello.
Is it possible somehow to pass config with other metrics or just change the metrics we export from Confluent, when using Docker or Kubernetes?
Thank you in advance.

Integration with Azure Monitor Logs Question

Hi all,
I have followed https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-prometheus-integration
documentation and currently we have cluster-wide omsagent replicaset (singleton) to scrape from the ccloudexporter deployment in AKS via kubernetes service.

kubectl logs <ccloudexporter> pod says "Listening on http://:2112/metrics\n", and

and just example of 1 of the lines in kubectl logs <omsagent rs> pod says
> prometheus, address=x.x.x.x, scrapeUrl=http://ccloudexportersvc.ccloudexporternamespace.svc.cluster.local:2112/metrics go_memstats_heap_objects=2577

However, in Azure portal > kubernetes service > one of the cluster > Logs > and I've ran queries such as

InsightsMetrics
| where Namespace contains "prometheus"
| where Computer contains "<hostname/node of the omsagent rs>

but the query returns no result.

Add a customer user agent to trace requests

The User-Agent should be specified: the format should be ccloudexporter/<commit version>. This, in order to help the Confluent Cloud team identify the origin of requests and have a way to easily trace the source of unusual workload.

Feature request: add new config, pause duration between retries

Description

When running the ccloud-exporter container within Kubernetes. The CCloud Metrics API rate limit of 50 requets / second is easily hit. Even when the pod has livenessProbe and readinessProbe disabled. Example of such an error

  "error": "Received status code 429 instead of 200 for POST on https://api.telemetry.confluent.cloud//v2/metrics/cloud/query ()",
  "level": "error",
  "msg": "Query did not succeed",
  "optimizedQuery": {
    "aggregations": [
      {
        "agg": "SUM",
        "metric": "io.confluent.kafka.server/partition_count"
      }
    ],
    "filter": {
      "op": "AND",
      "filters": [
        {
          "op": "OR",
          "filters": [
            {
              "field": "resource.kafka.id",
              "op": "EQ",
              "value": "<redacted>"
            }
          ]
        }
      ]
    },
    "granularity": "PT1M",
    "group_by": [],
    "intervals": [
      "2021-09-06T17:09:00Z/PT1M"
    ],
    "limit": 1000
  },
  "response": {
    "data": null
  },
  "time": "2021-09-06T17:11:30Z"
}

Once the API rate limit is triggered, it tends to self-maintain in an infinite loop. Probably because ccloud-exporter retries in quick succession without giving enough wait time between metrics collection.

Proposed solution

Add a new config secondsBetweenRetry as waiting time between a retries when an access to CCloud Metrics API had failed. Ideally this pause should be at the individual API request, and not at the batch of requests, like 9 at a time as in config.simple.yaml.

Even better, this pause duration should follow an "exponential backoff" for example, starts at 5 seconds, then doubles at each new retry and be capped off at, let's say 5 minutes.

Health endpoint

In order to incorporate this as part of a production monitoring solution I'm looking for a "health" endpoint that would fail if this app is down, is there such an endpoint which isn't the metrics endpoint that seems to be hitting the cluster everytime?

Thanks a lot for this project

Gaps in data due to interval misses

I'm seeing pretty regular gaps in the data, probably due to interval misses. I noticed here we're scraping between now and the previous minute:
https://github.com/Dabz/ccloudexporter/blob/master/cmd/internal/scrapper/query.go#L67-L76

If prometheus easily supports updating values, I'd recommend just widening the query window to 5 minutes so you get data points. Updates on the fly would also help with values that are still stabilizing (you'll notice that we expose data right away rather than waiting for late arrivals, then update on the fly).

Repository License

Please add a central LICENSE document declaring the whole repository under the MIT license.

Grouping per cluster is broken

It seems that, if you have multiple clusters configure in the configuration file, the data points are no longer grouped by clusters.

Cause: The exporter is relying on the "labels" field in the descriptor endpoint to find out which label can be used to "group by" the metrics (https://github.com/Dabz/ccloudexporter/blob/master/cmd/internal/collector/collector.go#L144-L152). It seems that the Metrics API is no longer exposing the cluster_id in the list of label.

Workaround: Have multiple rules in your configuration file, or multiple instances of the exporter.

Better indicate deprecation of username/password authentication

Username/password authentication to the Metrics API was only supported in the preview phase. For GA release, only API key/secret authentication is officially supported.

  • Rename CCLOUD_USER and CCLOUD_PASSSWORD environment variables to CCLOUD_APIKEY and CCLOUD_APISECRET respectively
    • The legacy environment variables may still need to be supported to maintain backwards compatibility
  • Update the README accordingly

Feature Request: Backfill data

I am not sure if this feature already exists. This may be another major iteration, or may need a different app altogether.

Metrics API recently saw a down time of few hours.
The idea here is to have something like a simple API call to back-fill those few hours of data with an overridable endpoint + metric format, ex. influx or Datadog, all while maintaining the same metric names as exposed by Prometheus in real-time.

Support exporting for multiple clusters

It would be useful to be able to support exporting for multiple clusters. The Metrics API can do this by using an OR:

"filter": { "filters": [ { "field": "metric.label.cluster_id", "op": "EQ", "value": "lkc-XXXX1" }, { "field": "metric.label.cluster_id", "op": "EQ", "value": "lkc-XXXX2" } ], "op": "OR" },

Results from the query can also be grouped by both the cluster id and the topic name to avoid collisions of topic names across clusters:

"group_by": [ "metric.label.cluster_id", "metric.label.topic" ],

Finally, you can also use the new GROUPED format by passing in the format parameter to make the grouping more explicit if it is helpful.

Feature request: expose additional endpoints for self health check (liveness and readiness)

Description

As of 2021-09-02, ccloud-exporter exposes only endpoint localhost:2112/metrics. When an HTTP request is made on this /metrics endpoint, ccloud-exporter makes outgoing requests to Confluent Cloud Metrics API. Which is the normal and expected behaviour.

In the context of Kubernetes, when ccloud-exporter runs within a pod with livenessProbe and readinessProbe. As the /metrics is the only endpoint exposed by ccloud-exporter, we might be attempted to use this endpoint to probe the readiness status of the ccloud-exporter container.

As a result, each time the /metrics endpoint is probed, and the probe frequency is high (every 5 seconds in this example). The probe request will trigger a collection of requests to Confluent Cloud Metrics API. The quick repeats of probing on the /metrics endpoint will then exhaust the CCloud Metrics API rate limit of 50 requets / minute.

}
  "Endpoint": "https://api.telemetry.confluent.cloud//v2/metrics/cloud/query",
  "StatusCode": 429,
  "body": "",
  "level": "error",
  "msg": "Received invalid response",
  "time": "2021-09-02T14:36:40Z"
}
{
  "error": "Received status code 429 instead of 200 for POST on https://api.telemetry.confluent.cloud//v2/metrics/cloud/query ()",
  "level": "error",
  "msg": "Query did not succeed",
  ... etc...
}

In the case of this example, the API rate limit error status 429 occurs within 15 seconds.
Then ccloud-exporter is stuck in an infinite loop of "StatusCode": 429. Because Kubernetes will endlessly probe the /metrics endpoint to check the health of the pod.

Suggestion

Add a separate endpoint for self health-check. For example: localhost:2113/selfcheck which returns OK if ccloud-exporter is in good shape. This helps Kubernetes to manage the life cycle of the container. For example, to restart the container if it is stuck in a non-functional state.

To reproduce the "StatusCode": 429

  • Uncomment the livenessProbe and readinessProbe sections in the manifest below and deploy it on your Kubernetes cluster.
  • Configure the value for the environment variables CCLOUD_...
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ccloud-exporter
  namespace: monitoring
  labels:
    app: ccloud-exporter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ccloud-exporter
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: ccloud-exporter
    spec:
      containers:
        - name: ccloud-exporter
          image: dabz/ccloudexporter:latest
          imagePullPolicy: IfNotPresent
          env:
            - name: CCLOUD_API_KEY
              value: CloudAPIKey?????
            - name: CCLOUD_API_SECRET
              value: CloudAPISecret?????
            - name: CCLOUD_CLUSTER
              value: lkc-?????
          ports:
            - name: metrics
              containerPort: 2112
              protocol: TCP
#         livenessProbe:
#           httpGet:
#             path: /metrics
#             port: metrics
#             scheme: HTTP
#           initialDelaySeconds: 30
#           timeoutSeconds: 30
#           periodSeconds: 15
#           successThreshold: 1
#           failureThreshold: 3
#         readinessProbe:
#           httpGet:
#             path: /metrics
#             port: metrics
#             scheme: HTTP
#           initialDelaySeconds: 30
#           timeoutSeconds: 30
#           periodSeconds: 5
#           successThreshold: 1
#           failureThreshold: 3
          resources:
            requests:
              cpu: "250m"
              memory: "128Mi"
            limits:
              cpu: "500m"
              memory: "256Mi"
---
apiVersion: v1
kind: Service
metadata:
  name: ccloud-exporter-service
  namespace: monitoring
  labels:
    app: ccloud-exporter
spec:
  ports:
    - name: metrics
      protocol: TCP
      port: 2112
      targetPort: 2112
  selector:
    app: ccloud-exporter

Allow an exclusion list for topics

Currently, ccloudexporter allows filtering based on topics specified.
It should also allow an exclusion list, where the listed topics are excluded from the prometheus metrics endpoint.

Send queries asynchronously instead of synchronously

All queries are sent synchronously, as the number of metric is increasing (thus the number of queries to send), the scrape duration is increasing.
The exporter should execute the query asynchronously in order to reduce the scrape duration (and avoid reaching the scrape timeout...)

Collect schema_registry metrics (initial io.confluent.kafka.schema_registry/schema_count)

V2 API added resource io.confluent.kafka.schema_registry and an initial metric io.confluent.kafka.schema_registry/schema_count. Exporter could collect this metric.

Sample query:

{
    "aggregations": [
        {
            "agg": "SUM",
            "metric": "io.confluent.kafka.schema_registry/schema_count"
        }
    ],
    "filter": {
        "field": "resource.schema_registry.id",
        "op": "EQ",
        "value": "lsrc-xxxxx"
    },
    "granularity": "PT1H",
    "intervals": [
        "2021-02-23T11:00:00+11:00/P0Y0M0DT1H0M0S"
    ],
    "group_by": [
    "resource.schema_registry.id"
    ]
}

Example response/metric:

{
    "aggregations": [
        {
            "agg": "SUM",
            "metric": "io.confluent.kafka.schema_registry/schema_count"
        }
    ],
    "filter": {
        "field": "resource.schema_registry.id",
        "op": "EQ",
        "value": "lsrc-xxxxx"
    },
    "granularity": "PT1H",
    "intervals": [
        "2021-02-23T11:00:00+11:00/P0Y0M0DT1H0M0S"
    ],
    "group_by": [
    "resource.schema_registry.id"
    ]
}
{
    "data": [
        {
            "timestamp": "2021-02-23T00:00:00Z",
            "value": 9.0,
            "resource.schema_registry.id": "lsrc-rw6m7"
        }
    ]
}

Fix CVE-2019-11254 in gopkg.in/yaml.v2 by upgrading to v2.2.8

During a recent vulnerability scan we run internally this was identified in the ccloudexporter binary.
Could I ask for a fix for this please?

{
    "Target": "ccloudexporter",
    "Type": "gobinary",
    "Vulnerabilities": [
      {
        "VulnerabilityID": "CVE-2019-11254",
        "PkgName": "gopkg.in/yaml.v2",
        "InstalledVersion": "v2.2.5",
        "FixedVersion": "v2.2.8",
        "Layer": {
          "DiffID": "sha256:c87148c01e568bde3a58ce90550eb43596a0d9c36bb0bfcb25d31df097c8439f"
        },
        "SeveritySource": "nvd",
        "PrimaryURL": "https://nvd.nist.gov/vuln/detail/CVE-2019-11254",
        "Title": "kubernetes: Denial of service in API server via crafted YAML payloads by authorized users",
        "Description": "The Kubernetes API Server component in versions 1.1-1.14, and versions prior to 1.15.10, 1.16.7 and 1.17.3 allows an authorized user who sends malicious YAML payloads to cause the kube-apiserver to consume excessive CPU cycles while parsing YAML.",
        "Severity": "MEDIUM",
        "CVSS": {
          "nvd": {
            "V2Vector": "AV:N/AC:L/Au:S/C:N/I:N/A:P",
            "V3Vector": "CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H",
            "V2Score": 4,
            "V3Score": 6.5
          },
          "redhat": {
            "V3Vector": "CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H",
            "V3Score": 6.5
          }
        },
        "References": [
          "https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-11254",
          "https://github.com/kubernetes/kubernetes/issues/89535",
          "https://groups.google.com/d/msg/kubernetes-announce/ALL9s73E5ck/4yHe8J-PBAAJ",
          "https://groups.google.com/forum/#!topic/kubernetes-security-announce/wuwEwZigXBc",
          "https://linux.oracle.com/cve/CVE-2019-11254.html",
          "https://linux.oracle.com/errata/ELSA-2020-5653.html",
          "https://security.netapp.com/advisory/ntap-20200413-0003/"
        ],
        "PublishedDate": "2020-04-01T21:15:00Z",
        "LastModifiedDate": "2020-10-02T17:37:00Z"
      }
    ]
  }

Rely on a custom collector instead of GaugeVec

It seems that you have more control over the timestamp of the metrics if you implement a custom collector. This could be useful in the case of Confluent Cloud as the latest data point might not be accurate and we might need to update old data points.

Not able to access more than one cluster using docker-compose option

Hi

Using this code, i tried to connect confluent using the docker-compose method. This code is able to connect and pull the metrics only when I am able to pass CCLOUD_CLUSTER with a single cluster only. Having the following issues and need resolve.

  1. During the single cluster-info through CCLOUD_CLUSTER, not able to get the topics/partitions information.
  2. How do I make the docker-compose method pull more than cluster pieces of information along with topics/partitions etc..

Appreciate if anyone able to help me with this.

  • Bhaskar

403 when using latest image

Hello,

I've followed the exact instructions for both docker and Go.

getting the following error message, even through the API key is valid:

{
   "Endpoint": "https://api.telemetry.confluent.cloud/v1/metrics/cloud/descriptors",
   "StatusCode": 403,
   "body": "eyJlcnJvciI6eyJjb2RlIjo0MDMsIm1lc3NhZ2UiOiJpbnZhbGlkIEFQSSBrZXkifX0K",
   "level": "fatal",
   "msg": "Received status code 403 instead of 200 for GET on https://api.telemetry.confluent.cloud/v1/metrics/cloud/descriptors. \n\n{\"error\":{\"code\":403,\"message\":\"invalid API key\"}}\n\n\n",
   "time": "2021-01-14T00:20:31Z"
 }

would appreciate any help with this

Pod keeps crashing/restarting

We are running this exporter and the Kubernetes pod keeps crashing and restarting. We aren't sure what's happening since no logging is occurring, we just know that the probe on /metrics fails and Kubernetes restarts the pod. Is there a way to get additional logging to figure out what's going on?

Scrapper --> Scraper

Recommend find and replace for scrapper -> scraper to fix the typo since it's used many places

x509: certificate signed by unknown authority

Hi All,

I am trying to run the exporter by using Docker command to extract metrics from our confluent cloud setup.

docker run \
  -e CCLOUD_API_KEY=$CCLOUD_API_KEY \
  -e CCLOUD_API_SECRET=$CCLOUD_API_SECRET \
  -e CCLOUD_CLUSTER=lkc-abc123 \
  -p 2112:2112 \
  dabz/ccloudexporter:latest

But I am getting the following error:

{
  "error": "Get \"https://api.telemetry.confluent.cloud/v2/metrics/cloud/descriptors/resources\": x509: certificate signed by unknown authority",
  "level": "fatal",
  "msg": "HTTP query for the descriptor endpoint failed",
  "time": "2021-12-08T08:42:53Z"
}

Is it due to we enable the X.509 certificate at confluent cloud? Anyone know how to solve it? Appreciate your help. Thanks.

Consider crashing if a 403 response is returned

It's easy to forget you have a ccloudexporter instance running against a cluster that has been torn down. This results in spamming 403 errors until someone notices. Since 403 errors are usually permanent failures that require user intervention to fix, consider crashing the ccloudexporter to fail fast if one is returned.

Misleading logs for metrics listener

This looks like a substitution error, but its very misleading to think that there is something wrong with the host name

{
"level": "info",
"msg": "Listening on http://:2112/metrics\n",
"time": "2021-07-22T18:34:31Z"
}

5xx response when querying telemetry api

When raised this issue with confluent support they simply asked to implement a retry mechanism, as recommended in the documentation. Is there a way this can be implemented in ccloudexporter?

{
"Endpoint": "https://api.telemetry.confluent.cloud//v1/metrics/cloud/query",
"StatusCode": 503,
"body": "upstream connect error or disconnect/reset before headers. reset reason: overflow",
"level": "error",
"msg": "Received invalid response",
"time": "2021-04-08T05:53:22Z"
}



{
"Endpoint": "https://api.telemetry.confluent.cloud//v1/metrics/cloud/query",
"StatusCode": 500,
"body": "{\"errors\":[{\"status\":\"500\",\"detail\":\"There was an error processing your request. It has been logged (ID xxxxxxxxxxxxxxx).\"}]}",
"level": "error",
"msg": "Received invalid response",
"time": "2021-04-08T05:59:30Z"
}

Kafka Output

The Prometheus format is available and works really well. To elevate ccloudexporter as the de-facto for gathering CCloud metrics instead of relying on multiple different solutions, I want to propose adding a Kafka Sink to the code base.

The defaults dont need to change at all, but if we can provide a Kafka sink, it would provide customers an option to stream the data to Kafka. Considering they might have a system that is streaming data off a sink connector to aggregation platforms like Splunk, ES etc, this would mean just adding topic name from this component to sink connector config to stream API data from Kafka as well.

P.S: I do understand that keeping the data of the system being monitored onto the system itself is an anti-pattern, but if we can give a choice, a lot of customers might like it.

Environment Variables in config.yml are not resolved

Hi,

I created a config.yml file based on the default configuration in the README.md page using :

rules:
  - clusters:
      - $CCLOUD_CLUSTER

On metrics fetch, I see errors as

Received status code 403 instead of 200 for POST on https://api.telemetry.confluent.cloud//v2/metrics/cloud/query ({\"errors\":[{\"status\":\"403\",\"detail\":\"Query must filter by at least one of your authorized resources

Running with docker-compose, I exec a shell in the ccloud_exporter and display environment variables :

$ env
...
CCLOUD_CLUSTER=lkc-xxxx1
...

When I set the real value (lkc-xxxx1) instead of the environment variable $CCLOUD_CLUSTER, metrics are correctly fetched on the cluster.

Timeout issue on EKS

Converted the ccloudexporter kubernetes files into a helm chart and am running into a timeout issue.

The deployment has these env vars set:

env:
- name: CCLOUD_API_KEY
  value: "vault:secret/grafana/kafka/ccloud#CCLOUD_API_KEY"
- name: CCLOUD_API_SECRET
  value: "vault:secret/grafana/kafka/ccloud#CCLOUD_API_SECRET"
- name: CCLOUD_CLUSTER
  value: {{ .Values.cluster }}

Seeing:

kubectl logs -n grafana ccloud-exporter-deployment-cdcbbbb67-wq9hr -f
{
  "error": "Get \"https://api.telemetry.confluent.cloud/v2/metrics/cloud/descriptors/resources\": dial tcp 52.38.184.52:443: i/o timeout",
  "level": "fatal",
  "msg": "HTTP query for the descriptor endpoint failed",
  "time": "2021-08-31T18:18:57Z"
}

which tells me the env vars (api_key|secret) are valid but the request is timing out.

Did a little test with a test pod:

› cat test.yaml                                                                                                                                                ☠️
apiVersion: v1
kind: Pod
metadata:
  name: test-pod-name
  namespace: grafana
spec:
  containers:
  - name: test-pod-name
    env:
    - name: CCLOUD_API_KEY
      value: vault:secret/grafana/kafka/ccloud#CCLOUD_API_KEY
    - name: CCLOUD_API_SECRET
      value: vault:secret/grafana/kafka/ccloud#CCLOUD_API_SECRET
    command: ["/bin/bash", "-c"]
    args:
    - curl -u $CCLOUD_API_KEY:$CCLOUD_API_SECRET https://api.telemetry.confluent.cloud/v2/metrics/cloud/descriptors/resources\?resource_type\=kafka

and I see:

› kubectl logs -n grafana test-pod-name -f
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   590  100   590    0     0      2      0  0:04:55  0:03:35  0:01:20   131
{"data":[{"type":"kafka","description":"A Kafka cluster","labels":[{"description":"ID of the Kafka cluster","key":"kafka.id"}]},{"type":"connector","description":"A Kafka Connector","labels":[{"description":"ID of the connector","key":"connector.id"}]},{"type":"ksql","description":"A ksqlDB application","labels":[{"description":"ID of the ksqlDB application","key":"ksql.id"}]},{"type":"schema_registry","description":"A schema registry","labels":[{"description":"ID of the schema registry","key":"schema_registry.id"}]}],"meta":{"pagination":{"page_size":100,"total_size":4}},"links":{}}%

and sometimes:

› kubectl logs -n grafana test-pod-name -f
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:04:17 --:--:--     0
curl: (28) Failed to connect to api.telemetry.confluent.cloud port 443: Connection timed out

Looks like it's taking too long and ends up timing out at times.

Any idea what could cause this in EKS?

timeout issue with docker compose

ccloud_exporter container logs below,

{
  "error": "Get \"https://api.telemetry.confluent.cloud/v2/metrics/cloud/descriptors/resources\": dial tcp: lookup api.telemetry.confluent.cloud on 127.0.0.11:53: read udp 127.0.0.1:57113-\u003e127.0.0.11:53: i/o timeout",
  "level": "fatal",
  "msg": "HTTP query for the descriptor endpoint failed",
  "time": "2021-09-21T14:12:54Z"
}

I tried to increase the timeout to 120 seconds as provided in README.md but no luck.
flag.IntVar(&Context.HTTPTimeout, "timeout", 120, "Timeout, in second, to use for all REST call with the Metric API")

Thanks in advance!

Remove superfluous group_by when filtering by single value

When the query is filtering to a single metric.label.cluster_id:

"filter" : {
  "op": "EQ",
  "field": "metric.label.cluster_id",
  "value": "lkc-12345"
}

there is no need to also specify a group_by, since we know that all results have the same cluster_id (lkc-12345 in this example).

"group_by": [
  "metric.label.cluster_id"
]

This superfluous group_by causes the query to be more expensive on the backend. We can explore optimizing this out on the backend, but it is difficult to do since the filter can contain arbitrarily complex boolean expressions.

Kafka Output channel

The Prometheus format is available and works really well. To elevate ccloudexporter as the de-facto for gathering CCloud metrics instead of relying on multiple different solutions, I want to propose adding a Kafka Sink to the code base.

The defaults don't need to change at all, but if we can provide a Kafka sink, it would provide customers an option to stream the data to Kafka. For customers already streaming metrics from Kafka to an end system, this means only adding topic name to the sink connector config to stream API data from Kafka as well.

P.S: I do understand that keeping data of the system being monitored onto the system itself is an anti-pattern, but if we can give a choice, a lot of customers might like it.

Query intervals should be aligned to minute boundaries

The query interval timeFrom is taken by applying the configured delay to time.Now()

timeFrom := time.Now().Add(time.Duration(Context.Delay*-1) * time.Second) // the last minute might contains data that is not yet finalized

Instead, the start time should be time.Now() with the seconds truncated (i.e. rounded down to the nearest minute). Since the Metrics API only stores data at minutely granularity, using time.Now() is effectively rounding up to the next minute, which makes the effective delay less than the configured delay.

For example:

  • Delay (configured): 120 seconds
  • Current time: 00:10:05
  • Query interval (actual): 00:08:05 / PT1M
  • Query interval (effective): 00:09:00 / PT1M (only metrics with timestamp 00:09:00 will be matched)
  • Delay (effective): 65 seconds (00:10:05 - 00:09:00)

connectors metrics don't have any of kafka.id

Hi,

we have connectors running in dev clusters and we have more than one dedicated cluster in our environments, we are able to see the metrics of kafka cluster which has dimensions of kafka.id where as connectors don't have any dimensions(kafka.id)), would that be possible to add cluster for the connectors so that we can find connector running associated with kafka.id?

I don't see label of kafka.id when I query manual from metrics api.

thanks
Niranjan

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.