GithubHelp home page GithubHelp logo

canonical / grafana-agent-k8s-operator Goto Github PK

View Code? Open in Web Editor NEW
7.0 9.0 18.0 973 KB

This charmed operator automates the operational procedures of running Grafana Agent, an open-soruce telemetry collector.

Home Page: https://charmhub.io/grafana-agent-k8s

License: Apache License 2.0

Python 100.00%
cos grafana grafana-agent juju kubernetes observability telemetry hacktoberfest juju-charm

grafana-agent-k8s-operator's Introduction

Grafana Agent Charmed Operator for Kubernetes

Charmhub Badge Release Edge Discourse Status

⚠️ Are you looking for the Grafana Agent machine charm?

The charms have been split into separate repositories. The machine charm is now available here.

Description

Grafana Agent is a telemetry collector for sending metrics, logs, and trace data to the opinionated Grafana observability stack.

The Grafana Agent Charmed Operator deploys Grafana Agent in Kubernetes using Juju and the Charmed Operator Lifecycle Manager (OLM).

As a single entry point to the Canonical Observability Stack, the Grafana Agent charm brings several conveniences when deployed inside a monitored cluster:

  • Charms are related to the Grafana Agent charm, instead of to Prometheus and Loki individually. In typical deployments this would reduce the number of cross-model relations that would have been otherwise needed.
  • Conversion from scraping to remote writing: Grafana Agent would collect telemetry inside the cluster network and push it to the COS cluster (via loki_push_api and prometheus_remote_write), which simplifies firewall configuration, as only outgoing connections would need to be established.

See deployment scenarios for further detail.

Usage

Create a Juju model for your operators, say "cos"

juju add-model cos

The Grafana agent may be deployed using the juju command line:

juju deploy grafana-agent-k8s --trust

If required, you can remove the deployment completely:

juju destroy-model -y cos --no-wait --force --destroy-storage

Relations

Currently supported relations are:

requires:
  send-remote-write:
    interface: prometheus_remote_write
  metrics-endpoint:
    interface: prometheus_scrape
  logging-consumer:
    interface: loki_push_api

provides:
  self-metrics-endpoint:
    interface: prometheus_scrape
  grafana-dashboard:
    interface: grafana_dashboard
  logging-provider:
    interface: loki_push_api

More detailed information about these relations can be found in Charmhub docs page.

OCI Images

This charm by default uses the latest release of the grafana-agent

grafana-agent-k8s-operator's People

Contributors

abuelodelanada avatar chanchiwai-ray avatar dragomirp avatar dstathis avatar ghislainbourgeois avatar ibraaoad avatar jneo8 avatar lucabello avatar mmkay avatar neoaggelos avatar observability-noctua-bot avatar pietropasotti avatar rbarry82 avatar rgildein avatar samuelallan72 avatar sed-i avatar simskij avatar sudeephb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

grafana-agent-k8s-operator's Issues

Missing relation in the charm based on the INTEGRATION documentation

Bug Description

The relation is present in the documentation of INTEGRATION but there is no implementation in the charm (yet).
It would be awesome to have it done or marked in the documentation that it is expected to start version xx.yy

To Reproduce

  1. Install microk8s and install cos-lite on top of it
  2. Install Charmed Kubernetes separately
  3. Create cross model relation between grafana-agent-k8s and grafana-k8s

Environment

I ran it using Mk8s on LXD with COS lite and Charmed Kubeflow on bare-metal. The grafana-agent-k8s charm was deployed on CK8s.

Relevant log output

$ juju add-relation admin/cos-kubeflow.grafana-dashboards grafana-agent-k8s:grafana-dashboard
ERROR application "grafana-agent-k8s" has no "grafana-dashboard" relation

$ juju status grafana-agent-k8s
Model     Controller        Cloud/Region  Version  SLA          Timestamp
kubeflow  foundations-maas  ck8s/default  2.9.37   unsupported  07:49:56Z

SAAS                             Status  Store             URL
grafana-dashboards               active  foundations-maas  admin/cos-kubeflow.grafana-dashboards
prometheus-receive-remote-write  active  foundations-maas  admin/cos-kubeflow.prometheus-receive-remote-write

App                Version  Status  Scale  Charm              Channel  Rev  Address         Exposed  Message
grafana-agent-k8s  0.26.1   active      1  grafana-agent-k8s  stable    24  122.33.123.123  no       

Unit                  Workload  Agent  Address         Ports  Message
grafana-agent-k8s/0*  active    idle   123.123.123.123

Additional context

No response

Grafana Dashboard with "creative" datasource names for Prometheus and Loki breaks the Grafana operator

Bug Description

The Grafana Dashboard below borks the Grafana operator:

Model  Controller   Cloud/Region        Version  SLA          Timestamp
cos    development  microk8s/localhost  2.9.27   unsupported  17:07:56+01:00

App           Version  Status   Scale  Charm             Channel  Rev  Address         Exposed  Message
alertmanager           active       1  alertmanager-k8s  edge      13  10.152.183.177  no       
grafana                waiting      1  grafana-k8s       edge      31  10.152.183.43   no       installing agent
loki                   active       1  loki-k8s          edge      15  10.152.183.145  no       
prometheus             active       1  prometheus-k8s    edge      24  10.152.183.70   no       
spring-music           active       1  spring-music                 0  10.152.183.201  no       

Unit             Workload  Agent  Address      Ports  Message
alertmanager/0*  active    idle   10.1.151.86         
grafana/0*       error     idle   10.1.151.87         hook failed: "grafana-dashboard-relation-changed"
loki/0*          active    idle   10.1.151.88         
prometheus/0*    active    idle   10.1.151.89         
spring-music/0*  active    idle   10.1.151.90  

To Reproduce

Add the following dashboard to the Spring Music charm, or provide it to Grafana via the COS Configuration charm:

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "description": "Dashboard for the Spring Music application, powered by Juju",
  "editable": true,
  "gnetId": 9845,
  "graphTooltip": 1,
  "id": 3,
  "iteration": 1639059314688,
  "links": [],
  "panels": [
    {
      "cacheTimeout": null,
      "datasource": "${promds}",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "custom": {},
          "mappings": [
            {
              "from": "",
              "id": 1,
              "text": "0",
              "to": "",
              "type": 1,
              "value": ""
            }
          ],
          "max": 100,
          "min": 0,
          "noValue": "0",
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "#299c46",
                "value": null
              },
              {
                "color": "rgba(237, 129, 40, 0.89)",
                "value": 1
              },
              {
                "color": "#d44a3a",
                "value": 5
              }
            ]
          },
          "unit": "none"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 7,
        "x": 0,
        "y": 0
      },
      "id": 31,
      "interval": null,
      "links": [],
      "maxDataPoints": 100,
      "options": {
        "orientation": "horizontal",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "",
          "values": false
        },
        "showThresholdLabels": false,
        "showThresholdMarkers": true,
        "text": {}
      },
      "pluginVersion": "7.4.1",
      "targets": [
        {
          "expr": "sum(increase(http_server_requests_seconds_count{juju_model=\"$juju_model\",juju_model_uuid=\"$juju_model_uuid\",juju_application=\"$juju_application\",juju_unit=\"$juju_unit\", outcome!=\"SUCCESS\", outcome!=\"REDIRECTION\"}[$__range])) / sum(increase(http_server_requests_seconds_count{juju_model=\"$juju_model\",juju_model_uuid=\"$juju_model_uuid\",juju_application=\"$juju_application\",juju_unit=\"$juju_unit\"}[$__range])) * 100",
          "format": "time_series",
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "title": "Error Rate",
      "type": "gauge"
    },
    {
      "cacheTimeout": null,
      "datasource": "${promds}",
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "custom": {},
          "decimals": 0,
          "mappings": [
            {
              "id": 0,
              "op": "=",
              "text": "0",
              "type": 1,
              "value": "null"
            }
          ],
          "noValue": "0",
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "blue",
                "value": null
              }
            ]
          },
          "unit": "locale"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 17,
        "x": 7,
        "y": 0
      },
      "id": 39,
      "interval": "1m",
      "links": [],
      "maxDataPoints": null,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "horizontal",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "",
          "values": false
        },
        "text": {},
        "textMode": "auto"
      },
      "pluginVersion": "7.4.1",
      "targets": [
        {
          "expr": "sum(rate(http_server_requests_seconds_count{juju_model=\"$juju_model\",juju_model_uuid=\"$juju_model_uuid\",juju_application=\"$juju_application\",juju_unit=\"$juju_unit\"}[4m]))",
          "format": "time_series",
          "interval": "1m",
          "intervalFactor": 1,
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "Req / Min",
      "type": "stat"
    },
    {
      "collapsed": false,
      "datasource": null,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 8
      },
      "id": 2,
      "panels": [],
      "title": "Application",
      "type": "row"
    },
    {
      "aliasColors": {
        "Erroneous Requests": "rgb(255, 53, 75)",
        "Error": "red",
        "Success": "blue",
        "Successful Requests": "rgb(47, 117, 46)",
        "{level=\"ERROR\"}": "red",
        "{level=\"INFO\"}": "blue"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "${promds}",
      "fieldConfig": {
        "defaults": {
          "color": {},
          "custom": {},
          "thresholds": {
            "mode": "absolute",
            "steps": []
          },
          "unit": "short"
        },
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 9,
        "w": 24,
        "x": 0,
        "y": 9
      },
      "hiddenSeries": false,
      "id": 43,
      "interval": "60s",
      "legend": {
        "alignAsTable": false,
        "avg": false,
        "current": false,
        "hideEmpty": true,
        "hideZero": false,
        "max": false,
        "min": false,
        "rightSide": false,
        "show": true,
        "total": true,
        "values": true
      },
      "lines": true,
      "linewidth": 3,
      "maxDataPoints": null,
      "nullPointMode": "null as zero",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.4.1",
      "pointradius": 0.5,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": true,
      "steppedLine": false,
      "targets": [
        {
          "expr": "sum(rate(http_server_requests_seconds_count{juju_model=\"$juju_model\",juju_model_uuid=\"$juju_model_uuid\",juju_application=\"$juju_application\",juju_unit=\"$juju_unit\", outcome!~\"SUCCESS|REDIRECTION\"}[4m]))",
          "hide": false,
          "instant": false,
          "interval": "1m",
          "intervalFactor": 1,
          "legendFormat": "Error",
          "refId": "B"
        },
        {
          "expr": "sum(rate(http_server_requests_seconds_count{juju_model=\"$juju_model\",juju_model_uuid=\"$juju_model_uuid\",juju_application=\"$juju_application\",juju_unit=\"$juju_unit\", outcome=~\"SUCCESS|REDIRECTION\"}[4m]))",
          "hide": false,
          "interval": "1m",
          "legendFormat": "Success",
          "refId": "A"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Requests/Outcome",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "transformations": [],
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "$$hashKey": "object:502",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "$$hashKey": "object:503",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {
        "{level=\"ERROR\"}": "red",
        "{level=\"INFO\"}": "blue"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "${lokids}",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 9,
        "w": 24,
        "x": 0,
        "y": 18
      },
      "hiddenSeries": false,
      "id": 46,
      "interval": null,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": true,
        "values": true
      },
      "lines": true,
      "linewidth": 3,
      "nullPointMode": "null as zero",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.4.1",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": true,
      "steppedLine": false,
      "targets": [
        {
          "expr": "sum(rate({juju_model=\"$juju_model\",juju_model_uuid=\"$juju_model_uuid\",juju_application=\"$juju_application\",juju_unit=\"$juju_unit\"}[1m])) by (level)",
          "queryType": "randomWalk",
          "refId": "A"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Logs by severity",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "$$hashKey": "object:594",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "$$hashKey": "object:595",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "datasource": "${lokids}",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "gridPos": {
        "h": 13,
        "w": 24,
        "x": 0,
        "y": 27
      },
      "id": 45,
      "options": {
        "showLabels": false,
        "showTime": false,
        "sortOrder": "Descending",
        "wrapLogMessage": false
      },
      "pluginVersion": "7.4.1",
      "targets": [
        {
          "expr": "{juju_model=\"$juju_model\",juju_model_uuid=\"$juju_model_uuid\",juju_application=\"$juju_application\",juju_unit=\"$juju_unit\"}",
          "queryType": "randomWalk",
          "refId": "A"
        }
      ],
      "title": "Logs",
      "type": "logs"
    }
  ],
  "refresh": "5s",
  "schemaVersion": 27,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": [
      {
        "current": {
          "selected": false,
          "text": "juju_lma_51fd2a3a-f900-4606-8c12-dc8f9febe555_prometheus_0",
          "value": "juju_lma_51fd2a3a-f900-4606-8c12-dc8f9febe555_prometheus_0"
        },
        "description": null,
        "error": null,
        "hide": 0,
        "includeAll": false,
        "label": "Prometheus Datasource",
        "multi": false,
        "name": "promds",
        "options": [],
        "query": "prometheus",
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "type": "datasource"
      },
      {
        "current": {
          "selected": false,
          "text": "juju_lma_51fd2a3a-f900-4606-8c12-dc8f9febe555_loki_0",
          "value": "juju_lma_51fd2a3a-f900-4606-8c12-dc8f9febe555_loki_0"
        },
        "description": null,
        "error": null,
        "hide": 0,
        "includeAll": false,
        "label": "Loki Datasource",
        "multi": false,
        "name": "lokids",
        "options": [],
        "query": "loki",
        "queryValue": "",
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "type": "datasource"
      },
      {
        "allValue": null,
        "current": {
          "selected": false,
          "text": "spring",
          "value": "spring"
        },
        "datasource": "${promds}",
        "definition": "label_values(up,juju_model)",
        "description": null,
        "error": null,
        "hide": 0,
        "includeAll": false,
        "label": "Juju model",
        "multi": false,
        "name": "juju_model",
        "options": [],
        "query": {
          "query": "label_values(up,juju_model)",
          "refId": "StandardVariableQuery"
        },
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "sort": 0,
        "tagValuesQuery": "",
        "tags": [],
        "tagsQuery": "",
        "type": "query",
        "useTags": false
      },
      {
        "allValue": null,
        "current": {
          "selected": false,
          "text": "c129a417-465c-4570-8c8b-b6e3402af25d",
          "value": "c129a417-465c-4570-8c8b-b6e3402af25d"
        },
        "datasource": "${promds}",
        "definition": "label_values(up{juju_model=\"$juju_model\"},juju_model_uuid)",
        "description": null,
        "error": null,
        "hide": 0,
        "includeAll": false,
        "label": "Juju model uuid",
        "multi": false,
        "name": "juju_model_uuid",
        "options": [],
        "query": {
          "query": "label_values(up{juju_model=\"$juju_model\"},juju_model_uuid)",
          "refId": "StandardVariableQuery"
        },
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "sort": 0,
        "tagValuesQuery": "",
        "tags": [],
        "tagsQuery": "",
        "type": "query",
        "useTags": false
      },
      {
        "allValue": null,
        "current": {
          "selected": false,
          "text": "spring-music",
          "value": "spring-music"
        },
        "datasource": "${promds}",
        "definition": "label_values(up{juju_model=\"$juju_model\",juju_model_uuid=\"$juju_model_uuid\"},juju_application)",
        "description": null,
        "error": null,
        "hide": 0,
        "includeAll": false,
        "label": "Juju application",
        "multi": false,
        "name": "juju_application",
        "options": [],
        "query": {
          "query": "label_values(up{juju_model=\"$juju_model\",juju_model_uuid=\"$juju_model_uuid\"},juju_application)",
          "refId": "StandardVariableQuery"
        },
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "sort": 0,
        "tagValuesQuery": "",
        "tags": [],
        "tagsQuery": "",
        "type": "query",
        "useTags": false
      },
      {
        "allValue": null,
        "current": {
          "selected": true,
          "text": "spring-music/0",
          "value": "spring-music/0"
        },
        "datasource": "${promds}",
        "definition": "label_values(up{juju_model=\"$juju_model\",juju_model_uuid=\"$juju_model_uuid\",juju_application=\"$juju_application\"},juju_unit)",
        "description": null,
        "error": null,
        "hide": 0,
        "includeAll": false,
        "label": "Juju unit",
        "multi": false,
        "name": "juju_unit",
        "options": [],
        "query": {
          "query": "label_values(up{juju_model=\"$juju_model\",juju_model_uuid=\"$juju_model_uuid\",juju_application=\"$juju_application\"},juju_unit)",
          "refId": "StandardVariableQuery"
        },
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "sort": 0,
        "tagValuesQuery": "",
        "tags": [],
        "tagsQuery": "",
        "type": "query",
        "useTags": false
      }
    ]
  },
  "time": {
    "from": "now-1h",
    "to": "now"
  },
  "timepicker": {
    "refresh_intervals": [
      "5s",
      "10s",
      "30s",
      "1m",
      "5m",
      "15m",
      "30m",
      "1h",
      "2h",
      "1d"
    ],
    "time_options": [
      "5m",
      "15m",
      "1h",
      "6h",
      "12h",
      "24h",
      "2d",
      "7d",
      "30d"
    ]
  },
  "timezone": "browser",
  "title": "Spring Music",
  "uid": "XwUYhT27k",
  "version": 7
}

Environment

A Juju controller next to you. Buh!

Relevant log output

unit-grafana-0: 17:07:17 ERROR unit.grafana/0.juju-log grafana-dashboard:6: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 680, in <module>
    main(GrafanaCharm, use_juju_for_storage=True)
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/main.py", line 431, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/main.py", line 142, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/framework.py", line 283, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/framework.py", line 743, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/framework.py", line 790, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-grafana-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1029, in _on_grafana_dashboard_relation_changed
    changes = self._render_dashboards_and_signal_changed(event.relation)
  File "/var/lib/juju/agents/unit-grafana-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1127, in _render_dashboards_and_signal_changed
    content = _encode_dashboard_content(_convert_dashboard_fields(content))
  File "/var/lib/juju/agents/unit-grafana-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 562, in _convert_dashboard_fields
    dict_content = _replace_template_fields(dict_content, datasources, existing_templates)
  File "/var/lib/juju/agents/unit-grafana-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 596, in _replace_template_fields
    ds = re.sub(r"(\$|\{|\})", "", panel["datasource"])
  File "/usr/lib/python3.8/re.py", line 210, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object

Additional context

I took the Spring Music dashboard from the Spring Music charm, replaces prometheusds with promds because I am devious.

Use pyproject.toml instead of requirements.txt?

turns out if you use pyproject.toml you don't quite need dependencies.txt. So yeah, we should drop that and unify.

[project]
name = "grafana-agent-k8s-operator"
# FIXME: Packing the charm with 2.2.0+139.gd011d92 will not include dependencies in PYDEPS key:
# https://chat.charmhub.io/charmhub/pl/wngp665ycjnb78ar9ojrfhxjkr
# That's why we are including cosl here until the bug in charmcraft is solved
dependencies = [
    "cosl",
    "ops>=2.0",
    "pydantic",
    "requests",
    "kubernetes",
    "lightkube",
    "lightkube-models",
]
version = "0.1"

# required for tox to work with isolated_build
[build-system]
requires = [
    "setuptools >= 35.0.2",
    "setuptools_scm >= 2.0.0, <3"
]

build-backend = "setuptools.build_meta"

Originally posted by @PietroPasotti in #160 (comment)

Invalid cos-agent relation data raises instead of blocking

Bug Description

The new version of cos-agent lib switched to a new schema + pydantic validation.
As a result, charm code raises and the unit is in error state.
Instead, it should probably block.

To Reproduce

Relate gagent to zookeeper (revision 96).

Environment

Model                    Controller  Cloud/Region         Version  SLA          Timestamp
test-machine-agent-2za1  lxd         localhost/localhost  2.9.42   unsupported  10:29:04-04:00

App                  Version  Status  Scale  Charm          Channel        Rev  Exposed  Message
agent                         error       4  grafana-agent  edge             7  no       hook failed: "cos-agent-relation-joined"
principal-cos-agent           active      2  zookeeper      edge            96  no       
principal-juju-info  22.04    active      2  ubuntu         latest/stable   22  no       

Relevant log output

unit-agent-2: 10:23:10.930 DEBUG unit.agent/2.juju-log cos-agent:4: Emitting Juju event cos_agent_relation_joined.
unit-agent-2: 10:23:11.028 ERROR unit.agent/2.juju-log cos-agent:4: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-agent-2/charm/./src/charm.py", line 473, in <module>
    main(GrafanaAgentMachineCharm)
  File "/var/lib/juju/agents/unit-agent-2/charm/venv/ops/main.py", line 441, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-agent-2/charm/venv/ops/main.py", line 149, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-agent-2/charm/venv/ops/framework.py", line 354, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-agent-2/charm/venv/ops/framework.py", line 830, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-agent-2/charm/venv/ops/framework.py", line 919, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-agent-2/charm/lib/charms/grafana_agent/v0/cos_agent.py", line 477, in _on_relation_data_changed
    provider_data = CosAgentProviderUnitData(**json.loads(raw))
  File "/var/lib/juju/agents/unit-agent-2/charm/venv/pydantic/main.py", line 341, in __init__
    raise validation_error
pydantic.error_wrappers.ValidationError: 5 validation errors for CosAgentProviderUnitData
metrics_alert_rules
  field required (type=value_error.missing)
log_alert_rules
  field required (type=value_error.missing)
dashboards
  field required (type=value_error.missing)
metrics_scrape_jobs
  field required (type=value_error.missing)
log_slots
  field required (type=value_error.missing)
unit-agent-2: 10:23:11.251 ERROR juju.worker.uniter.operation hook "cos-agent-relation-joined" (via hook dispatching script: dispatch) failed: exit status 1

Additional context

No response

Add support for configuring node-exporter at deploy time

Enhancement Proposal

Problem

  • I as a user want to build alert rules for particular services running on my machine-charm, based on various /proc and sysctl metrics

Issue

  • node-exporter provides the ability to make available these metrics, however currently the Grafana Agent Machine subordinate charm does not allow configuration of node-exporter at deploy time
  • It also isn't clear how one might specify 'Get me the /proc/$SERVICE_PID/limits metrics ONLY for specific service pids'. Grabbing them all would be overkill

Feature Request

  • To be able to pass configuration to node-exporter, running within the Grafana Agent Machine subordinate, that allows for enabling and configuring additional collectors

    • Assuming that is all that's needed. If not, ALSO that the metrics appear in Grafana and alert rules can be made based on them
  • NICE TO HAVE - If possible, as the grafana-agent charm runs as a subordinate on the same machine, it would be cool if we could have some way of specifying 'grab only metrics for XYZ service pid', maybe with a regex match as it might change?

    • On this, snap logs returns something like 2023-03-29T22:39:31+01:00 charmed-kafka.daemon[$SERVICE_PID]:, maybe that's helpful, maybe not
    • It's possible the process collector allows for this, can't tell. It says it's aggregate though which isn't ideal.

Example Usage

  • NOTE - going off of these disabled by default options
  • I deploy grafana-agent, specifying that I would like to enable --collector.sysctl or --collector.process or --collector.sysctl.include=vm.swappiness and --collector.sysctl.include=vm.max_map_count

add typing for endpoint

if typing.TYPE_CHECKING:
    from typing import TypedDict  # if the smallest python version we support has it! Otherwise, you could wrap this all in a try/except and if there's an error define `_EndpointDict = dict` and that's it.
    
    class _EndpointDict(TypedDict):  
        path:str
         port:int # ?
-        metrics_endpoints: List[dict] = [DEFAULT_METRICS_ENDPOINT],
+        metrics_endpoints: List["_EndpointDict"] = [DEFAULT_METRICS_ENDPOINT],      

Originally posted by @PietroPasotti in #95 (comment)

(machine) Use the `positions_directory` config option

I played around a little with this. You can also set a folder in which the grafana-agent will generate its own position files based on the job name so you don't have to set the name manually for every job. There's a field on the same level as "configs" called "positions_directory" (see https://grafana.com/docs/agent/latest/configuration/logs-config/). I had to put it in /var/snap/[...] to get it to work permissions wise, but that did the trick for me at least. Have you considered it?

Originally posted by @awnns in #116 (comment)

(machine_charm): create new jobs for each connected target plug

Bug Description

  • connected snaps show up in $SNAP/shared-logs/PLUGNAME-ID
  • when target slots change: create a loki job for each and pass it to loki, NOT a single job for $SNAP/shared-logs/*

To Reproduce

Environment

Relevant log output

-

Additional context

No response

Node exporter dashboard needs correct label.

Bug Description

Right now node exporter dashboard (node-exporter-full.json) will show errors when selected on Grafana. ${DS_PROMETHEUS} entries on the json would need to be changed to ${prometheusds} I think.

To Reproduce


Environment


Relevant log output

---

Additional context

No response

Upgrade Grafana Agent OCI image

Enhancement Proposal

In metadata.yaml we have:

resources:
  agent-image:
    type: oci-image
    upstream-source: grafana/agent:v0.26.1
    description: OCI image for Grafana Agent

But the last Grafana agent version is 0.28.0

The drop stage rule makes the machine agent crash

Bug Description

With the addition of the drop pipeline stage, grafana agent now crashes on machines.

To Reproduce

Environment

Relevant log output

Apr 27 13:29:04 kafka-node3 grafana-agent.grafana-agent[232392]: ts=2023-04-27T13:29:04.079151534Z caller=wal.go:198 level=info agent=prometheus instance=eb098155f879b1feac4a3afa141b0a27 msg="replaying WAL, this may take a while" dir=/tmp/agent/data/eb098155f879b1feac4a3afa141b0a27/wal
Apr 27 13:29:04 kafka-node3 grafana-agent.grafana-agent[232392]: ts=2023-04-27T13:29:04.079469032Z caller=main.go:64 level=error msg="error creating the agent server entrypoint" err="unable to apply config for log_file_scraper: unable to create logs instance: failed to make file target manager: invalid drop stage config: 1 error(s) decoding:\n\n* 'source' expected type 'string', got unconvertible type '[]interface {}', value: '[level msg]'"
Apr 27 13:29:04 kafka-node3 systemd[1]: snap.grafana-agent.grafana-agent.service: Main process exited, code=exited, status=1/FAILURE
Apr 27 13:29:04 kafka-node3 systemd[1]: snap.grafana-agent.grafana-agent.service: Failed with result 'exit-code'.
Apr 27 13:29:04 kafka-node3 systemd[1]: snap.grafana-agent.grafana-agent.service: Scheduled restart job, restart counter is at 5.
Apr 27 13:29:04 kafka-node3 systemd[1]: Stopped Service for snap application grafana-agent.grafana-agent.
Apr 27 13:29:04 kafka-node3 systemd[1]: snap.grafana-agent.grafana-agent.service: Start request repeated too quickly.
Apr 27 13:29:04 kafka-node3 systemd[1]: snap.grafana-agent.grafana-agent.service: Failed with result 'exit-code'.
Apr 27 13:29:04 kafka-node3 systemd[1]: Failed to start Service for snap application grafana-agent.grafana-agent.
Apr 27 13:29:21 kafka-node3 charmed-kafka.daemon[14267]: [2023-04-27 13:29:21,302] ERROR [ReplicaManager broker=2] Error while describing replica in dir /var/snap/charmed-kafka/common/var/lib/kafka/data/2 (kafka.server.ReplicaManager)

Additional context

No response

(machine charm) Grafana dashboards files are stored in a bad format (or not stored at all)

Bug Description

Grafana dashboards from a principal charm are not reaching Grafana through the Grafana Agent machine charm.

Let's assume we have the following model that it is related to COS Lite (running in another model) through CMR:

╭─ubuntu@charm-dev-juju-29 ~ 
╰─$ juju status --color --relations -m lxd:applications                
Model         Controller  Cloud/Region         Version  SLA          Timestamp
applications  lxd         localhost/localhost  2.9.38   unsupported  18:37:52-03:00

SAAS                             Status  Store                URL
grafana-dashboards               active  charm-dev-batteries  admin/cos.grafana-dashboards
loki-logging                     active  charm-dev-batteries  admin/cos.loki-logging
prometheus-receive-remote-write  active  charm-dev-batteries  admin/cos.prometheus-receive-remote-write

App            Version  Status  Scale  Charm          Channel  Rev  Exposed  Message
grafana-agent           active      1  grafana-agent             2  no       
zookeeper               active      1  zookeeper                 2  no       

Unit                Workload  Agent  Machine  Public address  Ports  Message
zookeeper/2*        active    idle   2        10.77.61.145           
  grafana-agent/4*  active    idle            10.77.61.145           

Machine  State    Address       Inst id        Series  AZ  Message
2        started  10.77.61.145  juju-ca2168-2  jammy       Running

Relation provider    Requirer                 Interface   Type         Message
zookeeper:cluster    zookeeper:cluster        cluster     peer         
zookeeper:cos-agent  grafana-agent:cos-agent  cos_agent   subordinate  
zookeeper:restart    zookeeper:restart        rolling_op  peer  

Zookeeper has 2 Grafana dashboards:

After relating zookeeper to grafana-agent, those dashboards should be in grafana agent, but they aren't:

$ juju ssh grafana-agent/4 ls -1 /var/lib/juju/agents/unit-grafana-agent-4/charm/grafana_dashboards
grafana-agent-node-exporter-quickstart_rev2.json
'juju_('\''cos-agent-zookeeper-13'\'',).rules'
node-exporter-full.json

As you may notice, there is a file with a malformed name: 'juju_('\''cos-agent-zookeeper-13'\'',).rules', besides the extension is .rules. Shouldn't be here.

But if we look inside this file, we'll discover something interesting:

ubuntu@juju-ca2168-2:~$ tail /var/lib/juju/agents/unit-grafana-agent-4/charm/grafana_dashboards/juju_\(\'cos-agent-zookeeper-13\'\,\).rules
  : \"\",\n        \"tags\": [],\n        \"tagsQuery\": \"\",\n        \"type\":\
  \ \"query\",\n        \"useTags\": false\n      }\n    ]\n  },\n  \"time\": {\n\
  \    \"from\": \"now-1h\",\n    \"to\": \"now\"\n  },\n  \"timepicker\": {\n   \
  \ \"refresh_intervals\": [\n      \"5s\",\n      \"10s\",\n      \"30s\",\n    \
  \  \"1m\",\n      \"5m\",\n      \"15m\",\n      \"30m\",\n      \"1h\",\n     \
  \ \"2h\",\n      \"1d\"\n    ],\n    \"time_options\": [\n      \"5m\",\n      \"\
  15m\",\n      \"1h\",\n      \"6h\",\n      \"12h\",\n      \"24h\",\n      \"2d\"\
  ,\n      \"7d\",\n      \"30d\"\n    ]\n  },\n  \"timezone\": \"\",\n  \"title\"\
  : \"ZooKeeper by Prometheus\",\n  \"uid\": \"SDE76m7Zzz\",\n  \"version\": 342\n\
  }\n"

The content of this file is in fact this Zookeeper dashboard

To Reproduce

  1. Follow this tutorial
  2. Check the dashboard files are not in /var/lib/juju/agents/unit-grafana-agent-{UNIT_NUMBER}/charm/grafana_dashboards

Environment

Relevant log output

-

Additional context

No response

(machine) Error when relating to to a principal charm

Bug Description

After relation zookeeper to grafana-agent, the charm ends up in error state

╭─ubuntu@charm-dev-juju-30 ~ [lxd:applications]
╰─$ juju status --color --relations  -m lxd:applications           
Model         Controller  Cloud/Region         Version  SLA          Timestamp
applications  lxd         localhost/localhost  3.0.3    unsupported  17:39:46-03:00

SAAS                             Status  Store     URL
grafana-dashboards               active  microk8s  admin/cos.grafana-dashboards
loki-logging                     active  microk8s  admin/cos.loki-logging
prometheus-receive-remote-write  active  microk8s  admin/cos.prometheus-receive-remote-write

App            Version  Status  Scale  Charm          Channel  Rev  Exposed  Message
grafana-agent           error       1  grafana-agent  edge       7  no       hook failed: "cos-agent-relation-joined"
zookeeper               active      1  zookeeper      edge      96  no       

Unit                Workload  Agent  Machine  Public address  Ports  Message
zookeeper/0*        active    idle   0        10.201.4.178           
  grafana-agent/3*  error     idle            10.201.4.178           hook failed: "cos-agent-relation-joined" for grafana-agent:cos-agent

Machine  State    Address       Inst id        Base          AZ  Message
0        started  10.201.4.178  juju-028aca-0  [email protected]      Running

Relation provider    Requirer                 Interface              Type         Message
grafana-agent:peers  grafana-agent:peers      grafana_agent_replica  peer         
zookeeper:cluster    zookeeper:cluster        cluster                peer         
zookeeper:cos-agent  grafana-agent:cos-agent  cos_agent              subordinate  
zookeeper:restart    zookeeper:restart        rolling_op             peer          

To Reproduce

  1. juju deploy zookeeper --channel edge
  2. juju deploy grafana-agent --channel edge
  3. juju relate zookeeper:cos-agent grafana-agent

Environment

  • juju 3.0.3
  • lxd 5.0.2-838e1b2

Relevant log output

unit-grafana-agent-3: 17:39:32.674 WARNING unit.grafana-agent/3.juju-log cos-agent:12: An incoming 'cos-agent' relation does not yet have any matching outgoing relation(s): [send-remote-write|grafana-cloud-config]
unit-grafana-agent-3: 17:39:32.800 ERROR unit.grafana-agent/3.juju-log cos-agent:12: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-grafana-agent-3/charm/./src/charm.py", line 473, in <module>
    main(GrafanaAgentMachineCharm)
  File "/var/lib/juju/agents/unit-grafana-agent-3/charm/venv/ops/main.py", line 441, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-grafana-agent-3/charm/venv/ops/main.py", line 149, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-grafana-agent-3/charm/venv/ops/framework.py", line 354, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-grafana-agent-3/charm/venv/ops/framework.py", line 830, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-grafana-agent-3/charm/venv/ops/framework.py", line 919, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-grafana-agent-3/charm/lib/charms/grafana_agent/v0/cos_agent.py", line 477, in _on_relation_data_changed
    provider_data = CosAgentProviderUnitData(**json.loads(raw))
  File "/var/lib/juju/agents/unit-grafana-agent-3/charm/venv/pydantic/main.py", line 341, in __init__
    raise validation_error
pydantic.error_wrappers.ValidationError: 5 validation errors for CosAgentProviderUnitData
metrics_alert_rules
  field required (type=value_error.missing)
log_alert_rules
  field required (type=value_error.missing)
dashboards
  field required (type=value_error.missing)
metrics_scrape_jobs
  field required (type=value_error.missing)
log_slots
  field required (type=value_error.missing)
unit-grafana-agent-3: 17:39:33.148 ERROR juju.worker.uniter.operation hook "cos-agent-relation-joined" (via hook dispatching script: dispatch) failed: exit status 1
unit-grafana-agent-3: 17:39:33.149 INFO juju.worker.uniter awaiting error resolution for "relation-joined" hook

Additional context

No response

Password shown on action changes after scaling

Hello!

After deploying the cos-lite bundle:

juju add-model cos
juju deploy cos-lite --channel=edge --trust

I can get the dashboard password with:

juju run-action grafana get-admin-password

If I scale down the grafana app with:

juju remove-unit grafana --num-units 1

and then scale it back up:

juju add-unit grafana --num-units 1

Using get-admin-password shows a different one, but the dashboard is accessible with the old one.

Address deprecated config structure

Bug Description

Grafana Agent has deprecated in its agent.yml configs the prometheus and loki keywords in exchange for metrics and logs respectively.

To Reproduce

juju deploy grafana-agent-k8s
microk8s.kubectl exec -it grafana-agent-k8s -n -c agent -- bash
apt update
apt install -y curl
curl localhost/-/reload -v

Environment

Grafana Agent on edge

Relevant log output

2022-04-07T12:03:26.917Z [agent] ts=2022-04-07T12:03:26.916091613Z caller=config.go:146 level=warn msg="DEPRECATION NOTICE: `prometheus` has been deprecated in favor of `metrics`"
2022-04-07T12:03:26.917Z [agent] ts=2022-04-07T12:03:26.916117061Z caller=config.go:146 level=warn msg="DEPRECATION NOTICE: `loki` has been deprecated in favor of `logs`"

Additional context

No response

Static check failing

➜  grafana-agent-operator git:(main) tox -e static
static installed: attrs==21.2.0,backcall==0.2.0,bcrypt==3.2.0,cachetools==4.2.4,certifi==2021.5.30,cffi==1.14.6,charset-normalizer==2.0.6,cryptography==35.0.0,decorator==5.1.0,google-auth==2.2.1,idna==3.2,iniconfig==1.1.1,ipdb==0.13.9,ipython==7.28.0,jedi==0.18.0,Jinja2==3.0.1,juju==2.9.3,jujubundlelib==0.5.6,kubernetes==18.20.0,macaroonbakery==1.3.1,MarkupSafe==2.0.1,matplotlib-inline==0.1.3,mypy==0.910,mypy-extensions==0.4.3,oauthlib==3.1.1,ops==1.2.0+42.g7420d84,packaging==21.0,paramiko==2.7.2,parso==0.8.2,pexpect==4.8.0,pickleshare==0.7.5,pluggy==1.0.0,prompt-toolkit==3.0.20,protobuf==3.18.0,ptyprocess==0.7.0,py==1.10.0,pyasn1==0.4.8,pyasn1-modules==0.2.8,pycparser==2.20,Pygments==2.10.0,pymacaroons==0.13.0,PyNaCl==1.4.0,pyparsing==2.4.7,pyRFC3339==1.1,pytest==6.2.5,pytest-asyncio==0.15.1,pytest-operator==0.8.3,python-dateutil==2.8.2,pytz==2021.1,PyYAML==5.4.1,requests==2.26.0,requests-oauthlib==1.3.0,rsa==4.7.2,six==1.16.0,theblues==0.5.2,toml==0.10.2,toposort==1.7,traitlets==5.1.0,types-PyYAML==5.4.10,types-requests==2.25.9,typing-extensions==3.10.0.2,typing-inspect==0.7.1,urllib3==1.26.7,wcwidth==0.2.5,websocket-client==1.2.1,websockets==7.0
static run-test-pre: PYTHONHASHSEED='495031126'
static run-test: commands[0] | mypy /home/jose/trabajos/canonical/repos/grafana-agent-operator/src/ /home/jose/trabajos/canonical/repos/grafana-agent-operator/tests/
tests/unit/test_charm.py:8: error: Cannot find implementation or library stub for module named "responses"  [import]
    import responses
    ^
tests/unit/test_charm.py:8: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
Found 1 error in 1 file (checked 4 source files)
ERROR: InvocationError for command /home/jose/trabajos/canonical/repos/grafana-agent-operator/.tox/static/bin/mypy src tests (exited with code 1)
__________________________________________________________________________________________________________________________________________________________________________ summary __________________________________________________________________________________________________________________________________________________________________________
ERROR:   static: commands failed

Reload configuration using the web api

While executing _reload_config now we get:

could not reload configuration: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

(machine) Exception raised on teardown: `juju_topology.InvalidUUIDError: 'None' is not a valid UUID`

Bug Description

During a CI itest teardown, the charm raised an exception.
Perhaps it's because the removal was forceful.

To Reproduce

juju remove-application.

Environment

Model                    Controller           Cloud/Region         Version  SLA          Timestamp
test-machine-agent-bnmx  github-pr-506db-lxd  localhost/localhost  3.1.2    unsupported  04:47:54Z

SAAS        Status  Store  URL
grafana     active  local  admin/test-machine-agent-bnmx.grafana-dashboards
loki        active  local  admin/test-machine-agent-bnmx.loki-logging
prometheus  active  local  admin/test-machine-agent-bnmx.prometheus-receive-remote-write

App                  Version  Status  Scale  Charm          Channel        Rev  Exposed  Message
agent                         active      4  grafana-agent  edge             7  no       
principal-cos-agent           active      2  zookeeper      edge            97  no       
principal-juju-info  22.04    active      2  ubuntu         latest/stable   22  no       

Relevant log output

unit-agent-3: 03:55:30 ERROR unit.unit-agent-3.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-agent-3/charm/./src/charm.py", line 473, in <module>
    main(GrafanaAgentMachineCharm)
  File "/var/lib/juju/agents/unit-agent-3/charm/venv/ops/main.py", line 429, in main
    charm = charm_class(framework)
  File "/var/lib/juju/agents/unit-agent-3/charm/./src/charm.py", line 161, in __init__
    super().__init__(*args)
  File "/var/lib/juju/agents/unit-agent-3/charm/src/grafana_agent.py", line 108, in __init__
    self._remote_write = PrometheusRemoteWriteConsumer(
  File "/var/lib/juju/agents/unit-agent-3/charm/lib/charms/prometheus_k8s/v0/prometheus_remote_write.py", line 641, in __init__
    self.topology = JujuTopology.from_charm(charm)
  File "/var/lib/juju/agents/unit-agent-3/charm/lib/charms/observability_libs/v0/juju_topology.py", line 154, in from_charm
    return cls(
  File "/var/lib/juju/agents/unit-agent-3/charm/lib/charms/observability_libs/v0/juju_topology.py", line 123, in __init__
    raise InvalidUUIDError(model_uuid)
charms.observability_libs.v0.juju_topology.InvalidUUIDError: 'None' is not a valid UUID.

Additional context

No response

The "Release" workflow fails because no entrypoint is rendered

Bug Description

The release workflow recently failed because at that point there is no charm.py rendered in src.

To Reproduce

Simply run charmcraft pack without tox -e render-k8s first.

Environment

Not relevant.

Relevant log output

$ charmcraft pack
Packing the charm
Lint Errors:
- entrypoint: Cannot find the entrypoint file: '/root/prime/src/charm.py' (https://juju.is/docs/sdk/charmcraft-analyzers-and-linters#heading--entrypoint)
Aborting due to lint errors (use --force to override).
Failed to build charm for bases index '0'.

Additional context

No response

Loki config missing if VM running juju and microk8s is restarted.

Loki config missing in grafana-agent charm.

how can it be that with the latest Grafana Agent edge, when I relate Grafana Agent with Loki, I get no loki configurations generated?

michele@boombox:~$ juju status --relations
Model  Controller   Cloud/Region        Version  SLA          Timestamp
lma    development  microk8s/localhost  2.9.22   unsupported  18:56:09+01:00

App            Version  Status  Scale  Charm              Store     Channel  Rev  OS          Address         Message
alertmanager            active      1  alertmanager-k8s   charmhub  edge       7  kubernetes  10.152.183.187  
grafana                 active      1  grafana-k8s        charmhub  edge      14  kubernetes  10.152.183.218  
grafana-agent           active      2  grafana-agent-k8s  charmhub  edge       4  kubernetes  10.152.183.232  
loki                    active      1  loki-k8s           charmhub  edge      11  kubernetes  10.152.183.168  
prometheus              active      1  prometheus-k8s     charmhub  edge      13  kubernetes  10.152.183.139  

Unit              Workload  Agent  Address      Ports  Message
alertmanager/0*   active    idle   10.1.151.76         
grafana-agent/0*  active    idle   10.1.151.80         
grafana-agent/1   active    idle   10.1.151.81         
grafana/0*        active    idle   10.1.151.77         
loki/0*           active    idle   10.1.151.78         
prometheus/0*     active    idle   10.1.151.79         

Relation provider          Requirer                 Interface              Type     Message
alertmanager:alerting      prometheus:alertmanager  alertmanager_dispatch  regular  
alertmanager:replicas      alertmanager:replicas    alertmanager_replica   peer     
grafana:grafana-peers      grafana:grafana-peers    grafana_peers          peer     
loki:grafana-source        grafana:grafana-source   grafana_datasource     regular  
loki:logging               grafana-agent:logging    loki_push_api          regular  
prometheus:grafana-source  grafana:grafana-source   grafana_datasource     regular 
michele@boombox:~$ microk8s.kubectl exec -it grafana-agent-0 -n lma -c agent -- cat /etc/agent/agent.yaml
integrations:
  agent:
    enabled: true
    relabel_configs:
    - regex: (.*)
      replacement: lma_f149bca2-4d3e-4289-8b72-6b66b40d14dc_grafana-agent_grafana-agent/0
      target_label: instance
    - replacement: grafana-agent-k8s
      source_labels:
      - __address__
      target_label: juju_charm
    - replacement: lma
      source_labels:
      - __address__
      target_label: juju_model
    - replacement: f149bca2-4d3e-4289-8b72-6b66b40d14dc
      source_labels:
      - __address__
      target_label: juju_model_uuid
    - replacement: grafana-agent
      source_labels:
      - __address__
      target_label: juju_application
    - replacement: grafana-agent/0
      source_labels:
      - __address__
      target_label: juju_unit
  prometheus_remote_write: []
loki: {}
prometheus:
  configs:
  - name: agent_scraper
    remote_write: []
    scrape_configs: []
server:
  log_level: info

Grafana Agent Subordinate Errors out for non-leader units

Bug Description

I'm deploying Kafka (on VMs) with multiple units (n=3). When I deploy and relate the grafana agent charm, the grafana-agent unit related to the Kafka juju leader unit works fine but the other two (related to the non-leaders units) errors out because of a permission error in the relation data bag.

This is the output of the juju status

Model    Controller  Cloud/Region         Version  SLA          Timestamp
default  dev         localhost/localhost  2.9.42   unsupported  23:14:08Z

App            Version  Status  Scale  Charm          Channel  Rev  Exposed  Message
grafana-agent           active      3  grafana-agent  edge       2  no       
kafka                   active      3  kafka          edge     106  no       
zookeeper               active      3  zookeeper      edge      92  no       

Unit                Workload  Agent  Machine  Public address  Ports  Message
kafka/6*            active    idle   24       10.21.168.7            
  grafana-agent/2   active    idle            10.21.168.7            
kafka/7             error     idle   25       10.21.168.75           hook failed: "cos-agent-relation-joined"
  grafana-agent/0*  active    idle            10.21.168.75           
kafka/8             error     idle   26       10.21.168.72           hook failed: "cos-agent-relation-joined"
  grafana-agent/1   active    idle            10.21.168.72           
zookeeper/6         active    idle   27       10.21.168.30           
zookeeper/7*        active    idle   28       10.21.168.101          
zookeeper/8         active    idle   29       10.21.168.85           

To Reproduce

  1. juju deploy kafka -n3 --channel edge
  2. juju deploy zookeeper -n3 --channel edge
  3. juju relate kafka zookeeper
  4. juju deploy grafana-agent --channel edge
  5. (wait everything to be active)
  6. juju relate grafana-agent kafka

Environment

juju version
2.9.42-ubuntu-amd64

lxd version
5.11

Relevant log output

unit-kafka-8: 23:13:51 ERROR unit.kafka/8.juju-log cos-agent:27: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-kafka-8/charm/./src/charm.py", line 503, in <module>
    main(KafkaCharm)
  File "/var/lib/juju/agents/unit-kafka-8/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-kafka-8/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-kafka-8/charm/venv/ops/framework.py", line 354, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-kafka-8/charm/venv/ops/framework.py", line 830, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-kafka-8/charm/venv/ops/framework.py", line 919, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-kafka-8/charm/lib/charms/grafana_agent/v0/cos_agent.py", line 251, in _on_refresh
    relation.data[self._charm.app].update({"config": self._generate_databag_content()})
  File "/usr/lib/python3.10/_collections_abc.py", line 994, in update
    self[key] = other[key]
  File "/var/lib/juju/agents/unit-kafka-8/charm/venv/ops/model.py", line 1473, in __setitem__
    self._validate_write(key, value)
  File "/var/lib/juju/agents/unit-kafka-8/charm/venv/ops/model.py", line 1459, in _validate_write
    raise RelationDataAccessError(
ops.model.RelationDataAccessError: kafka/8 is not leader and cannot write application data.
unit-kafka-8: 23:13:51 ERROR juju.worker.uniter.operation hook "cos-agent-relation-joined" (via hook dispatching script: dispatch) failed: exit status 1

Additional context

No response

Grafana agent fails with hook failed: "send-remote-write-relation-joined"

Bug Description

In SQA testun https://solutions.qa.canonical.com/v2/testruns/d6241418-2e99-4c93-95c0-51aad671b834, the grafana-agent unit fails with:

2023-04-25 15:31:21 DEBUG unit.zookeeper-agent/1.juju-log server.go:316 send-remote-write:17: Emitting custom event <PrometheusRemoteWriteEndpointsChangedEvent via GrafanaAgentMachineCharm/PrometheusRemoteWriteConsumer[send-remote-write]/on/endpoints_changed[68]>.
2023-04-25 15:31:21 ERROR unit.zookeeper-agent/1.juju-log server.go:316 send-remote-write:17: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/lib/charms/operator_libs_linux/v1/snap.py", line 309, in _snap_daemons
    return subprocess.run(_cmd, universal_newlines=True, check=True, capture_output=True)
  File "/usr/lib/python3.10/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['snap', 'restart', 'grafana-agent']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/./src/charm.py", line 286, in restart
    self.snap.restart()
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/lib/charms/operator_libs_linux/v1/snap.py", line 424, in restart
    self._snap_daemons(args, services)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/lib/charms/operator_libs_linux/v1/snap.py", line 311, in _snap_daemons
    raise SnapError("Could not {} for snap [{}]: {}".format(_cmd, self._name, e.stderr))
charms.operator_libs_linux.v1.snap.SnapError: Could not ['snap', 'restart', 'grafana-agent'] for snap [grafana-agent]: error: cannot perform the following tasks:
- Run service command "restart" for services ["grafana-agent"] of snap "grafana-agent" (systemctl command [start snap.grafana-agent.grafana-agent.service] failed with exit status 1: Job for snap.grafana-agent.grafana-agent.service failed because the control process exited with error code.
See "systemctl status snap.grafana-agent.grafana-agent.service" and "journalctl -xeu snap.grafana-agent.grafana-agent.service" for details.
)


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/./src/charm.py", line 482, in <module>
    main(GrafanaAgentMachineCharm)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/venv/ops/main.py", line 441, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/venv/ops/main.py", line 149, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/venv/ops/framework.py", line 354, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/venv/ops/framework.py", line 830, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/venv/ops/framework.py", line 919, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/lib/charms/prometheus_k8s/v0/prometheus_remote_write.py", line 673, in _handle_endpoints_changed
    self.on.endpoints_changed.emit(relation_id=event.relation.id)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/venv/ops/framework.py", line 354, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/venv/ops/framework.py", line 830, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/venv/ops/framework.py", line 919, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/src/grafana_agent.py", line 333, in on_remote_write_changed
    self._update_config()
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/src/grafana_agent.py", line 403, in _update_config
    self.restart()
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/./src/charm.py", line 288, in restart
    raise GrafanaAgentServiceError("Failed to restart grafana-agent") from e
GrafanaAgentServiceError: Failed to restart grafana-agent
2023-04-25 15:31:21 ERROR juju.worker.uniter.operation runhook.go:153 hook "send-remote-write-relation-joined" (via hook dispatching script: dispatch) failed: exit status 1

To Reproduce

Deploy the kafka bundle and relate it to cos on microk8s

Environment

Model       Controller        Cloud/Region        Version  SLA          Timestamp
controller  foundations-maas  maas_cloud/default  2.9.42   unsupported  15:31:46Z

Machine  State    Address         Inst id  Series  AZ     Message
0        started  10.246.164.122  juju1-7  focal   zone1  Deployed
1        started  10.246.167.8    juju1-8  focal   zone2  Deployed
2        started  10.246.166.182  juju2-9  focal   zone3  Deployed
Model  Controller        Cloud/Region        Version  SLA          Timestamp
kafka  foundations-maas  maas_cloud/default  2.9.42   unsupported  15:31:47Z

SAAS          Status  Store         URL
alertmanager  active  popocatepetl  admin/cos.alertmanager
grafana       active  popocatepetl  admin/cos.grafana
loki          active  popocatepetl  admin/cos.loki
prometheus    active  popocatepetl  admin/cos.prometheus

App                        Version  Status       Scale  Charm                      Channel           Rev  Exposed  Message
kafka                               blocked          3  kafka                      latest/edge       114  no       missing required zookeeper relation
kafka-agent                         maintenance      2  grafana-agent              latest/edge         8  no       Installing grafana-agent snap
ntp                        4.2      active           2  ntp                        latest/candidate   50  no       chrony: Ready
tls-certificates-operator           active           1  tls-certificates-operator  latest/edge        23  no       
zookeeper                           active           3  zookeeper                  latest/edge        98  no       
zookeeper-agent                     error            3  grafana-agent              latest/edge         8  no       hook failed: "send-remote-write-relation-joined"

Unit                          Workload     Agent      Machine  Public address  Ports    Message
kafka/0                       maintenance  executing  0        10.246.166.210           (install) installing charm software
kafka/1                       blocked      executing  1        10.246.167.131           missing required zookeeper relation
  kafka-agent/0*              maintenance  executing           10.246.167.131           (install) Installing grafana-agent snap
  ntp/0*                      active       executing           10.246.167.131  123/udp  (install) chrony: Ready
kafka/2*                      blocked      executing  2        10.246.165.79            missing required zookeeper relation
  kafka-agent/1               maintenance  executing           10.246.165.79            (install) Installing grafana-agent snap
  ntp/1                       active       executing           10.246.165.79   123/udp  (install) chrony: Ready
tls-certificates-operator/0*  active       idle       3        10.246.166.101           
zookeeper/0                   active       idle       4        10.246.167.32            
  zookeeper-agent/1           error        idle                10.246.167.32            hook failed: "send-remote-write-relation-joined"
zookeeper/1*                  active       executing  5        10.246.167.17            
  zookeeper-agent/0*          active       idle                10.246.167.17            
zookeeper/2                   active       idle       6        10.246.165.50            
  zookeeper-agent/2           active       idle                10.246.165.50            

Machine  State    Address         Inst id             Series  AZ     Message
0        started  10.246.166.210  vault2-5            jammy   zone1  Deployed
1        started  10.246.167.131  grafana2-3          jammy   zone3  Deployed
2        started  10.246.165.79   landscapeha-23-2-2  jammy   zone2  Deployed
3        started  10.246.166.101  vault1-7            jammy   zone1  Deployed
4        started  10.246.167.32   grafana1-3          jammy   zone3  Deployed
5        started  10.246.167.17   landscapeha-23-1-2  jammy   zone2  Deployed
6        started  10.246.165.50   microk8s1-4         jammy   zone1  Deployed
Model           Controller        Cloud/Region              Version  SLA          Timestamp
metallb-system  foundations-maas  microk8s_cloud/localhost  2.9.42   unsupported  15:31:48Z

App                 Version                         Status  Scale  Charm               Channel  Rev  Address         Exposed  Message
metallb-controller  res:metallb-controller-imag...  active      1  metallb-controller  stable    41  10.152.183.173  no       
metallb-speaker     res:metallb-speaker-image@6...  active      3  metallb-speaker     stable    36  10.152.183.144  no       

Unit                   Workload  Agent  Address         Ports     Message
metallb-controller/0*  active    idle   10.1.168.196    7472/TCP  
metallb-speaker/0*     active    idle   10.246.165.172  7472/TCP  
metallb-speaker/1      active    idle   10.246.165.74   7472/TCP  
metallb-speaker/2      active    idle   10.246.167.42   7472/TCP  

Model     Controller        Cloud/Region        Version  SLA          Timestamp
microk8s  foundations-maas  maas_cloud/default  2.9.42   unsupported  15:31:48Z

App       Version  Status  Scale  Charm     Channel  Rev  Exposed  Message
microk8s           active      3  microk8s  stable    35  yes      

Unit         Workload  Agent  Machine  Public address  Ports                     Message
microk8s/0*  active    idle   0        10.246.165.74   80/tcp,443/tcp,16443/tcp  
microk8s/1   active    idle   1        10.246.165.172  80/tcp,443/tcp,16443/tcp  
microk8s/2   active    idle   2        10.246.167.42   80/tcp,443/tcp,16443/tcp  

Machine  State    Address         Inst id      Series  AZ     Message
0        started  10.246.165.74   microk8s1-1  jammy   zone1  Deployed
1        started  10.246.165.172  microk8s1-2  jammy   zone2  Deployed
2        started  10.246.167.42   microk8s1-3  jammy   zone3  Deployed

Model  Controller    Cloud/Region              Version  SLA          Timestamp
cos    popocatepetl  microk8s_cloud/localhost  2.9.42   unsupported  15:31:49Z

App           Version  Status  Scale  Charm             Channel  Rev  Address         Exposed  Message
alertmanager  0.23.0   active      1  alertmanager-k8s  stable    47  10.152.183.150  no       
catalogue              active      1  catalogue-k8s     stable    13  10.152.183.94   no       
grafana       9.2.1    active      1  grafana-k8s       stable    64  10.152.183.192  no       
loki          2.4.1    active      1  loki-k8s          stable    60  10.152.183.41   no       
prometheus    2.33.5   active      1  prometheus-k8s    stable   103  10.152.183.184  no       
traefik       2.9.6    active      1  traefik-k8s       stable   110  10.246.167.226  no       

Unit             Workload  Agent      Address      Ports  Message
alertmanager/0*  active    idle       10.1.107.13         
catalogue/0*     active    idle       10.1.166.7          
grafana/0*       active    executing  10.1.166.13         
loki/0*          active    idle       10.1.107.14         
prometheus/0*    active    executing  10.1.166.12         
traefik/0*       active    idle       10.1.107.12         

Offer         Application   Charm             Rev  Connected  Endpoint              Interface                Role
alertmanager  alertmanager  alertmanager-k8s  47   0/0        karma-dashboard       karma_dashboard          provider
grafana       grafana       grafana-k8s       64   2/2        grafana-dashboard     grafana_dashboard        requirer
loki          loki          loki-k8s          60   2/2        logging               loki_push_api            provider
prometheus    prometheus    prometheus-k8s    103  2/2        metrics-endpoint      prometheus_scrape        requirer
                                                              receive-remote-write  prometheus_remote_write  provider

Relevant log output

See Bug Description, crashdumps and other configs can be found [here](https://oil-jenkins.canonical.com/artifacts/d6241418-2e99-4c93-95c0-51aad671b834/index.html)

Additional context

No response

(machine) How to go about having two gagent subords related to the same principal

Enhancement Proposal

Iiuc, the second charm would attempt to install a snap that is already installed, resulting in

snap "grafana-agent" is already installed, see 'snap help refresh'

which is, btw, 0 return code (i.e. success).

Specifically, it seems like this cannot not give us redundancy, and may create an unintended race. Not sure.

Similarly, what if the principal already happens to have the grafana-agent snap installed?

Machine charm does not currently traverse log directories

Bug Description

If a snap slot provides one or multiple nested folders for logs, we currently do not properly recurse these directories to find files. See the attached screenshot

image

To Reproduce

  1. juju deploy zookeeper, built from source
  2. juju deploy grafana-agent, built from source
  3. juju relate grafana-agent zookeeper
  4. relate via CMRs to Grafana

Environment

Relevant log output

2023-03-31 16:56:57	
Mar 31 14:56:56 juju-8c976f-6 grafana-agent.grafana-agent[10431]: ts=2023-03-31T14:56:56.954835365Z caller=filetarget.go:317 level=info component=logs logs_config=log_file_scraper msg="failed to tail file" error="file is a directory" filename=/snap/grafana-agent/11/shared-logs/zookeeper/version-2
2023-03-31 16:56:56	
ts=2023-03-31T14:56:56.954835365Z caller=filetarget.go:317 level=info component=logs logs_config=log_file_scraper msg="failed to tail file" error="file is a directory" filename=/snap/grafana-agent/11/shared-logs/zookeeper/version-2

Additional context

No response

(machine_charm) When related over loki_push_api, grafana-agent fails to start

Bug Description

After relating to loki, the grafana-agent service fails to start.

To Reproduce

  1. Deploy kafka or zookeeper
  2. Deploy grafana-agent and relate to the above charm
  3. Relate to loki via cross-model relation
  4. systemctl status snap.grafana-agent.grafana-agent.service

Environment

Relevant log output

ubuntu@juju-488a58-4:~$ /usr/bin/snap run grafana-agent
2023/03/07 15:09:25 error loading config file /etc/grafana-agent.yaml: Loki configs push_api_server and log_file_scraper must have different positions file paths

Additional context

No response

Downscaling Loki results in a bad config

Bug Description

When a grafana-agent charm is related with (let's say) 3 Loki units, and Loki is down scaled to (let's say) 2 units, the lokisection in emptied.

To Reproduce

  1. Deploy grafana-agent
  2. Deploy 3 loki units
  3. Relate both charms juju add-relation grafana-agent loki
  4. Down scale Loki juju scale-application loki 2
  5. Verify that loki section in grafana-agent config file is empty juju ssh --container agent agent/0 cat /etc/agent/agent.yaml

Environment

Grafana agent: main branch

Relevant log output

This is the config file of a grafana-agent charm related with 2 Loki units and then downscaled to 1 Loki unit. 

juju ssh --container agent agent/0 cat /etc/agent/agent.yaml
integrations:
  agent:
    enabled: true
    relabel_configs:
    - regex: (.*)
      replacement: cos_1651a423-4b8c-4f27-8fbd-bff940439eed_agent_agent/0
      target_label: instance
    - replacement: grafana-agent-k8s
      source_labels:
      - __address__
      target_label: juju_charm
    - replacement: cos
      source_labels:
      - __address__
      target_label: juju_model
    - replacement: 1651a423-4b8c-4f27-8fbd-bff940439eed
      source_labels:
      - __address__
      target_label: juju_model_uuid
    - replacement: agent
      source_labels:
      - __address__
      target_label: juju_application
    - replacement: agent/0
      source_labels:
      - __address__
      target_label: juju_unit
  prometheus_remote_write: []
loki: {}
prometheus:
  configs:
  - name: agent_scraper
    remote_write: []
    scrape_configs: []
server:
  log_level: info

Additional context

We must fix _loki_config method in charm.py

Relating several applications only stores dashboards and alerts from first relation

Bug Description

Relating several applications to grafana-agent will partially fail. Only the dashboards and alerts from the first relation will be sent over to COS.
Logs and metrics seem fine, I can see them from both applications.

To Reproduce

juju relate grafana-agent kafka
juju relate grafana-agent zookeeper
# This will lead to dashboards and alerts only from kafka
-------
juju relate grafana-agent zookeeper
juju relate grafana-agent kafka
# This will lead to dashboards and alerts only from zookeeper

Environment

grafana-agent: channel edge, revision 4
zookeeper: https://github.com/canonical/zookeeper-operator/tree/feature/grafana_agent_integration
kafka: https://github.com/deusebio/kafka-operator/tree/wip-logs-integration

Relevant log output

-

Additional context

No response

Refactor Grafana agent config generation.

This issue is related to #34.

The config file is built on pebble_ready and completed once you relate grafana agent with loki.

We're saving the configuration in stored because if you have a grafana agent and loki charms related, when you reboot the VM in which microk8s is running, the loki_push_api_endpoint_joined is not fired again and the config will be in a different state (without loki section)

The safest way to make this work is to have a method that builds the whole config file and call it in both pebble_ready and relation_joined. The method can simply exit if resources are not yet ready and it will be called when the second hook fires.

This is better is we have all the information needed to construct the config file statelessly and added a stored state introduces another moving part which could introduce bugs.

Unit tests failing

➜  grafana-agent-operator git:(main) tox -e unit
unit installed: attrs==21.2.0,cachetools==4.2.4,certifi==2021.5.30,charset-normalizer==2.0.6,coverage==6.0,google-auth==2.2.1,idna==3.2,iniconfig==1.1.1,kubernetes==18.20.0,oauthlib==3.1.1,ops==1.2.0+42.g7420d84,packaging==21.0,pluggy==1.0.0,py==1.10.0,pyasn1==0.4.8,pyasn1-modules==0.2.8,pyparsing==2.4.7,pytest==6.2.5,python-dateutil==2.8.2,PyYAML==5.4.1,requests==2.26.0,requests-oauthlib==1.3.0,responses==0.14.0,rsa==4.7.2,six==1.16.0,toml==0.10.2,urllib3==1.26.7,websocket-client==1.2.1
unit run-test-pre: PYTHONHASHSEED='477063725'
unit run-test: commands[0] | coverage run --source=/home/jose/trabajos/canonical/repos/grafana-agent-operator/src/ -m unittest discover -v /home/jose/trabajos/canonical/repos/grafana-agent-operator/tests//unit
Can't read 'pyproject.toml' without TOML support. Install with [toml] extra
ERROR: InvocationError for command /home/jose/trabajos/canonical/repos/grafana-agent-operator/.tox/unit/bin/coverage run --source=/home/jose/trabajos/canonical/repos/grafana-agent-operator/src/ -m unittest discover -v tests/unit (exited with code 1)
__________________________________________________________________________________________________________________________________________________________________________ summary __________________________________________________________________________________________________________________________________________________________________________
ERROR:   unit: commands failed

Alert rules are injected with incorrect label filters

Bug Description

Grafana agent charm has default alert rules for prometheus. However, when the charm is deployed and relate to COS, the alert rules are injected with incorrect JujuTopology (juju_*) labels, and that matches to no timeseries.

Take for example, the HostOomKillDetected alert rule. In Prometheus web UI, it has was modified to the following expr

Screenshot from 2023-05-12 10-59-21

However, in Prometheus, node_vmstat_oom_kill metric does not have any timeseries with those labels.

Screenshot from 2023-05-12 11-00-33

So there's a label mismatch between actual metric from grafana-agent via node_exporter and from the alert rule.

To Reproduce

  1. Deploy / with existing COS-lite
  2. Deploy ubuntu charm and grafana-agent machine charm
  3. relate ubuntu and grafana-agent; relate grafana-agent to COS-lite
  4. Open Prometheus web UI, check the alert rules and inspect the expr

Environment

Built from latest source @ 73543ef, or lastest/edge

Relevant log output

None

Additional context

No response

Network related panels in the System Resources dashboard don't keep track of interface names

Bug Description

All the network-related panels in the System Resources dashboard display a hardcoded legend text that does not distinguish among interface names, rendering filtering by interface impossible.

image

The device names do exist in the metrics so it's just a matter of exposing them in the dashboard.

To Reproduce

  1. juju deploy ubuntu
  2. juju deploy grafana-agent
  3. juju relate ubuntu grafana-agent
  4. relate grafana-agent with cos (cross-model)

Environment

This is a small cos on openstack environment.

Sending side (the microk8s unit contains the COS):

ubuntu@aieri-bastion:~$ juju status -m machines --relations
Model     Controller        Cloud/Region             Version  SLA          Timestamp
machines  aieri-controller  serverstack/serverstack  2.9.42   unsupported  00:10:05Z

SAAS                             Status  Store             URL
grafana-dashboards               active  aieri-controller  admin/cos-lite.grafana-dashboards
loki-logging                     active  aieri-controller  admin/cos-lite.loki-logging
prometheus-receive-remote-write  active  aieri-controller  admin/cos-lite.prometheus-receive-remote-write

App            Version  Status  Scale  Charm          Channel  Rev  Exposed  Message
grafana-agent           active      2  grafana-agent  edge       5  no       
microk8s                active      1  microk8s       stable    35  no       
ubuntu                  active      2  ubuntu         stable    22  no       

Unit                Workload  Agent  Machine  Public address  Ports                     Message
microk8s/1*         active    idle   1        10.5.1.159      80/tcp,443/tcp,16443/tcp  
ubuntu/0*           active    idle   2        10.5.3.110                                
  grafana-agent/0*  active    idle            10.5.3.110                                
ubuntu/1            active    idle   3        10.5.1.146                                
  grafana-agent/1   active    idle            10.5.1.146                                

Machine  State    Address     Inst id                               Series  AZ    Message
1        started  10.5.1.159  cf92d2b1-8ed7-47cb-803d-39ede6b65b02  focal   nova  ACTIVE
2        started  10.5.3.110  f64a8889-6356-495e-9cc9-48ce60d852f7  focal   nova  ACTIVE
3        started  10.5.1.146  2cb13a22-d7c6-40f3-8966-eb88e178e558  focal   nova  ACTIVE

Relation provider                                     Requirer                              Interface                Type         Message
grafana-agent:grafana-dashboards-provider             grafana-dashboards:grafana-dashboard  grafana_dashboard        regular      
loki-logging:logging                                  grafana-agent:logging-consumer        loki_push_api            regular      
microk8s:cluster                                      microk8s:cluster                      microk8s-cluster         peer         
prometheus-receive-remote-write:receive-remote-write  grafana-agent:send-remote-write       prometheus_remote_write  regular      
ubuntu:juju-info                                      grafana-agent:juju-info               juju-info                subordinate  

Receiving side:

ubuntu@aieri-bastion:~$ juju status -m cos-lite --relations
Model     Controller        Cloud/Region     Version  SLA          Timestamp
cos-lite  aieri-controller  micro/localhost  2.9.42   unsupported  00:12:21Z

App           Version  Status  Scale  Charm             Channel  Rev  Address         Exposed  Message
alertmanager  0.25.0   active      1  alertmanager-k8s  edge      64  10.152.183.57   no       
catalogue              active      1  catalogue-k8s     edge      14  10.152.183.103  no       
grafana       9.2.1    active      1  grafana-k8s       edge      76  10.152.183.97   no       
loki          2.7.4    active      1  loki-k8s          edge      80  10.152.183.215  no       
prometheus    2.42.0   active      1  prometheus-k8s    edge     119  10.152.183.139  no       
traefik       2.9.6    active      1  traefik-k8s       edge     124  10.5.1.159      no       

Unit             Workload  Agent  Address       Ports  Message
alertmanager/0*  active    idle   10.1.102.153         
catalogue/0*     active    idle   10.1.102.154         
grafana/0*       active    idle   10.1.102.155         
loki/0*          active    idle   10.1.102.156         
prometheus/0*    active    idle   10.1.102.157         
traefik/0*       active    idle   10.1.102.158         

Offer                            Application   Charm             Rev  Connected  Endpoint              Interface                Role
alertmanager-karma-dashboard     alertmanager  alertmanager-k8s  64   0/0        karma-dashboard       karma_dashboard          provider
grafana-dashboards               grafana       grafana-k8s       76   1/1        grafana-dashboard     grafana_dashboard        requirer
loki-logging                     loki          loki-k8s          80   1/1        logging               loki_push_api            provider
prometheus-receive-remote-write  prometheus    prometheus-k8s    119  1/1        receive-remote-write  prometheus_remote_write  provider
prometheus-scrape                prometheus    prometheus-k8s    119  0/0        metrics-endpoint      prometheus_scrape        requirer

Relation provider                   Requirer                     Interface              Type     Message
alertmanager:alerting               loki:alertmanager            alertmanager_dispatch  regular  
alertmanager:alerting               prometheus:alertmanager      alertmanager_dispatch  regular  
alertmanager:grafana-dashboard      grafana:grafana-dashboard    grafana_dashboard      regular  
alertmanager:grafana-source         grafana:grafana-source       grafana_datasource     regular  
alertmanager:replicas               alertmanager:replicas        alertmanager_replica   peer     
alertmanager:self-metrics-endpoint  prometheus:metrics-endpoint  prometheus_scrape      regular  
catalogue:catalogue                 alertmanager:catalogue       catalogue              regular  
catalogue:catalogue                 grafana:catalogue            catalogue              regular  
catalogue:catalogue                 prometheus:catalogue         catalogue              regular  
grafana:grafana                     grafana:grafana              grafana_peers          peer     
grafana:metrics-endpoint            prometheus:metrics-endpoint  prometheus_scrape      regular  
loki:grafana-dashboard              grafana:grafana-dashboard    grafana_dashboard      regular  
loki:grafana-source                 grafana:grafana-source       grafana_datasource     regular  
loki:metrics-endpoint               prometheus:metrics-endpoint  prometheus_scrape      regular  
prometheus:grafana-dashboard        grafana:grafana-dashboard    grafana_dashboard      regular  
prometheus:grafana-source           grafana:grafana-source       grafana_datasource     regular  
prometheus:prometheus-peers         prometheus:prometheus-peers  prometheus_peers       peer     
traefik:ingress                     alertmanager:ingress         ingress                regular  
traefik:ingress                     catalogue:ingress            ingress                regular  
traefik:ingress-per-unit            loki:ingress                 ingress_per_unit       regular  
traefik:ingress-per-unit            prometheus:ingress           ingress_per_unit       regular  
traefik:metrics-endpoint            prometheus:metrics-endpoint  prometheus_scrape      regular  
traefik:traefik-route               grafana:ingress              traefik_route          regular  

Relevant log output

N/A

Additional context

No response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.