grafana / agent Goto Github PK

Vendor-neutral programmable observability pipelines.

Home Page: https://grafana.com/docs/agent/

License: Apache License 2.0

Go 94.54% Jsonnet 3.24% Makefile 0.56% Dockerfile 0.19% Shell 0.31% HTML 0.01% CSS 0.15% TypeScript 0.88% Smarty 0.12% Terra 0.01%

prometheus agent grafana loki opentelemetry-collector monitoring observability opentelemetry

agent's Issues

Generate configuration-reference.md

The configuration reference document is currently hand written, large, and subject to change when updating vendors. It would be nice to have a script that autogenerates the entire document. Cortex has something like this for its equivalent config reference that could be used for inspiration.

Some secrets still get mangled in scraping service mode

See full configuration reference; there are many fields marked as secret that aren’t being converted to a string for storage. For example, the basic auth credentials in the scrape configs.

We need a better solution than replicating all the types that have a secret field.

Kubernetes manifest questions

I stumbled across the deployment manifest here: https://github.com/grafana/agent/blob/master/production/kubernetes/agent.yaml and I have a few questions:

You are using minReadySeconds (https://github.com/grafana/agent/blob/master/production/kubernetes/agent.yaml#L250), I assume because there are no liveness/readiness probes (yet). Could I use the /metrics endpoint instead or what's the purpose behind using minReadySeconds?
Security context is set to root user / privileged (https://github.com/grafana/agent/blob/master/production/kubernetes/agent.yaml#L274-L276). Why does the agent need root permissions? Opening a socket on port 80 is probably one reason, but I think the port could easily be changed to another port which does not require root permissions?
There are two different paths specified for the wal_directory. Does the argument overwrite the setting from the YAML config or are these two different config options? See: https://github.com/grafana/agent/blob/master/production/kubernetes/agent.yaml#L203 and https://github.com/grafana/agent/blob/master/production/kubernetes/agent.yaml#L262

Investigate overhead of running more than one Prometheus instance

Work needs to be done to investigate what the overhead is of running (for example) 100 scrape configs within a single instance vs 100 scrape configs spread across 100 instances.

If the overhead is small enough, no action is needed and this ticket can be closed. Otherwise, if the overhead is non-trivial, it needs to:

a) be documented with a warning to avoid multiple instances where possible
b) open an issue for allowing the various Agent systems (integrations, scraping service) to run within a single instance

Deploy Docker image per merge

We need a CI action to create an push the grafana/agent image whenever a merge happens.

Extend Agent to send recording rules to Cortex ruler API

The process of migrating from a Prometheus/OpenMetrics system that supports recording rules to the Agent is made more difficult by the total lack of knowledge about recording rules from within the Agent. While there are currently no plans to support recording rules within the Agent, the Agent could act as a client to the Cortex rules API. This lets users use the Agent with a more complete Prometheus-like config and not have to install yet another tool for their migration (i.e., like the very useful cortextool).

For example, we could add rule_files to the instance config:

server:
  log_level: info
  http_listen_port: 12345

prometheus:
  global:
    scrape_interval: 5s
  configs:
    - name: test
      host_filter: false
      ###################
      ### THIS IS NEW ### 
      ###################
      rule_files: 
         - recording-rules.yml 
      scrape_configs:
        - job_name: local_scrape
          static_configs:
            - targets: ['127.0.0.1:12345']
              labels:
                cluster: 'localhost'
      remote_write:
        - url: http://localhost:9009/api/prom/push

/cc @gotjosh

Move back to upstream Prometheus

The rfratto/prometheus fork is being used while memory improvements are being investigated and tested. Any changes made in the fork should eventually make its way back upstream through a Prometheus PR.

Use strict YAML parsing

The Agent doesn't use strict YAML parsing so invalid YAMLs are silently ignored and makes it hard to track issues down.

Explore reducing Kubernetes SD load when using host_filter

It appears that the overhead of Kubernetes SD increases across the agents as:

a) the number of nodes increase
b) the number of pods increase

We noticed that a DaemonSet of 20 agents was using more CPU combined compared to a single instance of Prometheus that included more CPU-intensive tasks, such as TSDB, Recording Rules, and Alerts.

When the agents are running as a DaemonSet with host_filter, it would be more less CPU intensive if each agent only scraped pods on the node it's running on rather than pods across the entire cluster.

I'm not sure how/if this can be done.

Research on what is needed to embed `node_exporter`

What we need to do

As part of agent we want to able to bundle multiple exporters into the agent itself, node_exporter is one of them.

We'd like to understand what is the work to embed it. There are multiple options here which are worth considering:

We can get the agent to scrape itself with whatever node_exporter exposes. This translates to just exposing the exporter metrics into /metrics and let the agent scrape the endpoint.
We can get the contents of the node_exporter metrics and ship them through the remote-write storage.

Implementation notes

It is important, that everything goes through the WAL.
Robert thinks this is just a matter of using a custom gatherer and then exposing it.
A good starting point for understanding what the node_exporter is doing is the main file.
If we go down the route of agent scraping itself, should we have it on a different /metrics endpoint? We believe Prometheus might have increase memory usage if we scrape endpoints with many metrics, plus we'd like to have separate concerns per exporter.

Expected output

We don't necessarily need to write code as an output, writing on next steps should suffice. Specifically, a new set of issues on how to tackle it.

Sync Agent scrape_configs with latest Jsonnet usage

I'm not sure if anything needs to change, but I noticed the two following Loki issues this morning:

grafana/loki#2080
grafana/loki#2091

The Jsonnet configs from the agent might be outdated and need to be synced up with the Prometheus Jsonnet.

Update to Prometheus v2.18.1

Prometheus v2.18.1 came out recently; we should update for its remote_write fixes.

Per the maintenance guide, here are the grafana/prometheus branches that need to be updated to the latest release:

v0.3's strict YAML parsing uncovered a typo in agent.yaml

agent/production/kubernetes/agent.yaml

Line 171 in 3cbb06b

metric_label_configs:

Should be metric_relabel_configs instead, I believe?

Prometheus Vendor Update Tracking

This issue is meant to track new versions of Prometheus and decide whether they introduce new functionality that the Agent would benefit from. We will note each version as:

vendored in bold
skipped update in italics.
planned for vendoring in non-bold, not italicized text

The current vendored version of Prometheus is v2.21.0.

v.2.17.1
v2.17.2: Skipping, no fixes related to Agent
v2.18.0: Skipping, going directly to 2.18.1
v2.18.1 (#58)
v2.19.0: Skipping, going directly to 2.20.1
v2.19.3: Skipping, going directly to 2.20.1
v2.20.0: Skipping, going directly to 2.20.1
v2.20.1 (#149, vendored in v0.6.0)
v2.21.0 (#203, vendored in v0.7.2)

Don't return 404 for config management API when disabled

It can be confusing for users to call into the config management API only to get a 404 - it is hard to tell from the client level if the 404 is returned because of using the wrong URL or if the Agent being invoked does not have the scraping service mode enabled.

All of the API endpoints should be wired up, even when scraping service mode is disabled, but in the case of being disabled, they should return HTTP 405 with a helpful error message.

Docker containers should use release build by default

Currently, make agent defaults to building the debug binary - this is fine and makes sense, but the use case for debug by default starts to make less sense for the docker containers.

make agent should continue to be the debug build by default while the Dockerfile should default to the release build.

Jsonnet deployment, mixins

We need Jsonnet configs and better dashboards for deploying the agent. The dashboards are lower priority, but the configs are important to be able to create an install script.

We're not including the Prometheus Remote Write dashboard right now; we should add it in for the Agent.

remote_write config stanzas ignore password_file

Per docs, it looks like one should be able to use password_file in a remote_write configuration (as in prometheus; desirable cause it plays well with K8s secrets, etc).

Best I can tell though, a password_file value appears to be checked for / validated, but never makes it into the actual runtime configuration used? (I was led to this by 401 Unauthorized errors which I believe are attributable to this.)

Document WAL appender

More documentation is needed around the WAL appender and spawned "Prometheus" instances, namely:

Where the WAL is created
What flags or config options are needed for the WAL to work
How each instance has their own WAL

Document metrics exposed by the Agent

It'd be nice to have a document that describes all metrics the Agent creates, similar to the one found for Loki.

Create Target and Metric info API

For debugging, it would be useful to have an API to:

Return list of Prometheus instances
(For a given instance) Return list of jobs and targets per job
(For a given instance, job tuple) Return a list of scraped metrics
(For a given metric name) Return a list of (instance, job) tuples that expose that metric

This can be built using the MetadataList method that is exposed in the scrape targets.

Implement a node_exporter integration

Depends on #83

A node_exporter integration should be added to the Agent, implementing the Integration interface.

Scraping service: Explore allowing failed Agents in hash ring

If an Agent goes down and a config hashes to it, that config will not be loaded until the node is forgotten from the ring. Can we remove the dependency on the quorum checking for the Agent? It's not clear if we benefit from the same guarantees that Cortex needs when looking up something in the ring.

Agent use high memory even metrics scraped is not with a high volume

We use hashmod to shard the agents to remote write to some Prometheus remote write compatible backends. We found some agents are taking much more memory than the otheres even the metrics it scraped is not that much. Any ideas why some agents can take so much memory?

Document Agent lifecycle

The shutdown process of the Agent is different than how Prometheus does it, since the Agent writes staleness markers on shutdown by default.

Documentation around this process should help.

For a bonus, the log messages when waiting to write staleness markers should log how much longer is left before it times out.

add up metrics for instance scrape status

Can we add

up{job="<job-name>", instance="<instance-id>"}

into this agent? This metrics is especially useful to monitor any scrape failure. Any alternative metrics we can use if some concern to add up metrics into the agent?

My one use case on this is to confirm which agent is scraping what set of pods. This is especially helpful to confirm sharding works.

Improve host filtering

Currently host filtering just checks if the system's hostname matches a target's __address__ __host__ field. This won't work when the URL specified in one of those fields contains an alias to the node running the agent.

The host filtering mechanism should be improved to do match on all of the following:

Its machine hostname
All of the IP addresses assigned to interfaces on the node running the agent

Then, matching a target's hostname should first be done on its original value, followed by its DNS resolved IP if the original value doesn't match.

WAL slowly fills forever on corrupt segment

I noticed an agent with an infinitely growing WAL today. This was found in the logs:

level=warn ts=2020-04-21T21:26:44.741058582Z caller=instance.go:365 agent=prometheus instance=agent msg="could not truncate WAL" err="create checkpoint: read segments: corruption in segment /tmp/agent/data/agent/wal/00001087 at 131133: unexpected full record"
--

This is preventing truncation from running and required a full delete of the WAL.

Invalid config is ignored

When validating the config fails, the process should exit rather than continuing on as normal:

agent/cmd/agent/main.go

Line 77 in b0016d2

 level.Error(util.Logger).Log("msg", "failed to create prometheus instance", "err", err) 

Create a YAML config section to control node exporter

What we need to do

At the moment, node_exporter is controlled using a set of query parameters on the scrape URL. We'd like to avoid this and configure it using a similar structure to what we have today: YAML config.

Implementation notes

Add a new root key in the Config struct named integrations that holds a collection of exporters. e.g. A key for the collection in the case of node_expoerter would look like:

server:
  log_level: info
  http_listen_port: 12345

prometheus:
  global:
    scrape_interval: 5s

integrations:
  node_exporter:
    an_example_option_here:

Whatever we do, please keep in mind that in the future we'd like to load integrations from an individual file.
Ensure we have proper documentation for the different keys too.

Copy and paste installation script

We need a copy-and-paste installation script to automatically generate k8s configs for the agent. It should use the same scrape configs that the Tanka deployment uses, but assist the user in inserting the remote_write URL, username, and API key.

Ideally, the script should have an interactive mode as well as a flags mode:

# Flags mode  
$ ./grafana_cloud_deploy.sh -l $REMOTE_WRITE_URL -u $AUTH_USER -p $AUTH_PASSWORD  | kubectl apply

# Interactive mode 
$ ./grafana_cloud_deploy.sh | kubectl apply 
Enter your remote write URL:  https://example.com/api/prom/push 
Enter your remote write username: 12345
Enter your remote write password: s3cur3p455w0rd

The flags mode makes it easy for tooling to generate a copy and paste set of commands (set up environment variables, run script) while the interactive mode is provided for convenience: users can just copy and paste each section and not have to modify anything before running it in the terminal.

node_exporter: filesystem collector doesn't collect anything by default on Linux

Looks like the filesystem config regexes aren't working for Linux

Implement an integrations system

As part of the project to embed exporters within the Agent, we need a generalized system that can run an "integration." An integration will initially be defined as:

A metrics endpoint scoped to the specific integration (e.g., /integrations/<integration>/metrics).
An set of optional extra endpoints for extra visibility/control into integrations.
The ability to Start and Stop an integration.

Integrations will have to implement to expose these functionalities, roughly something like the following:

type Integration interface {
  // Name returns the name of the integration. Must be unique. 
  Name() string 
  
  // RegisterRoutes should register HTTP handlers for the integration.
  RegisterRoutes(r *mux.Router) error 

  // MetricsEndspoints should return the endpoint(s) for the integration
  // that expose Prometheus metrics. 
  MetricsEndpoints() []string 

  // Run should start the integration and block until it fails or ctx is canceled. 
  Run(ctx context.Context) error
}

The Agent must run each integration and make sure that they stay alive - if an integration exits unexpectedly, it should be restarted.

The Agent should also create a special non-configurable Prometheus Instance that doesn't run any SD and collects its targets from the running integrations This special Prometheus instance will be dedicated to scraping metrics from integrations.

There should be one "sample" integration added as part of the initial implementation: an agent integration, where the Agent collects its own metrics. A second integration, an embedded node_exporter, should eventually be added, but is out of scope for this issue.

Integrations should be placed in the Agent config file under its own dedicated "namespace":

server: 
  <server_config>

prometheus:
  <prometheus_config>

integrations:
  # Settings for the "agent" integration
  agent:
    enabled: true 

  # Directory to store the WAL for integration samples
  wal_dir: <string>  
  
  # All integrations will remote write to these endpoints 
  prometheus_remote_write:
    - <remote_write_config>

The implementation for this feature should be split across multiple PRs.

/cc @gotjosh @hoenn

Release automation

Should do roughly the same thing that Loki does. When a tag is made:

Create a version release Docker image tag (i.e., v1.2.3)
Publish prebuilt binaries for the agent
Create a new release with a template

At least for the first release, just build prebuilt binaries for the following platforms:

Linux AMD64
Windows AMD64
FreeBSD AMD64
Darwin AMD64

Health and Readiness endpoints

Health and Readiness handlers should be added to the agent. This should be pretty easy to do; Prometheus' existing handlers don't do anything but return HTTP 200.

Global Prometheus Remote Write

Today, users using integrations alongside normal scrape configs have to configure remote_write twice: once for the Prometheus instance config and once for the integrations config.

It'd be nice to have a global remote write section within the prometheus block that affects all instance configs and integration configs, removing the need to put the same scrape config twice.

Provide rendered example YAML from Jsonnet configs.

In lieu of a Helm chart, it would be useful to provide rendered YAML in the production folder. The rendered YAML may even be useful for creating the deploy script: it could download the YAML, replace environment variables, and print out the result.

Divide by zero when scrape_config is not specified

When running the agent with a bare config that includes a scrape_config but that omits specifying scrape_interval, I see a crash due to the zero scrap interval:

panic: runtime error: integer divide by zero

goroutine 291 [running]:
github.com/prometheus/prometheus/scrape.(*Target).offset(0xc000409900, 0x0, 0xeab27cf2a6289c0a, 0x0)
	/home/rob/grafana/agent/vendor/github.com/prometheus/prometheus/scrape/target.go:161 +0x12d
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run(0xc000409a40, 0x0, 0x0, 0x0)
	/home/rob/grafana/agent/vendor/github.com/prometheus/prometheus/scrape/scrape.go:914 +0x77
created by github.com/prometheus/prometheus/scrape.(*scrapePool).sync
	/home/rob/grafana/agent/vendor/github.com/prometheus/prometheus/scrape/scrape.go:423 +0x6ef

For reference, the config is

server:
  http_listen_address: localhost
  http_listen_port: 9898

prometheus:
  wal_directory: ./wal
  configs:
    - name: myconfig
      remote_write:
        - url: https://prometheus-us-central1.grafana.net/api/prom/push
          basic_auth:
            username: xxx
            password: yyy
      scrape_configs:
        - job_name: 'node'
          static_configs:
            - targets: ['localhost:9100']

(I'll submit a PR with a tentative fix; thought I'd file this separately anyway for documentation purposes.)

Documentation?

The project needs documentation. Loki's docs are a good starting point (although I'm biased since I wrote them), but in general we need to document:

Overview, comparisons to alternatives
Getting started: installing, configuring, etc
Full configuration guide
Observability of the agent

Scraping Service: `basic_auth` gets wiped by marshaling to YAML

Prometheus stores the basic_auth as a Secret type which implements yaml.Marshaler and always forces the string to be <secret>. This breaks the configuration storage.

The yamlCodec in pkg/prometheus/ha should wrap the instance.Config type and store secret values separately. The following are secrets:

BasicAuth Password
HTTP Client bearer token

Note that for the same reason, hashing configs when detecting if configs changed is broken; the value being hashed is not the secret but rather the string <secret>. To fix this, the secret values above should be added separately to the hashing function if they are non-nil and non-empty.

Release script should always install latest release

The release script is hard coded with a release version, meaning we have to manually update it every release. It's unlikely users will want to install a release that's not the latest, so we should modify the script to find the latest tag and use that.

Related to the discussion from #35.

Agent-specific User-Agent header for agent

It would be useful to uniquely identify requests from grafana/agent incoming within the remote_write API implementation. One way to do this would be to set the User-Agent header to identify it as the Agent.

Currently, Prometheus hard-codes the User-Agent header; a PR is needed upstream to allow for overriding this specific header or a mechanism to arbitrarily change headers (e.g., passing a custom http.Client).

Stick the Apache license on this please.

Dry-run flag for `agentctl config-sync`

#60 introduced a command line tool that allows users to sync a directory of config files to the Agent's config management server. It requires that all files are valid instance config YAMLs before running. If one of the files isn't valid YAML, nothing will be uploaded.

For users that want to see if all config files in a directory are valid, it would be useful to have a dry-run flag that stops after validating the directory and doesn't upload anything.

Tanka default/kubernetes job does not work with host_filter: true

agent/production/tanka/grafana-agent/config.libsonnet

Lines 363 to 370 in 8a24eec

 job_name: 'default/kubernetes', 

 kubernetes_sd_configs: [{ 

 role: 

 if $._config.scrape_api_server_endpoints 

 then 'endpoints' 

 else 'service', 

 }], 

 scheme: 'https',

Using kubernetes_sd_configs with host_filter: true only works if the role for SD is set to node or pod. If it's anything else, targets won't have any of the labels the host filterer depends on for filtering:

__address__ won't be set to the node's IP
__meta_kubernetes_pod_node_name won't exist
__meta_kubernetes_node_name won't exist

And, as such, all targets will be dropped. This is preventing all Kubernetes API metrics from being scraped. We're still seeing some apiserver metrics, but based on the host filtering rules, the scrape job for role: endpoints should never work fully.

While you could set up a second config with host_filter: false in the Agent, this will cause a problem for the Daemonset deployment as all agents across all Kubernetes nodes will scrape the same target and run into out of order timestamp errors. host_filter: false as a daemonset would only work if Cortex clients could configure how HA deduplication should be applied, which they currently can't.

The other solution, although ugly, is to run a second Agent deployment with one replica dedicated for scraping metrics from targets where host_filter: true does not work.

This second deployment should be added to the Agent Tanka configs, since this issue affects all users of the Agent using Tanka and the YAML generated from it.

/cc @cstyan @gotjosh

Instance configs missing Prometheus validation checks

The Agent is missing some validation checks that Prometheus has:

Validations for scrape_configs
Validations for remote_write

Check the links to see what Prometheus does.

The instance configs don't do validation checks on unmarshaling like Prometheus does; rather it's broken up into ApplyDefaults and Validate. However, ApplyDefaults and Validate aren't being called from the instance config API; this should also be fixed.

Append staleness markers on agent shutdown

Implementing this leaves the agent to behave a little differently than Prometheus, but this makes sense for the agent, where a node running the agent may be removed; in this case, we'd need to write the staleness markers.

This requires shutting down the scrape manager first so no new samples get appended. Then, for each instance that is stopping, it should append a staleness marker to all active series. The instance should then wait until the staleness markers were written by the remote storage, then the instance can stop.

After all instances stop, the remote storage can stop.

Improve memory usage

Memory usage of the agent for ~30K series is around 350MB megabytes of in-use heap memory. We should try as much as we can to investigate where that memory usage is coming from and what can be done to improvement. This will likely have to be done in a temporary fork of Prometheus before the memory improvements can be reintroduced upstream.

TODO

Provide a more detailed analysis of memory vs Prometheus and Cortex
Provide a more detailed description of where memory problems lie
Create more issues to track specific areas where memory usage can be improved.

Image of memory usage comparing Cortex, the agent, and Prometheus, where both the agent and Prometheus are scraping the same sources:

Cut a 0.3.2 release

The various issues found in the scraping service warrant a new patch release.

Before 0.3.2, we need the following PRs merged: #92, #90 and for all code to run for at least half a day to make sure everything looks good.

/cc @gotjosh @hoenn

Add CI validations for go modules, Kubernetes manifest, Dashboards

It's easy to make changes to the Agent and forget to run some things before opening a PR:

go mod tidy
go mod vendor
make example-dashboards (produces example dashboards used in the Docker-Compose example)
make example-kubernetes (produces Kubernetes manifest used for installing Agent)

It would be nice if GitHub Actions ran these commands against a PR and failed if any of them cause git diffs.

	job_name: 'default/kubernetes',
	kubernetes_sd_configs: [{
	role:
	if $._config.scrape_api_server_endpoints
	then 'endpoints'
	else 'service',
	}],
	scheme: 'https',

grafana / agent Goto Github PK

agent's Issues

What we need to do

Implementation notes

Expected output

What we need to do

Implementation notes

Recommend Projects

Recommend Topics

Recommend Org

Jobs