grafana / agent Goto Github PK
View Code? Open in Web Editor NEWVendor-neutral programmable observability pipelines.
Home Page: https://grafana.com/docs/agent/
License: Apache License 2.0
Vendor-neutral programmable observability pipelines.
Home Page: https://grafana.com/docs/agent/
License: Apache License 2.0
The configuration reference document is currently hand written, large, and subject to change when updating vendors. It would be nice to have a script that autogenerates the entire document. Cortex has something like this for its equivalent config reference that could be used for inspiration.
See full configuration reference; there are many fields marked as secret that aren’t being converted to a string for storage. For example, the basic auth credentials in the scrape configs.
We need a better solution than replicating all the types that have a secret field.
I stumbled across the deployment manifest here: https://github.com/grafana/agent/blob/master/production/kubernetes/agent.yaml and I have a few questions:
You are using minReadySeconds
(https://github.com/grafana/agent/blob/master/production/kubernetes/agent.yaml#L250), I assume because there are no liveness/readiness probes (yet). Could I use the /metrics
endpoint instead or what's the purpose behind using minReadySeconds
?
Security context is set to root user / privileged (https://github.com/grafana/agent/blob/master/production/kubernetes/agent.yaml#L274-L276). Why does the agent need root permissions? Opening a socket on port 80 is probably one reason, but I think the port could easily be changed to another port which does not require root permissions?
There are two different paths specified for the wal_directory
. Does the argument overwrite the setting from the YAML config or are these two different config options? See: https://github.com/grafana/agent/blob/master/production/kubernetes/agent.yaml#L203 and https://github.com/grafana/agent/blob/master/production/kubernetes/agent.yaml#L262
Work needs to be done to investigate what the overhead is of running (for example) 100 scrape configs within a single instance vs 100 scrape configs spread across 100 instances.
If the overhead is small enough, no action is needed and this ticket can be closed. Otherwise, if the overhead is non-trivial, it needs to:
a) be documented with a warning to avoid multiple instances where possible
b) open an issue for allowing the various Agent systems (integrations, scraping service) to run within a single instance
We need a CI action to create an push the grafana/agent
image whenever a merge happens.
The process of migrating from a Prometheus/OpenMetrics system that supports recording rules to the Agent is made more difficult by the total lack of knowledge about recording rules from within the Agent. While there are currently no plans to support recording rules within the Agent, the Agent could act as a client to the Cortex rules API. This lets users use the Agent with a more complete Prometheus-like config and not have to install yet another tool for their migration (i.e., like the very useful cortextool).
For example, we could add rule_files
to the instance config:
server:
log_level: info
http_listen_port: 12345
prometheus:
global:
scrape_interval: 5s
configs:
- name: test
host_filter: false
###################
### THIS IS NEW ###
###################
rule_files:
- recording-rules.yml
scrape_configs:
- job_name: local_scrape
static_configs:
- targets: ['127.0.0.1:12345']
labels:
cluster: 'localhost'
remote_write:
- url: http://localhost:9009/api/prom/push
/cc @gotjosh
The rfratto/prometheus fork is being used while memory improvements are being investigated and tested. Any changes made in the fork should eventually make its way back upstream through a Prometheus PR.
The Agent doesn't use strict YAML parsing so invalid YAMLs are silently ignored and makes it hard to track issues down.
It appears that the overhead of Kubernetes SD increases across the agents as:
a) the number of nodes increase
b) the number of pods increase
We noticed that a DaemonSet of 20 agents was using more CPU combined compared to a single instance of Prometheus that included more CPU-intensive tasks, such as TSDB, Recording Rules, and Alerts.
When the agents are running as a DaemonSet with host_filter, it would be more less CPU intensive if each agent only scraped pods on the node it's running on rather than pods across the entire cluster.
I'm not sure how/if this can be done.
As part of agent we want to able to bundle multiple exporters into the agent itself, node_exporter
is one of them.
We'd like to understand what is the work to embed it. There are multiple options here which are worth considering:
We can get the agent to scrape itself with whatever node_exporter
exposes. This translates to just exposing the exporter metrics into /metrics
and let the agent scrape the endpoint.
We can get the contents of the node_exporter
metrics and ship them through the remote-write storage.
node_exporter
is doing is the main file./metrics
endpoint? We believe Prometheus might have increase memory usage if we scrape endpoints with many metrics, plus we'd like to have separate concerns per exporter.I'm not sure if anything needs to change, but I noticed the two following Loki issues this morning:
grafana/loki#2080
grafana/loki#2091
The Jsonnet configs from the agent might be outdated and need to be synced up with the Prometheus Jsonnet.
Prometheus v2.18.1 came out recently; we should update for its remote_write fixes.
Per the maintenance guide, here are the grafana/prometheus
branches that need to be updated to the latest release:
agent/production/kubernetes/agent.yaml
Line 171 in 3cbb06b
Should be metric_relabel_configs
instead, I believe?
This issue is meant to track new versions of Prometheus and decide whether they introduce new functionality that the Agent would benefit from. We will note each version as:
The current vendored version of Prometheus is v2.21.0.
It can be confusing for users to call into the config management API only to get a 404 - it is hard to tell from the client level if the 404 is returned because of using the wrong URL or if the Agent being invoked does not have the scraping service mode enabled.
All of the API endpoints should be wired up, even when scraping service mode is disabled, but in the case of being disabled, they should return HTTP 405 with a helpful error message.
Currently, make agent
defaults to building the debug binary - this is fine and makes sense, but the use case for debug by default starts to make less sense for the docker containers.
make agent
should continue to be the debug build by default while the Dockerfile should default to the release build.
We need Jsonnet configs and better dashboards for deploying the agent. The dashboards are lower priority, but the configs are important to be able to create an install script.
We're not including the Prometheus Remote Write dashboard right now; we should add it in for the Agent.
Per docs, it looks like one should be able to use password_file
in a remote_write
configuration (as in prometheus; desirable cause it plays well with K8s secrets, etc).
Best I can tell though, a password_file
value appears to be checked for / validated, but never makes it into the actual runtime configuration used? (I was led to this by 401 Unauthorized errors which I believe are attributable to this.)
More documentation is needed around the WAL appender and spawned "Prometheus" instances, namely:
It'd be nice to have a document that describes all metrics the Agent creates, similar to the one found for Loki.
For debugging, it would be useful to have an API to:
This can be built using the MetadataList method that is exposed in the scrape targets.
Depends on #83
A node_exporter
integration should be added to the Agent, implementing the Integration interface.
If an Agent goes down and a config hashes to it, that config will not be loaded until the node is forgotten from the ring. Can we remove the dependency on the quorum checking for the Agent? It's not clear if we benefit from the same guarantees that Cortex needs when looking up something in the ring.
The shutdown process of the Agent is different than how Prometheus does it, since the Agent writes staleness markers on shutdown by default.
Documentation around this process should help.
For a bonus, the log messages when waiting to write staleness markers should log how much longer is left before it times out.
Can we add
up{job="<job-name>", instance="<instance-id>"}
into this agent? This metrics is especially useful to monitor any scrape failure. Any alternative metrics we can use if some concern to add up
metrics into the agent?
My one use case on this is to confirm which agent is scraping what set of pods. This is especially helpful to confirm sharding works.
Currently host filtering just checks if the system's hostname matches a target's __address__
__host__
field. This won't work when the URL specified in one of those fields contains an alias to the node running the agent.
The host filtering mechanism should be improved to do match on all of the following:
Then, matching a target's hostname should first be done on its original value, followed by its DNS resolved IP if the original value doesn't match.
I noticed an agent with an infinitely growing WAL today. This was found in the logs:
level=warn ts=2020-04-21T21:26:44.741058582Z caller=instance.go:365 agent=prometheus instance=agent msg="could not truncate WAL" err="create checkpoint: read segments: corruption in segment /tmp/agent/data/agent/wal/00001087 at 131133: unexpected full record"
--
This is preventing truncation from running and required a full delete of the WAL.
When validating the config fails, the process should exit rather than continuing on as normal:
Line 77 in b0016d2
At the moment, node_exporter
is controlled using a set of query parameters on the scrape URL. We'd like to avoid this and configure it using a similar structure to what we have today: YAML config.
Config
struct named integrations
that holds a collection of exporters. e.g. A key for the collection in the case of node_expoerter
would look like:server:
log_level: info
http_listen_port: 12345
prometheus:
global:
scrape_interval: 5s
integrations:
node_exporter:
an_example_option_here:
We need a copy-and-paste installation script to automatically generate k8s configs for the agent. It should use the same scrape configs that the Tanka deployment uses, but assist the user in inserting the remote_write URL, username, and API key.
Ideally, the script should have an interactive mode as well as a flags mode:
# Flags mode
$ ./grafana_cloud_deploy.sh -l $REMOTE_WRITE_URL -u $AUTH_USER -p $AUTH_PASSWORD | kubectl apply
# Interactive mode
$ ./grafana_cloud_deploy.sh | kubectl apply
Enter your remote write URL: https://example.com/api/prom/push
Enter your remote write username: 12345
Enter your remote write password: s3cur3p455w0rd
The flags mode makes it easy for tooling to generate a copy and paste set of commands (set up environment variables, run script) while the interactive mode is provided for convenience: users can just copy and paste each section and not have to modify anything before running it in the terminal.
Looks like the filesystem config regexes aren't working for Linux
As part of the project to embed exporters within the Agent, we need a generalized system that can run an "integration." An integration will initially be defined as:
/integrations/<integration>/metrics
).Start
and Stop
an integration.Integrations will have to implement to expose these functionalities, roughly something like the following:
type Integration interface {
// Name returns the name of the integration. Must be unique.
Name() string
// RegisterRoutes should register HTTP handlers for the integration.
RegisterRoutes(r *mux.Router) error
// MetricsEndspoints should return the endpoint(s) for the integration
// that expose Prometheus metrics.
MetricsEndpoints() []string
// Run should start the integration and block until it fails or ctx is canceled.
Run(ctx context.Context) error
}
The Agent must run each integration and make sure that they stay alive - if an integration exits unexpectedly, it should be restarted.
The Agent should also create a special non-configurable Prometheus Instance that doesn't run any SD and collects its targets from the running integrations This special Prometheus instance will be dedicated to scraping metrics from integrations.
There should be one "sample" integration added as part of the initial implementation: an agent
integration, where the Agent collects its own metrics. A second integration, an embedded node_exporter
, should eventually be added, but is out of scope for this issue.
Integrations should be placed in the Agent config file under its own dedicated "namespace":
server:
<server_config>
prometheus:
<prometheus_config>
integrations:
# Settings for the "agent" integration
agent:
enabled: true
# Directory to store the WAL for integration samples
wal_dir: <string>
# All integrations will remote write to these endpoints
prometheus_remote_write:
- <remote_write_config>
The implementation for this feature should be split across multiple PRs.
Should do roughly the same thing that Loki does. When a tag is made:
v1.2.3
)At least for the first release, just build prebuilt binaries for the following platforms:
Health and Readiness handlers should be added to the agent. This should be pretty easy to do; Prometheus' existing handlers don't do anything but return HTTP 200.
Today, users using integrations alongside normal scrape configs have to configure remote_write twice: once for the Prometheus instance config and once for the integrations config.
It'd be nice to have a global remote write section within the prometheus
block that affects all instance configs and integration configs, removing the need to put the same scrape config twice.
In lieu of a Helm chart, it would be useful to provide rendered YAML in the production
folder. The rendered YAML may even be useful for creating the deploy script: it could download the YAML, replace environment variables, and print out the result.
When running the agent with a bare config that includes a scrape_config
but that omits specifying scrape_interval
, I see a crash due to the zero scrap interval:
panic: runtime error: integer divide by zero
goroutine 291 [running]:
github.com/prometheus/prometheus/scrape.(*Target).offset(0xc000409900, 0x0, 0xeab27cf2a6289c0a, 0x0)
/home/rob/grafana/agent/vendor/github.com/prometheus/prometheus/scrape/target.go:161 +0x12d
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run(0xc000409a40, 0x0, 0x0, 0x0)
/home/rob/grafana/agent/vendor/github.com/prometheus/prometheus/scrape/scrape.go:914 +0x77
created by github.com/prometheus/prometheus/scrape.(*scrapePool).sync
/home/rob/grafana/agent/vendor/github.com/prometheus/prometheus/scrape/scrape.go:423 +0x6ef
For reference, the config is
server:
http_listen_address: localhost
http_listen_port: 9898
prometheus:
wal_directory: ./wal
configs:
- name: myconfig
remote_write:
- url: https://prometheus-us-central1.grafana.net/api/prom/push
basic_auth:
username: xxx
password: yyy
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
(I'll submit a PR with a tentative fix; thought I'd file this separately anyway for documentation purposes.)
The project needs documentation. Loki's docs are a good starting point (although I'm biased since I wrote them), but in general we need to document:
Prometheus stores the basic_auth
as a Secret
type which implements yaml.Marshaler
and always forces the string to be <secret>
. This breaks the configuration storage.
The yamlCodec
in pkg/prometheus/ha
should wrap the instance.Config
type and store secret values separately. The following are secrets:
Note that for the same reason, hashing configs when detecting if configs changed is broken; the value being hashed is not the secret but rather the string <secret>
. To fix this, the secret values above should be added separately to the hashing function if they are non-nil and non-empty.
The release script is hard coded with a release version, meaning we have to manually update it every release. It's unlikely users will want to install a release that's not the latest, so we should modify the script to find the latest tag and use that.
Related to the discussion from #35.
It would be useful to uniquely identify requests from grafana/agent
incoming within the remote_write
API implementation. One way to do this would be to set the User-Agent
header to identify it as the Agent.
Currently, Prometheus hard-codes the User-Agent header; a PR is needed upstream to allow for overriding this specific header or a mechanism to arbitrarily change headers (e.g., passing a custom http.Client).
#60 introduced a command line tool that allows users to sync a directory of config files to the Agent's config management server. It requires that all files are valid instance config YAMLs before running. If one of the files isn't valid YAML, nothing will be uploaded.
For users that want to see if all config files in a directory are valid, it would be useful to have a dry-run flag that stops after validating the directory and doesn't upload anything.
agent/production/tanka/grafana-agent/config.libsonnet
Lines 363 to 370 in 8a24eec
Using kubernetes_sd_configs
with host_filter: true
only works if the role
for SD is set to node
or pod
. If it's anything else, targets won't have any of the labels the host filterer depends on for filtering:
__address__
won't be set to the node's IP__meta_kubernetes_pod_node_name
won't exist__meta_kubernetes_node_name
won't existAnd, as such, all targets will be dropped. This is preventing all Kubernetes API metrics from being scraped. We're still seeing some apiserver metrics, but based on the host filtering rules, the scrape job for role: endpoints
should never work fully.
While you could set up a second config with host_filter: false
in the Agent, this will cause a problem for the Daemonset deployment as all agents across all Kubernetes nodes will scrape the same target and run into out of order timestamp errors. host_filter: false
as a daemonset would only work if Cortex clients could configure how HA deduplication should be applied, which they currently can't.
The other solution, although ugly, is to run a second Agent deployment with one replica dedicated for scraping metrics from targets where host_filter: true
does not work.
This second deployment should be added to the Agent Tanka configs, since this issue affects all users of the Agent using Tanka and the YAML generated from it.
The Agent is missing some validation checks that Prometheus has:
Check the links to see what Prometheus does.
The instance configs don't do validation checks on unmarshaling like Prometheus does; rather it's broken up into ApplyDefaults
and Validate
. However, ApplyDefaults
and Validate
aren't being called from the instance config API; this should also be fixed.
Implementing this leaves the agent to behave a little differently than Prometheus, but this makes sense for the agent, where a node running the agent may be removed; in this case, we'd need to write the staleness markers.
This requires shutting down the scrape manager first so no new samples get appended. Then, for each instance that is stopping, it should append a staleness marker to all active series. The instance should then wait until the staleness markers were written by the remote storage, then the instance can stop.
After all instances stop, the remote storage can stop.
Memory usage of the agent for ~30K series is around 350MB megabytes of in-use heap memory. We should try as much as we can to investigate where that memory usage is coming from and what can be done to improvement. This will likely have to be done in a temporary fork of Prometheus before the memory improvements can be reintroduced upstream.
TODO
Image of memory usage comparing Cortex, the agent, and Prometheus, where both the agent and Prometheus are scraping the same sources:
It's easy to make changes to the Agent and forget to run some things before opening a PR:
go mod tidy
go mod vendor
make example-dashboards
(produces example dashboards used in the Docker-Compose example)make example-kubernetes
(produces Kubernetes manifest used for installing Agent)It would be nice if GitHub Actions ran these commands against a PR and failed if any of them cause git diffs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.