grafana / cortex-tools Goto Github PK
View Code? Open in Web Editor NEWIf you're using this tool with Grafana Mimir, please switch to "mimirtool" instead: https://github.com/grafana/mimir
License: Apache License 2.0
If you're using this tool with Grafana Mimir, please switch to "mimirtool" instead: https://github.com/grafana/mimir
License: Apache License 2.0
It involves a number of manual steps, they can be seen here: #51
Queries that have a matcher would still work as they don't touch use the GetReadQueriesForMetricName
function.
The go.mod
file specifies cortextool
as the name, while the repo is actually named cortex-tools
.
While this is not a big problem, it is confusing when naively trying to import code from this repo:
go: github.com/sh0rez/gctl/pkg/spec imports
github.com/grafana/cortex-tools/pkg/client: github.com/grafana/[email protected]: parsing go.mod:
module declares its path as: github.com/grafana/cortextool
but was required as: github.com/grafana/cortex-tools
To not actually break go imports, the "clean" solution would probably be to rename this repo to grafana/cortextool
Using an address with a trailing slash can cause unexpected behaviour.
e.g.
$ CORTEX_ADDRESS=https://prometheus-us-central1.grafana.net/ cortextool rules load test.yml
ERRO[0000] unable to load rule group error="requested resource not found" group=up_job namespace=example_namespace
cortextool: error: load operation unsuccessful, try --help
In some cases, this is not taken into account e.g.
when using the diff command
$ CORTEX_ADDRESS=https://prometheus-us-central1.grafana.net/ cortextool rules diff --rule-files=test.yml
Changes are indicated with the following symbols:
+ updated
The following changes will be made if the provided rule set is synced:
~ Namespace: example_namespace
~ Group: up_job
Diff Summary: 0 Groups Created, 1 Groups Updated, 0 Groups Deleted
I think is because the diff command does not make use of any endpoints where the trailing slashes matter (e.g. subroutes on the API)
It would be nice if cortex-tool would gain a block upload feature that enables posting a block to block storage, e.g. after a cortex downtime.
It would be helpful if the output format of cortextool rules get <namespace> <group>
matched the exact format expected to cortextool rules sync
. This would make it easier to get
down all rules and commit them into Git, then sync
them back again. Right now I have to munge the YAML slightly to make it compatible to sync
.
Right now the format is this:
$ cortextool rules get somenamespace anygroup
name: anygroup
rules:
- alert: FrontEnd Prometheus
expr: .......
Ideally it would be this:
namespace: somenamespace
groups:
- name: anygroup
rules:
- alert: FrontEnd Prometheus
expr: ........
You can have your Ruler and Alertmanager in separate URLs. As a result, it becomes tedious having to change between commands, we should make this a bit more explicit that these are two separate components.
๐ hi!
I'm running Cortex on k8s for a while and I've protected the API with client TLS authentication with the help of ingress-nginx controller.
Right now I want to use cortex-tools to lint and load rules in an automated fashion from a CI pipeline. Thus I would like to authenticate the http client with a TLS client certificate.
I saw you're using go http client so it shouldn't be hard to add tls certs to CortexClient struct:
cortex-tools/pkg/client/client.go
Line 31 in 94327d5
It would be something like https://gist.github.com/michaljemala/d6f4e01c4834bf47a9c4
The cli flags would look like:
cortextools rules load my-rule.yml --address=ADDRESS --id=ID --cacert ca.pem --key client.key --cert client.pem
and also adding environment variables:
CORTEX_TLS_CA_CERT
CORTEX_TLS_CLIENT_KEY
CORTEX_TLS_CLIENT_CERT
I'm wondering if you consider tls client auth useful for the project. In that case I'm willing to send a PR.
Thanks!
Running cortextool rules list
when no rules are configured returns the following error message
$ cortextool rules list
time="2019-12-17T12:49:29-05:00" level=fatal msg="unable to read rules from cortex, requested resource not found"
This initially led me to believe there was an issue with my CORTEX_ADDRESS
value. Should it return an empty list / fail silently instead?
usage: cortextool rules diff --address=ADDRESS --id=ID [<flags>]
diff a set of rules to a designated cortex endpoint
Flags:
--help Show context-sensitive help (also try --help-long and --help-man).
--log.level="info" set level of the logger
--push-gateway.endpoint=PUSH-GATEWAY.ENDPOINT
url for the push-gateway to register metrics
--push-gateway.job=PUSH-GATEWAY.JOB
job name to register metrics
--push-gateway.interval=1m interval to forward metrics to the push gateway
--key="" Api key to use when contacting cortex, alternatively set $CORTEX_API_KEY.
--backend=cortex Backend type to interact with: <cortex|loki>
--address=ADDRESS Address of the cortex cluster, alternatively set CORTEX_ADDRESS.
--id=ID Cortex tenant id, alternatively set CORTEX_TENANT_ID.
--tls-ca-path="" TLS CA certificate to verify cortex API as part of mTLS, alternatively set CORTEX_TLS_CA_PATH.
--tls-cert-path="" TLS client certificate to authenticate with cortex API as part of mTLS, alternatively set CORTEX_TLS_CERT_PATH.
--tls-key-path="" TLS client certificate private key to authenticate with cortex API as part of mTLS, alternatively set CORTEX_TLS_KEY_PATH.
--ignored-namespaces=IGNORED-NAMESPACES
comma-separated list of namespaces to ignore during a diff.
--rule-files=RULE-FILES The rule files to check. Flag can be reused to load multiple files.
--rule-dirs=RULE-DIRS Comma separated list of paths to directories containing rules yaml files. Each file in a directory with a .yml or .yaml
suffix will be parsed.
--disable-color disable colored output
Currently, as you can see above, the diff command accepts a list of namespaces to ignore. Making it very hard to diff any particular namespace.
I'd like to suggest adding the option of adding an acceptlist of namespaces as well, make the two flags exclusive, to make this easier.
When running cortextool version
on a host that can't reach github, the program crashes.
ยฑ .cortextool version
version 0.3.2
checking latest version... panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1e491ce]
goroutine 1 [running]:
github.com/grafana/cortex-tools/pkg/version.getLatestFromGitHub(0xc0013c2d00, 0x1a)
/build/source/pkg/version/version.go:40 +0x10e
github.com/grafana/cortex-tools/pkg/version.CheckLatest()
/build/source/pkg/version/version.go:21 +0x49
main.main.func1(0xc00152c510, 0x40c5a3, 0x20d0500)
/build/source/cmd/cortextool/main.go:33 +0x98
gopkg.in/alecthomas/kingpin%2ev2.(*actionMixin).applyActions(0xc0002cbd58, 0xc00152c510, 0x0, 0x0)
/build/go/pkg/mod/gopkg.in/alecthomas/[email protected]/actions.go:28 +0x6d
gopkg.in/alecthomas/kingpin%2ev2.(*Application).applyActions(0xc00120c0f0, 0xc00152c510, 0x0, 0x0)
/build/go/pkg/mod/gopkg.in/alecthomas/[email protected]/app.go:557 +0xdc
gopkg.in/alecthomas/kingpin%2ev2.(*Application).execute(0xc00120c0f0, 0xc00152c510, 0xc00103afe0, 0x1, 0x1, 0x0, 0x0, 0x0, 0xc001747f08)
/build/go/pkg/mod/gopkg.in/alecthomas/[email protected]/app.go:390 +0x8f
gopkg.in/alecthomas/kingpin%2ev2.(*Application).Parse(0xc00120c0f0, 0xc00000e090, 0x1, 0x1, 0x1, 0xc000b8ac38, 0x0, 0x1)
/build/go/pkg/mod/gopkg.in/alecthomas/[email protected]/app.go:222 +0x1fe
main.main()
/build/source/cmd/cortextool/main.go:38 +0x1bf
[1] 142838 exit 2 cortextool version
Dereference at issue: https://github.com/grafana/cortex-tools/blob/v0.3.2/pkg/version/version.go#L40
I don't think we need three (and maybe more?) images, we could pack all the binaries in a single image making the release process simpler.
With #47 we introduced a breaking change that wouldn't allow us to build docker images - it'll be good to have an image building process as part of the pipeline to catch these a bit earlier.
#131 shows an example where GetRuleGroup
would escape the space in escaped namespace
and buildRequest
would re-escape it, resulting in the failed test result of %2520
from %20
.
At the moment we only have cortextool, but would be good to include the other binaries as well.
I tried to delete a rule group that includes a %
in the name. I got an error with the following reasoning:
invalid URL escape "% n"
Command line inputs need to be sanitized before being used in a URL.
At the moment, we're using both go-yaml.v2 and go-yaml.v3, it'll be good to unify usage of both and avoid any potential pitfalls for having two versions.
We don't have a version command.
It would be useful to know to which server, tenant we're syncing / diff rules to given it is possible to configure using environment variables but also pass arguments.
Right now, when you use the diff
command it will only tell you which groups are going to be changed, but does not tell you which individual rules/alerts are changing.
I would be good to have an idea of what exactly is changing when you run this or the sync command.
The cortex-tools project includes a client for the cortex-ruler. This, in itself, seems pretty simple. However, including the client into a simple golang app causes that app to go from 12->72Mb, importing Cortex, AWS CLI, and lots of other things.
Is it possible to simplify the client such that it doesn't increase binary size so much?
Recently, github.com/prometheus/[email protected] introduced the ability to define durations using mixed units, e.g. 1h30m. This change is not backwards compatible, which creates issues for this project.
This change has been in the cortexproject/cortex master branch for a while, which some extremely notable users (Grafana Cloud!) are using. These changes are also in this project's master branch, but are unreleased. This means that there is no released version of cortextool that can properly interact with Prometheus rules stored in these Cortex instances.
All this requires is a new release of this project! Please cut a new release!
Running cortextool alertmanager get
when no rules are configured returns the following error message
$ cortextool alertmanager get
cortextool.exe: error: requested resource not found, try --help
This initially led me to believe there was an issue with my CORTEX_ADDRESS
value. Should it return an empty list / fail silently instead?
The alertmanager
subcommands use CORTEX_TENANT_ID
as the id parameter while the rules
subcommands use CORTEX_TENTANT_ID
.
Right now we keep a central changelogs for all the binaries. Consider each binary having its own separate changelog.
$ kubectl -n cortex port-forward deploy/ruler 8080:80
expr
field, with a breaking line at the beginning$ cat >> test.yml << EOF
groups:
- name: rule-group-name
rules:
- alert: alert-name
expr: |
up{jop="my-awesome-job"} == 0
EOF
$ cortextool rules load --address "http://localhost:8080" --id=0 --log.level="debug" test.yml
INFO[0000] log level set to debug
DEBU[0000] path built to request rule group url=/api/prom/rules/test/rule-group-name
DEBU[0000] sending request to cortex api method=GET url="http://localhost:8080/api/prom/rules/test/rule-group-name"
DEBU[0000] checking response status="404 Not Found"
DEBU[0000] resource not found fields.msg="request failed with response body group does not exist\n" status="404 Not Found"
DEBU[0000] sending request to cortex api method=POST url="http://localhost:8080/api/prom/rules/test"
DEBU[0000] checking response status="400 Bad Request"
ERRO[0000] requests failed fields.msg="request failed with response body unable to decoded rule group\n" status="400 Bad Request"
ERRO[0000] unable to load rule group error="failed request to the cortex api" group=rule-group-name namespace=test
cortextool: error: load operation unsuccessful, try --help
cortex ruler version 1.4.0
cortex-tools compiled with current master HEAD 432ad77
http request dump
POST /api/prom/rules/test HTTP/1.1
Host: localhost:8080
User-Agent: Go-http-client/1.1
Content-Length: 107
X-Scope-Orgid: 0
Accept-Encoding: gzip
name: rule-group-name
rules:
- alert: alert-name
expr: |4
up{jop="my-awesome-job"} == 0
As you can see, the body content is not a valid YAML.
cortex-tools should ensure is sending a valid YAML before reach cortex API.
I'm willing to help to fix it with some help.
Thanks!
In cases where the metrics reported by the cortextool and Cortex diverge, tracing would be useful.
I am a cortex administrator, at the moment, it is unclear which tenant create or delete rule groups.
Is there some api can export all user rule grups?
I want to record all rule group to mysql db and sync to cortex periodly.
At the moment, it is unclear what is the exact difference between both commands. From a quick peek at the code it seems like load is more of a "only uploaded if the rule group does not exist under that namespace" while sync
is more of a replace everything but tell me about it.
I'd be good to make clear how do we support each of the following use cases:
Often when preparing rules ($ rules prepare
) you would like to have a clear idea of what change in the diff - given there's no homogenous way of formatting rules YAML (e.g. Prometheus rules linter) a side of effect of the marshalling/unmarshalling of rules files is that your expressions and the file itself end up being linted by either the PromQL parser or the go YAML library.
This makes it difficult to have a consistent diff.
Given there's a promfmt
for rules in the work, let's do something simple as an intermediary step. Take file(s), unmarshal then to our struct, marshal them, and format the promQL expressions in the rules file. With this, our users can "lint" their files before running them through the prepare command and have a more consistent diff on what changed.
Given bug.yaml
:
namespace: bug
groups:
- name: bug
rules:
- alert: AlwaysFire
expr: vector(1)
cortextool rules lint --backend=loki bug.yaml
gives:
ERRO[0000] unable parse rules file error="could not parse expression: parse error at line 1, col 1: syntax error: unexpected IDENTIFIER" file=bug.yaml
cortextool: error: prepare operation unsuccessful, unable to parse rules files: file read error, try --help
There's nothing obviously wrong at line 1, col 1. I have an invalid logQL expression and cortextool should tell me that directly.
cortextool 0.3.2
i want to use jsonnet framework to template Alerts. does cortextool work with Json format.
it is Alert creation and configuration as well
Seems to be that cortextool sync/cortextool load isn't working in my environment. We're running Cortex v1.5.0 and CortexTool v0.5.0.
I've created a new rulegroup in a new NameSpace using the below config:
namespace: collector-rules
groups:
- name: collector-status
rules:
- record: ""
alert: PrometheuServerIsDown
expr: absent(up)
for: 10m
labels:
severity: critical
annotations:
assignment_group: Site Reliability Engineering
company: REDACTED
description: Cortex has not received any metrics from the n4monitoring tenant for 10 minutes
impact: "1"
suggested_actions: Check if Prometheus is running in the REDACTED namespace
summary: 'Cortex has not received metrics for REDACTED for 10minutes'
urgency: "1"
The NS collector-rules does not exists. cortextool rules load throws an error:
ryan@WINDOWS-H8Q4C40:~/rw170/Documents/cortex-config$ cortextool rules load n4monitoring/rulegroups/collector.yml
ERRO[0000] unable to load rule group error="requested resource not found" group=collector-status namespace=collector-rules
cortextool: error: load operation unsuccessful, try --help
cortextool rules sync --rule-dirs=<dir>
also throws an error:
ryan@WINDOWS-H8Q4C40:~/rw170/Documents/cortex-config$ cortextool rules sync --rule-dirs=n4monitoring/rulegroups
INFO[0000] creating group group=collector-status namespace=collector-rules
cortextool: error: sync operation unsuccessful, unable to complete executing changes.: requested resource not found, try --help
I've got a feeling it's potentially related to our ingress rules but I can't spot anything here is the ingress:
spec:
rules:
- host: REDACTED
http:
paths:
- backend:
serviceName: alertmanager
servicePort: 80
path: /multitenant_alertmanager/status
- backend:
serviceName: alertmanager
servicePort: 80
path: /alertmanager
- backend:
serviceName: alertmanager
servicePort: 80
path: /api/v1/alerts
- backend:
serviceName: ruler
servicePort: 80
path: /ruler/ring
- backend:
serviceName: ruler
servicePort: 80
path: /api/v1/rules
- backend:
serviceName: ruler
servicePort: 80
path: /api/prom/api/v1/alerts
- backend:
serviceName: ruler
servicePort: 80
path: /api/prom/rules
- backend:
serviceName: distributor
servicePort: 80
path: /distributor/all_user_stats
Loading the follow namespace/file into cortex:
groups:
- name: my_group
rules:
- record: value
expr: vector(0)
labels:
val: '0'
And then immediately diffing with cortextool rules diff
will produce a report that the group my_group
will be updated.
Changes are indicated with the following symbols:
+ updated
The following changes will be made if the provided rule set is synced:
~ Namespace: my_namespace
~ Group: my_group
Diff Summary: 0 Groups Created, 1 Groups Updated, 0 Groups Deleted
This is because the rules file is unmarshalled to a Prometheus Rule
struct with Annotations map[string]string = nil
while Cortex assigns the same field to an empty map. This leads the deep equality check to report a difference.
Users can work around this bug by adding an empty annotations map to their rule files. But this is a little counterintuitive given that documentation doesn't show annotations
as a valid field for recording rules. It might be better for cortextool to fill this field in itself if it is going to rely on on reflect.deepEquals
.
When running Cortex locally I tried loading/diffing a local rules file against the rules endpoint multiple times. Since rulefmt started using yaml.v3 the RuleNode struct contains extra information about the formatting of the underlying yaml file. With Cortex the yaml returned from the API will not have the same formatting which can lead to diffs when none exist:
INFO[0000] updating group difference="rule #0 does not match {{8 0 !!str sum_up <nil> [] 3 15} {0 0 <nil> [] 0 0} {8 0 !!str sum(up) <nil> [] 4 13} 0s map[] map[]} != {{8 0 !!str sum_up <nil> [] 5 13} {0 0 <nil> [] 0 0} {8 0 !!str sum(up) <nil> [] 4 11} 0s map[] map[]}" group=test_rules namespace=rules
The differences in the above string are due to the yaml column and row and not the rules themselves.
Many tools do not handle shell colors very well. In my case, I am trying to parse the YAML output using https://github.com/kislyuk/yq, but it fails due to the colorized input. As a workaround, I have to pipe the output through sed -e $'s/\x1b\[[0-9;]*m//g'
, which frankly I barely understand.
This project currently pulls in version 0.13 of alertmanager. The current version is 0.20.
The result of this is that there are many valid alertmanager configurations that do not work. For example, the image_url
and actions
fields in slack configuration.
This should be updated.
When you have no rules loaded yet, and try to do a cortextool rules list
the output you'll get is the following:
$ cortextool rules list
FATA[0000] unable to read rules from cortex, requested resource not found
This is a bit deceiving given, we were able to load rules there just aren't any yet.
I loaded this:
namespace: test
groups:
- name: default
rules:
- alert: AlwaysFiring
record: ""
for: 0s
expr: 1 == bool 1
- record: "agent:custom_server_info:up"
alert: ""
expr: |2
custom_server_info * 0
unless on (agent_hostname)
up{job="integrations/agent"}
or on (agent_hostname)
custom_server_info
Since then, I can't print the rules anymore.
FATA[0000] unable to read rules from cortex, yaml: line 10: did not find expected key
I can query the API and get the expected YAML.
curl -u $CORTEX_USER:$CORTEX_KEY "$CORTEX_URL/api/v1/rules"
bug:
- name: default
rules:
- record: test:scalar:bug
expr: vector(1)
test:
- name: default
rules:
- alert: AlwaysFiring
expr: 1 == bool 1
- record: agent:custom_server_info:up
expr: |4
custom_server_info * 0
unless on (agent_hostname)
up{job="integrations/agent"}
or on (agent_hostname)
custom_server_info
To try to isolate the bug, I deleted all the rules and I tried loading this one:
namespace: bug
groups:
- name: test
rules:
- record: "test:scalar:bug"
expr: |2
vector(1)
or
vector(2)
cortextool rules load rules-bug.yml \
--address=$CORTEX_URL \
--id=$CORTEX_USER \
--key=$CORTEX_KEY \
--log.level=debug
INFO[0000] log level set to debug
DEBU[0000] path built to request rule group url=/api/prom/rules/bug/test
DEBU[0000] sending request to cortex api method=GET url="https://prometheus-us-central1.grafana.net/api/prom/rules/bug/test"
DEBU[0000] checking response status="404 Not Found"
DEBU[0000] resource not found fields.msg="request failed with response body group does not exist\n" status="404 Not Found"
DEBU[0000] sending request to cortex api method=POST url="https://prometheus-us-central1.grafana.net/api/prom/rules/bug"
DEBU[0000] checking response status="400 Bad Request"
ERRO[0000] requests failed fields.msg="request failed with response body unable to decoded rule group\n" status="400 Bad Request"
ERRO[0000] unable to load rule group error="failed request to the cortex api" group=test namespace=bug
I was able to load this rule group using curl.
name: bug
rules:
- record: "test:scalar:bug"
expr: |2
vector(1)
or
vector(2)
curl -u $CORTEX_USER:$CORTEX_KEY "$CORTEX_URL/api/prom/rules/bug" -H "Content-Type: application/yaml" --data-binary @rules-bug-api.yml -i
HTTP/2 202
content-length: 58
content-type: application/json
date: Fri, 20 Nov 2020 16:41:23 GMT
via: 1.1 google
alt-svc: clear
{"status":"success","data":null,"errorType":"","error":""}
I'm still unable to print the rules:
INFO[0000] log level set to debug
DEBU[0000] sending request to cortex api method=GET url="https://prometheus-us-central1.grafana.net/api/prom/rules"
DEBU[0000] checking response status="200 OK"
FATA[0000] unable to read rules from cortex, yaml: line 3: did not find expected key
But I can GET them from the API:
curl -u $CORTEX_USER:$CORTEX_KEY "$CORTEX_URL/api/prom/rules"
bug:
- name: bug
rules:
- record: test:scalar:bug
expr: |4
vector(1)
or
vector(2)
Also, this rule does not run! I don't see the test:scalar:bug
metric in my database.
If I create the same rule on a single line, then it works, so I think both Cortex and Cortextool has an issue with the YAML block quotes with an indentation indicator syntax as described in Prometheus docs.
Right now when you try to delete, it only allows a group specification.
Recording rules have best practices to follow, one of them is the level:metric:operations
naming convention.
To ensure, rules are named in a proper way within a file I think we could add a flag to rules lint
that ensures the metric name follows this.
The help text for alertmanager load
show the usage as:
usage: cortextool alertmanager load <config> [<template-files>...]
However, the implementation does not use the template files for anything.
https://github.com/grafana/cortextool/blob/master/pkg/commands/alerts.go#L76
I'd like to be able to list rules in Cortex, then programatically process them. The output of rules list
is great for the human eye, but needlessly difficult to program around.
We should add an -o, --output
flag to support YAML output. JSON output would also be appreciated, though plenty of client tools can make this conversion as necessary.
As we lint/prepare we should not include the tags that are empty.
I created an issue on Cortex (cortexproject/cortex#3357) about weird directory processing behavior on the alertmanager side, but @gotjosh suggested I create an issue here to address the root problem. I wouldn't assume the path that I'm asking cortextool to read it from on the local machine would have any effect on how cortex processes the file on the backend. I'm not sure the path should be sent to the backend. Even the filename is kind of annoying to have to be sent, but to match up the template name with the alert yaml I think that is necessary.
This seems like an issue where the tool should ignore the directory specified and not send that full path to the backend.
If a user does not front cortex with an authentication gateway, adding the tenant ID by default can be useful.
It would be great if this was packaged for homebrew to make it easier to update/manage via brew.
Currently, cortextool always sets the namespace based on the name of the file. This behavior results in rule organization that feels quite unnatural. For example, if we want to define one alert per file, we end up with an absurd number of namespaces.
Additionally, not having control over the namespace makes the new sync
functionality difficult to use. It allows us to ignore namespaces, but since there are so many namespaces that are dynamically created, this flag doesn't do anything particularly useful for us. We would much rather use sync
with a specific namespace, then have all the changes applied within that namespace.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.