fairwindsops / astro Goto Github PK

View Code? Open in Web Editor NEW

86.0 21.0 7.0 29.87 MB

[alpha] Emit Datadog monitors based on Kubernetes state.

Home Page: https://fairwinds.com

License: Apache License 2.0

Go 94.65% Dockerfile 0.77% Shell 4.58%

datadog kubernetes monitoring fairwinds-incubator

astro's Introduction

Astro is designed to simplify Datadog monitor administration. This is an operator that emits Datadog monitors based on Kubernetes state. The operator responds to changes of resources in your kubernetes cluster and will manage Datadog monitors based on the configured state.

Installing

The Astro helm chart is the preferred way to install Astro into your cluster.

Configuration

A combination of environment variables and a yaml file is used to configure the application. An example configuration file is available here.

Environment Variables

Variable	Descritpion	Required	Default
`DD_API_KEY`	The api key for your Datadog account.	`Y`
`DD_APP_KEY`	The app key for your Datadog account.	`Y`
`OWNER`	A unique name to designate as the owner of the generated monitors. A tag with the owner's value will be applied to managed monitors. If deploying astro on multiple clusters, it is required to provide different `owner` values to segregate monitor management.	`N`	`astro`
`DEFINITIONS_PATH`	The path to monitor definition configurations. This can be a local path or a URL. Multiple paths should be separated by a `;`	`N`	`conf.yml`
`DRY_RUN`	when set to true monitors will not be managed in Datadog.	`N`	`false`

Configuration File

A configuration file is used to define your monitors. These are organized as rulesets, which consist of the type of resource the ruleset applies to, annotations that must be present on the resource to be considered valid objects, and a set of monitors to manage for that resource. Go templating syntax may be used in your monitors and values will be inserted from each Kubernetes object that matches the ruleset. There is also a section called cluster_variables that you can use to define your own variables. These variables can be inserted into the monitor templates.

---
cluster_variables:
  var1: test
  var2: test2
rulesets:
- type: deployment
  match_annotations:
  - name: astro/owner
    value: astro
  monitors:
    dep-replica-alert:
      name: "Deployment Replica Alert - {{ .ObjectMeta.Name }}"
      type: metric alert
      query: "max(last_10m):max:kubernetes_state.deployment.replicas_available{kubernetescluster:foobar,deployment:{{ .ObjectMeta.Name }}} <= 0"
      message: |-
        {{ "{{#is_alert}}" }}
        Available replicas is currently 0 for {{ .ObjectMeta.Name }}
        {{ "{{/is_alert}}" }}
        {{ "{{^is_alert}}" }}
        Available replicas is no longer 0 for {{ .ObjectMeta.Name }}
        {{ "{{/is_alert}}" }}
      tags: []
      options:
        no_data_timeframe: 60
        notify_audit: false
        notify_no_data: false
        renotify_interval: 5
        new_host_delay: 5
        evaluation_delay: 300
        timeout_h: 1
        escalation_message: ""
        thresholds:
          critical: 2
          warning: 1
          unknown: -1
          ok: 0
          critical_recovery: 0
          warning_recovery: 0
        include_tags: true
        require_full_window: true
        locked: false

cluster_variables: (dict). A collection of variables that can be used in monitors. They can be used in monitors by prepending with ClusterVariables, eg {{ ClusterVariables.var1 }}.
rulesets: (List). A collection of rulesets. A ruleset consists of a Kubernetes resource type, annotations the resource must have to be considered valid, and a collection of monitors to manage for the resource.
- type: (String). The type of resource to match if matching with annotations. Can also be static or binding. Currently supports deployment, namespace, binding, and static as values.
- match_annotations: (List). A collection of name/value pairs pairs of annotations that must be present on the resource to manage it.
- bound_objects: (List). A collection of object types that are bound to this object. For instance, if you have a ruleset for a namespace, you can bind other objects like deployments, services, etc. Then, when the bound objects in the namespace get updated, those rulesets apply to it.
- monitors: (Map). A collection of monitors to manage for any resource that matches the rules defined.
  - Monitor Identifier (map key: unique and arbitrary, it should only include alpha characters and -)
    - name: Name of the Datadog monitor.
    - type: The type of the monitor, chosen from:
      - metric alert
      - service check
      - event alert
      - query alert
      - composite
      - log alert
    - query: The monitor query to notify on.
    - message: A message included with in monitor notifications.
    - tags: A list of tags to add to your monitor.
    - options: A dict of options, consisting of the following:
      - no_data_timeframe: Number of minutes before a monitor will notify if data stops reporting.
      - notify_audit: boolean that indicates whether tagged users are notified if the monitor changes.
      - notify_no_data: boolean that indicates if the monitor notifies if data stops reporting.
      - renotify_interval: Number of minutes after the last notification a monitor will re-notify.
      - new_host_delay: Number of seconds to wait for a new host before evaluating the monitor status.
      - evaluation_delay: Number of seconds to delay evaluation.
      - timeout_h: Number of hours the before the monitor will automatically resolve if it's not reporting data.
      - escalation_message: Message to include with re-notifications.
      - thresholds: Map of thresholds for the alert. Valid options are:
        
        ok
        
        critical
        
        warning
        
        unknown
        
        critical_recovery
        
        warning_recovery
      - include_tags: When true, notifications from this monitor automatically insert triggering tags into the title.
      - require_full_window: boolean indicating if a monitor needs a full window of data to be evaluated.
      - locked: boolean indicating if changes are only allowed from the creator or admins.

Static monitors

A static monitor is one that does not depend on the presence of a resource in the kubernetes cluster. An example of a static monitor would be Host CPU Usage. There are a variety of example static monitors in the static_conf.yml example

A Note on Templating

Since Datadog uses a very similar templating language to go templating, to pass a template variable to Datadog it must be "escaped" by inserting it as a template literal:

{{ "{{/is_alert}}" }}

The above note is not applicable for static monitors and if extra brackets are present, creation of the static monitors will fail.

Overriding Configuration

It is possible to override monitor elements using Kubernetes resource annotations.

You can annotate an object like so to override the name of the monitor:

annotations:
  astro.fairwinds.com/override.dep-replica-alert.name: "Deployment Replicas Alert"

In the example above we will be modifying the dep-replica-alert monitor (which is the Monitor Identifier from the config) to have a new name As of now, the only fields that can be overridden are:

name
message
query
type
threshold-critical
threshold-warning

Templating in the override is currently not available.

Contributing

PRs welcome! Check out the Contributing Guidelines, Code of Conduct, and Roadmap for more information.

Further Information

A history of changes to this project can be viewed in the Changelog

If you'd like to learn more about Astro, or if you'd like to speak with a Kubernetes expert, you can contact [email protected] or visit our website

License

Apache License 2.0

Join the Fairwinds Open Source Community

The goal of the Fairwinds Community is to exchange ideas, influence the open source roadmap, and network with fellow Kubernetes users. Chat with us on Slack join the user group to get involved!

Love Fairwinds Open Source? Share your business email and job title and we'll send you a free Fairwinds t-shirt!

Other Projects from Fairwinds

Enjoying Astro? Check out some of our other projects:

Polaris - Audit, enforce, and build policies for Kubernetes resources, including over 20 built-in checks for best practices
Goldilocks - Right-size your Kubernetes Deployments by compare your memory and CPU settings against actual usage
Pluto - Detect Kubernetes resources that have been deprecated or removed in future versions
Nova - Check to see if any of your Helm charts have updates available
rbac-manager - Simplify the management of RBAC in your Kubernetes clusters

astro's People

Contributors

Stargazers

Watchers

Forkers

mmcaya sandip750 devopstoday11 zimpler isaaguilar ryanmaclean alexandre-bolduc-braze

astro's Issues

[v2] initial CRD generation

Using kubebuilder, create initial CRD manifests based on the design in DESIGN.md

Add Code of Conduct

Add controller to watch for new MonitorBinding

Once #153 is finished, set up a controller to watch for this CRD object

Overriding configurations for alert thresholds with annotations does not work

Context

Based on the documentation here overriding alert configurations with the field "threshold-critical" does not work.

Process

Created a custom monitor (apdex-alert) in configuration file
Added annotations to override monitor name, query, and message in a Kubernetes deployment and there are no logged errors for these.
Added annotation to override alert threshold in the same deployment- "astro.fairwinds.com/override.apdex-alert.threshold-critical: 0.80"

Expected Result

Monitor is created
Newly created monitor reflects the new values

Current Result

Monitor is not created.
The following can be seen in the logs

time="2021-06-02T15:55:58Z" level=warning msg="override provided does mot match any monitor fields. provided field: threshold-critical"

Please help fix or suggest any necessary modifications.

Replicate existing monitors in dd-manager

Once this is implemented, let's test the results in a client cluster

Property Overrides

Notification Profiles

Mute monitors with an annotation

Support for muting monitors by adding an annotation would be beneficial. For example, an annotation could be added to a kubernetes object:

kubernetes.io/astro/mute-until: "2019-12-30 00:00:00"

and a downtime would be created to turn off alerting for that object.

Unit Testing

Add unit testing and ensure at least 80% test coverage.

Not removing monitor when annotation is removed while astro is not running

This is based on a test I was doing when checking out a dependabot PR. I annotated a deployment for astro to create a monitor which worked fine. I stopped astro and pulled down code from a different PR, removed the annotation, then started astro and the monitor was not removed. It should be able to handle this.

I'd like to reproduce this a few times, but wanted to log this issue so I didn't forget.

Ignore resources

Add the ability to ignore resources. For example, say a deployment exists in a namespace. You want to monitor every deployment in the namespace EXCEPT for this one. Being able to exclude it from configure monitors via an annotation would be very helpful.

Static global monitors in config file

This is something @mjhuber and I were chatting about recently. Right now in the example config we have things like master node High System Load Average listed as namespaced monitors which really doesn't make sense to me.

I was thinking having a new config section for static monitors that don't rely on kube state would make sense here. This would allow all monitoring for a given cluster to be fully defined and owned by astro without the weird namespaced workaround we currently have.

Interested in opinions on this.

Duplicate reconciliation when object is updated or added

I've noticed this behavior for a while and finally started looking into it but can't figure out why it's happening. The below logs are generated when updating the kstats-2 deployment in the kstats namespace. It runs this UpdateFunc code twice which also causes the reconciliation code to run twice. Any ideas on this? I can't seem to find anything in the code that would cause this, but I could be missing something.

INFO[0033] deployment/kstats/kstats-2 has been updated. 
INFO[0033] Handler got an OnUpdate event of type deployment 
INFO[0033] Loading rulesets from conf.yml               
INFO[0033] deployment/kstats/kstats-2 has been updated. 
INFO[0033] Annotation astro/admin with value fairwinds does not exist, so monitor 1 does not match 
INFO[0033] Reconcile monitor Deployment Replica Alert - kstats-2 
INFO[0033] Update templated monitor: Deployment Replica Alert - kstats-2 
INFO[0033] Monitor 14633828 needs updating.             
INFO[0033] Handler got an OnUpdate event of type deployment 
INFO[0033] Loading rulesets from conf.yml               
INFO[0033] Annotation astro/admin with value fairwinds does not exist, so monitor 1 does not match 
INFO[0033] Reconcile monitor Deployment Replica Alert - kstats-2 
INFO[0033] Update templated monitor: Deployment Replica Alert - kstats-2 
INFO[0033] Monitor 14633828 needs updating.

Documentation improvement

The docs don't make it super clear that we need to annotate deployments/namespaces in order to have monitors created. Also we can point to the how-to kube video that describes astro and how it works.

Dependabot can't resolve your Go dependency files

Dependabot can't resolve your Go dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:

github.com/fairwindsops/astro/cmd: cannot find module providing package github.com/fairwindsops/astro/cmd

If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

View the update logs.

Objects aren't properly reconciled when an annotation is removed from a monitored object

Currently the update handler only looks at the new object. This won't work in the use case of an object that is currently monitored, but the annotations are removed from the object. The monitors won't get removed. To function correctly, the update handler should examine the new and old object. If the old object had monitors that the new one does not, they should be removed.

Support Secrets for API/APP Keys

The new datadog charts support specifying a secret for loading in the api or app keys. This might be an option for the project too. Happy to discuss if you'd like

Branch strategy to pin versions

Create separate branches for kubernetes versions so we can pin module verison to them

Add controller to watch for new ClusterMonitorSet

Once #153 is finished, set up a controller to watch for this CRD object

k8s 1.16 compatibility

Verify 1.16 kubernetes compatibility

Add controller to watch for new ClusterMonitorBinding

Once #153 is finished, set up a controller to watch for this CRD object

changes to configmap are not synced to datadog api

Steps to reproduce

Deployed astro using the helm chart with the following command:

helm --tiller-namespace ops --namespace ops upgrade um-astro fairwinds-stable/astro --debug -i --set custom_config.enabled=true --set secret.create=false --set secret.name=um-datadog-astro-secret -f custom-config.yaml

The ^^ deploys the following versions:

# helm --tiller-namespace ops list um-astro --output yaml
Next: ""
Releases:
- AppVersion: 1.5.3
  Chart: astro-1.0.4
  Name: um-astro
  Namespace: ops
  Revision: 7
  Status: DEPLOYED
  Updated: Mon Jul 13 21:33:48 2020

The docker image deployed is: image: quay.io/fairwinds/astro:v1.5.3

Data in custom-config.yaml

custom_config:
  enabled: false
  data: |
    ---
    cluster_variables:
      warning_notifications: "@[email protected]"
    rulesets:
    - type: deployment
      match_annotations:
        - name: astro/owner
          value: astro
      monitors:
        deploy-replica-alert:
          name: "Deployment Replica Alert - {{ .ObjectMeta.Name }}"
          type: metric alert
          query: "max(last_10m):max:kubernetes_state.deployment.replicas_available{deployment:{{ .ObjectMeta.Name }},namespace:{{ .ObjectMeta.Namespace }}} <= 0"
          message: |-
            {{ "{{#is_alert}}" }}
            Available replicas is currently 0 for {{ .ObjectMeta.Name }}
            {{ "{{/is_alert}}" }}
            {{ "{{^is_alert}}" }}
            Available replicas is no longer 0 for {{ .ObjectMeta.Name }}
            {{ "{{/is_alert}}" }}
            {{ ClusterVariables.warning_notifications }}
          tags: []
          options:
            no_data_timeframe: 60
            notify_audit: false
            notify_no_data: false
            renotify_interval: 5
            new_host_delay: 5
            evaluation_delay: 300
            timeout: 300
            escalation_message: ""
            threshold_count:
              critical: 0
            require_full_window: true
            locked: false

When I make a configuration change to custom-config.yaml e.g. last_5m to last_10m and run the helm upgrade command again, the configmap is updated appropriately but the changes never make it Datadog monitor (it continues to show last_5m)

Configmap after helm upgrade command:

$kubectl -nops get cm um-astro-data -oyaml | grep query
          query: "max(last_10m):max:kubernetes_state.deployment.replicas_available{deployment:{{ .ObjectMeta.Name }},namespace:{{ .ObjectMeta.Namespace }}} <= 0"

pods logs after helm upgrade:

time="2020-07-13T21:53:07Z" level=info msg="Loading rulesets from /etc/config/config.yml"
time="2020-07-13T21:54:07Z" level=info msg="Loading rulesets from /etc/config/config.yml"
time="2020-07-13T21:55:07Z" level=info msg="Loading rulesets from /etc/config/config.yml"

Also, the configmap changes are only synced if I bounce the old pods.

[feature/modification] use klog instead of logrus and review logging levels

Right now we are a little too verbose in our logging. I think a lot of the logs can be modified to be DEBUG level. While we're at it we can switch to using klog since a lot of our other projects are making that change or are already using it.

Monitors sometimes get created twice on startup

on startup, if an object is found that matches a monitor, it can get created twice.

Add controller to watch for new ClusterAlertConfiguration

Once #153 is finished, set up a controller to watch for this CRD object

Dependabot can't resolve your Go dependency files

Dependabot can't resolve your Go dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:

github.com/fairwindsops/astro/cmd: cannot find module providing package github.com/fairwindsops/astro/cmd

If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

View the update logs.

Filter Monitors with annotations

Add filters to change monitors fields/behaviors based on annotations. The filtering system must be extensible so new filters can be added as they're requested. This could be possible by parsing annotations and returning functions that perform operations based on the results.

Some possible examples I've thought about:

given this annotation:

astro/ignore-resource

perhaps a function is returned that causes all monitors created for the object to NOT get created. Taking that example a step further, given this annotation:

astro/ns-io-wait-times/ignore-resource

would cause only the monitor ns-io-wait-times to get ignored.

override any monitor field based on an annotation

For example, given this monitor definition:

    ns-io-wait-times:
      name: "I/O Wait Times"
      type: metric alert
      query: "avg(last_10m):avg:system.cpu.iowait{*} by {host} > 50"
      message: |
        {{ "{{^is_recovery }}" }}
        The I/O wait time for {{ "{host.ip}" }} is very high
        - Is the EBS volume out of burst capacity for iops?
        - Is something writing lots of errors to the journal?
        - Is there a pod doing something unexpected (crash looping, etc)?
        {{ "{{/is_recovery}}" }}
        {{ "{{#is_recovery}}" }}
        I/O waiti time for {{ "{host.ip}" }} has returned to normal.
        {{ "{{/is_recovery}}" }}
        {{ "{{#is_alert}}" }}
        {{ ClusterVariables.critical_notifications }}
        {{ "{{/is_alert}}" }}
        {{ "{{#is_alert_recovery}}" }}
        {{ ClusterVariables.critical_notifications }}
        {{ "{{/is_alert_recovery}}" }}
        {{ "{{#is_warning}}" }}
        {{ ClusterVariables.warning_notifications }}
        {{ "{{/is_warning}}" }}
        {{ "{{#is_warning_recovery}}" }}
        {{ ClusterVariables.warning_notifications }}
        {{ "{{/is_warning_recovery}}" }}
      tags:
        - astro
      options:
        notify_audit: false
        new_host_delay: 300
        notify_no_data: false
        require_full_window: true
        locked: false
        thresholds:
          critical: 50
          warning: 30

We might want to override the critical threshold value of 50 to 40. (Note: the query is also affected here)

add PR template

Auto reloading of monitoring definitions

The config file should be reloaded periodically to pick up updates.

Leader Election

client-go has a nice leader election package that could be used in astro: https://github.com/kubernetes/client-go/tree/master/examples/leader-election . Adding this would enable use to run redundant pods.

monitors of type "log alert" are created and deleted immediately in Datadog

dockerimage: v1.6.0
k8s version: 1.15.11

I am using the custom config as follows:

custom_config:
  enabled: true
  data: |
    ---
    cluster_variables:
      warning_notifications: "@[email protected]"
      cluster_name: "sandbox"
    rulesets:
    - type: deployment
      match_annotations:
        - name: astro/owner
          value: falco
      monitors:
        falco-alert:
          name: "[ASTRO TEST] Falco Alert: - {{ ClusterVariables.cluster_name }} "
          type: log alert
          query: "logs(\"service:falco (\"A shell was spawned in a container with an attached terminal\" OR \"Privileged container started\" OR \"File below /etc opened for writing\")\").index(\"main\").rollup(\"count\").last(\"5m\") > 10"
          message: |-
            {{ ClusterVariables.warning_notifications }}
          tags: []
          options:
            no_data_timeframe: 60
            notify_audit: false
            notify_no_data: false
            renotify_interval: 5
            new_host_delay: 5
            evaluation_delay: 300
            timeout: 300
            escalation_message: ""
            threshold_count:
              critical: 10
              warning: 5
            require_full_window: true
            locked: false

On adding an annotation astro/owner=falco to a deployment, the logs show that a datadog monitor is created and deleted right after its creation

Logs:

time="2020-07-16T19:52:41Z" level=debug msg="deployment/ops-managed/stateful-sample-app-drupal-example has been updated."
time="2020-07-16T19:52:41Z" level=debug msg="Handler got an OnUpdate event of type deployment"
time="2020-07-16T19:52:41Z" level=debug msg="Loading rulesets from /etc/config/config.yml"
time="2020-07-16T19:52:41Z" level=debug msg="Reconcile monitor [ASTRO TEST] Falco Alert: - sandbox "
time="2020-07-16T19:52:41Z" level=debug msg="Update templated monitor: [ASTRO TEST] Falco Alert: - sandbox "
time="2020-07-16T19:52:41Z" level=debug msg="deployment/ops-managed/stateful-sample-app-drupal-example has been updated."
time="2020-07-16T19:52:41Z" level=info msg="Creating new monitor: [ASTRO TEST] Falco Alert: - sandbox "
time="2020-07-16T19:52:41Z" level=info msg="Removing monitor: [ASTRO TEST] Falco Alert: - sandbox"
time="2020-07-16T19:52:41Z" level=debug msg="Handler got an OnUpdate event of type deployment"
time="2020-07-16T19:52:41Z" level=debug msg="Old annotations match new, not updating: ops-managed/stateful-sample-app-drupal-example"

My expectation was that the monitor should only be deleted when the annotation is removed

Add endpoints for liveness / readiness probes

Add endpoints for liveness & readiness probes

feature: add exponential backoff based delay when creating monitors to deal with api-throttling

While creating 5 datadog monitors as part of my regression testing, I encountered the following API error (Rate limit of 5 requests in 600 seconds reached):

time="2020-07-21T15:59:09Z" level=error msg="Error creating monitor [ASTRO TEST] Deployment Replica Alert for Namespace: um-astro-poc in sandbox : API error 429 Too Many Requests: {\"errors\":[\"Can not create duplicate monitors: Rate limit of 5 requests in 600 seconds reached. Please try again later.\"]}"

Implementing exponential backoff and parametrizing it via a configurable parameter (maybe via helm chart) might be a desirable option.

reload isn't creating or deleting monitors

The 1-minute reload is not triggering the addition and removal of new monitors. Is this expected behavior?

I can confirm that the updates to some of the fields are working, like modifying the message of a monitor.

The options `require_full_window` isn't working

I'm using image: quay.io/fairwinds/astro:v1.6.0 version of the Astro controller and I'm having an issue with the require_full_window options in all of my monitors. Setting the options for require_full_window: true in my astro-data ConfigMap is not translating to the correct setting in the actual DD monitor.

Here's a snippet of a monitor that I'm trying to make have a require_full_window: true. The monitor gets created with all the right settings except that one.

# ...
    - monitors:
        deployment:
          message: |-
            {{#is_alert}}
            Daemonset 'aws-node' has an issue
            {{/is_alert}}

            {{#is_recovery}}
            Daemonset 'aws-node' ready count recovered
            {{/is_recovery}}
          name: 'aws-node Daemonset Alert'
          options:
            evaluation_delay: 0
            locked: false
            no_data_timeframe: 60
            notify_audit: false
            notify_no_data: false
            renotify_interval: 0
            require_full_window: true
            threshold_count:
              critical: 0
          query: min(last_10m):abs( max:kubernetes_state.daemonset.desired{cluster_name:my-cluster,daemonset:aws-node} - max:kubernetes_state.daemonset.ready{cluster_name:my-cluster,daemonset:aws-node} ) > 0
          type: metric alert
      type: static

Add badges

chart tests are broken

Helm chart is available at https://github.com/reactiveops/charts/tree/mh/dd-manager. When the chart is deployed with default values, the config loader throws an exception. Instead, the app should gracefully recover and if a required field is not defined, it should run in a dry-run mode.

fairwindsops / astro Goto Github PK

astro's Introduction

Installing

Configuration

Environment Variables

Configuration File

Static monitors

A Note on Templating

Overriding Configuration

Contributing

Further Information

License

Join the Fairwinds Open Source Community

Other Projects from Fairwinds

astro's People

Contributors

Stargazers

Watchers

Forkers

astro's Issues

Context

Process

Expected Result

Current Result

Steps to reproduce

Logs:

Recommend Projects

Recommend Topics

Recommend Org

Jobs