GithubHelp home page GithubHelp logo

autoheal's Introduction

Auto-heal Service

This project contains the auto-heal service. It receives alert notifications from the Prometheus alert manager and executes Ansible playbooks to resolve the root cause.

Configuration

Most of the configuration of the auto-heal service is kept in a YAML configuration file. The name of the configuration file is specified using the --config-file command line option. If this option isn't explicitly given then the service will try to load the autoheal.yml file from the current working directory.

In addition to the configuration file the auto-heal service also uses command line options to configure the connection to the Kubernetes API and the log level. Use the -h option to get a complete list of these command line options.

The --kubeconfig command line option is used to specify the location of the Kubernetes client configuration file. When running outside of a Kubernetes cluster the auto-heal service will use $HOME/.kube/config by default, the same used by the kubectl command. When running inside a Kubernetes cluster it will use the configuration that Kubernetes mounts automatically in the pod file system. So in most cases this command line option won't have to be explicitly included.

Assuming that you want to have your own my.yml configuration file a typical command line will be the following:

$ autoheal server --config-file=my.yml --logtostderr

See the autoheal.yml file for a complete example.

AWX or AnsibleTower configuration

The first section of the configuration file is named awx and it contains all the details needed to connect to the AWX or Ansible Tower server:

awx:
  address: https://myawx.example.com/api
  proxy: http://myproxy.example.com:3128
  credentialsRef:
    namespace: my-namespace
    name: my-awx-credentials
  tlsRef:
    namespace: my-namespace
    name: my-awx-ca
  project: "Auto-heal"

The address parameter is the URL of the API of the AWX server. It should contain the /api suffix, but not the /v1 or /v2 suffix, as the auto-heal service will internally decide which version to use.

The proxy parameter is optional, and it indicates what HTTP proxy should be used to connect to the AWX API. If this parameter is not specified, or if it is empty, then the connection will be direct to the AWX server, without a proxy.

The credentialsRef parameter is a reference to the Kubernetes secret that contains the user name and password used to connect to the AWX API. That secret should contain the username and password keys. For example:

apiVersion: v1
kind: Secret
metadata:
  namespace: my-namespace
  name: my-awx-credentials
data:
  username: YWxlcnQtaGVhbGVy
  password: ...

Alternatively it is also possible to specify the user name and password directly inside the configuration file, using the credentials section. For example:

credentials:
  username: autoheal
  password: ...

This is very convenient for development environments, but it is not recommended for production environments, as then the configuration file needs to be protected very carefully. For example, you can create a separate file for the credentials, give it restricted permissions, and then load it using the --config-file option twice:

$ echo > general.yml <<.
awx:
  address: https://myawx.example.com/api
.
$ echo > credentials.yml <<.
credentials:
  username: "autoheal"
  password: "..."
.
$ chmod u=r,g=,o= credentials.yml
$ autoheal server --config-file=general.yml --config-file=credentials.yml

The tlsRef parameter is a reference to the Kubernetes secret that contains the certificates used to connect to the AWX API. That secret should contain the ca.crt key, for example:

apiVersion: v1
kind: Secret
metadata:
  namespace: my-namespace
  name: my-awx-tls
data:
  ca.crt: |-
    LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvVENDQWVXZ0F3SUJBZ0lKQUxNRXB6OWxa
    VkVzdzI3Sm5BYlMyejNhbUF0YTc1QmNnVGcvOUFCdDV0VVc2VTJOKzkKbXc9PQotLS0tLUVORCBD
    ...

Alternatively it is also possible to specify the CA certificates directly inside the configuration file, using the tls section. For example:

tls:
  caCerts: |-
    -----BEGIN CERTIFICATE-----
    MIIFgzCCA2ugAwIBAgIPXZONMGc2yAYdGsdUhGkHMA0GCSqGSIb3DQEBCwUAMDsx
    CzAJBgNVBAYTAkVTMREwDwYDVQQKDAhGTk1ULVJDTTEZMBcGA1UECwwQQUMgUkFJ
    ...
    -----END CERTIFICATE-----

They can also be specified indirectly, putting the name of a PEM file in the caFile parameter:

tls:
  caFile: /etc/autoheal/my-ca.pem

The insecure parameter controls whether to use an insecure connection to the AWX server. If the connection is insecure then the TLS will not be verified. It should always be set to false (the default) in production environments.

The project parameter is the name of the AWX project that contains the job templates that will be used to run the playbooks.

Throttling configuration

The throttling section of the configuration describes how to throttle the execution of healing actions. This is intended to prevent healing storms that could happen if the same alerts are send repeatedly to the service.

The interval parameter controls the time that the service will remember an executed healing action. If an action is triggered more than once in the given interval it will be executed only the first time. The rest of the times it will be logged and ignored. (see autoheal.yml for an example.)

The default interval value is one hour. Leaving the interval parameter 0 will disable throttling altogether.

Note that for throttling purposes actions are considered the same if they have exactly the same fields with exactly the same values after processing them as templates. For example, an action defined like this:

awxJob:
  template: "Restart {{ $labels.service }}"

Will have different values for the template field if the triggering alerts have different service labels.

The auto-heal service performs a periodic job status check against AWX server, to check the status of the active jobs that were triggered. The jobStatusCheckInterval parameter determines how often to perform this check. It is optional, and the defult is '5m' (every 5 minutes).

Healing rules configuration

The second important section of the configuration file is rules. It contains the list of healing rules used by the auto-heal service to decide which action to run for each received alert. For example:

rules:

- metadata:
    name: start-node
  labels:
    alertname: "NodeDown"
  awxJob:
    template: "Start node"
    extraVars: 
      node: "{{ $labels.instance }}"

- metadata:
    name: start-service
  labels:
    alertname: ".*Down"
    service: ".*"
  awxJob:
    template: "Start service"

The above example contains two healing rules. The first rule will be executed when the alert received contains a label named alertname with a value that matches the regular expression NodeDown.

The second rule will be executed when the alert received contains a labels alertname and service, matching the regular expressions .*Down and .* respectively.

The metadata parameter of each rule is used to specify the name of the rule, which is used by the auto-heal service to reference it in log messages and in metrics.

The labels and annotations parameters of a rule are maps of strings used to specify the labels and annotations that the alerts should contain in order to match the rule. The keys of these maps are the names of the labels or annotations. The values of these maps are regular expressions that the values of those labels or annotations should match.

The awxJob parameter indicates which job template should be executed when an alert matches the rule.

The template parameter is the name of the AWX job template.

The extraVars parameter is optional, and if specified it is used to pass additional variables to the playbook, like with the --extra-vars option of the ansible-playbook command.

Regardless to the extraVars setting, the content of the alert that triggered the AWX job will be passed to the playbook as part of extraVars, in a variable named alert.

The limit parameter is optional, and if specified it is passed to AWX to constrain the list of hosts managed or affected by the playbook. Multiple patterns can be separated by colons (:). As with core Ansible, a:b means "in group a or b", a:b:&c means "in a or b but must be in c", and a:!b means "in a, and definitely not in b".

Note that in order to be able to use extraVars and limit mechanisms the AWX job template should have the Prompt on lauch box checked, otherwise the variables passed will be ignored.

The values of all the parameters inside awxJob are processed as Go templates before executing the job. These templates receive the details of the alert inside the $labels and $annotations variables. For example, to generate dynamically the name of the job templates to execute from the value of the template annotation of the alert:

awxJob:
  template: "{{ $annotations.template }}"

Or to pass a variable node to the playbook, calculated from the instance label:

awxJob:
  template: "My template"
  extraVars: 
    node: "{{ $labels.node }}"

Limit execution to a host, calculated from the instance label:

awxJob:
  template: "My template"
  limit: "{{ $labels.instance }}"

Alertmanager Configuration

Follow the upstream Prometheus Alertmanager documentation to configure alerts.

For reference, here is an example Alertmanager configuration that sends an alert to the auto-heal service with authentication. This example assumes autoheal and the Alertmanager are running on the same OpenShift cluster, and requires Alertmanager 0.15 or newer.

global:
  resolve_timeout: 1m

route:
  group_wait: 1s
  group_interval: 1s
  repeat_interval: 5m
  receiver: autoheal
  routes:
  - match:
      alertname: DeadMansSwitch
    repeat_interval: 5m
    receiver: autoheal 
receivers:
- name: default
- name: deadmansswitch
- name: autoheal
  webhook_configs:
  - url: https://autoheal.openshift-autoheal.svc/alerts
    http_config:
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt

When using the cluster-monitoring-operator, save the configuration as alertmanager.yaml and use this command to apply it:

   --namespace=openshift-monitoring \
   --from-literal=alertmanager.yaml="$(< alertmanager.yaml)" \
   --dry-run -oyaml \
   | \
   oc replace secret \
   --namespace=openshift-monitoring \
   --filename=-

Building

To build the binary run this command:

$ make

To build the RPM and the images, run this command:

$ make build-images

Testing

To run the automated tests of the project run this command:

$ make check

To manually test the service, without having to have a running Prometheus alert manager that generates the alert notifications, you can use the *-alert.json files that are inside the examples directory. For example, to simulate the NodeDown alert start the server and then use curl to send the alert notification:

$ autoheal server --config-file=my.yml
$ curl --data @examples/node-down-alert.json http://localhost:9099/alerts

Installing

To install the service to an OpenShift cluster use the template contained in the template.yml file. This template requires at the very minimum the address and the credentials to connect to the AWX or Ansible Tower server. See the template.sh script for an example of how to use it.

Development

If needed for development, we can run the server without an OpenShift cluster, simulating OpenShift's alert manager using curl commands.

In the examples dir we have examples of firing alerts, and a configuration file that does not require a connection to a working OpenShift cluster.

To run autoheal in dev mode (without a running OpenShift cluster) developers can use the dev config file in the examples dir.

To simulate alerts firing, developers can use the example alerts.

$ make build
$ make run-dev
$ curl --data @examples/node-down-alert.json http://localhost:9099/alerts

When developing features that does not require AWX server, developers can use a mock-awx server from the examples dir. The mock server will listen on port 8080.

$ cd examples/mock-awx
$ go run mock-awx.go

autoheal's People

Contributors

cben avatar jhernand avatar nimrodshn avatar openshift-merge-robot avatar twiest avatar yaacov avatar zgalor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

autoheal's Issues

Avoid compiling and running the tests of `tools`

The tools directory of the project contains the source of some tools that have been copied from the openshift/release project, as decribed here. A side effect of that is that when running make check the build scripts try to compile also the tests of those tools, in particular the tests of the junitreport tool. That fails because of missing dependencies:

tools/junitreport/pkg/builder/flat/test_suites_builder.go:4:2: cannot find package "github.com/openshift/origin/tools/junitreport/pkg/api" in any of:
	/files/projects/openshift/autoheal/_output/local/go/src/github.com/openshift/autoheal/vendor/github.com/openshift/origin/tools/junitreport/pkg/api (vendor tree)
	/usr/lib/golang/src/github.com/openshift/origin/tools/junitreport/pkg/api (from $GOROOT)
	/files/projects/openshift/autoheal/_output/local/go/src/github.com/openshift/origin/tools/junitreport/pkg/api (from $GOPATH)

Currently we solved this removing those tests files from the repository, but that isn't a good long term solution, as the tests may be accidentally re-added when the tools directory is updated in the future.

A possible solution suggested by @ironcladlou in #6 is to tell dep to ignore that directory, adding the following to the Gopkg.toml file:

ignored = ["github.com/openshift/autoheal/tools*"]

That seems like a more correct solution, but in our initial tests it didn't solve the problem. In addition, even if it solves the problem, then the tests of the junitreport tool would be executed in addition to the tests of the project itself, thus polluting the output:

ok  	github.com/openshift/autoheal/cmd/autoheal	1.452s	coverage: 11.8% of statements
ok  	github.com/openshift/autoheal/tools/junitreport/pkg/builder/flat	1.009s	coverage: 100.0% of statements
ok  	github.com/openshift/autoheal/tools/junitreport/pkg/builder/nested	1.008s	coverage: 98.5% of statements
ok  	github.com/openshift/autoheal/tools/junitreport/pkg/parser/oscmd	1.103s	coverage: 100.0% of statements
ok  	github.com/openshift/autoheal/tools/junitreport/pkg/parser/stack	1.009s	coverage: 17.2% of statements

The optimal solution would preserve the test files, wouldn't add the junitreport project or its dependencies to the vendor directory and wouldn't run the junitreport tests together with the project tests. We should try to find that solution.

Running autoheal w/o Kubernetes

Why is using k8s a must here? I have a Prometheus and AWX and I'd like it to simply run playbooks according to alerts but autoheal enforces me to use kubernetes also. Shouldn't some workaround be possible?

Rename the API group from `monitoring` to `autoheal`

The auto-heal service uses the Kubernetes API infrastructure to define the HealingRule type. This type is inside a monitoring.openshift.io API group because originally we planned to have other types in that group unrelated to auto-heal. This does not make sense now, so we should rename that group to autoheal.openshift.io, or similar.

Improve exposed metrics usability

Currently we expose one counter called autoheal_actions_initiated_total with labels [template,type and rule] that is incremented each time an AWX job is successfully launched[1][2]

Current gaps:

  • No reporting on the recently added job statuses.
  • If a heal failed prior to launching the action it is not reported.
  • Maybe we need to expose received alerts as well.

[1]

h.incrementAwxActions(action, rule.ObjectMeta.Name)

[2] https://github.com/openshift/autoheal/blob/b46f71b591080365a2238362ac2525f697b9fe3a/documentation/metrics.md

Send all labels and annotations as `extraVars` by default

Currently the auto heal service can send to the AWX job a set of extraVars that are defined as a JSON document:

awxJob:
  template: "My template"
  extraVars: |-
    {
      "myvar": "myvalue",
      "yourvar": "yourvalue",
    }

This is very useful to send values or labels or annotations of the alert, for example, to send the value of the instance label:

awxJob:
  template: "My template"
  extraVars: |-
    {
      "instance": "{{ $labels.instance }}",
    }

It is so useful that it should be the default: if the extraVars field isn't used then we should automatically populate it with all the labels and annotations of the alert. For example, if the alert is like this:

labels:
  instance: 192.168.100.7:9100
  job: node-exporter-123
annotations:
  message: "Node '192.168.100.7:9100' is down"

Then we should automatically populate the extraVars field like this:

extraVars: |-
  {
    "labels": {
      "instance": "192.168.100.7:9100",
      "job": "node-exporter-123"
    },
    "annotations": {
      "message": "Node '192.168.100.7
    }
  }

Actually we should probably just provide the full alert description:

extraVars: |-
  {
    "alerts": [
     ...
    ]
  }

This should be compatible with other custom extraVars that the user may want to add. For example, the following action:

awxJob:
  template: "My template"
  extraVars:
    myvar: myvalue
    yourvalue: yourvalue

Should be equivalent to this:

awxJob:
  template: "My template"
  extraVars: |-
    {
      "myvar": "myvalue",
      "yourvar": "yourvalue",
      "alerts": [...]
    }

Automatically create AWX projects and job templates

Currently when a healing rule specifies a job template it has to exist in advance, the administrator of the AWX server needs to create it before creating the healing rule. Instead of that, the auto-heal service could manage its own git repository, and it could automatically create the job template, and playbook when the rule is added. For example, the healing rule could be defined like this:

apiVersion: autoheal.openshift.io/v1alpha2
kind: HealingRule
metadata:
  name: my-rule
labels:
  alertname: "MyAlert"
awxJob:
  template: "My template"
  extraVars:
    instance: {{ $labels.instance }}
  playbook: |+
    The text of the playbook ...

When this rule is added to the configuration the auto-heal service should take the text of the playbook and commit it to its own git repository. Then it could also create the My template job template in the AWX server, if it doesn't exist yet, using the committed playbook.

The location and credentials of the git repository could then be part of the global configuration of the auto-heal service:

git:
  address: git://github.com/myuser/myrepo
  credentials:
    username: ...
    password:

Or we could run a small git repository within the auto-heal service pod.

In order to create the templates (and maybe other objects) in the AWX server the auto-heal service should have the credentials of an AWX user with the required permissions, probably different to the one used to run the jobs. This should also go in the global configuration of the service, for example:

awx:
  adminCredentials:
    username: "admin"
    password: ...

Make `server` the default command for autoheal cli

Today the CLI usage of autoheal service has the following interface:

Usage:
  autoheal [command]

Available Commands:
  help        Help about any command
  server      Starts the auto-heal server

Since running autoheal as a server would be the default widely usage, I think a simpler interface would be making server command the default and turn help command to a flag --help.

@jhernand @elad661 @moolitayer @ironcladlou your thoughts on this?

Removed configuration files aren't detected

The auto-heal service has the capability to be configured specifying a configuration directory in the command line, for example:

$ autoheal server --config-file=/etc/autoheal/config.d

In this scenario the auto-heal service should read all the .yml or .yaml files from that directory, and it should reload them when they are modified, when a new file is added and also when a file is removed. But removing files seems to have no effect. That needs to be fixed.

Automatically reload configuration

Currently the auto-heal service loads the configuration only when it is started. If any changes happen to the configuration file or to the Kubernetes secrets that it uses then the service needs to be restarted. Instead of that the service should watch for changes in the configuration file and in the secrets and reload them automatically. This watching should be implemented internally in the Config type. Other parts of the service should have a mechanism to be notified of changes in the configuration, for example:

type ConfigChangeEvent {
}

type ConfigChangeListener func (*ConfigChangeEvent)

config := ...
awxAddress := config.AWX().Address()
config.AddConfigChangeListener(func () {
  awsAddress = config.AWX().Address()
})

Check correctly errors when checking if files exist

Currently the code that checks if files exist doesn't handle correctly all the errors potentially returned by os.Stat. In particular the code that checks if the Kubernetes client configuration file exist doesn't take into account that there may be errors other than the file not existing. See here:

_, err = os.Stat(kubeConfig)
if os.IsNotExist(err) {
glog.Infof(
"The Kubernetes configuration file '%s' doesn't exist, will try to use the "+
"in-cluster configuration",
kubeConfig,
)
config, err = rest.InClusterConfig()
if err != nil {
glog.Fatalf(
"Error loading in-cluster REST client configuration: %s",
err.Error(),
)
}
} else {
config, err = clientcmd.BuildConfigFromFlags(kubeAddress, kubeConfig)
if err != nil {
glog.Fatalf(
"Error loading REST client configuration from file '%s': %s",
kubeConfig,
err.Error(),
)
}
}

That should be fixed so that it checks and reports all potential errors.

Verify AWX `PROMPT ON LAUNCH` option before running

In order to support passing limit and extraVars to AWX template, the PROMPT ON LAUNCH option must be checked.

We should verify it is checked for each of the fields in 2 stages:

  1. When the rule is processed
  2. Before the template is launched

If one of those fields is defined in the rule, but not set to PROMPT ON LAUNCH in the AWX template, we should log a warning to the user.

Hide passwords and tokens from the log

These come from HTTP debugging of payload and headers. We should apply regular expressions to hide them. In the output below you can see 2 issues both the basic auth as well as the returning oAuth token:

I0411 12:41:53.363281   20313 connection.go:398] Sending POST request to 'http://localhost:9100/api/v2/authtoken/'.
I0411 12:41:53.363314   20313 connection.go:399] Request body:
{
  "username": "admin",
  "password": "password"
}
I0411 12:41:53.363344   20313 connection.go:400] Request headers:
I0411 12:41:53.363370   20313 connection.go:402] 	Content-Type: [application/json]
I0411 12:41:53.363400   20313 connection.go:402] 	Accept: [application/json]
I0411 12:41:53.363424   20313 connection.go:402] 	User-Agent: []
I0411 12:41:53.363449   20313 connection.go:402] 	Authorization: [Basic YWRtaW46cGFzc3dvcmQ=]
I0411 12:41:53.525526   20313 connection.go:418] Response body:
{
  "detail": "The requested resource could not be found."
}
I0411 12:41:53.525560   20313 connection.go:419] Response headers:
I0411 12:41:53.525568   20313 connection.go:421] 	Connection: [keep-alive]
I0411 12:41:53.525577   20313 connection.go:421] 	Vary: [Accept, Accept-Language, Cookie]
I0411 12:41:53.525584   20313 connection.go:421] 	X-Api-Total-Time: [0.006s]
I0411 12:41:53.525590   20313 connection.go:421] 	Content-Language: [en]
I0411 12:41:53.525597   20313 connection.go:421] 	Server: [nginx/1.12.2]
I0411 12:41:53.525604   20313 connection.go:421] 	Content-Type: [application/json]
I0411 12:41:53.525610   20313 connection.go:421] 	Content-Length: [55]
I0411 12:41:53.525617   20313 connection.go:421] 	Date: [Wed, 11 Apr 2018 09:41:53 GMT]
W0411 12:41:53.525626   20313 connection.go:229] Failed to aquire authtoken 'Status code '404' returned from server: '404 Not Found'', attempting PAT
I0411 12:41:53.526114   20313 connection.go:398] Sending POST request to 'http://localhost:9100/api/v2/users/admin/personal_tokens/'.
I0411 12:41:53.526128   20313 connection.go:399] Request body:
{
  "description": "AWX Go Client",
  "application": null,
  "scope": "write"
}
I0411 12:41:53.526138   20313 connection.go:400] Request headers:
I0411 12:41:53.526144   20313 connection.go:402] 	User-Agent: []
I0411 12:41:53.526152   20313 connection.go:402] 	Authorization: [Basic YWRtaW46cGFzc3dvcmQ=]
I0411 12:41:53.526159   20313 connection.go:402] 	Content-Type: [application/json]
I0411 12:41:53.526166   20313 connection.go:402] 	Accept: [application/json]
I0411 12:41:54.005212   20313 connection.go:418] Response body:
{
  "id": 1,
  "type": "o_auth2_access_token",
  "url": "/api/v2/tokens/1/",
  "related": {
    "user": "/api/v2/users/1/",
    "activity_stream": "/api/v2/tokens/1/activity_stream/"
  },
  "summary_fields": {
    "user": {
      "id": 1,
      "username": "admin",
      "first_name": "",
      "last_name": ""
    }
  },
  "created": "2018-04-11T09:41:53.732408Z",
  "modified": "2018-04-11T09:41:53.740174Z",
  "description": "AWX Go Client",
  "user": 1,
  "token": "DIfUPU5Br8hXbwb5BTTZOWiY8chVKJ",
  "refresh_token": null,
  "application": null,
  "expires": "3017-08-12T09:41:53.731847Z",
  "scope": "write"
}

Send logs to stdout or stderr by default

Currently the autoheal application doesn't send the logs
to stderr or stdout by default. To do so it requires the
-logalsotostderr command line option. We should change
that so that logs go to stderr and stdout by default, without
requiring any option.

Configure throttling interval per heal?

We are currently adding a configurable throttling interval to avoid running the same heal twice.
Use cases:

  • Alert fired again while heal is still running
  • Several different alerts are connected to the same heal

We might want to consider adding an override interval within each heal, as some heals might take longer than others

Add support for passing the `limit` parameter to AWX jobs

When an AWX job is started it is possible to pass a limit parameter to limit the hosts where it will run. Currently the auto-heal service doesn't support this, but it should allow to pass it. For example, to pass the value of the instance label of the alert:

awxJob:
  template: "My template"
  extraVars:
    myvar: myvalue
    yourvar: yourvalue
  limit: {{ $labels.instance }}

To be able to use that limit parameter the job template has to be configured with the Prompt on launch flag, so we should also check that it is set.

Add support for Ansible Runner

Currently the auto-heal service can react to alerts running AWX jobs or Kubernetes batch jobs. It should also be able to run playbooks using Ansible Runner.

The description of the healing action should contain all the data needed to run the playbook, in an structure similar to the one used by Ansible Runner itself. For example:

apiVersion: autoheal.openshift.io/v1alpha2
kind: HealingRule
metadata:
  name: my-rule
labels:
  alertname: "MyAlert"
ansibleRunner:
  env:
    envvars:
      TESTVAR: aval
    extravars:
      ansible_connection: local
      test: val
    passwords:
      "Password:\\s*?$": "some_password"
    settings:
      idle_timeout: 600
      job_timeout: 3600
      pexpect_timeout: 10
    ssh_key: ...
  inventory:
    hosts:
      ...
  project:
    test.yml:
      ...

The auto-heal service should take all these information and create a config map (or secret) containing all this information. It should then start a Kubernetes job that runs the Ansible Runner image with the config map (or secret) mounted so that Ansible Runner can use it.

In order to be able to pass passwords and SSH keys to Ansible Runner the auto-heal service should also be able to access Kubernetes secrets, and inject their values into the generated config map. For example, assuming that the SSH private key will be stored in a secret named mysshkey, the auto-heal service should support something like this:

ssh_key: $secrets.mysshkey['ssh-privatekey']

Use etcd to and an API server to manage the healing rules

Currently the healing rules are stored in the configuration file of the auto-heal service. This is OK when there only a few rules, and when reconfiguring the auto-heal service to add/remove/modify healing rules isn't frequent. But to be really useful the auto-heal service should be very easy to reconfigure, because problems and their temporary solutions are very dynamic.

One possible solution to make the auto-heal service more dynamicv is to create an API server that manages the healing rules, storing them in an etcd database. This would allow other components to add/remove/modify healing rules via this API, without having to change the configuration of the auto-heal service directly. We could even build a GUI on top of that API, to allow human users to add/remove/modify healing rules manually.

Initially this API server doesn't need to be integrated with the rest of API servers of the cluster, but it could, in the future.

I have created #84 to explore this.

Add tests

Currently different parts of autoheal are not tested. (rules_woker, config, files under pkg/ etc.), we should add tests for these.

cc: @jhernand

Check if a job is still running before relaunching it

The idea is not to run a heal to a reoccurring alert if the previous heal is still running.
This is relevant for cases where the healing playbook runs longer than the throttling interval.

If #18 is implemented, it could address this issue, but it will require the user to learn how long a playbook takes, which could be different on different setups.

The implementation is trivial, because we connect the alert to a template, but store the running jobs.
The connection between template and job is done in AWX, and we currently don't save that data.

After unmarshling, rules have no name

Log has:

I0403 19:00:11.428969   15651 rules_worker.go:83] Rule '' was added

Not sure why this is suddenly happening. I have seen it on master with the default configuration

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.