chaostoolkit / chaostoolkit-lib Goto Github PK

View Code? Open in Web Editor NEW

77.0 10.0 46.0 791 KB

The Chaos Toolkit core library

Home Page: https://chaostoolkit.org/

License: Apache License 2.0

Python 100.00%

chaos-engineering chaostoolkit chaostoolkit-core reliability-engineering

chaostoolkit-lib's Introduction

Chaos Toolkit - Chaos Engineering for Everyone

Community • ChangeLog

Purpose

The purpose of this library is to provide the core of the Chaos Toolkit model and functions it needs to render its services.

Features

The library provides the followings features:

discover capabilities from extensions Allows you to explore the support from an extension that would help you initialize an experiment against the system this extension targets
validate a given experiment syntax The validation looks at various keys in the experiment and raises errors whenever something doesn't look right. As a nice addition, when a probe calls a Python function with arguments, it tries to validate the given argument list matches the signature of the function to apply.
run your steady state before and after the method. The former as a gate to decide if the experiment can be executed. The latter to see if the system deviated from normal.
run probes and actions declared in an experiment It runs the steps in a experiment method sequentially, applying first steady probes, then actions and finally close probes.

A journal, as a JSON payload, is return of the experiment run.

The library supports running probes and actions defined as Python functions, from importable Python modules, processes and HTTP calls.
run experiment's rollbacks when provided
Load secrets from the experiments, the environ or vault
Provides event notification from Chaos Toolkit flow (although the actual events are published by the CLI itself, not from this library), supported events are:
- on experiment validation: started, failed or completed
- on discovery: started, failed or completed
- on initialization of experiments: started, failed or completed
- on experiment runs: started, failed or completed
For each event, the according payload is part of the event as well as a UTC timestamp.

Install

If you are user of the Chaos Toolkit, you probably do not need to install this package yourself as it comes along with the chaostoolkit cli.

However, should you wish to integrate this library in your own Python code, please install it as usual:

$ pip install -U chaostoolkit-lib

Specific dependencies

In addition to essential dependencies, the package can install a couple of other extra dependencies for specific use-cases. They are not mandatory and the library will warn you if you try to use a feature that requires them.

Vault

If you need Vault support to read secrets from, run the following command:

$ pip install -U chaostoolkit-lib[vault]

To authenticate with Vault, you can either:

Use a token through the vault_token configuration key
Use an AppRole via the vault_role_id, vault_secret_id pair of configuration keys
Use a service account configured with an appropriate role via the vault_sa_role configuration key. The vault_sa_token_path, vault_k8s_mount_point, and vault_secrets_mount_point configuration keys can optionally be specified to point to a location containing a service account token, a different Kubernetes authentication method mount point, or a different secrets mount point, respectively.

JSON Path

If you need JSON Path support for tolerance probes in the hypothesis, also run the following command:

$ pip install -U chaostoolkit-lib[jsonpath]

Contribute

Contributors to this project are welcome as this is an open-source effort that seeks discussions and continuous improvement.

From a code perspective, if you wish to contribute, you will need to run a Python 3.6+ environment. Please, fork this project, write unit tests to cover the proposed changes, implement the changes, ensure they meet the formatting standards set out by black, flake8, and isort, add an entry into CHANGELOG.md, and then raise a PR to the repository for review.

Please refer to the formatting section for more information on the formatting standards.

The Chaos Toolkit projects require all contributors must sign a Developer Certificate of Origin on each commit they would like to merge into the master branch of the repository. Please, make sure you can abide by the rules of the DCO before submitting a PR.

Develop

If you wish to develop on this project, make sure to install the development dependencies. To do so, first install pdm.

$ pdm install --dev

Now, you can edit the files and they will be automatically be seen by your environment, even when running from the chaos command locally.

Test

To run the tests for the project execute the following:

$ pdm run test

Formatting and Linting

We use ruff to perform linting and code style.

Before raising a Pull Request, we recommend you run formatting against your code with:

$ pdm run format

This will automatically format any code that doesn't adhere to the formatting standards.

As some things are not picked up by the formatting, we also recommend you run:

$ pdm run lint

To ensure that any unused import statements/strings that are too long, etc. are also picked up.

chaostoolkit-lib's People

Contributors

Stargazers

Watchers

chaostoolkit-lib's Issues

Add authorization support to HTTP activities

Most HTTP API are behind authorizations, it should be straightforward to provide credentials to experiments when needed.

Make sense of the ply dependency requirement failure

Reading vault secrets is not working as you'd expect

Reading vault secrets is not working as you'd expect.

Currently, the whole Vault payload of a secret is read into the chaostoolkit secret section (including the vault secret metadata). This is not what you'd expect. Also, it's not intuitive that the key argument refers to the path.

Fix needed for activity-level controls to work as expected

Currently there is a bug when a controls block as applied at the activity level, i.e. not top level, and there are no top level controls applied either.

A fix such as the following needs to be applied to the chaos lib from line 201 onwards:

    for c in controls.copy():
        if "ref" in c:
            for top_level_control in top_level_controls:
                if c["ref"] == top_level_control["name"]:
                    controls.append(deepcopy(top_level_control))
                    break
        else:
            tc = None
            for tc in top_level_controls:
                if c.get("name") == tc.get("name"):
                    break
            else:
                if tc and tc.get("automatic", True):
                    controls.append(deepcopy(tc))

HTTP provider must allow requests against HTTPS endpoint that are self-signed

When performing local tests, a user may rely on a self-signed certificate for their server, the HTTP probe must take a parameter to disable TLS verification.

Provide a way to mak an action as "dangerous"

While Chaos Engineering make degrade the system, we should be careful not to harm too massively. So, having a mechanism to mark an action as "dangerous" could help those use cases.

From the CLI side, this could translate into asking the users before running an experiment?

Catch expired vault secret id

When the vault secret id of the app role has expired, this blows up the whole process. Catch and fail gracefully.

JSON Path cannot be empty

Fail vaildation when jsonpath is empty

All config/secrets to be passed directly to an action or probe, overriding global defaults

Call to hvac read_secret fails on KV v2

I got tricked into thinking that you could call client.secrets.kv.read_secret(path) as per the documentation but it seems the documentation is quite out of sync with the code.

Add requirements.txt for test dependencies

Can we pass results from one activity to the rest of the experiment?

The tookit does its best to not have a global state and so far, there was never really a need to take the output of an activity and feed it into another activity. But there are cases when this is useful (when an operation returns an ID for instance).

Let's see how we can add this.

Control is duplicated

Controls seem to be duplicated while they are applied

Add support for saving settings

At the moment there is a load_settings function but not one to then save settings back if settings have been changed in some way.

control can't handl ref activity

When a before_activity control is executed, if the activity references another activity, it is not looked up before hand so the control has no real context.

Log a debug message of the file where an activity was loaded from

For debug purpose, it could be handy to log a message where a particular activity was loaded from.

Pass th experiment to all controls

Currently, only the current context (steady-state, method, activity...) is passed down to a control fonction. We should also pass the experiment as it contains a larger context that may be useful as well.

Ideally this should go into 1.0.0rc2

Use safe_load from pyyaml

It is not recommended to use pyyaml.load so let's use safe_load instead.

Allow validation of experiment without importing modules

Sometimes, we want to validate the experiment in a more shallow fashion and we can't load the Python providers in those cases. Add a flag to support that case.

Set author name to contact address.

Make wording around steady-state-hypothesis more informative, or remove if hypothesis block is optional

when using probe in `method` tolerance is not validated

Hi,

I just noticed that it is allowed to use probes inside method as opposed to steady-state-hypothesis but at the same time tolerance is not validated when doing so.

I find it useful to use probes inside method, first example use case I have is when I don't want to run probe both before and after some actions. Instead the probe can be used to validate some conditions in the middle of experiment. For example, I stop a random instance in ASG and if it's not marked unhealthy (or perhaps is not replaced fast) I don't want to continue with the experiment.

Please advice if this behavior is by intention or by accident.

Control level is overriden

The control level to determine the Python function to call is overriden and should be preserved.

Correct activity-level control behaviour where missing controls are simply warned of in the logging

Bail cleanly when environment key was not found

It appears the toolkit doesn't tell you when a key couldn't be found in the environment.

Do not fail on discovery of module which don't export all

Right now the discovery mechanism expects module to have a __all__ attribute. Do not fail when it is missing.

NameError: name 'ModuleNotFoundError' is not defined

We can't rely on ModuleNotFoundError which was defined Python 3.6 and invalid in 3.5

HTTP and process activities should not fail on unexpected response codes

ability to perform tolerance check on 'stdout' property of process probe, rather than 'status'

My hypothesis probe defines a curl request, which outputs its total time to stdout stream.
The probe defines a range tolerance of [0, 1] intended to check if the total time is within one second. Here's that probe for reference:

{
...
    "steady-state-hypothesis": {
        "title": "cURL www.google.com",
        "probes": [
            {
                "type": "probe",
                "name": "http google",
                "tolerance": [0,1],
                "provider": {
                    "type" : "process",
                    "path" : "curl",
                    "arguments": "-o /dev/null -w \"%{time_total}\" -s https://www.google.com"
                }
            }
        ]
    },
...
}

What appears happens is the tolerance range of [0, 1] checks the status value of the process probe rather than the stdout. This means that if the output is 10.234, the hypothesis is still met.

For example, I would expect this to succeed, as stdout is between 0 and 1

[2019-03-19 13:04:23 DEBUG] [process:54] Running: /usr/bin/curl -o /dev/null -w "%{time_total}" -s https://www.google.com
[2019-03-19 13:04:23 DEBUG] [__init__:115] Data encoding detected as 'ascii' with a confidence of 1.0
[2019-03-19 13:04:23 DEBUG] [activity:179]   => succeeded with '{'stderr': '', 'stdout': '0.420', 'status': 0}'
[2019-03-19 13:04:23 DEBUG] [hypothesis:177] allowed tolerance is [0, 1]
[2019-03-19 13:04:23 INFO] [hypothesis:184] Steady state hypothesis is met!

and I would expect the following to fail as stdout is greater than 1, but it passes as status is 0.

[2019-03-19 13:04:23 DEBUG] [process:54] Running: /usr/bin/curl -o /dev/null -w "%{time_total}" -s https://www.google.com
[2019-03-19 13:04:27 DEBUG] [__init__:115] Data encoding detected as 'ascii' with a confidence of 1.0
[2019-03-19 13:04:27 DEBUG] [activity:179]   => succeeded with '{'stderr': '', 'stdout': '3.397', 'status': 0}'
[2019-03-19 13:04:27 DEBUG] [hypothesis:177] allowed tolerance is [0, 1]
[2019-03-19 13:04:27 INFO] [hypothesis:184] Steady state hypothesis is met!

Is there some way to instruct the process probe tolerance which value it needs to check, rather than just using 'status'? Looking at using the HTTP probe type there is no response time property either.

Add discover feature

The current interface of chaostoolkit supports the run of experiments. However, as part of the goal for chaostoolkit, we always wantedt o make it simpler to get into chaos engineering.

The new discover command is aiming to collecting information about a specific target and offer suggestions about potential chaos engineering experiments.

discover has the goal to let look up what an extension is capable of doing, as well as a summary of the platform/application this extension targets (if available) and a list of chaos experiment suggestions.

As not all extensions are Python packages, discover should eventually be able to load a spec file which describes an extension made of process calls or HTTP calls. This may be done in a second iteration of the command.

Add and document some strategies around time-constrained experiments

Original content from docs: # Adding Time Constraints to an Experiment

It is a common requirement to execute a chaos experiment for a certain period of
time, ending the experiment if it goes on indefinitely.

We've been very careful to rely on other tools for these types of concerns, and
so timing constraints are not a built-in feature of the Chaos Toolkit's
experiments.

Expand Tolerance options

According to documentation i can only get responses from an http method as string.
But tolerance using [int, int] only allows int and fails on the steady state validation.

Example:

{
    "type": "probe",
    "name": "CCCCC",
    "tolerance":: [1,10000],
    "provider": {
      "type": "http",
      "url": "http://localhost:3000/metrics/query",
      "method": "POST",
      "arguments": {
        "query": any service returning a string",
        "datasource": 161
      }

}

Support non Python based extension providers

This issue's goal is for the community to discuss interest and solutions to support extension providers implemented in languages other than Python.

As a reminder, currently, the toolkit supports three extension providers:

http: whereby you declare a URL to call and the toolkit does it for you
process: where you provdie the path to a binary which is executed by the toolkit
python: where you define a Python function that is imported from a module extension

While Python is considered a good choice for the core and most extensions, we always cared for larger than a single community. @dastergon asked on that subject topic on the community slack and he suggested I should kick the ball with a high-level view of what would need to be done.

Generally speaking, it seems the simplest/easiest integration for calling native code from Python is to export a native library that exports its symbols (much like a C library). When doing that, Python has facilities to call them for you with ctypes.

This is what people seem to generally do:

Alternatives to ctypes are CFFI and cython. The latter is quite interesting because you provide a C-like wrapper on your native extensions and the generated Python code makes it look fairly native. It is popular but requires more work.

There could be two paths:

The core of the toolkit makes it loud and clear it officially supports ctypes/CFFI and you declare it like this:

{
   "type": "probe",
   "name": "my-go-blah",
   "provider": {
        "type": "go",
        "lib_name": "my-go-lib.so",
        "func": "func_name_in_lib",
        "arguments": { ... }
}

This is what is done for Python as well but here that would expect simply a native library.

An extension author wraps entirely the native code inside a Python extension using cython. In that case, the "python" provider is enough and would work as it already does.

I think both are valuable but I wonder what communities would prefer.

Present warning on the command line output when a HTTP or process activity fails

For process calls, anything other than a 0 return code should result in a warning. For HTTP, a status code greater than 399 should trigger a warning message.

Consider using schema validation

While the current validation does an okay job, it can't handle some important cases in the data that are being passed on.

It might be useful to rely on schema validation https://github.com/keleshev/schema

name is not declared in control/python.py

In the validation function, the name variable is undeclared.

Support for hooks/events

I would be nice to be able to perform actions before and after certain points in an experiment.
Some ideas/examples:

Before running an experiment, announce to a slack channel that we are about to run an chaos experiment.
After finishing the experiment, announce that it has finished.
Before and after each probe, log the results to some log server on a customised format.
If we had to run a rollback, send an email to the service owner so they can check that everything looks fine.

Chaos Toolkit model link to doc is broken

In the readme, the Chaos Toolkit model link (http://chaostoolkit.org/overview/concepts/) is broken (404)

by the way, your 404 page contains weird content "Cloud bread lo-fi woke echo park cronut plaid banjo hammock fingerstache ennui gentrify fashion axe poke. ... " is this wanted ?

Fail more gracefully when process doesn't return utf-8

When a process returns non-utf-8 data, the activity fails quite poorly. Try to be smarter here.

Log HTTP notifications

HTTP-based notifications aren't logged into the chaostoolkit.log (unless of an error) so it's hard to know if they worked.

Allow to setup headers when loading experiments over HTTP

Currently loading experiments over HTTP forces the Accept header to static values of:

"application/json, application/x-yaml"

In some cases, this should be amended by the operator.

Improve error handling when using discovery and a conflict occurs with an existing extension

I got the following unfriendly output when there was a collision with an existing integration extension:

chaos discover chaostoolkit-kubernetes
[2018-01-30 15:35:15 INFO] Attempting to download and install package 'chaostoolkit-kubernetes'
[2018-01-30 15:35:19 INFO] Package downloaded and installed in current environment
Traceback (most recent call last):
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/discovery/package.py", line 85, in get_importname_from_package
    name = dist.get_metadata('top_level.txt').split("\n)", 1)[0]
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1493, in get_metadata
    value = self._get(self._fn(self.egg_info, name))
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1605, in _get
    with open(path, 'rb') as stream:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaostoolkit_kubernetes-0.8.0.dist-info/top_level.txt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/russellmiles/.venvs/chaostk/bin/chaos", line 11, in <module>
    sys.exit(cli())
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaosiq/cli.py", line 140, in discover
    download_and_install=not no_install)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/discovery/discover.py", line 30, in discover
    package = load_package(package_name)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/discovery/package.py", line 45, in load_package
    name = get_importname_from_package(package_name)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/discovery/package.py", line 89, in get_importname_from_package
    "Was the package installed properly?".format(p=package_name))
chaoslib.exceptions.DiscoveryFailed: failed to load package 'chaostoolkit-kubernetes' metadata. Was the package installed properly?

chaos returns 0 exit code for a failed experiment

I was trying to script chaos to run continuously if the experiment was successful.

The script was simple:

chaos run ./experiments/consensus-recovery.json
while [ $? -eq 0 ]; do
    chaos run ./experiments/consensus-recovery.json
done

Eventually, the experiment stopped being successful but continued to run.

Add settings support for the chaostoolkit

In order to support #33, it will be necessary to store settings for the toolkit.

Migrate FailedActivity exception to ActivityFailed

Ensure steady state hypothesis is met after Rollback

Hi,

I've been testing chaostoolkit and stumbled upon below scenario:

During a successful experiment run, rollback was unsuccessful, changing the system and basically bringing the app down, yet experiment was successful:

Rollback configuration in experiment:

app-must-be-healthy is a probe ref of steady-state-hypothesis


    "rollbacks": [
        {
            "type": "action",
            "name": "restart-app",
            "provider": {
                "type": "process",
                ....
                ....
          },
            "pauses": {
                "after": 5
            }
        },
        {
            "ref": "app-must-be-healthy"
        }
    ]

Experiment logs:

chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Steady state hypothesis is met!
chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Let's rollback...
chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Rollback: restart-app
chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Action: restart-app
chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Pausing after activity for 5s...
chaostoolkit_1  | [2019-04-04 13:48:18 INFO] Rollback: None
chaostoolkit_1  | [2019-04-04 13:48:18 INFO] Probe: app-must-be-healthy
chaostoolkit_1  | [2019-04-04 13:48:18 ERROR]   => failed: failed to connect to http://nginx:80/health: HTTPConnectionPool(host='nginx', port=80): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f91218d54e0>: Failed to establish a new connection: [Errno -2] Name does not resolve',))
chaostoolkit_1  | [2019-04-04 13:48:18 INFO] Experiment ended with status: completed

Would it make sense to re-evaluate steady-state-hypothesis and experiment result after rollback?

P.S. I hope I opened this issue correctly here and not in https://github.com/chaostoolkit :)