GithubHelp home page GithubHelp logo

chaostoolkit / chaostoolkit-lib Goto Github PK

View Code? Open in Web Editor NEW
77.0 10.0 46.0 791 KB

The Chaos Toolkit core library

Home Page: https://chaostoolkit.org/

License: Apache License 2.0

Python 100.00%
chaos-engineering chaostoolkit chaostoolkit-core reliability-engineering

chaostoolkit-lib's Introduction


Chaos Toolkit - Chaos Engineering for Everyone

Release Build GitHub issues License Python version

Community โ€ข ChangeLog


Purpose

The purpose of this library is to provide the core of the Chaos Toolkit model and functions it needs to render its services.

Features

The library provides the followings features:

  • discover capabilities from extensions Allows you to explore the support from an extension that would help you initialize an experiment against the system this extension targets

  • validate a given experiment syntax The validation looks at various keys in the experiment and raises errors whenever something doesn't look right. As a nice addition, when a probe calls a Python function with arguments, it tries to validate the given argument list matches the signature of the function to apply.

  • run your steady state before and after the method. The former as a gate to decide if the experiment can be executed. The latter to see if the system deviated from normal.

  • run probes and actions declared in an experiment It runs the steps in a experiment method sequentially, applying first steady probes, then actions and finally close probes.

    A journal, as a JSON payload, is return of the experiment run.

    The library supports running probes and actions defined as Python functions, from importable Python modules, processes and HTTP calls.

  • run experiment's rollbacks when provided

  • Load secrets from the experiments, the environ or vault

  • Provides event notification from Chaos Toolkit flow (although the actual events are published by the CLI itself, not from this library), supported events are:

    • on experiment validation: started, failed or completed
    • on discovery: started, failed or completed
    • on initialization of experiments: started, failed or completed
    • on experiment runs: started, failed or completed

    For each event, the according payload is part of the event as well as a UTC timestamp.

Install

If you are user of the Chaos Toolkit, you probably do not need to install this package yourself as it comes along with the chaostoolkit cli.

However, should you wish to integrate this library in your own Python code, please install it as usual:

$ pip install -U chaostoolkit-lib

Specific dependencies

In addition to essential dependencies, the package can install a couple of other extra dependencies for specific use-cases. They are not mandatory and the library will warn you if you try to use a feature that requires them.

Vault

If you need Vault support to read secrets from, run the following command:

$ pip install -U chaostoolkit-lib[vault]

To authenticate with Vault, you can either:

  • Use a token through the vault_token configuration key
  • Use an AppRole via the vault_role_id, vault_secret_id pair of configuration keys
  • Use a service account configured with an appropriate role via the vault_sa_role configuration key. The vault_sa_token_path, vault_k8s_mount_point, and vault_secrets_mount_point configuration keys can optionally be specified to point to a location containing a service account token, a different Kubernetes authentication method mount point, or a different secrets mount point, respectively.

JSON Path

If you need JSON Path support for tolerance probes in the hypothesis, also run the following command:

$ pip install -U chaostoolkit-lib[jsonpath]

Contribute

Contributors to this project are welcome as this is an open-source effort that seeks discussions and continuous improvement.

From a code perspective, if you wish to contribute, you will need to run a Python 3.6+ environment. Please, fork this project, write unit tests to cover the proposed changes, implement the changes, ensure they meet the formatting standards set out by black, flake8, and isort, add an entry into CHANGELOG.md, and then raise a PR to the repository for review.

Please refer to the formatting section for more information on the formatting standards.

The Chaos Toolkit projects require all contributors must sign a Developer Certificate of Origin on each commit they would like to merge into the master branch of the repository. Please, make sure you can abide by the rules of the DCO before submitting a PR.

Develop

If you wish to develop on this project, make sure to install the development dependencies. To do so, first install pdm.

$ pdm install --dev

Now, you can edit the files and they will be automatically be seen by your environment, even when running from the chaos command locally.

Test

To run the tests for the project execute the following:

$ pdm run test

Formatting and Linting

We use ruff to perform linting and code style.

Before raising a Pull Request, we recommend you run formatting against your code with:

$ pdm run format

This will automatically format any code that doesn't adhere to the formatting standards.

As some things are not picked up by the formatting, we also recommend you run:

$ pdm run lint

To ensure that any unused import statements/strings that are too long, etc. are also picked up.

chaostoolkit-lib's People

Contributors

alexshemeshwix avatar arpiagar avatar charliemoon37 avatar ciaranevans avatar claymccoy avatar devatoria avatar dmartin35 avatar idanto avatar joshuaroot avatar lawouach avatar mattiascockburn avatar mirimi avatar ojongerius avatar roeik-wix avatar russmiles avatar snej- avatar tam-lin avatar twuyts avatar wixoleo avatar ykskb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chaostoolkit-lib's Issues

Reading vault secrets is not working as you'd expect

Reading vault secrets is not working as you'd expect.

Currently, the whole Vault payload of a secret is read into the chaostoolkit secret section (including the vault secret metadata). This is not what you'd expect. Also, it's not intuitive that the key argument refers to the path.

Fix needed for activity-level controls to work as expected

Currently there is a bug when a controls block as applied at the activity level, i.e. not top level, and there are no top level controls applied either.

A fix such as the following needs to be applied to the chaos lib from line 201 onwards:

    for c in controls.copy():
        if "ref" in c:
            for top_level_control in top_level_controls:
                if c["ref"] == top_level_control["name"]:
                    controls.append(deepcopy(top_level_control))
                    break
        else:
            tc = None
            for tc in top_level_controls:
                if c.get("name") == tc.get("name"):
                    break
            else:
                if tc and tc.get("automatic", True):
                    controls.append(deepcopy(tc))

Provide a way to mak an action as "dangerous"

While Chaos Engineering make degrade the system, we should be careful not to harm too massively. So, having a mechanism to mark an action as "dangerous" could help those use cases.

From the CLI side, this could translate into asking the users before running an experiment?

Catch expired vault secret id

When the vault secret id of the app role has expired, this blows up the whole process. Catch and fail gracefully.

Can we pass results from one activity to the rest of the experiment?

The tookit does its best to not have a global state and so far, there was never really a need to take the output of an activity and feed it into another activity. But there are cases when this is useful (when an operation returns an ID for instance).

Let's see how we can add this.

Add support for saving settings

At the moment there is a load_settings function but not one to then save settings back if settings have been changed in some way.

control can't handl ref activity

When a before_activity control is executed, if the activity references another activity, it is not looked up before hand so the control has no real context.

Pass th experiment to all controls

Currently, only the current context (steady-state, method, activity...) is passed down to a control fonction. We should also pass the experiment as it contains a larger context that may be useful as well.

Ideally this should go into 1.0.0rc2

when using probe in `method` tolerance is not validated

Hi,

I just noticed that it is allowed to use probes inside method as opposed to steady-state-hypothesis but at the same time tolerance is not validated when doing so.

I find it useful to use probes inside method, first example use case I have is when I don't want to run probe both before and after some actions. Instead the probe can be used to validate some conditions in the middle of experiment. For example, I stop a random instance in ASG and if it's not marked unhealthy (or perhaps is not replaced fast) I don't want to continue with the experiment.

Please advice if this behavior is by intention or by accident.

ability to perform tolerance check on 'stdout' property of process probe, rather than 'status'

My hypothesis probe defines a curl request, which outputs its total time to stdout stream.
The probe defines a range tolerance of [0, 1] intended to check if the total time is within one second. Here's that probe for reference:

{
...
    "steady-state-hypothesis": {
        "title": "cURL www.google.com",
        "probes": [
            {
                "type": "probe",
                "name": "http google",
                "tolerance": [0,1],
                "provider": {
                    "type" : "process",
                    "path" : "curl",
                    "arguments": "-o /dev/null -w \"%{time_total}\" -s https://www.google.com"
                }
            }
        ]
    },
...
}

What appears happens is the tolerance range of [0, 1] checks the status value of the process probe rather than the stdout. This means that if the output is 10.234, the hypothesis is still met.

For example, I would expect this to succeed, as stdout is between 0 and 1

[2019-03-19 13:04:23 DEBUG] [process:54] Running: /usr/bin/curl -o /dev/null -w "%{time_total}" -s https://www.google.com
[2019-03-19 13:04:23 DEBUG] [__init__:115] Data encoding detected as 'ascii' with a confidence of 1.0
[2019-03-19 13:04:23 DEBUG] [activity:179]   => succeeded with '{'stderr': '', 'stdout': '0.420', 'status': 0}'
[2019-03-19 13:04:23 DEBUG] [hypothesis:177] allowed tolerance is [0, 1]
[2019-03-19 13:04:23 INFO] [hypothesis:184] Steady state hypothesis is met!

and I would expect the following to fail as stdout is greater than 1, but it passes as status is 0.

[2019-03-19 13:04:23 DEBUG] [process:54] Running: /usr/bin/curl -o /dev/null -w "%{time_total}" -s https://www.google.com
[2019-03-19 13:04:27 DEBUG] [__init__:115] Data encoding detected as 'ascii' with a confidence of 1.0
[2019-03-19 13:04:27 DEBUG] [activity:179]   => succeeded with '{'stderr': '', 'stdout': '3.397', 'status': 0}'
[2019-03-19 13:04:27 DEBUG] [hypothesis:177] allowed tolerance is [0, 1]
[2019-03-19 13:04:27 INFO] [hypothesis:184] Steady state hypothesis is met!

Is there some way to instruct the process probe tolerance which value it needs to check, rather than just using 'status'? Looking at using the HTTP probe type there is no response time property either.

Add discover feature

The current interface of chaostoolkit supports the run of experiments. However, as part of the goal for chaostoolkit, we always wantedt o make it simpler to get into chaos engineering.

The new discover command is aiming to collecting information about a specific target and offer suggestions about potential chaos engineering experiments.

discover has the goal to let look up what an extension is capable of doing, as well as a summary of the platform/application this extension targets (if available) and a list of chaos experiment suggestions.

As not all extensions are Python packages, discover should eventually be able to load a spec file which describes an extension made of process calls or HTTP calls. This may be done in a second iteration of the command.

Add and document some strategies around time-constrained experiments

Original content from docs: # Adding Time Constraints to an Experiment

It is a common requirement to execute a chaos experiment for a certain period of
time, ending the experiment if it goes on indefinitely.

We've been very careful to rely on other tools for these types of concerns, and
so timing constraints are not a built-in feature of the Chaos Toolkit's
experiments.

Expand Tolerance options

According to documentation i can only get responses from an http method as string.
But tolerance using [int, int] only allows int and fails on the steady state validation.

Example:

{
    "type": "probe",
    "name": "CCCCC",
    "tolerance":: [1,10000],
    "provider": {
      "type": "http",
      "url": "http://localhost:3000/metrics/query",
      "method": "POST",
      "arguments": {
        "query": any service returning a string",
        "datasource": 161
      }

}

Support non Python based extension providers

This issue's goal is for the community to discuss interest and solutions to support extension providers implemented in languages other than Python.

As a reminder, currently, the toolkit supports three extension providers:

  • http: whereby you declare a URL to call and the toolkit does it for you
  • process: where you provdie the path to a binary which is executed by the toolkit
  • python: where you define a Python function that is imported from a module extension

While Python is considered a good choice for the core and most extensions, we always cared for larger than a single community. @dastergon asked on that subject topic on the community slack and he suggested I should kick the ball with a high-level view of what would need to be done.

Generally speaking, it seems the simplest/easiest integration for calling native code from Python is to export a native library that exports its symbols (much like a C library). When doing that, Python has facilities to call them for you with ctypes.

This is what people seem to generally do:

Alternatives to ctypes are CFFI and cython. The latter is quite interesting because you provide a C-like wrapper on your native extensions and the generated Python code makes it look fairly native. It is popular but requires more work.

There could be two paths:

  1. The core of the toolkit makes it loud and clear it officially supports ctypes/CFFI and you declare it like this:
{
   "type": "probe",
   "name": "my-go-blah",
   "provider": {
        "type": "go",
        "lib_name": "my-go-lib.so",
        "func": "func_name_in_lib",
        "arguments": { ... }
}

This is what is done for Python as well but here that would expect simply a native library.

  1. An extension author wraps entirely the native code inside a Python extension using cython. In that case, the "python" provider is enough and would work as it already does.

I think both are valuable but I wonder what communities would prefer.

Support for hooks/events

I would be nice to be able to perform actions before and after certain points in an experiment.
Some ideas/examples:

  • Before running an experiment, announce to a slack channel that we are about to run an chaos experiment.
  • After finishing the experiment, announce that it has finished.
  • Before and after each probe, log the results to some log server on a customised format.
  • If we had to run a rollback, send an email to the service owner so they can check that everything looks fine.

Log HTTP notifications

HTTP-based notifications aren't logged into the chaostoolkit.log (unless of an error) so it's hard to know if they worked.

Improve error handling when using discovery and a conflict occurs with an existing extension

I got the following unfriendly output when there was a collision with an existing integration extension:

chaos discover chaostoolkit-kubernetes
[2018-01-30 15:35:15 INFO] Attempting to download and install package 'chaostoolkit-kubernetes'
[2018-01-30 15:35:19 INFO] Package downloaded and installed in current environment
Traceback (most recent call last):
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/discovery/package.py", line 85, in get_importname_from_package
    name = dist.get_metadata('top_level.txt').split("\n)", 1)[0]
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1493, in get_metadata
    value = self._get(self._fn(self.egg_info, name))
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1605, in _get
    with open(path, 'rb') as stream:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaostoolkit_kubernetes-0.8.0.dist-info/top_level.txt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/russellmiles/.venvs/chaostk/bin/chaos", line 11, in <module>
    sys.exit(cli())
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaosiq/cli.py", line 140, in discover
    download_and_install=not no_install)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/discovery/discover.py", line 30, in discover
    package = load_package(package_name)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/discovery/package.py", line 45, in load_package
    name = get_importname_from_package(package_name)
  File "/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/discovery/package.py", line 89, in get_importname_from_package
    "Was the package installed properly?".format(p=package_name))
chaoslib.exceptions.DiscoveryFailed: failed to load package 'chaostoolkit-kubernetes' metadata. Was the package installed properly?

chaos returns 0 exit code for a failed experiment

I was trying to script chaos to run continuously if the experiment was successful.

The script was simple:

chaos run ./experiments/consensus-recovery.json
while [ $? -eq 0 ]; do
    chaos run ./experiments/consensus-recovery.json
done

Eventually, the experiment stopped being successful but continued to run.

Ensure steady state hypothesis is met after Rollback

Hi,

I've been testing chaostoolkit and stumbled upon below scenario:

During a successful experiment run, rollback was unsuccessful, changing the system and basically bringing the app down, yet experiment was successful:

Rollback configuration in experiment:

app-must-be-healthy is a probe ref of steady-state-hypothesis


    "rollbacks": [
        {
            "type": "action",
            "name": "restart-app",
            "provider": {
                "type": "process",
                ....
                ....
          },
            "pauses": {
                "after": 5
            }
        },
        {
            "ref": "app-must-be-healthy"
        }
    ]


Experiment logs:

chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Steady state hypothesis is met!
chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Let's rollback...
chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Rollback: restart-app
chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Action: restart-app
chaostoolkit_1  | [2019-04-04 13:48:13 INFO] Pausing after activity for 5s...
chaostoolkit_1  | [2019-04-04 13:48:18 INFO] Rollback: None
chaostoolkit_1  | [2019-04-04 13:48:18 INFO] Probe: app-must-be-healthy
chaostoolkit_1  | [2019-04-04 13:48:18 ERROR]   => failed: failed to connect to http://nginx:80/health: HTTPConnectionPool(host='nginx', port=80): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f91218d54e0>: Failed to establish a new connection: [Errno -2] Name does not resolve',))
chaostoolkit_1  | [2019-04-04 13:48:18 INFO] Experiment ended with status: completed

Would it make sense to re-evaluate steady-state-hypothesis and experiment result after rollback?

P.S. I hope I opened this issue correctly here and not in https://github.com/chaostoolkit :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.