GithubHelp home page GithubHelp logo

traas-stack / chaosmeta Goto Github PK

View Code? Open in Web Editor NEW
310.0 12.0 53.0 31.92 MB

A chaos engineering platform for supporting the complete fault drill lifecycle.

Home Page: https://chaosmeta.gitbook.io/chaosmeta-cn

License: Apache License 2.0

Shell 0.92% Go 71.34% C 0.40% Java 0.24% Dockerfile 0.30% Makefile 0.99% TypeScript 22.41% Less 0.14% Smarty 3.08% JavaScript 0.01% HTML 0.18%
automated chaos chaos-engineering chaos-testing drill fault fault-injection golang kubernetes microservice

chaosmeta's Introduction

中文版README

Official Document

Introduction

ChaosMeta is a cloud-native chaos engineering platform open sourced by Ant Group. It embodies the methodologies, technologies and products that Ant Group has accumulated over many years in the practice of large-scale red and blue offensive and defensive drills at the company level. With the "Risk Catalog" (internal general risk scenario manual for technical components in various fields) as theoretical guidance, combined with technical practice, it has escorted Ant Group's various promotional activities for many years.

ChaosMeta is a platform dedicated to supporting all stages of fault drills, covering platform capabilities in multiple stages such as access detection, traffic injection, fault injection, fault measurement, fault recovery, and recovery measurement. While liberating productivity for users, it is also pursuing the future form of chaos engineering: one-click automated drills, and even intelligent drills.

Core advantages

Simple and easy to use, provides user interface, low threshold for use

Support visual user interface, Kubernetes API, command line, HTTP API, and other methods. docs/static/componentlink.png

Fully verified by a large amount of practical experience, high reliability

The Blue Army team of Ant Group has been deeply involved in the chaos engineering industry for many years. It holds company-level large-scale red and blue offensive and defensive drills every year, facing all the company's businesses, and many businesses also conduct 7X24-hour drills and monthly normal drills

Internal drill object types cover cloud products, Kubernetes, Operator applications, databases (OceanBase, Etcd, etc.), middleware (message queues, distributed scheduling, configuration centers, etc.), business applications (Java applications, C++ applications, Golang applications)

High flexibility, supporting a variety of user needs

Whether the user wants a complete chaos engineering platform, or just wants the underlying platform capabilities such as remote injection, orchestration and scheduling, or even just wants the single-machine fault injection capability, or manages and injects targets on or off the cloud Failure, there are corresponding deployment plans to meet

Rich fault injection capabilities, cloud native chaos engineering

Because Ant Group attaches great importance to offensive and defensive drills, it has led to large-scale and high-frequency drills, which in turn has promoted the construction of various fault injection capabilities. And because Ant has a huge internal infrastructure scale, coupled with the low fault tolerance of finance, the stability requirements for infrastructure such as Kubernetes and middleware are very high. Therefore, Ant Chaos Engineering has accumulated rich fault capabilities in the cloud-native field. and exercise experience.

The platform has powerful capabilities, supports the complete "chaos engineering life cycle", and is oriented towards automation.

ChaosMeta covers access detection, traffic injection, fault injection, fault measurement, fault recovery, recovery measurement and other stages of platform capabilities, as the technical basis of "automated chaos engineering"

In addition to the platform capability support of the exercise process, another big mountain in the automated exercise is the design of the experiment. At present, it is difficult to completely rely on machines to automatically design. However, we can systematically abstract the reusable experience and organize it into a book. When conducting chaos engineering exercises on the same type of components, we can quickly reuse it. This is the original intention of the risk catalog design

ChaosMeta will realize the automated drill capability of one-click physical examination based on the technical foundation of "Chaos Engineering Life Cycle" and the theoretical basis of "Risk Catalog", directly generate the target stability score, and greatly liberate users in chaos

Architecture overview

User layer (Client)

The Client layer is mainly composed of chaosmeta-platform components. Its main task is to lower the threshold for users to use and provide a visual interface to facilitate users' planning, orchestration, experiment configuration, experiment record details, and Agent management (pods/node of k8s clusters, cross-cluster objects, non-k8s physical machines/containers, etc.) and other platform capabilities.

Engine layer (Engine)

The Engine layer includes the core platform capabilities of ChaosMeta and the implementation of some cloud-native fault capabilities, including the following components:

  • chaosmeta-CRD: ChaosMeta's platform capabilities are developed based on the Operator framework, so each type of capability has a corresponding CRD, and then the corresponding Operator monitors the status and performs the corresponding operations. For example, the CRD of the fault injection capability is experiments.inject.chaosmeta.io and the corresponding monitoring operator is chaosmeta-inject-operator. Therefore, users can create corresponding CR instances through Kubectl or Kubernetes-Client to perform corresponding capabilities;

  • chaosmeta-inject-operator: Listens to CR instances related to fault injection created by users, compares the actual status of CR in the cluster with the expected status in the control loop to execute relevant fault injection logic and status transfer, and converts the actual status Tune into the desired state. Different operations are performed based on the fault type defined by the CR instance. For example: if it is a system resource fault, remote injection is required through chaosmeta-daemonset or HTTP or command channel; if it is a cloud native fault, injection will be based on Kubernetes APIServer. , and if it involves a dynamic admission failure, chaosmeta-webhook will also be requested to update the tampering rules and interception rules;

  • chaosmeta-webhook: The API processing process of each APIServer needs to go through authentication, authentication, and admission, and the admission stage will go through the Mutating Admission Webhook (tampering) and Validating Admission Webhook (verification) stages, chaosmeta -webhook will update the resource matching rules according to the fault definition, and intercept, tamper with, delay, and exception the user's Kubernetes resource creation request. This is very meaningful for failure drill scenarios related to Operator applications and Kubernetes' own cluster robustness.

  • chaosmeta-measure-operator: This is the component used to perform measurement capabilities, mainly used in two phases: failure measurement and recovery measurement. The fault metric is an effectiveness measure of the fault injection effect, while the recovery metric is an effectiveness measure of the resilience of the defense platform. Measurement capabilities are the key capabilities to achieve automation and intelligence in chaos engineering.

For example, the failure effect of a drill is expected to be that the number of successful requests for a certain service drops by 50%, and the corresponding defense platform is expected to be able to detect it within 5 minutes and recover within 10 minutes. The execution method is to achieve full CPU usage. Then the fault measurement phase must find the time point when the number of successful service requests drops by 50% compared to before the fault injection (fault effective point). In the recovery measurement phase, it is necessary to find the time point when the corresponding alarm is generated (fault discovery point), and also to find the time point after the fault discovery point to request a successful amount to restore the water level before the drill (fault recovery point). Finally, an analysis report of the exercise was generated, giving areas for improvement in the defense platform.

  • chaosmeta-workflow-operator: Provides fault orchestration capabilities. Because in reality, except for a single failure scenario. There are also demands for a large number of complex fault scenarios, which require simulation through serial and parallel combinations of different fault injection capabilities. And orchestration is not limited to fault injection, but can also include orchestration nodes with different capability types such as traffic injection, fault admission detection, fault measurement, recovery measurement, etc. This is also a key capability for automating drills.

  • chaosmeta-flow-operator: This is a component used to perform traffic injection, mainly used to mock the traffic of the target services. Because when we conduct fault drills, we often need to meet the flow rate to achieve the effect of the fault. For example, if you want to trigger a service delay alarm for a certain service, it is not enough to inject the delay into the container network of this service. If there is no traffic request, the corresponding monitoring alarm will not be triggered.

Kernel layer (Kernel)

The Kernel layer mainly includes the implementation of single-machine fault injection capability, mainly including the chaosmetad component, which provides the method of resident HTTP service and command line execution, and also encapsulates the corresponding daemonset component (chaosmeta-daemonset). The training platform can be flexibly matched with different needs.

Capabilities of the current version

The current version has released: user interface, fault injection scheduling engine, measurement engine, traffic injection engine, single machine fault injection tool and other components

User Interface

  • Provides experiment orchestration capabilities and lowers the threshold for use (the current version of the UI does not yet support traffic injection type and measurement type nodes);
  • Provides the ability to inject and filter remote targets of Pod/Node in the cluster (the UI will support targets outside the cluster in the future);
  • Provides space management capabilities and can separate and manage data on demand;
  • Provide account permission management system.

Fault injection capability

  • System Resources Exception: CPU, memory, network, disk, process, file, etc.;
  • Kernel Resource Exception: fd, nproc, etc.;
  • JVM Dynamic Injection: function call delay, function return value tampering, function throwing exception, etc.;
  • Container Fault Injection: kill container, suspend container, CPU, memory, network, disk, process, file, JVM injection and other experimental scenarios in the container;
  • Kubernetes Injection: execute experimental scenarios such as CPU, memory, network, disk, process, file, JVM injection on any pod;
  • Cloud-Native Faults: Abnormalities in cluster resources such as accumulation of a large number of Pending Pods and Completed Jobs; there are also abnormalities in instances of cloud-native resources such as Deployment, Node, and Pod, such as copy expansion and shrinking tampering of Deployment instances, and injection of Pod instance Finalizers.

Measuring Capabilities

  • monitor: Make expected judgments on the values of monitoring items, such as whether the CPU usage monitoring value of a certain machine is greater than 90%. Prometheus is supported by default.
  • pod: Make expected judgments on pod-related data, such as whether the number of pod instances of an application is greater than 3
  • http: Make expected judgments on http requests. For example, when making a specified http request, whether the return status code is 200
  • tcp: Make expected judgments on tcp requests, such as testing whether the 8080 port of a certain server is connectable

Traffic injection capability

  • http: http traffic injection

Getting Start

Quickly try the single-machine injection capability

# Download docker mirror and run container
docker run --privileged -it registry.cn-hangzhou.aliyuncs.com/chaosmeta/chaosmetad-demo:v0.5.3 /bin/bash

# Start the test service
cd /tmp && python -m SimpleHTTPServer 8080 > server.log 2>&1 &
curl 127.0.0.1:8080

# Create an experiment to inject a 2s network delay into the lo network card, and it will automatically recover after 10 minutes
chaosmetad inject network delay -i lo -l 2s --uid test-fg3g4 -t 10m

# View experiment information, test effect
chaosmetad query
curl 127.0.0.1:8080

# Manually recover the experiment
chaosmetad recover test-fg3g4

Fault Ability Usage

For details, see: Function Instructions

Installation Guide

For details, see: Installation Guide

Communicate

Welcome to submit defects, questions, suggestions and new features, all problems can be submitted to Github Issues, you can also contact us in the following ways:

  • DingTalk group: 21765030887
  • Slack group: ChaosMeta
  • WeChat public account: ChaosMeta混沌工程
  • Twitter:AntChaosMeta
  • Email: [email protected]
  • WeChat group: email communication/WeChat public account to obtain QR code invitation

RoadMap

Platform capabilities

The future evolution of ChaosMeta platform capabilities is divided into three stages

Phase 1 - Manual Configuration

The goal to be achieved is to open all the components in the architecture diagram to the outside world. At this time, it can support the complete life cycle of chaos engineering, enter the field of primary automated chaos engineering, and use the "risk catalog" as a theoretical reference. Once manual configuration, multiple times automatically. The order of opening to the outside world is as follows (if you have relevant needs, you are welcome to submit an issue, and priority adjustments will be considered):

  • Stand-alone fault injection tool:chaosmetad
  • Fault Remote Injection Engine:chaosmeta-inject-operator
  • Platform Dashboard:chaosmeta-platform
  • Orchestration Engine:chaosmeta-workflow-operator
  • Measure Engine:chaosmeta-measure-operator
  • Traffic Injection Engine:chaosmeta-flow-operator
  • Risk Catalog:Common Risk Scenario Handbook for Technical Components in Each Field
  • Cloud Native Dynamic Access Fault Injection Capability:chaosmeta-webhook
Phase 2 - Automation

At this stage, the "Risk Catalog" will play a greater role. It not only gives the risk of a class of applications, but also the corresponding prevention and emergency recommendations, and the score of each item, and ChaosMeta will The "risk catalog" is integrated into a risk medical examination package of general components, which realizes the one-click "physical examination" capability, inputs target application information, and directly outputs a risk score and risk analysis report.

Phase 3 - intelligence

Explore the direction of combining artificial intelligence

Fault Injection Capability

The following is just a classification of fault capabilities. For the specific atomic fault capabilities provided, please refer to the description of fault capabilities (welcome to submit issues and put forward new capability requirements, and those with higher requirements are given priority):

License

ChaosMeta follows the Apache 2.0 license, please read LICENSE for details

chaosmeta's People

Contributors

c1erman avatar cycwll avatar delphisfang avatar hlt1997 avatar kingsonkai avatar samson-samson avatar xvwenyuan avatar zwk1091 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chaosmeta's Issues

纯Docker环境希望安装chaosmeta平台服务端

问题描述:
看官方文档里面,目前只有K8S集群的chaosmeta部署,没有纯docker模式下的部署,
自己docker pull一个 chaosmeta前端的docker,貌似缺少一些依赖,想问问有没有docker compose或者纯cocker的部署文档

万分感谢

堆积pendding状态的pod配置问题

堆积pendding状态的pod实验,需要一个当前集群不存在的ns,在配置过程中只能选择已存在的ns,无法手动输入不存在的ns;
版本是v0.6.0

Open up more Risk Catalog and Metric Engine

Thanks to provide that great tool, in my environment using the mysql/Elasticsearch/clickhouse/kafka… Etc., hoping to open up more Risk Catalog and Metric Engine and use to produc environment

"parallel" or "serial" experiments, but the actual results are opposite

When I create a “parallel” experiment, it is actually executed serially. When I create a “serial” experiment, it is actually executed in parallel.
The following screenshot shows that the two experiments in row “1” are actually executed in parallel, while the experiment in row "2" is executed after the experiment in row 1 is completed.
image

前端bug:空间概览-最近运行的实验结果 页面翻页bug

如截图,当前我已经是在第2页了,但是这里还显示第1页,并且无法返回到实际的第1页。
另外提个建议,这个列表应该按时间倒序排列,也就是最近执行的实验排在最前面,这样觉得更合理。
image

另外即将运行的实验,我理解是定时运行的实验,下一次运行的时间?但如截图我创建的是一个手动运行的实验,并且已经运行完成了,理论上讲我并不一定会再次运行这个实验,也在即将运行的实验当中。不知道我理解的是否正确。
image

Please support ARM Architechture

We hope to perform fault injection on arm architecture machines, but chaosmeta only support x86 architecture. Other chaos engineering tools also support arm architecture

Bug: Unable to perform memory fill by "bytes"

pod-->mem --> fill
Due to the limitations on the front end of the page, percentage is required and must be between 1 and 100, and cannot be 0. And once percent= 0, the number of injected bytes will be calculated based on percentage (source code chaosmetad pkg/utils/memory/mem. go). This results in the inability to specify a fixed byte value。

由于页面前端限制了 percent 为必填项,并且必须为1-100之间,不能为0。一旦 percent !=0 ,就会按percent计算注入的bytes数,(源码chaosmetad pkg/utils/memory/mem.go 34行)。从而导致无法指定一个固定的 bytes 值,或者说指定了也无效,因为percent不能填0。

image

另外,我个人觉得对于pod来说,这里的 percent 没什么实际意义,因为percent是按宿主机的内存指标计算的,与pod的request和limit都没有关系。这里的逻辑是否应该完善一下?不知道我的理解是否正确,还是这个设计是针对某特定场景而考虑的。

create experiment failed

2023-12-11 06:00:24 error experiment/routine.go:128 convertToExperimentInstance:{"uuid":"17340907200016588801","name":"test","description":"","creator":1,"namespace_id":1,"create_time":"","update_time":"","status":"Running","message":"","workflow_nodes":[{"uuid":"768e691097ea11eea3611b6e38aa27a3","name":"增删Pod标签","row":0,"column":0,"duration":"60s","scope_id":3,"target_id":23,"exec_name":"增删Pod标签","exec_type":"fault","exec_id":68,"status":"","message":"","create_time":"","update_time":"","args_value":[{"args_id":251,"value":"app=demo"}],"subtasks":{"id":0,"workflow_node_instance_uuid":"","target_name":"chaosmeta-measure-controller-manager-85c4f44449-fqnvh","target_ip":"","target_hostname":"","target_label":"","target_app":"","target_namespace":"chaosmeta","range_type":"","exec_log":"","status":"","message":"","create_time":"0001-01-01T00:00:00Z","update_time":"0001-01-01T00:00:00Z"},"flow_subtasks":null,"measure_subtasks":null}]}
2023-12-11 06:00:35 error experiment/routine.go:192 fault CR get failed, err:experiments.chaosmeta.io "inject-fault-kubernetes-pod-e-173409073969391616011702274424node" not found

bug: "内存填充" type fault, execution error

I create a "内存填充" type fault, the configuration parameters refer to the following screenshot. But an error was encountered during execution.


# k -n chaosmeta get experiments.chaosmeta.io   inject-pod-mem-170359246454102835211695003069experiment-170359246457458278411695003069node -o yaml

......
    recover:
    - injectObjectName: pod/default/httpserver-8cb888b6d-klfbj/httpserver
      message: 'inject error: container cp from [/tmp/chaosmetad-0.2.0/tools/chaosmeta_memfill]
        to [/tmp/chaosmeta_memfill] error: task start error: OCI runtime exec failed:
        exec failed: unable to start container process: exec: "/bin/bash": stat /bin/bash:
        no such file or directory: unknown'
......
image

dial tcp 10.99.2.161:443: connect: connection timed out

模拟Kubernetes原子故障注入能力:删除pod
报错:kubectl apply -f 111.yaml
Error from server (InternalError): error when creating "111.yaml": Internal error occurred: failed calling webhook "mexperiment.kb.io": failed to call webhook: Post "https://chaosmeta-inject-webhook-service.chaosmeta.svc:443/mutate-inject-chaosmeta-io-v1alpha1-experiment?timeout=10s": dial tcp 10.99.2.161:443: connect: connection timed out

$kubectl get pod -n obcluster
NAME READY STATUS RESTARTS AGE
sapp-ob-test-cn-zone1-0 2/2 Running 0 25h
sapp-ob-test-cn-zone2-0 2/2 Running 0 25h
sapp-ob-test-cn-zone3-0 2/2 Running 0 25h

111.yaml配置文件内容如下
$cat 111.yaml
apiVersion: inject.chaosmeta.io/v1alpha1
kind: Experiment
metadata:
name: kubernetes-pod-delete-experiment
namespace: chaosmeta
spec:
scope: kubernetes
targetPhase: inject
rangeMode:
type: count
value: 2
experiment:
target: pod
fault: delete
duration: 10m
selector:
- namespace: obcluster
name:
- sapp-ob-test-cn-zone3-0

"Network" fault: error msg=\"unknown args: [true]

The following error is occurred when injecting network type faults.

    - injectObjectName: pod/default/httpserver-8cb888b6d-klfbj/httpserver
      message: "experiment inject error: kubectl exec error: exec remote cmd error:
        command terminated with exit code 1 time=\"2023-09-18 13:02:59\" level=error
        msg=\"unknown args: [true], please add -h to get more info\"\n "

Experiment detail:

apiVersion: chaosmeta.io/v1alpha1
kind: Experiment
metadata:
  creationTimestamp: "2023-09-18T05:02:59Z"
  finalizers:
  - chaosmeta/experiment
  generation: 1
  name: inject-pod-network-170363570037876736011695013377experiment-170363570039135027211695013377node
  namespace: chaosmeta
  resourceVersion: "213682137"
  uid: a35922f5-a377-41b7-8701-41a69c09781a
spec:
  experiment:
    args:
    - key: percent
      value: "30"
      valueType: int
    - key: interface
      value: eth0
      valueType: string
    - key: mode
      value: normal
      valueType: string
    - key: force
      value: "true"
      valueType: bool
    - key: containername
      value: firstcontainer
      valueType: string
    duration: 60s
    fault: loss
    target: network
  scope: pod
  selector:
  - name:
    - httpserver-8cb888b6d-klfbj
    namespace: default
  targetPhase: inject
status:
  createTime: "2023-09-18 05:02:59"
  detail:
    inject:
    - injectObjectName: pod/default/httpserver-8cb888b6d-klfbj/httpserver
      message: "experiment inject error: kubectl exec error: exec remote cmd error:
        command terminated with exit code 1 time=\"2023-09-18 13:02:59\" level=error
        msg=\"unknown args: [true], please add -h to get more info\"\n "
      startTime: "2023-09-18 05:02:59"
      status: failed

关于度量引擎和流量注入

hi~
我使用0.5.0版本部署文件部署了chaosmeta体验一下,发现度量引擎和流量注入是被禁用的,是我的配置不对还是还没有release出来呢

Unable to inject memory faults when using minimalist container image (such as distroless or scratch)

Error occurred: inject memory faults proportionally when using minimalist container image (such as distroless and scratch)

chaosmetad log:

./chaosmetad inject mem fill --percent=30 --mode=ram --timeout 180s --container-runtime containerd --container-id 0737e2e63ccd4c0df86b3b5a4c287c5732d9f2b92d0d1ecba390a8e5c4ae174e --log-level debug
DEBU[2023-10-17 14:38:06] get containerd client                        
DEBU[2023-10-17 14:38:06] new containerd client, ns: k8s.io, socket: /run/containerd/containerd.sock 
INFO[2023-10-17 14:38:06] uid: 202310171438066116                      
INFO[2023-10-17 14:38:06] args: {"percent":30,"mode":"ram"}            
DEBU[2023-10-17 14:38:06] get containerd client                        
DEBU[2023-10-17 14:38:06] container exec cmd: [/bin/bash -c /root/chaosmeta-github/chaosmeta/chaosmetad/tools/chaosmeta_execns -t 1083599  -m -c "grep -m1 MemTotal /proc/meminfo | sed 's/[^0-9]*//g'"] 
DEBU[2023-10-17 14:38:07] container exec result: exit code: 0, output: , err: <nil> 
ERRO[2023-10-17 14:38:07] inject error: calculateFillKBytes error: get total mem error: get total mem[] error: strconv.ParseFloat: parsing "": invalid syntax 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.