The aim of this project is to create a Monitoring Check that is able to detect failure modes of HTTP APIs reliably, taking into account:
-
Partial failures (10% of all requests are failing)
-
Flapping states (Service changing between Up/Down quickly)
It does so by:
-
Enriched state model that includes partial failure modes.
-
Adaptive change of the measurement frequency.
-
? for flapping states.
The experimental UI was written in Tornado and python3. It is recommended to use a virtualenv when running it.
Install the requirements:
pip install -r requirements.txt
Then run the probe server:
python probe.py
And the simulation server:
python simulation_server.py
Currently configurations must be done in the code. WiP
- Make simple bayes model
- Add transition probability parameter
- Gather real-world data
- Make a simulation of partial failure modes
- Make the parameters configurable in the UI
- Alert on transition to "bad" state