Comments (8)
As user I would expect that severity follow the natural order, so 1 is low and 10 is high.
I too was surprised by this, but this is as-defined and implemented in Elasticsearch 's corresponding endpoint, and corresponds to the severity in support systems where a "Sev 1" is of utmost importance. The goal here is to follow their prior art wherever possible.
from logstash.
has_status
. I can seestatus
always has value fromunknown
togreen
. Do you think we can remove the layerhas_status
, sostatus
pop to the upper level?
I was a little split on this, but ended up breaking has_status
(and has_indicators
, etc) off to reduce repetition and to reduce the scope of what a reader of the schema needs to be aware of at each level. Each of the has_*
sub-schemas validates both the key and the shape of the value, which I found to be more straight-forward than validating the key in multiple separate places and using the $ref
only for the value. For this one in particular we could also just use an enum
, but I wanted to be able to include a description for each of the values.
diagnosis.affected_resources
vsimpact.impact_areas
. Can you tell me what is the difference? They sound like a similar thing.
This is directly pulled from the Elasticsearch health report's schema, and the difference is still a little vague to me.
In my mental model, a diagnosis
tells you about something that isn't right, gives you steps to remediate it, and indicates what areas are affected, but an impact
tells you how your processing is affected by the thing that is wrong. I'm not entirely sure why they are decoupled in the Elasticsearch endpoint, but I don't have a strong reason to break from their prior-art.
a) I'd be nice to have a sample (manually constructed) json response from this API to help visualize the benefit of this API.
I've added one, immediately below the schema.
b) A concern I have is that the wider and more in depth the indicator tree grows, the more sensitive the top level status becomes to small perturbations. A suggestion is that we could allow the incubation of indicators. These would be visible in the health report but not yet bubble up their contribution to the parent indicator. A way to implement this would be to mark its impact to the
impact_areas
as 0.
I see the concern, and think we can handle this in two ways:
- enable a probe to include a
diagnosis
without degrading the status of the indicator, so that we can introduce new "advisory" or "incubator" probes - enable the acking of an indicator, a probe-type, or a specific probe via an API end-point, preventing it from contributing to the status either forever or until a TTL has passed (but not preventing it from supplying a diagnosis).
c)
probes themselves aren't exposed via the API, rather they are the internal component that can add a diagnosis and impacts to the indicator and degrade its status.
Not even behind a
?show_probes=true
flag? :) Maybe with a few examples of the API's return it becomes clearer that it's not necessary, but without the probe values I wonder if users will question the overall report.
I hope that this is not needed, but I believe the implementation can defer it until it is needed. One way would be to have the details
(which is a free-form key/value map) include the probes by their type and their affect on the status in order to add confidence. The Elasticsearch indicators each have a variety of details
that they present specific to what is being observed, and we can define the specifics of our details
for each top-level indicator or for the pipeline-indicator type.
I think memory pressure is more similar to the ES indicator level than resources, and it will give us more options for properties of memory pressure. I understand the desire to have pipeline level information, but a more appropriate indicator at the LS health API level might be pipeline_flow with individual pipelines as properties and pipeline issues as properties.
We can have many probes that contribute to degrading the status of the resources
indicator, including multiple ones that expose different types of memory pressure (such as tracking new metrics like humongous allocations). Perhaps the phase 1 too-much-time-collecting-garbage probe would be better named. I avoided breaking the top-level resources
into component parts memory
, disk
, cpu
, etc., because grouping them gives us the freedom to develop probes that view these resources collectively in the future without breaking the shape of the API response.
The Elasticsearch indicators map to either whole cluster-wide systems (master_is_stable
, ilm
, slm
, repository_integrity
, or shards_capacity
) or cluster-wide resource availability (disk
), and we don't have direct analogues to these things in Logstash. The details
properties show forefront in the Elasticsearch API documentation, but where the API really shines is the diagnosis
, which are fed by a variety of probes.
In shaping the top-level indicators to be resources
and pipelines
, my goal is to let a user of the API know (1) this process has the resources it needs and (2) its pipelines are processing events as-expected, and when one or more pipelines isn't processing as-expected, they can drill down into the specific pipeline before being flooded with diagnosis and guidance.
from logstash.
Really interesting and complete proposal. Have only one concern, the impact.severity
field is defined as integer and from it's description
How important this impact is to functionality. A value of 1 is the highest severity, with larger values indicating lower severity.
As user I would expect that severity follow the natural order, so 1 is low and 10 is high.
I also think that ti provide better information, we have cap the max value so a user could ask 10 is severe enough or we have also 100? In this case having the severity in a range, say 1..10 could provide a better measure; I left out 0 intentionally, because 0 means "no problem", so no need for existence of the field in that case.
from logstash.
This is a thoughtful proposal. I have two questions regarding the schema.
has_status
. I can seestatus
always has value fromunknown
togreen
. Do you think we can remove the layerhas_status
, sostatus
pop to the upper level?diagnosis.affected_resources
vsimpact.impact_areas
. Can you tell me what is the difference? They sound like a similar thing.
from logstash.
It really shows the amount of effort put into this, it's much appreciated.
a) I'd be nice to have a sample (manually constructed) json response from this API to help visualize the benefit of this API.
b) A concern I have is that the wider and more in depth the indicator tree grows, the more sensitive the top level status becomes to small perturbations. A suggestion is that we could allow the incubation of indicators. These would be visible in the health report but not yet bubble up their contribution to the parent indicator. A way to implement this would be to mark its impact to the impact_areas
as 0.
c)
probes themselves aren't exposed via the API, rather they are the internal component that can add a diagnosis and impacts to the indicator and degrade its status.
Not even behind a ?show_probes=true
flag? :) Maybe with a few examples of the API's return it becomes clearer that it's not necessary, but without the probe values I wonder if users will question the overall report.
from logstash.
This is awesome, and I want to prioritize it as soon as possible. For Phase 1, do you have a high level sense of the amount of development time required?
I really like relying on the prior art and being consistent with the rest of the stack. Looking at the Elasticsearch health api, I would like to verify that our indicators are consistent with that approach- https://www.elastic.co/guide/en/elasticsearch/reference/current/health-api.html
Example ES indicator - shards_availability {unassigned_primaries, initializing_primaries, creating_primaries...}
Example LS indicator - resources (memory_pressure}
I think memory pressure is more similar to the ES indicator level than resources, and it will give us more options for properties of memory pressure. I understand the desire to have pipeline level information, but a more appropriate indicator at the LS health API level might be pipeline_flow with individual pipelines as properties and pipeline issues as properties.
from logstash.
This is awesome work, and I can see a lot of utility here, and some potential uses for our dashboards, and maybe even autoscaling in ECK.
- Would you anticipate being able to get subsets of the health report, eg
GET /_health_report/pipelines/<pipeline_id>
for individual pipelines? - Would you anticipate having a "verbose/minimized" feature to reduce payloads/weight for clients that poll just looking for status?
- Would it be possible to turn off status propagation altogether at the pipeline level for non-critical pipelines? I am thinking about configurations that we see in production, where there are a huge number of pipelines, some of which are either low traffic or QA, etc?
- Would we consider a 'source' or similar field to determine the source probe for the degradation number? Or are the probes an "implementation detail"?
- Probably outside of the scope of this issue, but I'd love to get the
memory_pressure
numbers available, either here or in the general metrics API.
from logstash.
Related Issues (20)
- Doc: Add diagrams for pipeline-to-pipeline architectural patterns
- Not able to parse logs having spaces between key value pair in json HOT 1
- How to update logstash 8.9.1 to 8.12.02 HOT 9
- Doc: Remove Beta label from Logstash + Integrations topic
- rubyTests failing in 8.12.2 logstash HOT 8
- logstash not starting after update to 8.13.0 from red hat package manager - files from bundled jdk missing after update HOT 12
- Move `env2yaml` tool logics to the core.
- [performance][Logstash] Change Netty's based plugins to use JavaHeap instead of Direct
- Secrets in Source code apprise like Secret leak in scanners HOT 1
- Docs: Logstash 8.13.1 release docs
- Add global setting to that changes the behavior to use Heap memory - affects beats, http and tcp input HOT 3
- Docs: Logstash 8.13.2 release docs
- Docs: Logstash 7.17.20 release docs
- Windows 2022 flaky test on GeoIP database file access HOT 1
- Add agent-driven monitoring integration tests
- RFC: memory account and sort of circuit breaking to avoid OOM
- LS_JAVA_OPTS env var could lead to duplication of java options HOT 2
- Doc: Research and scrub book-scoped variables for Logstash Reference
- Doc: Research and scrub book-scoped variables for Logstash Versioned Plugin Reference
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from logstash.