Comments (8)
I'd prefer an implementation that completely disallows the dynamic creation of metric descriptors. One could allow for a set of detailed status code through configuration or convention, but still mark the family-style status codes (e.g. 2xx). Any status code that is outside of configuration would not be created.
Dynamically created metric descriptors has caused a lot of end-user pain in the past. While the suggested approach improves it, I'd prefer excluding them completely.
I understand that dynamically created metric descriptors are problematic, but I think it is a far bigger problem to hide unexpected status codes. They tend to be indications of serious problems, which is the main reason I wanted to not use the 4xx/5xx approach but rather retain the granularity of individual status codes.
What exactly have the problems been that we've seen so far? Might it be possible to make improvements to the way that Heroic deals with this situation instead?
from apollo.
I apologize for the wall of text, I'm trying to categorize my thoughts the best I can.
I'd classify the issues associated with dynamic descriptors into two categories.
Correctness
When a service is reloaded a dynamic metric descriptor doesn't exist until at least one sample has been observed. This causes problems with heuristics that does (or doesn't) take the absence if samples into account. This means we can't distinguish between a service reload, or an issue in the pipeline.
It is possible to interpolate and assume a metric is some value if it doesn't exist. This is misrepresentative of the real state of the system. An absence of a sample doesn't mean anything, we don't know if the system intends for the value to be 3.14
, 0
, or something else. While a very general observation, it is a principle that we internally have started to push for; we should not make decisions based on the absence of data. We are currently designing quality heuristics, and they will include how much data for a time-series is available before we are comfortable making a decision. Dynamic descriptors causes ephemeral gaps in the delivery of metrics and prevents our systems from making good decisions.
The most prominent real world effect I can think of around this is flapping or slow-to-react alerts, specifically surrounding service reloads.
Think of it like a heart monitor. You want it to always report data even if the heart hasn't beat yet. And you can't assume that the patient is dead when no data is being reported.
Discovery
Until at least one sample has been recorded Heroic cannot provide suggestions. This typically results in an unknowing operator contacting support wondering where their metric is. Or some alert not covering the desired case because it was not visible when the alert was created. Status code family classification is a low barrier of entry for inexperienced operators, they need not learn of the caveats of dynamic metric descriptors.
The primary effect is that a particular status-code is not available for suggestion, or visible in the result groups when setting up a query or an alert. It makes it hard to determine what a particular alert covers since that might change over time.
Finally
My hunch is also that families and a very few default specific status codes will cover 80% of use-cases and reduce the amount of hand-holding needed to get started with basic use-cases.
from apollo.
@pettermahlen @udoprog What do you think about adding both status-code and status-family tags to all endpoint-request-rate metrics, and either always pre-create at least one code from each family, or add pre-creation for at least one code from each family in the skeleton config?
That way it would be possible to group by family or group by code. It would also be possible to see the actual error codes even when grouping by family.
from apollo.
TBH, I think the use of status-family is a workaround that creates correct (in some sense), but not very useful metrics. So I would argue against making that a part of OSS Apollo. I agree with everything you say, @udoprog, but I think that the current solution, which allows service owners to pre-create meters for any status codes they are interested in, is sufficient to address any problems I've seen.
As I see it, there are two key things we should achieve:
- correct metrics for expected status codes (because you don't want to alert on unexpected status codes, since you don't know what they mean).
- visibility of unexpected status codes (which means not hiding them through aggregating into status families).
The second can be solved by dynamically created metric descriptors. For the first, you need something like pre-creation.
If anything, it feels like we should have some kind of aggregation of, or metric on, dynamically created descriptors to further highlight that an unexpected status code has been seen.
from apollo.
Will meters for commonly used status codes like 200, 404, 500 be automatically created by Apollo?
from apollo.
No, currently all error codes that should be created automatically must be included in the configuration.
from apollo.
We will release 1.2.0 in the current state, but keep this issue open for further discussion.
Also the PR #128 implementing my suggestion above will remain open.
from apollo.
Closing issue as project has been moved to be Spotify internal
from apollo.
Related Issues (20)
- Add Middleware to a set o routes HOT 2
- Problems building examples HOT 2
- Is there support for CORS? If not, are there plans to add it? If not, can I submit a PR? HOT 2
- How to get remote client IP for a request? HOT 4
- Change master branch back to 1.x branch? HOT 5
- Unable to build project HOT 4
- Requests with "Transfer-Encoding: chunked" wrongly conclude that no payload is being sent HOT 1
- Security: is possible to add role to Route URI? HOT 6
- Java 9 Illegal Access Warnings HOT 1
- Server Side Event for server-side push HOT 2
- File Upload HOT 2
- Route execution elapsed time HOT 2
- Swagger or ReST Documentation HOT 2
- how to run this with gradle ? HOT 1
- Download resource HOT 1
- Connect to DB HOT 1
- Example is missing in Readme.md HOT 2
- Upgrade to Guice 5 HOT 2
- Modules with decorators are restricted by order they are declared HOT 1
- Bom should not have a parent
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from apollo.