GithubHelp home page GithubHelp logo

Comments (8)

pettermahlen avatar pettermahlen commented on May 13, 2024

Quoting @udoprog from #122:

I'd prefer an implementation that completely disallows the dynamic creation of metric descriptors. One could allow for a set of detailed status code through configuration or convention, but still mark the family-style status codes (e.g. 2xx). Any status code that is outside of configuration would not be created.

Dynamically created metric descriptors has caused a lot of end-user pain in the past. While the suggested approach improves it, I'd prefer excluding them completely.

I understand that dynamically created metric descriptors are problematic, but I think it is a far bigger problem to hide unexpected status codes. They tend to be indications of serious problems, which is the main reason I wanted to not use the 4xx/5xx approach but rather retain the granularity of individual status codes.

What exactly have the problems been that we've seen so far? Might it be possible to make improvements to the way that Heroic deals with this situation instead?

from apollo.

udoprog avatar udoprog commented on May 13, 2024

I apologize for the wall of text, I'm trying to categorize my thoughts the best I can.

I'd classify the issues associated with dynamic descriptors into two categories.

Correctness

When a service is reloaded a dynamic metric descriptor doesn't exist until at least one sample has been observed. This causes problems with heuristics that does (or doesn't) take the absence if samples into account. This means we can't distinguish between a service reload, or an issue in the pipeline.

It is possible to interpolate and assume a metric is some value if it doesn't exist. This is misrepresentative of the real state of the system. An absence of a sample doesn't mean anything, we don't know if the system intends for the value to be 3.14, 0, or something else. While a very general observation, it is a principle that we internally have started to push for; we should not make decisions based on the absence of data. We are currently designing quality heuristics, and they will include how much data for a time-series is available before we are comfortable making a decision. Dynamic descriptors causes ephemeral gaps in the delivery of metrics and prevents our systems from making good decisions.

The most prominent real world effect I can think of around this is flapping or slow-to-react alerts, specifically surrounding service reloads.

Think of it like a heart monitor. You want it to always report data even if the heart hasn't beat yet. And you can't assume that the patient is dead when no data is being reported.

Discovery

Until at least one sample has been recorded Heroic cannot provide suggestions. This typically results in an unknowing operator contacting support wondering where their metric is. Or some alert not covering the desired case because it was not visible when the alert was created. Status code family classification is a low barrier of entry for inexperienced operators, they need not learn of the caveats of dynamic metric descriptors.

The primary effect is that a particular status-code is not available for suggestion, or visible in the result groups when setting up a query or an alert. It makes it hard to determine what a particular alert covers since that might change over time.

Finally

My hunch is also that families and a very few default specific status codes will cover 80% of use-cases and reduce the amount of hand-holding needed to get started with basic use-cases.

from apollo.

jo-ri avatar jo-ri commented on May 13, 2024

@pettermahlen @udoprog What do you think about adding both status-code and status-family tags to all endpoint-request-rate metrics, and either always pre-create at least one code from each family, or add pre-creation for at least one code from each family in the skeleton config?

That way it would be possible to group by family or group by code. It would also be possible to see the actual error codes even when grouping by family.

from apollo.

pettermahlen avatar pettermahlen commented on May 13, 2024

TBH, I think the use of status-family is a workaround that creates correct (in some sense), but not very useful metrics. So I would argue against making that a part of OSS Apollo. I agree with everything you say, @udoprog, but I think that the current solution, which allows service owners to pre-create meters for any status codes they are interested in, is sufficient to address any problems I've seen.

As I see it, there are two key things we should achieve:

  • correct metrics for expected status codes (because you don't want to alert on unexpected status codes, since you don't know what they mean).
  • visibility of unexpected status codes (which means not hiding them through aggregating into status families).

The second can be solved by dynamically created metric descriptors. For the first, you need something like pre-creation.

If anything, it feels like we should have some kind of aggregation of, or metric on, dynamically created descriptors to further highlight that an unexpected status code has been seen.

from apollo.

mattnworb avatar mattnworb commented on May 13, 2024

Will meters for commonly used status codes like 200, 404, 500 be automatically created by Apollo?

from apollo.

jo-ri avatar jo-ri commented on May 13, 2024

No, currently all error codes that should be created automatically must be included in the configuration.

from apollo.

jo-ri avatar jo-ri commented on May 13, 2024

We will release 1.2.0 in the current state, but keep this issue open for further discussion.
Also the PR #128 implementing my suggestion above will remain open.

from apollo.

klaraward avatar klaraward commented on May 13, 2024

Closing issue as project has been moved to be Spotify internal

from apollo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.