Retrospectively thinking about what sort of memory implications this will have when we

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I usually prefer to not make more configuration <p dir=

Memory leak with high cardinality tag and MeterFilter meant to cap cardinality about micrometer HOT 8 CLOSED

shakuzen commented on July 17, 2024

Memory leak with high cardinality tag and MeterFilter meant to cap cardinality

from micrometer.

Comments (8)

shakuzen commented on July 17, 2024 1

I'm still not sure what to do about this yet, but I wanted to get the issue open because this is a critical regression and we need to figure out something before 1.13 GA or we may need to revert the performance enhancement that introduced this behavior which would be rather unfortunate.

from micrometer.

shakuzen commented on July 17, 2024 1

When I said config, I didn't literally mean a property.

I understood. If it is pluggable, it is exposed API that adds complexity users need to understand (as opposed to an implementation detail), which is the concern.

Disallowing a metric would mean the metric is no more usable because its count can be relied upon as it represents only a sample whereas flattening out the high cardinal dimension.

I understand why the example filter results in more useful metrics than denying meters, but metrics instrumentation with a high cardinality tag is essentially a bug that should be fixed if at all possible. It already requires prior knowledge that such a filter needs to be configured for a high cardinality tag on a meter. Such a filter should be a stop-gap solution or for cases where it is out of control of the thing doing the config to ensure tags are limited to low cardinality (for example, Spring Boot configures a high cardinality tag filter for HTTP client metrics because users may not use uri templates - see here). With the most common case of HTTP client instrumentation, there should always be a way that templated paths can be tagged, but this requires users to use the HTTP client in a certain way that isn't (can't be) enforced. Proper instrumentation should make it possible for the cardinality to be low as long as usage follows some pattern. Then it is up to the end user (application code) to use it that way. If not, such MeterFilters will be needed, but they should be an indication the code should be fixed; not a permanent "solution". The filter's primary purpose is to prevent an OOM error and indicate a problem to the application owners to fix.

Does that make sense? Am I missing something?

I like the idea of a pluggable strategy (other than the complexity it adds), but I do worry about getting the API for this right directly in GA with no pre-releases. I'm wondering if maybe the safest way forward is to disable the cache by default and offer some way to enable it or possibly configure a strategy for it in an experimental way that it is clear it is not GA. I'll try to experiment with this.

from micrometer.

lenin-jaganathan commented on July 17, 2024

WDYT about making this cache pluggable via config? We can provide a default cache that would cap the cache at 2x the size of the number of meters registered or any other strategy that would be an ideal default.

from micrometer.

shakuzen commented on July 17, 2024

@lenin-jaganathan I usually prefer to not make more configuration, but I think in this case we may not have a better way forward. I like the general direction of what you mentioned. It somewhat reminds me of the HighCardinalityTagsDetector we already have. I'll think on it some and discuss internally. A challenge is how to describe to users what is configurable, why, and what the tradeoffs are of different configurations. I think (but unfortunately we can't really know) that most users should be unaffected by this issue, which means most should not need to touch the default or know about this configuration. However, for those that do need to know about it and change the default, how do they find out they need to change it? We can log when we reach the cap (or some point before reaching the cap perhaps) but we'll need to figure out what the log message should say that will be clear to users.

A tangential question is: should the cause of this (a filter like the one in the issue description) be something we discourage or eventually even disallow? It doesn't seem unreasonable without the new implementation detail in MeterRegistry that results in this issue. That said, we have in some way, I think, encouraged using a deny filter when reaching max allowable tags rather than the approach in the meter filter implementation here. There may be other use cases to consider as well.

from micrometer.

lenin-jaganathan commented on July 17, 2024

I usually prefer to not make more configuration

When I said config, I didn't literally mean a property. What I was thinking is more of a pluggable interface with a basic implementation which would be ideal. But, can be plugged to have desired behavior based on needs.

should the cause of this (a filter like the one in the issue description) be something we discourage or eventually even disallow?

I probably would disagree a bit on this. There are a lot of open source libraries out there using Micrometer and there are probabilities sometimes that they can have high cardinal dimensions E.g: could HTTP path etc.
Disallowing a metric would mean the metric is no more usable because its count can be relied upon as it represents only a sample whereas flattening out the high cardinal dimension. Sure we are losing out on a dimension, but we can still answer what is the total count of things (distributions in case of timers/summaries).

from micrometer.

lenin-jaganathan commented on July 17, 2024

I do worry about getting the API for this right directly in GA with no pre-releases

I agree with this.

from micrometer.

shakuzen commented on July 17, 2024

After much thought, I eventually went to write a test for this only to realize we don't have the memory leak we thought we would because meters are only added to the preFilterIdToMeterMap if they are not already in the meterMap. So in the case given, or the general case of a filter that maps multiple IDs to a same ID, only the first time will it be added to the cache. After that, other IDs that end up mapping to the same post-filter ID will not use the cache and will have all filters applied to them and the shared meter will be retrieved from the meterMap. Incidentally, this seems like the right behavior here without the need to make anything configurable. I think we get the performance gain in most cases like we want and in cases like this where we don't get the performance boost, it is avoiding a memory leak by not adding to the cache. Here's the test I wrote:

@Test
void differentPreFilterIdsMapToSameId_thenCacheIsBounded() {
    registry.config().meterFilter(MeterFilter.replaceTagValues("secret", s -> "redacted"));
    Counter c1 = registry.counter("counter", "secret", "value");
    Counter c2 = registry.counter("counter", "secret", "value2");

    assertThat(c1).isSameAs(c2);
    assertThat(registry.preFilterIdToMeterMap).hasSize(1);
    assertThat(registry.remove(c1)).isSameAs(c2);
    assertThat(registry.remove(c2)).isNull();
    assertThat(registry.getMeters()).isEmpty();
    assertThat(registry.preFilterIdToMeterMap).isEmpty();
}

from micrometer.

lenin-jaganathan commented on July 17, 2024

Yeah, I can see that now. We won't have a many-to-one mapping in the cache. Sorry, for the confusion and thanks for confirming it.

from micrometer.

Memory leak with high cardinality tag and MeterFilter meant to cap cardinality about micrometer HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs