GithubHelp home page GithubHelp logo

Comments (5)

rtrox avatar rtrox commented on August 15, 2024 1

The Problem

The issue with backfill in general is making sure the counter stays consistent, since we don't keep any sort of external state. So the way it works right now:
Backfill Disabled: When backfill is disabled, we start from 0 each process start. So you start prowlarr, and in 60 days, you accrue 500 grabs. When you restart, this starts. again at 0, and monotonically increases. Prometheus is pretty smart about restarts, and increase(), sum(), etc will all see the counter decrease, and assume the counter was reset. So even if you're grabbing stuff quickly, and the counter resets from 500 to 20, because there were 20 grabs after restart before prometheus polled, increase() over that period will still show 20. In this situation, you lose the benefit of being able to look at a single datapoint as a snapshot of lifetime grabs, but prometheus always considers this an unintended side effect, and everything else works as expected.
Backfill Enabled: When backfill is enabled, we fill from the entire backlog in prowlarr on each boot. So assume to date you've had 1000 grabs. We'd backfill, start the counter at 1000, and then each poll period we'd get the delta and add it -- so say exportarr runs for 60 days, and accrues another 500 grabs. When you restart, we'll backfill from the beginning of time again, and start our cache from 1500 grabs. To prometheus, there is no interruption to the counter, and it continues monotonically increasing.

So now let's look at the idea of a partial backfill: You start prowlarr for the first time, and we backfill one month of data -- in that month, you've accrued 500 grabs. You leave it running for a week, and accrue 20 more grabs, taking the counter to 520. A couple days later, you restart the service, and leave the backfill flag enabled. Prowlarr dutifully looks back one month for the first datapoint again, and sees something like 515 grabs, as the first week of the prior backfill fell off. So now you have a period in your graph where the counter drops from 520 to 515. Now, when you run increase() over this period, prometheus will return 515, when it should return 0 -- prometheus assumes that because the counter decreased, it was reset to zero at some point, and therefore thinks those 515 grabs are new.

What it means

This means that if we add a backfill duration, it will be extremely important to only run the backfill once, and then turn off the backfill option, or you'll see strange results in your metrics. That honestly worries me a little, that's the type of side-effect that's really easy for folks to miss, and mess up their metrics. The risk could be reduced by keeping a small amount of external state, but the trade-offs don't seem right (caching the counter value to a file adds unnecessary i/o, and putting anywhere else adds unnecessary complexity).

So the actual implementation here isn't particularly difficult (probably --enable-backfill and --backfill-period flags for the prowlarr subcommand), but I'm not sure I like the side-effects.

TL;DR

When a counter decreases, prometheus assumes it was reset to zero, and any non-zero value is a new count. With partial backfill, each restart will cause the counter to decrement, leading to bad metrics in prometheus. We need to figure out how to handle this first.

from exportarr.

rtrox avatar rtrox commented on August 15, 2024

Was thinking about this a bit this morning -- one option might be to have a "backill-since-date" flag, so that the backfill is consistent. This still isn't fire & forget -- at some point that first query may get long enough that it causes timeouts, but it wouldn't cause broken metrics, at least

from exportarr.

Y0ngg4n avatar Y0ngg4n commented on August 15, 2024

@rtrox i think it would be better to not specify a fixed date. It would be better to calculate that date from a flag like "90 days backwards".

from exportarr.

rtrox avatar rtrox commented on August 15, 2024

@Y0ngg4n I agree that would be a better user experience, but it will lead to broken metrics -- see the problem I outlined above. With a duration window as you describe, when exportarr is restarted, that window will point to a different start date. As such, the actual count after restart can be lower than before restart. Prometheus will interpret this as a counter reset, and assume that that entire count is "new". So if the counter prior to restart is 1000, and the 90 day backfill after restart comes up with 900 as a count, and you query increase(<metric>), prometheus will return that the counter increased by 900, and assume that it has a total value of 1900, despite there being no activity after the restart.

So using a rolling window, based on the relative behaviors of prowlarr & prometheus is a non-starter. A fixed date solves this problem, but means that the operator will have to remember to disable backfill, or at some point they may hit a point where the exporter times out after the restart. I'm not a huge fan of this either, but the failure mode has less implications, as it doesn't create broken data.

from exportarr.

Y0ngg4n avatar Y0ngg4n commented on August 15, 2024

@rtrox ah yeah makes sense

from exportarr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.