GithubHelp home page GithubHelp logo

prometheus / prometheus Goto Github PK

View Code? Open in Web Editor NEW
52.8K 1.1K 8.7K 195.3 MB

The Prometheus monitoring system and time series database.

Home Page: https://prometheus.io/

License: Apache License 2.0

Makefile 0.07% Go 90.63% HTML 0.45% CSS 0.05% JavaScript 0.41% Shell 0.26% Lex 0.09% Dockerfile 0.02% TypeScript 7.27% Yacc 0.48% SCSS 0.27%
monitoring metrics alerting graphing time-series prometheus hacktoberfest

prometheus's Issues

Adjust return signature of GetBoundaryValues() metric persistence method

Current:

GetBoundaryValues(_model.LabelSet, *model.Interval, *StalenessPolicy) (_model.Sample, *model.Sample, error)

The return value is hard to use because the caller needs to manually match labelsets between the first and second return value e.g. for computing deltas. And there is nothing in the return types itself that ensures that the labels even do match.

So I think it should be the same as the GetRangeValues() return value:

GetBoundaryValues(_model.LabelSet, *model.Interval, *StalenessPolicy) (_model.SampleSet, error)

Downside: it's not explicit from the return type that it contains exactly two datapoints in each timeseries, but it's probably better than introducing yet another special type that ensures that.

Remote storage

Prometheus needs to be able to interface with a remote and scalable data store for long-term storage/retrieval.

Investigate supporting arrays as label values

This would be useful for things like e.g. the list of roles attached to a host and then querying only by a single role. Not sure if supporting this is worth the time and complexity though.

Implement Additional Store of Metric Counter Resets

If we had a mediator around the storage system, we could easily track counter resets with respect to metric values. This would save the number of range (a, b] queries in favor of just querying the endpoints.

Parameterize LevelDB Storage Behaviors

  1. Synchronous I/O should be a static flag for all LevelDB persistence engines. This should default to true for the time being.
  2. LRU cache size should be a flag for each LevelDB we use. We can discuss the safe defaults.

Expression browser

Implement an expression browser via a web form (user inputs a rule language expression and gets back the evaluated result).

Data model optimization

Let's take another look at optimizing the Prometheus data models after our first experiments.

Implement Links between Graph and Expression Browser Pages

In the Graph Page

  • "View this Graph in the Expression Browser"

In the Expression Browser

  • "View this Expression as a Graph" link. @juliusv can offer some insights into runtime checks to ensure that the right kind of expressions are only allowed this.
  • Use heuristics from the AST to create node-level links of expressions such that these sub expressions can be graphed.

Change MetricPersistence interface to query values by fingerprint instead of by metric

It might make sense to change the interface of e.g. GetValueAtTime(), GetBoundaryValues(), and GetRangeValues() to expect a fingerprint instead of a metric labelset.

The AST currently starts off knowing a labelset, then gets all fingerprints for that, then gets the metrics for those fingerprints, then gets the values for each of those metrics.

It could be just: get all fingerprints for labelset, fetch values for each fingerprint. One conversion step less.

Rule formatting tool.

Like "gofmt" for Go, we ought to have a "promfmt" for Prometheus since we have a syntax tree. The idea being that the system produces uniform style that minimizes deviation and learning curve.

Update after we have totally moved to YAML rule files: In addition to formatting the PromQL expressions, we also want to format the YAML files to have a fixed structure, while preserving comments for both PromQL expressions and the YAML file.

GetBoundaryValues() and GetRangeValues() should return labels in SampleSets

I've only tested GetRangeValues() so far, but maybe GetBoundaryValues() has the same behavior. The SampleSets that get returned have a nil-map as the Metric member, whereas they should probably contain the labels that the function was called with, to yield a proper timeseries.

It's not a big problem right now, because the caller has the right labels anyways and can insert them. However, I'm not sure if that's intended.

Make graphs linkable

Graphs should be linkable via their current URL from the browser address bar.

Remove temporal aliasing from rate() and delta() functions.

The rate() and delta() functions should consider the times of the actual first and last sample times within an interval vs. the desired begin and end time of the interval and compensate for any temporal aliasing that occurs when the graphing resolution is not a multiple of the recorded sample resolution.

Invalid iterator crash bug in newSeriesFrontier()

With my expressions benchmark (living in branch "julius-metrics-persistence-benchmarks"), I managed to provoke the following crash in newSeriesFrontier():

$ go run -a expressions_benchmark.go --leveldbFlushOnMutate=false -numTimeseries=10 -populateStorage=true -deleteStorage=true -evalIntervalSeconds=3600 > /tmp/foo.txt
^[OFpanic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x18 pc=0x7f9e61c48254]

goroutine 1 [select]:
github.com/prometheus/prometheus/storage/metric.(*tieredStorage).MakeView(0xf840000e00, 0xf8445f8040, 0xf8420e0140, 0xdf8475800, 0x0, ...)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/storage/metric/tiered.go:135 +0x34b
github.com/prometheus/prometheus/rules/ast.viewAdapterForRangeQuery(0xf8400bee00, 0xf8420e0f80, 0x0, 0x0, 0x0, ...)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/rules/ast/query_analyzer.go:138 +0x480
github.com/prometheus/prometheus/rules/ast.EvalVectorRange(0xf844652ac0, 0xf8420e0f80, 0x0, 0x0, 0x0, ...)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/rules/ast/ast.go:274 +0xff
main.doBenchmark(0x6b5ce4, 0xf800000008)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/expressions_benchmark.go:113 +0x45c
main.main()
    /home/julius/gosrc/src/github.com/prometheus/prometheus/expressions_benchmark.go:153 +0x37a

goroutine 2 [syscall]:
created by runtime.main
    /home/julius/go/src/pkg/runtime/proc.c:221

goroutine 9 [syscall]:
github.com/jmhodges/levigo._Cfunc_leveldb_iter_key(0x7f9e480017e0, 0xf843f274e0)
    github.com/jmhodges/levigo/_obj/_cgo_defun.c:178 +0x2f
github.com/jmhodges/levigo.(*Iterator).Key(0xf843f274a0, 0x746920410000000f, 0x7f9e620038a0, 0x100000001)
    github.com/jmhodges/levigo/_obj/batch.cgo1.go:519 +0x44
github.com/prometheus/prometheus/storage/raw/leveldb.levigoIterator.Key(0xf843f274a0, 0xf843f27498, 0xf843f27490, 0xf84009e820, 0x0, ...)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/storage/raw/leveldb/leveldb.go:121 +0xe8
github.com/prometheus/prometheus/storage/raw/leveldb.(*levigoIterator).Key(0xf84195e060, 0x0, 0x0, 0x0)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/storage/raw/leveldb/batch.go:0 +0x8c
github.com/prometheus/prometheus/storage/metric.extractSampleKey(0xf844bbeb40, 0xf84195e060, 0xf844652480, 0x0, 0x0, ...)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/storage/metric/leveldb.go:692 +0xa4
github.com/prometheus/prometheus/storage/metric.newSeriesFrontier(0xf8400d6000, 0xf8400d9b40, 0xf8400d6000, 0xf84195e000, 0x0, ...)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/storage/metric/frontier.go:147 +0x7ee
github.com/prometheus/prometheus/storage/metric.(*tieredStorage).renderView(0xf840000e00, 0xf8445f8040, 0xf8420e0140, 0xf840f596c0)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/storage/metric/tiered.go:384 +0x444
github.com/prometheus/prometheus/storage/metric.(*tieredStorage).Serve(0xf840000e00, 0x0)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/storage/metric/tiered.go:181 +0x143
created by main.main
    /home/julius/gosrc/src/github.com/prometheus/prometheus/expressions_benchmark.go:139 +0x292

goroutine 10 [syscall]:
created by addtimer
    /home/julius/go/src/pkg/runtime/ztime_amd64.c:72
exit status 2

The culprit is this line, where we rewind the iterator although it is possible that it is already pointing at the first element on disk:

Please add logic to prevent this as well as a regression test.

Harden Closing Behavior

Upon close or close request, …

  1. Prometheus should go into a drain mode immediately whereby no further retrievals or queries are answered.
  2. Once in drain mode, it should flush all pending metrics for appending into the storage infrastructure.
  3. The storage infrastructure, should then begin a flush procedure of its own—e.g., moving in-memory values to the on-disk LevelDB store. After this step is finished, we should be safe to shut down.

Item no. 3 is pertinent now, as it is possible, though unlikely, to introduce metric index corruption for which we do not have any tools to perform referential integrity checks on the LevelDB storage. Quick example:

  1. Metric and sample are requested to be appended.
  2. LevelDB storage checks indices for metric.
  3. No index element is found; it creates a preliminary one for label name and label value pairs.
  4. Finally an index is made for the entire metric.
  5. Sample is stored.

The ordering for no. 2 and no. 3 may be wrong; but although this mutation process is idempotent, we would never set the fingerprints for the new metric correctly.

Proposal:

  1. Fix the problem as I described above.
  2. Create an offline referential integrity scanner and repair utility. This would not take too long to due and would simply require the LevelDB iterator type and the model decoders.

Ruby client

We need a Ruby version of the client instrumentation library.

Support User-Provided Static Asset Serving Directory

What we have right now for dashboard generation is good for ad hoc sharing but does not support good long-term persistent dashboard uses cases where additional visual elements or metadata may be required.

Thusly, I would like to envision a world where …

  1. A precompiled Prometheus binary could be offered to teams, possibly packaged as a self-contained archive file with all external dependencies: the binary, the compiled-in blob assets, required shared libraries and such. We're basically here with the new build system.
  2. A team can take one of these packages mentioned above and vendor it to include a set of static assets that they would like served with their Prometheus. For instance, a custom dashboard with associated templates, HTML, CSS, JS, you-name-it.

./prometheus --userAssets=/path/to/asset/root

/path/to/asset/root may contain

  1. index.html or index.html.tmpl, which is used as the root handler for http://prometheus.host/user.
  2. Go template files, which Go will evaluate and interpolate into rendered content for a list of publicly-defined (via contract) variables.

/CC: @discordianfish and @juliusv

Expression evaluation code is not goroutine-safe

To reproduce, create a small program which launches concurrent requests to prometheus/api/query?json=JSON&expr=<anything>. Each request will have one of three possible outcomes:

  1. success (lucky you!)
  2. "Error parsing rules at line X, char Y: syntax error"
  3. crashing Prometheus with a slice-out-of-bounds

For the record, the panic has this stacktrace:

rules/lexer.l.go:196 (0x44c886)
rules/load.go:51 (0x44ce7e)
rules/parser.y.go:192 (0x44debf)
rules/parser.y.go:265 (0x450c0b)
rules/load.go:75 (0x44d177)
rules/load.go:116 (0x44d634)
rules/load.go:125 (0x44d746)
web/api/query.go:30 (0x5461d5)
web/api/api.go:0 (0x547565)

Simple graph generation interface

For a first demo, we need a rudimentary graph generation interface. It can be a simple web form that allows the input of metric, labels, and timerange and shows a graph accordingly.

Targetpools scrape only first target in pool

Targetpools scrape only first target that was added to the pool (at pool creation time). I haven't figured out the exact problem yet, but maybe it is something wrong with the heap handling? I've verified that Add() is called correctly for all targets on the right pools, but after that in the actual runs, p.Len() is always just 1.

I'll send a minimal config with multiple targets via mail.

Implement UI for graph end-time selection

Currently we only allow choosing a range back in time from a given Unix timestamp as the graph ending time. The interface should instead have arrows that allow skipping back/forwards in time by smart units.

Incorporate Data Resampling and Destruction Policy

The datastore grows ad infinitum right now. We need a couple of capabilities:

  1. The capability of specifying a reduction policy along …
    • an interval of a given size (e.g., one hour),
    • with a reduction method (e.g., mean, median, minimum, maximum), and
    • on samples with a timestamp subject to a certain predicate condition (e.g., older than one day from now).
  2. A reduction policy should be specifiable on a …
    • global basis (e.g., a median is OK for most things), and
    • a per-metric basis (e.g., downsample input pertaining to a SLA with the most pessimistic method like a minimum or maximum).

GetFingerprintsForLabelSet() does OR, not AND on labels

The expression:

targets_healthy_scrape_latency_ms{percentile="0.010000"}

Yields 12 vector elements where it should only yield one:

request_metrics_latency_equal_tallying_microseconds{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 657 @[2013-01-13 05:25:59.236345 +0100 CET]
requests_metrics_latency_equal_accumulating_microseconds{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 657 @[2013-01-13 05:25:59.236345 +0100 CET]
requests_metrics_latency_logarithmic_accumulating_microseconds{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 657 @[2013-01-13 05:25:59.236345 +0100 CET]
requests_metrics_latency_logarithmic_tallying_microseconds{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 657 @[2013-01-13 05:25:59.236345 +0100 CET]
sample_append_disk_latency_microseconds{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 725 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_healthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 1.751 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_healthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 1.751 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_healthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.050000'} => 1.751 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_healthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.500000'} => 1.751 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_healthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.900000'} => 1.751 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_healthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.990000'} => 1.751 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_unhealthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => NaN @[2013-01-13 05:25:59.236345 +0100 CET]

This is an OR of the labels, not an AND (all timeseries that match either the name of the metric OR the percentile value). The bug is in GetFingerprintsForLabelSet():

fmt.Printf("===========> %v\n", labels);
fingerprints, err := p.persistence.GetFingerprintsForLabelSet(&labels)
fmt.Printf("===========> %v\n", fingerprints);

This outputs this:

===========> map[percentile:0.010000 name:targets_healthy_scrape_latency_ms]
===========> [0xf8402fd650 0xf8402fd660 0xf8402fd680 0xf8402fd690 0xf8402fd6a0 0xf8402fd750 0xf8402fd760 0xf8402fd770 0xf8402fd780 0xf8402fd790 0xf8402fd7a0 0xf8402fd7b0]

Note the 12 fingerprints where there should only be one!

The reason is that GetFingerprintsForLabelSet() steps through all labels and fetches the matching metrics for each label, thus resulting in an OR.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.