prometheus / prometheus Goto Github PK

The Prometheus monitoring system and time series database.

License: Apache License 2.0

Makefile 0.07% Go 90.63% HTML 0.45% CSS 0.05% JavaScript 0.41% Shell 0.26% Lex 0.09% Dockerfile 0.02% TypeScript 7.27% Yacc 0.48% SCSS 0.27%

monitoring metrics alerting graphing time-series prometheus hacktoberfest

prometheus's Issues

Rule re-evaluation during runtime

When a rule file changes on disk, Prometheus should be able to load and apply the changed rules during runtime.

Create Elementary Configuration and Rule Language Guide

@juliusv Let's work together on this.

Adjust return signature of GetBoundaryValues() metric persistence method

Current:

GetBoundaryValues(_model.LabelSet, *model.Interval, *StalenessPolicy) (_model.Sample, *model.Sample, error)

The return value is hard to use because the caller needs to manually match labelsets between the first and second return value e.g. for computing deltas. And there is nothing in the return types itself that ensures that the labels even do match.

So I think it should be the same as the GetRangeValues() return value:

GetBoundaryValues(_model.LabelSet, *model.Interval, *StalenessPolicy) (_model.SampleSet, error)

Downside: it's not explicit from the return type that it contains exactly two datapoints in each timeseries, but it's probably better than introducing yet another special type that ensures that.

Fix assumption in deltaImpl() that matrix timeseries always have at least one sample

Fix the samples.Values[0] here:

prometheus/rules/ast/functions.go

Line 100 in db5868f

resultValue := lastValue - samples.Values[0].Value + counterCorrection

Issue an Error if No Evaluation or Scrape Interval is Provided

For the configuration system, a scrape interval and a evaluation interval must be provided. If either or both are missing, this should trigger a runtime error. I can't think of a sensible default behavior for this.

Remote storage

Prometheus needs to be able to interface with a remote and scalable data store for long-term storage/retrieval.

In-memory store for timeseries

Build an in-memory store/cache to make access more efficient.

Add support for graphing multiple unrelated timeseries (metrics) in one graph

It should be possible to correlate multiple metrics in the same graph. The graphing UI should show (dynamically) multiple expression input fields to enable that.

Implement Configuration and Target Reloading via Disk Watches

http://golang.org/pkg/syscall/#InotifyAddWatch

Investigate supporting arrays as label values

This would be useful for things like e.g. the list of roles attached to a host and then querying only by a single role. Not sure if supporting this is worth the time and complexity though.

Add operator that supports getting ranges at intervals

If a query resolution is lower (e.g. 5m) than a metric range (like foo[1m]) requested in a query, we actually need to fetch ranges at repeated intervals so as not to fetch too much data.

Embedding instrumentation library

Use the client instrumentation library throughout Prometheus itself (to gain insight).

Implement Additional Store of Metric Counter Resets

If we had a mediator around the storage system, we could easily track counter resets with respect to metric values. This would save the number of range (a, b] queries in favor of just querying the endpoints.

Topological sorting of rules

Rules need to be evaluated in topological sort order in order to respect data dependencies between them.

Parameterize LevelDB Storage Behaviors

Synchronous I/O should be a static flag for all LevelDB persistence engines. This should default to true for the time being.
LRU cache size should be a flag for each LevelDB we use. We can discuss the safe defaults.

Make prometheus self-hosting by including external dependencies (jQuery etc)

Expression browser

Implement an expression browser via a web form (user inputs a rule language expression and gets back the evaluated result).

Persist Retroactive Rule Reevaluations

We can already reevaluate queries on old data, but we should be able to persist that for a certain window from [Oldest, Now).

HUD to view config and state of Prometheus

Prometheus should serve a status page that shows some internal state:

currently loaded config / rules
target health status, etc.

Data model optimization

Let's take another look at optimizing the Prometheus data models after our first experiments.

add tests to web UI

Let's figure out what we can use.

ie.:

Implement Links between Graph and Expression Browser Pages

In the Graph Page

"View this Graph in the Expression Browser"

In the Expression Browser

"View this Expression as a Graph" link. @juliusv can offer some insights into runtime checks to ensure that the right kind of expressions are only allowed this.
Use heuristics from the AST to create node-level links of expressions such that these sub expressions can be graphed.

Change MetricPersistence interface to query values by fingerprint instead of by metric

It might make sense to change the interface of e.g. GetValueAtTime(), GetBoundaryValues(), and GetRangeValues() to expect a fingerprint instead of a metric labelset.

The AST currently starts off knowing a labelset, then gets all fingerprints for that, then gets the metrics for those fingerprints, then gets the values for each of those metrics.

It could be just: get all fingerprints for labelset, fetch values for each fingerprint. One conversion step less.

Include Dashboard for Current Rules and Their Evaluation Statistics

For standing and recorded rules, we should offer statistics about how long they take to evaluate per cycle as well as summary statistics of the total evaluation duration in Prometheus. Other questions to have this help answer: Are we caught up or behind?

Rule formatting tool.

Like "gofmt" for Go, we ought to have a "promfmt" for Prometheus since we have a syntax tree. The idea being that the system produces uniform style that minimizes deviation and learning curve.

Update after we have totally moved to YAML rule files: In addition to formatting the PromQL expressions, we also want to format the YAML files to have a fixed structure, while preserving comments for both PromQL expressions and the YAML file.

Implement target re-evaluation during runtime

via DNS
via changed targets file on disk (need to separate targets into own file first)

No Results with Summing of Rates in Console versus Graph

This needs to be tested on a HAProxy-backed collector.
This needs to be evaluated in the console page.

Works

rate(haproxy_bytes_in{service="dashbox-web"}[1m])

Fails

sum(rate(haproxy_bytes_in{service="dashbox-web"}[1m]))

Implement federation (timeseries streaming)

It should be possible to efficiently stream timeseries from one Prometheus instance to another, with exchanged series determined based on a federation configuration.

Create synthetic up/down timeseries per target

A scrape should result in up/down timeseries per target. E.g.

up{job="foo",instance="..."} => 1
down{job="foo",instance="..."} => 0

GetBoundaryValues() and GetRangeValues() should return labels in SampleSets

I've only tested GetRangeValues() so far, but maybe GetBoundaryValues() has the same behavior. The SampleSets that get returned have a nil-map as the Metric member, whereas they should probably contain the labels that the function was called with, to yield a proper timeseries.

It's not a big problem right now, because the caller has the right labels anyways and can insert them. However, I'm not sure if that's intended.

Reorganize Source Tree and Packages for Convention

Low priority, but it should be done.

Fix rate() to be per-second again

ec41345 introduced a regression in

prometheus/rules/ast/functions.go

Line 123 in 6cb3c51

sample.Value /= model.SampleValue(interval / time.Second)

. Since samples in vectors are no longer pointers, the per-second rate adjustment in rateImpl() is applied to the result vector. Fix this and add a test!

Integrate data from memory and disk in renderView()

Views should be populated from both memory and disk data, not only the latter.

Write documentation about rule language

basic concepts
data types
operators
functions
examples

Make graphs linkable

Graphs should be linkable via their current URL from the browser address bar.

Remove temporal aliasing from rate() and delta() functions.

The rate() and delta() functions should consider the times of the actual first and last sample times within an interval vs. the desired begin and end time of the interval and compensate for any temporal aliasing that occurs when the graphing resolution is not a multiple of the recorded sample resolution.

Invalid iterator crash bug in newSeriesFrontier()

With my expressions benchmark (living in branch "julius-metrics-persistence-benchmarks"), I managed to provoke the following crash in newSeriesFrontier():

$ go run -a expressions_benchmark.go --leveldbFlushOnMutate=false -numTimeseries=10 -populateStorage=true -deleteStorage=true -evalIntervalSeconds=3600 > /tmp/foo.txt
^[OFpanic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x18 pc=0x7f9e61c48254]

goroutine 1 [select]:
github.com/prometheus/prometheus/storage/metric.(*tieredStorage).MakeView(0xf840000e00, 0xf8445f8040, 0xf8420e0140, 0xdf8475800, 0x0, ...)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/storage/metric/tiered.go:135 +0x34b
github.com/prometheus/prometheus/rules/ast.viewAdapterForRangeQuery(0xf8400bee00, 0xf8420e0f80, 0x0, 0x0, 0x0, ...)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/rules/ast/query_analyzer.go:138 +0x480
github.com/prometheus/prometheus/rules/ast.EvalVectorRange(0xf844652ac0, 0xf8420e0f80, 0x0, 0x0, 0x0, ...)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/rules/ast/ast.go:274 +0xff
main.doBenchmark(0x6b5ce4, 0xf800000008)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/expressions_benchmark.go:113 +0x45c
main.main()
    /home/julius/gosrc/src/github.com/prometheus/prometheus/expressions_benchmark.go:153 +0x37a

goroutine 2 [syscall]:
created by runtime.main
    /home/julius/go/src/pkg/runtime/proc.c:221

goroutine 9 [syscall]:
github.com/jmhodges/levigo._Cfunc_leveldb_iter_key(0x7f9e480017e0, 0xf843f274e0)
    github.com/jmhodges/levigo/_obj/_cgo_defun.c:178 +0x2f
github.com/jmhodges/levigo.(*Iterator).Key(0xf843f274a0, 0x746920410000000f, 0x7f9e620038a0, 0x100000001)
    github.com/jmhodges/levigo/_obj/batch.cgo1.go:519 +0x44
github.com/prometheus/prometheus/storage/raw/leveldb.levigoIterator.Key(0xf843f274a0, 0xf843f27498, 0xf843f27490, 0xf84009e820, 0x0, ...)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/storage/raw/leveldb/leveldb.go:121 +0xe8
github.com/prometheus/prometheus/storage/raw/leveldb.(*levigoIterator).Key(0xf84195e060, 0x0, 0x0, 0x0)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/storage/raw/leveldb/batch.go:0 +0x8c
github.com/prometheus/prometheus/storage/metric.extractSampleKey(0xf844bbeb40, 0xf84195e060, 0xf844652480, 0x0, 0x0, ...)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/storage/metric/leveldb.go:692 +0xa4
github.com/prometheus/prometheus/storage/metric.newSeriesFrontier(0xf8400d6000, 0xf8400d9b40, 0xf8400d6000, 0xf84195e000, 0x0, ...)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/storage/metric/frontier.go:147 +0x7ee
github.com/prometheus/prometheus/storage/metric.(*tieredStorage).renderView(0xf840000e00, 0xf8445f8040, 0xf8420e0140, 0xf840f596c0)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/storage/metric/tiered.go:384 +0x444
github.com/prometheus/prometheus/storage/metric.(*tieredStorage).Serve(0xf840000e00, 0x0)
    /home/julius/gosrc/src/github.com/prometheus/prometheus/storage/metric/tiered.go:181 +0x143
created by main.main
    /home/julius/gosrc/src/github.com/prometheus/prometheus/expressions_benchmark.go:139 +0x292

goroutine 10 [syscall]:
created by addtimer
    /home/julius/go/src/pkg/runtime/ztime_amd64.c:72
exit status 2

The culprit is this line, where we rewind the iterator although it is possible that it is already pointing at the first element on disk:

prometheus/storage/metric/frontier.go

Line 133 in b2e4c88

i.Previous()

Please add logic to prevent this as well as a regression test.

Harden Closing Behavior

Upon close or close request, …

Prometheus should go into a drain mode immediately whereby no further retrievals or queries are answered.
Once in drain mode, it should flush all pending metrics for appending into the storage infrastructure.
The storage infrastructure, should then begin a flush procedure of its own—e.g., moving in-memory values to the on-disk LevelDB store. After this step is finished, we should be safe to shut down.

Item no. 3 is pertinent now, as it is possible, though unlikely, to introduce metric index corruption for which we do not have any tools to perform referential integrity checks on the LevelDB storage. Quick example:

Metric and sample are requested to be appended.
LevelDB storage checks indices for metric.
No index element is found; it creates a preliminary one for label name and label value pairs.
Finally an index is made for the entire metric.
Sample is stored.

The ordering for no. 2 and no. 3 may be wrong; but although this mutation process is idempotent, we would never set the fingerprints for the new metric correctly.

Proposal:

Fix the problem as I described above.
Create an offline referential integrity scanner and repair utility. This would not take too long to due and would simply require the LevelDB iterator type and the model decoders.

Rule-evaluation management framework

There needs to be a component that evaluates rules periodically during runtime and stores their results.

Ruby client

We need a Ruby version of the client instrumentation library.

Bundle resource files (HTML, CSS/JS, images, templates) into binary

It would be great if we could distribute and run Prometheus as a single binary without any external files needed for serving e.g. the web interface.

One possibility how to do this: https://github.com/jteeuwen/go-bindata
Another option: https://github.com/bradfitz/camlistore/blob/master/pkg/fileembed/fileembed.go

SIGSEGV in storage/metric tests

It's intermittent, but I do see this every few times I run make in the root directory, or go test from storage/metric. Here's the trace: https://gist.github.com/bernerdschaefer/5371918

Support User-Provided Static Asset Serving Directory

What we have right now for dashboard generation is good for ad hoc sharing but does not support good long-term persistent dashboard uses cases where additional visual elements or metadata may be required.

Thusly, I would like to envision a world where …

A precompiled Prometheus binary could be offered to teams, possibly packaged as a self-contained archive file with all external dependencies: the binary, the compiled-in blob assets, required shared libraries and such. We're basically here with the new build system.
A team can take one of these packages mentioned above and vendor it to include a set of static assets that they would like served with their Prometheus. For instance, a custom dashboard with associated templates, HTML, CSS, JS, you-name-it.

./prometheus --userAssets=/path/to/asset/root

/path/to/asset/root may contain

index.html or index.html.tmpl, which is used as the root handler for http://prometheus.host/user.
Go template files, which Go will evaluate and interpolate into rendered content for a list of publicly-defined (via contract) variables.

/CC: @discordianfish and @juliusv

Expression evaluation code is not goroutine-safe

To reproduce, create a small program which launches concurrent requests to prometheus/api/query?json=JSON&expr=<anything>. Each request will have one of three possible outcomes:

success (lucky you!)
"Error parsing rules at line X, char Y: syntax error"
crashing Prometheus with a slice-out-of-bounds

For the record, the panic has this stacktrace:

rules/lexer.l.go:196 (0x44c886)
rules/load.go:51 (0x44ce7e)
rules/parser.y.go:192 (0x44debf)
rules/parser.y.go:265 (0x450c0b)
rules/load.go:75 (0x44d177)
rules/load.go:116 (0x44d634)
rules/load.go:125 (0x44d746)
web/api/query.go:30 (0x5461d5)
web/api/api.go:0 (0x547565)

Benchmark and Include TCMalloc-Powered LevelDB

The scope is installing TCMalloc and Google perf tools into the overlay_root hierarchy.

Simple graph generation interface

For a first demo, we need a rudimentary graph generation interface. It can be a simple web form that allows the input of metric, labels, and timerange and shows a graph accordingly.

Targetpools scrape only first target in pool

Targetpools scrape only first target that was added to the pool (at pool creation time). I haven't figured out the exact problem yet, but maybe it is something wrong with the heap handling? I've verified that Add() is called correctly for all targets on the right pools, but after that in the actual runs, p.Len() is always just 1.

I'll send a minimal config with multiple targets via mail.

Implement UI for graph end-time selection

Currently we only allow choosing a range back in time from a given Unix timestamp as the graph ending time. The interface should instead have arrows that allow skipping back/forwards in time by smart units.

Incorporate Data Resampling and Destruction Policy

The datastore grows ad infinitum right now. We need a couple of capabilities:

The capability of specifying a reduction policy along …
- an interval of a given size (e.g., one hour),
- with a reduction method (e.g., mean, median, minimum, maximum), and
- on samples with a timestamp subject to a certain predicate condition (e.g., older than one day from now).
A reduction policy should be specifiable on a …
- global basis (e.g., a median is OK for most things), and
- a per-metric basis (e.g., downsample input pertaining to a SLA with the most pessimistic method like a minimum or maximum).

GetFingerprintsForLabelSet() does OR, not AND on labels

The expression:

targets_healthy_scrape_latency_ms{percentile="0.010000"}

Yields 12 vector elements where it should only yield one:

request_metrics_latency_equal_tallying_microseconds{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 657 @[2013-01-13 05:25:59.236345 +0100 CET]
requests_metrics_latency_equal_accumulating_microseconds{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 657 @[2013-01-13 05:25:59.236345 +0100 CET]
requests_metrics_latency_logarithmic_accumulating_microseconds{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 657 @[2013-01-13 05:25:59.236345 +0100 CET]
requests_metrics_latency_logarithmic_tallying_microseconds{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 657 @[2013-01-13 05:25:59.236345 +0100 CET]
sample_append_disk_latency_microseconds{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 725 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_healthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 1.751 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_healthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 1.751 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_healthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.050000'} => 1.751 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_healthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.500000'} => 1.751 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_healthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.900000'} => 1.751 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_healthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.990000'} => 1.751 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_unhealthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => NaN @[2013-01-13 05:25:59.236345 +0100 CET]

This is an OR of the labels, not an AND (all timeseries that match either the name of the metric OR the percentile value). The bug is in GetFingerprintsForLabelSet():

fmt.Printf("===========> %v\n", labels);
fingerprints, err := p.persistence.GetFingerprintsForLabelSet(&labels)
fmt.Printf("===========> %v\n", fingerprints);

This outputs this:

===========> map[percentile:0.010000 name:targets_healthy_scrape_latency_ms]
===========> [0xf8402fd650 0xf8402fd660 0xf8402fd680 0xf8402fd690 0xf8402fd6a0 0xf8402fd750 0xf8402fd760 0xf8402fd770 0xf8402fd780 0xf8402fd790 0xf8402fd7a0 0xf8402fd7b0]

Note the 12 fingerprints where there should only be one!

The reason is that GetFingerprintsForLabelSet() steps through all labels and fetches the matching metrics for each label, thus resulting in an OR.

prometheus / prometheus Goto Github PK

prometheus's Issues

In the Graph Page

In the Expression Browser

Works

Fails

Recommend Projects

Recommend Topics

Recommend Org

Jobs