prometheus / prometheus Goto Github PK
View Code? Open in Web Editor NEWThe Prometheus monitoring system and time series database.
Home Page: https://prometheus.io/
License: Apache License 2.0
The Prometheus monitoring system and time series database.
Home Page: https://prometheus.io/
License: Apache License 2.0
When a rule file changes on disk, Prometheus should be able to load and apply the changed rules during runtime.
@juliusv Let's work together on this.
Current:
GetBoundaryValues(_model.LabelSet, *model.Interval, *StalenessPolicy) (_model.Sample, *model.Sample, error)
The return value is hard to use because the caller needs to manually match labelsets between the first and second return value e.g. for computing deltas. And there is nothing in the return types itself that ensures that the labels even do match.
So I think it should be the same as the GetRangeValues() return value:
GetBoundaryValues(_model.LabelSet, *model.Interval, *StalenessPolicy) (_model.SampleSet, error)
Downside: it's not explicit from the return type that it contains exactly two datapoints in each timeseries, but it's probably better than introducing yet another special type that ensures that.
Fix the samples.Values[0] here:
prometheus/rules/ast/functions.go
Line 100 in db5868f
For the configuration system, a scrape interval and a evaluation interval must be provided. If either or both are missing, this should trigger a runtime error. I can't think of a sensible default behavior for this.
Prometheus needs to be able to interface with a remote and scalable data store for long-term storage/retrieval.
Build an in-memory store/cache to make access more efficient.
It should be possible to correlate multiple metrics in the same graph. The graphing UI should show (dynamically) multiple expression input fields to enable that.
This would be useful for things like e.g. the list of roles attached to a host and then querying only by a single role. Not sure if supporting this is worth the time and complexity though.
If a query resolution is lower (e.g. 5m) than a metric range (like foo[1m]) requested in a query, we actually need to fetch ranges at repeated intervals so as not to fetch too much data.
Use the client instrumentation library throughout Prometheus itself (to gain insight).
If we had a mediator around the storage system, we could easily track counter resets with respect to metric values. This would save the number of range (a, b] queries in favor of just querying the endpoints.
Rules need to be evaluated in topological sort order in order to respect data dependencies between them.
true
for the time being.Implement an expression browser via a web form (user inputs a rule language expression and gets back the evaluated result).
We can already reevaluate queries on old data, but we should be able to persist that for a certain window from [Oldest, Now)
.
Prometheus should serve a status page that shows some internal state:
Let's take another look at optimizing the Prometheus data models after our first experiments.
Let's figure out what we can use.
ie.:
It might make sense to change the interface of e.g. GetValueAtTime(), GetBoundaryValues(), and GetRangeValues() to expect a fingerprint instead of a metric labelset.
The AST currently starts off knowing a labelset, then gets all fingerprints for that, then gets the metrics for those fingerprints, then gets the values for each of those metrics.
It could be just: get all fingerprints for labelset, fetch values for each fingerprint. One conversion step less.
For standing and recorded rules, we should offer statistics about how long they take to evaluate per cycle as well as summary statistics of the total evaluation duration in Prometheus. Other questions to have this help answer: Are we caught up or behind?
Like "gofmt" for Go, we ought to have a "promfmt" for Prometheus since we have a syntax tree. The idea being that the system produces uniform style that minimizes deviation and learning curve.
Update after we have totally moved to YAML rule files: In addition to formatting the PromQL expressions, we also want to format the YAML files to have a fixed structure, while preserving comments for both PromQL expressions and the YAML file.
rate(haproxy_bytes_in{service="dashbox-web"}[1m])
sum(rate(haproxy_bytes_in{service="dashbox-web"}[1m]))
It should be possible to efficiently stream timeseries from one Prometheus instance to another, with exchanged series determined based on a federation configuration.
A scrape should result in up/down timeseries per target. E.g.
up{job="foo",instance="..."} => 1
down{job="foo",instance="..."} => 0
I've only tested GetRangeValues() so far, but maybe GetBoundaryValues() has the same behavior. The SampleSets that get returned have a nil-map as the Metric member, whereas they should probably contain the labels that the function was called with, to yield a proper timeseries.
It's not a big problem right now, because the caller has the right labels anyways and can insert them. However, I'm not sure if that's intended.
Low priority, but it should be done.
ec41345 introduced a regression in
prometheus/rules/ast/functions.go
Line 123 in 6cb3c51
Views should be populated from both memory and disk data, not only the latter.
Graphs should be linkable via their current URL from the browser address bar.
The rate() and delta() functions should consider the times of the actual first and last sample times within an interval vs. the desired begin and end time of the interval and compensate for any temporal aliasing that occurs when the graphing resolution is not a multiple of the recorded sample resolution.
With my expressions benchmark (living in branch "julius-metrics-persistence-benchmarks"), I managed to provoke the following crash in newSeriesFrontier()
:
$ go run -a expressions_benchmark.go --leveldbFlushOnMutate=false -numTimeseries=10 -populateStorage=true -deleteStorage=true -evalIntervalSeconds=3600 > /tmp/foo.txt
^[OFpanic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x18 pc=0x7f9e61c48254]
goroutine 1 [select]:
github.com/prometheus/prometheus/storage/metric.(*tieredStorage).MakeView(0xf840000e00, 0xf8445f8040, 0xf8420e0140, 0xdf8475800, 0x0, ...)
/home/julius/gosrc/src/github.com/prometheus/prometheus/storage/metric/tiered.go:135 +0x34b
github.com/prometheus/prometheus/rules/ast.viewAdapterForRangeQuery(0xf8400bee00, 0xf8420e0f80, 0x0, 0x0, 0x0, ...)
/home/julius/gosrc/src/github.com/prometheus/prometheus/rules/ast/query_analyzer.go:138 +0x480
github.com/prometheus/prometheus/rules/ast.EvalVectorRange(0xf844652ac0, 0xf8420e0f80, 0x0, 0x0, 0x0, ...)
/home/julius/gosrc/src/github.com/prometheus/prometheus/rules/ast/ast.go:274 +0xff
main.doBenchmark(0x6b5ce4, 0xf800000008)
/home/julius/gosrc/src/github.com/prometheus/prometheus/expressions_benchmark.go:113 +0x45c
main.main()
/home/julius/gosrc/src/github.com/prometheus/prometheus/expressions_benchmark.go:153 +0x37a
goroutine 2 [syscall]:
created by runtime.main
/home/julius/go/src/pkg/runtime/proc.c:221
goroutine 9 [syscall]:
github.com/jmhodges/levigo._Cfunc_leveldb_iter_key(0x7f9e480017e0, 0xf843f274e0)
github.com/jmhodges/levigo/_obj/_cgo_defun.c:178 +0x2f
github.com/jmhodges/levigo.(*Iterator).Key(0xf843f274a0, 0x746920410000000f, 0x7f9e620038a0, 0x100000001)
github.com/jmhodges/levigo/_obj/batch.cgo1.go:519 +0x44
github.com/prometheus/prometheus/storage/raw/leveldb.levigoIterator.Key(0xf843f274a0, 0xf843f27498, 0xf843f27490, 0xf84009e820, 0x0, ...)
/home/julius/gosrc/src/github.com/prometheus/prometheus/storage/raw/leveldb/leveldb.go:121 +0xe8
github.com/prometheus/prometheus/storage/raw/leveldb.(*levigoIterator).Key(0xf84195e060, 0x0, 0x0, 0x0)
/home/julius/gosrc/src/github.com/prometheus/prometheus/storage/raw/leveldb/batch.go:0 +0x8c
github.com/prometheus/prometheus/storage/metric.extractSampleKey(0xf844bbeb40, 0xf84195e060, 0xf844652480, 0x0, 0x0, ...)
/home/julius/gosrc/src/github.com/prometheus/prometheus/storage/metric/leveldb.go:692 +0xa4
github.com/prometheus/prometheus/storage/metric.newSeriesFrontier(0xf8400d6000, 0xf8400d9b40, 0xf8400d6000, 0xf84195e000, 0x0, ...)
/home/julius/gosrc/src/github.com/prometheus/prometheus/storage/metric/frontier.go:147 +0x7ee
github.com/prometheus/prometheus/storage/metric.(*tieredStorage).renderView(0xf840000e00, 0xf8445f8040, 0xf8420e0140, 0xf840f596c0)
/home/julius/gosrc/src/github.com/prometheus/prometheus/storage/metric/tiered.go:384 +0x444
github.com/prometheus/prometheus/storage/metric.(*tieredStorage).Serve(0xf840000e00, 0x0)
/home/julius/gosrc/src/github.com/prometheus/prometheus/storage/metric/tiered.go:181 +0x143
created by main.main
/home/julius/gosrc/src/github.com/prometheus/prometheus/expressions_benchmark.go:139 +0x292
goroutine 10 [syscall]:
created by addtimer
/home/julius/go/src/pkg/runtime/ztime_amd64.c:72
exit status 2
The culprit is this line, where we rewind the iterator although it is possible that it is already pointing at the first element on disk:
prometheus/storage/metric/frontier.go
Line 133 in b2e4c88
Please add logic to prevent this as well as a regression test.
Upon close or close request, …
Item no. 3 is pertinent now, as it is possible, though unlikely, to introduce metric index corruption for which we do not have any tools to perform referential integrity checks on the LevelDB storage. Quick example:
The ordering for no. 2 and no. 3 may be wrong; but although this mutation process is idempotent, we would never set the fingerprints for the new metric correctly.
Proposal:
There needs to be a component that evaluates rules periodically during runtime and stores their results.
We need a Ruby version of the client instrumentation library.
It would be great if we could distribute and run Prometheus as a single binary without any external files needed for serving e.g. the web interface.
One possibility how to do this: https://github.com/jteeuwen/go-bindata
Another option: https://github.com/bradfitz/camlistore/blob/master/pkg/fileembed/fileembed.go
It's intermittent, but I do see this every few times I run make
in the root directory, or go test
from storage/metric
. Here's the trace: https://gist.github.com/bernerdschaefer/5371918
What we have right now for dashboard generation is good for ad hoc sharing but does not support good long-term persistent dashboard uses cases where additional visual elements or metadata may be required.
Thusly, I would like to envision a world where …
./prometheus --userAssets=/path/to/asset/root
/path/to/asset/root may contain
/CC: @discordianfish and @juliusv
To reproduce, create a small program which launches concurrent requests to prometheus/api/query?json=JSON&expr=<anything>
. Each request will have one of three possible outcomes:
For the record, the panic has this stacktrace:
rules/lexer.l.go:196 (0x44c886)
rules/load.go:51 (0x44ce7e)
rules/parser.y.go:192 (0x44debf)
rules/parser.y.go:265 (0x450c0b)
rules/load.go:75 (0x44d177)
rules/load.go:116 (0x44d634)
rules/load.go:125 (0x44d746)
web/api/query.go:30 (0x5461d5)
web/api/api.go:0 (0x547565)
The scope is installing TCMalloc and Google perf tools into the overlay_root hierarchy.
For a first demo, we need a rudimentary graph generation interface. It can be a simple web form that allows the input of metric, labels, and timerange and shows a graph accordingly.
Targetpools scrape only first target that was added to the pool (at pool creation time). I haven't figured out the exact problem yet, but maybe it is something wrong with the heap handling? I've verified that Add() is called correctly for all targets on the right pools, but after that in the actual runs, p.Len() is always just 1.
I'll send a minimal config with multiple targets via mail.
Currently we only allow choosing a range back in time from a given Unix timestamp as the graph ending time. The interface should instead have arrows that allow skipping back/forwards in time by smart units.
The datastore grows ad infinitum right now. We need a couple of capabilities:
The expression:
targets_healthy_scrape_latency_ms{percentile="0.010000"}
Yields 12 vector elements where it should only yield one:
request_metrics_latency_equal_tallying_microseconds{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 657 @[2013-01-13 05:25:59.236345 +0100 CET]
requests_metrics_latency_equal_accumulating_microseconds{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 657 @[2013-01-13 05:25:59.236345 +0100 CET]
requests_metrics_latency_logarithmic_accumulating_microseconds{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 657 @[2013-01-13 05:25:59.236345 +0100 CET]
requests_metrics_latency_logarithmic_tallying_microseconds{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 657 @[2013-01-13 05:25:59.236345 +0100 CET]
sample_append_disk_latency_microseconds{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 725 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_healthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 1.751 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_healthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => 1.751 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_healthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.050000'} => 1.751 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_healthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.500000'} => 1.751 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_healthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.900000'} => 1.751 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_healthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.990000'} => 1.751 @[2013-01-13 05:25:59.236345 +0100 CET]
targets_unhealthy_scrape_latency_ms{instance='http://localhost:9090/metrics.json',percentile='0.010000'} => NaN @[2013-01-13 05:25:59.236345 +0100 CET]
This is an OR of the labels, not an AND (all timeseries that match either the name of the metric OR the percentile value). The bug is in GetFingerprintsForLabelSet():
fmt.Printf("===========> %v\n", labels);
fingerprints, err := p.persistence.GetFingerprintsForLabelSet(&labels)
fmt.Printf("===========> %v\n", fingerprints);
This outputs this:
===========> map[percentile:0.010000 name:targets_healthy_scrape_latency_ms]
===========> [0xf8402fd650 0xf8402fd660 0xf8402fd680 0xf8402fd690 0xf8402fd6a0 0xf8402fd750 0xf8402fd760 0xf8402fd770 0xf8402fd780 0xf8402fd790 0xf8402fd7a0 0xf8402fd7b0]
Note the 12 fingerprints where there should only be one!
The reason is that GetFingerprintsForLabelSet() steps through all labels and fetches the matching metrics for each label, thus resulting in an OR.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.