Prometheus needs to be able to interface with a remote and scalable data store for lon

Remote storage,about prometheus/prometheus

Comments (170)

jkinred commented on April 30, 2024 5

Just wanted to reiterate the sentiment of others above... even though we're a reasonably sized team, we don't want to operationalise HBase. OpenTSDB isn't even on our short list for this reason.

Disclaimer: We're not technically Prometheus users at this time, but we should be in the next couple of weeks :).

from prometheus.

beorn7 commented on April 30, 2024 2

Radio Yerevan: "In principle yes." (Please forgive that Eastern European digression... ;)

from prometheus.

raliste commented on April 30, 2024 2

What about a BigQuery exporter by stream loading? Should be the analogous option to borgmon -> tsdb?

from prometheus.

yhilem commented on April 30, 2024 2

What about https://github.com/hawkular/hawkular-metrics
Hawkular Metrics is the metric data store for the Hawkular project. It can also be used independently.

Collecting Metrics from Prometheus Endpoints : http://www.hawkular.org/blog/2016/04/22/collecting-metrics-from-prometheus-endpoints.html
The agent now has the ability to monitor Prometheus endpoints and store their metrics to Hawkular Metrics. This means any component that exports Prometheus metric data via either the Prometheus binary or text formats can have those metric data collected by the agent and pushed up into Hawkular Metrics

Prometheus Metrics Scraper : https://github.com/hawkular/hawkular-agent/tree/master/prometheus-scraper

from prometheus.

JensRantil commented on April 30, 2024 1

Running hbase in standalone defeats the point of it being distributed ;)

@mattkanwisher True.

Prometheus already has its own store.

True. However, AFAIK Prometheus storage doesn't support downsampling. AFAIK, OpenTSDB doesn't. This could be a reason for wanting to run OpenTSDB (or something else) non-distributed.

Is anyone actually working on this at this time?

Good question. I wouldn't be surprised to hear noone is working on it because this issue is large and not well defined. As I see it, there are actually multiple subissues:

Prometheus doesn't support downsampling of data which means data takes up more space than necessary. Workaround: Use larger disks.
Prometheus doesn't support replication of data. Loosing a master, means that you will loose that data. Workaround: Use a distributed file system.

Something missing?

Disclaimer Just like @jkinred, I am not a Prometheus user. However, I keep coming back to it...

from prometheus.

johann8384 commented on April 30, 2024

Is there anyone planning to work on this? Is the work done in the opentsdb-integration branch still valid or has the rest of the code-base moved past that?

from prometheus.

beorn7 commented on April 30, 2024

The opentsdb-integration branch is indeed completely outdated (still using the old storage backend etc.). Personally, I'm a great fan of the OpenTSDB integration, but where I work, there is not an urgent enough requirement to justify a high priority from my side...

from prometheus.

juliusv commented on April 30, 2024

To be clear, the outdated "opentsdb-integration" was only for the
proof-of-concept read-back support (querying OpenTSDB through Prometheus).

Writing into OpenTSDB should be experimentally supported in master, but
the last time we tried it was a year ago on a single-node OpenTSDB.

You initially asked on #10:

"I added the storage.remote.url command line flag, but as far as I can tell
Prometheus doesn't attempt to store any metrics there."

A couple of questions:

did you enable the OpenTSDB option "tsd.core.auto_create_metrics"?
Otherwise OpenTSDB won't auto-create metrics for you, as the option is
false by default. See
http://opentsdb.net/docs/build/html/user_guide/configuration.html
if you run Prometheus with -logtostderr, do you see any relevant log
output? If there is an error sending samples to TSDB, it should be logged
(glog.Warningf("error sending %d samples to TSDB: %s", len(s), err))
Prometheus also exports metrics itself about sending to OpenTSDB. On
/metrics of your Prometheus server, you should find the counter metrics
"prometheus_remote_storage_sent_errors_total" and
"prometheus_remote_storage_sent_samples_total". What do these say?

Cheers,
Julius

On Thu, Feb 5, 2015 at 9:22 AM, Björn Rabenstein [email protected]
wrote:

The opentsdb-integration branch is indeed completely outdated (still using
the old storage backend etc.). Personally, I'm a great fan of the OpenTSDB
integration, but where I work, there is not an urgent enough requirement to
justify a high priority from my side...

—
Reply to this email directly or view it on GitHub
#10 (comment)
.

from prometheus.

sammcj commented on April 30, 2024

I cannot +1 this enough

from prometheus.

mwitkow commented on April 30, 2024

Is InfluxDB on the cards in any way? :)

from prometheus.

mwitkow commented on April 30, 2024

:D That was slightly before my time ;)

from prometheus.

juliusv commented on April 30, 2024

We're just waiting for InfluxDB 0.9.0, which has a new data model which
should be more compatible with Prometheus's.

On Thu, Mar 5, 2015 at 10:31 AM, Michal Witkowski [email protected]
wrote:

:D That was slightly before my time ;)

—
Reply to this email directly or view it on GitHub
#10 (comment)
.

from prometheus.

pires commented on April 30, 2024

We're just waiting for InfluxDB 0.9.0, which has a new data model which
should be more compatible with Prometheus's.

Can I say awesome more than once? Awesome!

from prometheus.

fabxc commented on April 30, 2024

Unfortunately, @juliusv ran some tests with 0.9 and InfluxDB consumed 14x more storage than Prometheus.

Before it was an overhead of 11x but Prometheus's could reduce storage size significantly since then - so in reality InfluxDB has apparently improved in that regard.
Nonetheless, InfluxDB did not turn out to be the eventual answer for long-term storage, yet.

from prometheus.

beorn7 commented on April 30, 2024

At least experimental write support is in master, as of today, so anybody can play with Influxdb receiving Prometheus metrics. Quite possible somebody finds the reason for the blow-up in storage space and everything will be unicorns and rainbows in the end...

from prometheus.

pires commented on April 30, 2024

@beorn7 that's great. TBH I'm not concerned about disk space, it's the cheapest resource on the cloud after all. Not to mention, I'm expecting to hold data with a very small TTL, i.e. few weeks.

from prometheus.

beorn7 commented on April 30, 2024

@pires In that case, why not just run two identically configured Prometheis with a reasonably large disk?
A few weeks or months is usually fine as retention time for Prometheus. (Default is 15d for a reason... :) The only problem is that if your disk breaks, your data is gone, but for that, you have the other server.

from prometheus.

fabxc commented on April 30, 2024

@pires do you have a particular reason to hold the data in another database for that time? "A few weeks" does not seem to require a long-term storage solution. Prometheus's default retention time is 15 days - increasing that to 30 or even 60 days should not be a problem.

from prometheus.

pires commented on April 30, 2024

@beorn7 @fabxc I am currently using a proprietary & very specific solution that writes monitoring metrics into InfluxDB. This can eventually be replaced with Prometheus.

Thing is I have some tailored apps that read metrics from InfluxDB in order to reactively scale up/down, that would need to be rewritten to read from Prometheus instead. Also, I use continuous queries. Does Prometheus deliver such a feature?

from prometheus.

brian-brazil commented on April 30, 2024

http://prometheus.io/docs/querying/rules/#recording-rules are the equivalent to InfluxDB's continuous queries.

from prometheus.

dever860 commented on April 30, 2024

from prometheus.

drawks commented on April 30, 2024

👍

from prometheus.

blysik commented on April 30, 2024

How does remote storage as currently implemented interact with PromDash or grafana?

I have a use case where I want to run Prometheus in a 'heroku-like' environment, where the instances could conceivably go away at any time.

Then I would configure a remote, traditional influxdb cluster to store data in.

Could this configuration function normally?

from prometheus.

matthiasr commented on April 30, 2024

This depends on your definition of "normally", but mostly, no.

Remote storage as it is is write-only; from Prometheus you would only get what it has locally.

To get at older data, you need to query OpenTSDB or InfluxDB directly, using their own interfaces and query languages. With PromDash you're out of luck in that regard; AFAIK Grafana knows all of them.

You could build your dashboards fully based on querying them and leave Prometheus to be a collection and rule evaluation engine, but you would miss out on its query language for ad hoc drilldowns over extended time spans.

from prometheus.

matthiasr commented on April 30, 2024

Also note that both InfluxDB and OpenTSDB support are somewhat experimental, under-exercised on our side, and in flux.

from prometheus.

mattkanwisher commented on April 30, 2024

We're kicking around the idea of a flat file exporter, thus we can start storing long term data and then once bulk import issue is done we can use that #535. Would you guys be open for a PR around this?

from prometheus.

juliusv commented on April 30, 2024

For #535 take a look at my way outdated branch import-api, where I once added an import API as a proof-of-concept: https://github.com/prometheus/prometheus/commits/import-api. It's from March, so it doesn't apply to master anymore, but it just shows that in principle adding such an API using the existing transfer formats would be trivial. We just need to agree that we want this (it's a contentious issue, /cc @brian-brazil) and whether it should use the same sample transfer format as we use for scraping. The issue with this transfer format is that it's optimized for the many-series-one-sample (scrape) case, while with batch imports you often care more about importing all samples of a series at once, without having to repeat the metric name and labels for each sample (massive overhead). But maybe we don't care about efficiency in the (rare?) bulk import case, so the existing format could be fine.

For the remote storage part, there was this discussion
https://groups.google.com/forum/#!searchin/prometheus-developers/json/prometheus-developers/QsjXwQDLHxI/Cw0YWmevAgAJ about decoupling the remote storage in some generic way, but some details haven't been resolved yet. The basic idea was that Prometheus could send all samples in some well-defined format (JSON, protobuf, or whatever) to a user-specified endpoint which could then do anything it wants with it (write it to a file, send it to another system, etc.).

So it might be ok to add a flat file exporter as a remote storage backend directly to Prometheus, or resolve that discussion above and use said well-defined transfer format and an external daemon.

from prometheus.

brian-brazil commented on April 30, 2024

I think for flat file we'd be talking the external daemon, as it's not something we can ever read back from.

from prometheus.

mattkanwisher commented on April 30, 2024

So the more I think about it, it would be nice to have this /import-api (a raw data) api, so we can have backup nodes mirroring the data from the primary prometheus. Would their be appetite for a PR for this and corresponding piece inside of prometheus to import the data. So you can have essentially read slaves?

from prometheus.

brian-brazil commented on April 30, 2024

For that use case we generally recommend running multiple identical Prometheus servers. Remote storage is about long term data, not redundancy or scaling.

from prometheus.

mattkanwisher commented on April 30, 2024

I think running multiple scrapers is not a good solution cause the data won't match, also there is no way to backfill data. So we have issue where I need to spin up some redundant nodes and now they are missing a month of data. If you have an api to raw import the data you could at least catch them up. Also the same interface could be used for backups

from prometheus.

brian-brazil commented on April 30, 2024

So we have issue where I need to spin up some redundant nodes and now they are missing a month of data. If you have an api to raw import the data you could at least catch them up. Also the same interface could be used for backups

This is the use case for remote storage, you pull the older data from remote storage rather than depending on Prometheus being stateful. Similarly in such a setup there's no need for backups, as Prometheues doesn't have any notable state.

from prometheus.

mattkanwisher commented on April 30, 2024

Remote storage is still not that useful cause there is no way to query it. Seems like you could make a pretty quick long term storage solution with the existing primitives if you allows nodes to be able to do an initial backfill.

from prometheus.

brian-brazil commented on April 30, 2024

The plan is to have a way to query it. The current primitives do not allow for good long term storage, as you're going to be limited by the amount of SSD you can put on a single node. Something more clustery is required.

from prometheus.

mattkanwisher commented on April 30, 2024

The existing hash moding scheme already allows you to expand past a single node. If you had a way to spin up a new cluster when you want to resize and import old data you could have a poor mans approach at scaling. In the future you could add some more intelligent resharding techniques. I don't think any of the external storage options are even that good right now to be legitimate solutions

from prometheus.

brian-brazil commented on April 30, 2024

Hashing is for scaling ingestion and processing, not storage, and comes with significant complexity overhead. It should be avoided unless you've no other choice due to this. As you note you have a discontinuity every time you upshard, and generally you want to keep storage and compute disaggregated for better scaling and efficiency.

We don't want to have to end up implementing a clustered storage system within Prometheus, as that's adding a lot of complexity and potential failure modes to a critical monitoring system. We'd much prefer to use something external and let them solve those hard problems, even though none of the current options is looking too great. If that doesn't work out we can consider writing our own as a separate component, but hopefully it doesn't come to it.

I appreciate that you'd like long term storage. I ask you to wait until we can support it properly, rather than depend on users to hack something together in a way that's against the overall architecture of the system and that ends up being a operational and maintenance burden in the future.

from prometheus.

mattkanwisher commented on April 30, 2024

its ok to have grand visions, it doesn't mean we can't do anything in the short term. Graphite uses sharding for storage and query performance. It can even shard incoming queries and its not a particularly sophisticated system.

from prometheus.

juliusv commented on April 30, 2024

While I agree that long-term and robust storage should be solved properly, I see some benefits of having a batch import endpoint independent of that. It's at least useful for things such as testing (when you want to quickly import a bunch of data into Prometheus to play with it or do benchmarks) or backfilling data in certain situations (like importing batches of metrics generated from delayed Hadoop event log processing that you want to correlate with other metrics).

The downside would of course be that it could attract lots of users to do the wrong thing (e.g. pushing data when they should really pull), and that any feature in general makes a product worse for all users who don't use it (more perceived product complexity, etc.).

from prometheus.

mattkanwisher commented on April 30, 2024

You can make the batch import be a pull process, the new secondary (or slave) prometheus can pull the existing data from another prometheus via a /raw endpoint. Not sure it would be that confusing to new users as there are other features like federation which do somewhat similiar tasks but are probably unused by 95% of the users

from prometheus.

brian-brazil commented on April 30, 2024

backfilling data in certain situations (like importing batches of metrics generated from delayed Hadoop event log processing that you want to correlate with other metrics).

That's frontfilling, which is a different use case and where the push vs. pull issue is. Backfiling is what we're talking about here as we're inserting data before existing data, where the questions are more around expectations of durability of Prometheus storage. Backfilling is less of an issue conceptually, as it's mainly an operational question.

If we were to implement backfilling I think something pushish would make more sense as it's an administrative action against a particular server rather than a more generic "expose data to be used somehow". You'd likely also want a reasonable amount of control around how quickly it's done etc. so as not to interfere with ongoing monitoring.

from prometheus.

mattkanwisher commented on April 30, 2024

Is it really that much conceptually different then your /federate endpoint? which allows downstream systems to scrape it. I'm just thinking we can expose the entire timeseries with some paging.

from prometheus.

brian-brazil commented on April 30, 2024

Yes, /federate is only about getting in new data to provide high-level aggregations and has no bearing on storage semantics or expectations.

What you're talking about is adding data back in time, which is not supported by the storage engine (and not something that should be even considered outside of this exact use case). This changes the default stance that if a Prometheus loses it's data, then bring up a fresh new one and move on.

from prometheus.

brian-brazil commented on April 30, 2024

Just thinking there, the other place we'll have to have backfill for is for when someone wants to make an expression take effect back in time. So independent of storage related discussions we're likely to add in backfill at some point.

from prometheus.

mattkanwisher commented on April 30, 2024

@brian-brazil are you thinking about like downsampling or like aggregations backwards in time

from prometheus.

brian-brazil commented on April 30, 2024

Aggregations/rules back in time. Downsampling is an explicit non-goal, we believe that belongs in long term storage.

from prometheus.

JensRantil commented on April 30, 2024

In #10 (comment) @fabxc stated:

ran some tests with 0.9 and InfluxDB consumed 14x more storage than Prometheus.

I just wanted to chime in that the latest InfluxDB 0.10 GA seem to have improved their storage engine aggressively. Their blog post states:

Support for Hundreds of Thousands of Writes Per Second and 98% Better Compression

Could be worth revisiting.

from prometheus.

brian-brazil commented on April 30, 2024

With https://influxdata.com/blog/update-on-influxdb-clustering-high-availability-and-monetization/ InfluxDB is no longer on the table.

from prometheus.

jkinred commented on April 30, 2024

Does Blueflood have a compatible data model?

With InfluxDB off the table, it seems to be the most promising open source TSDB, backed by Cassandra.

Rackspace have a reasonably good reputation of keeping things open.

http://blueflood.io/

from prometheus.

brian-brazil commented on April 30, 2024

Blueflood seems to have milisecond timestamps, it's unclear if it supports float64 as a data type (I'm guessing it does) but it has no notion of labels. It seems to have the right idea architecturally, but doesn't quite fit.

from prometheus.

jkinred commented on April 30, 2024

Newts is another Cassandra backed option.

The data model is described at https://github.com/OpenNMS/newts/wiki/DataModel, supports labels.

I see a bit of contention about how good Cassandra is for time series but a few of the TSDB's are building on it (KairosDB, Heroic and Newts).

from prometheus.

brian-brazil commented on April 30, 2024

Newts doesn't really have a notion of labels, and I don't think it's Cassandra schema will scale well for the amount of data we're dealing with.

from prometheus.

mattkanwisher commented on April 30, 2024

Yeah most of the external options are not very good or require more management then the investment warrants. So currently at DigitalOcean we have some rather large sharded setups with prometheus. We are investigating the possibility of not needing long term storage, by having the prometheus instances allow the data to be backed up to other nodes. Maybe if have a way to reshard data. I haven't heard of anyone talking yet about just extending the existing capabilities instead of pushing it to yet another database, which will have a different query language then prometheus.

from prometheus.

fabxc commented on April 30, 2024

Your last part is more or less what it will converge to eventually – at least in my head.
With their custom query languages and models existing solutions come with their own overhead and limitations. For a consistent read/write path it doesn't really make sense to enforce a mapping to work around those.

They are mostly based on Cassandra or HBase anyway and that for good reasons. We have to find a good indexing and chunk storage model that's applicable to similar storage backends, which then might even be choosable.

It's easy to talk about all that – it's not worth much without an implementation, which will take some time of course :)

from prometheus.

brian-brazil commented on April 30, 2024

I see significant challenges with that approach, as it's ultimately making Prometheus into a full-on distributed storage system. I think we need to keep long term storage decoupled from Prometheus itself, so as not to threaten it's primary goal of being a critical monitoring system. We wouldn't want a deadlock or code bug in such a complex system taking out monitoring, it's much easier to get deadlines on a few RPC calls right.

I haven't heard of anyone talking yet about just extending the existing capabilities instead of pushing it to yet another database, which will have a different query language then prometheus.

The plan is that however we resolve this, that you'll be able to seamlessly query the old data via Prometheus. If we just wanted to pump the data to another system with no reading back it'd make things far easier - you can already do that if you want.

from prometheus.

fabxc commented on April 30, 2024

Yes, it makes it a full-on distributed storage system. And it shouldn't be part of the main server, of course.
It would be its own thing, but directly catering to our data and querying model.

I know it has challenging implications. But the ones for waiting for a TSDB that fits our model without limitations are worse. The existing ones seem to be unsuitable. And I'm not aware of anyone working on something that will be.

from prometheus.

brian-brazil commented on April 30, 2024

As long as there's decoupling via a generic-ish RPC system I'm okay with that.

The existing and stable ones seem to be unsuitable.

Some are close.

There's actually more problems with OpenTSDB than the 8 tags, there's also limits on the number of values a label can have so we can't even really use it as storage - though we've several users planning on putting their data there.

And I'm not aware of anyone working on something that will be.

😄

The question it turns out isn't so much how do you solve it, but more what your budget is.

from prometheus.

bobrik commented on April 30, 2024

There's actually more problems with OpenTSDB than the 8 tags

It's not an issue anymore.

there's also limits on the number of values a label can have so we can't even really use it as storage - though we've several users planning on putting their data there.

Limits are pretty big (16M) and you can make them even bigger.

from prometheus.

johann8384 commented on April 30, 2024

I am an OpenTSDB comitter, I would be happy to coordinate the roadmap to
meet your needs if possible.

This would be very beneficial to me. I believe the authentication and
startup plugins will be useful for this. You could have a TSD startup and
get its configuration from Prometheus. The startup plugins are designed to
allow tight coupling between systems.

Another way to go about this would potentially be to use the Realtime
Publishing plugin in OpenTSDB to accept data into OpenTSDB and publish
additionally to Prometheus. There are benefits and drawbacks to that I am
sure.

-Jonathan
On Mar 30, 2016 7:42 AM, "Ivan Babrou" [email protected] wrote:

There's actually more problems with OpenTSDB than the 8 tags

It's not an issue anymore.

there's also limits on the number of values a label can have so we can't
even really use it as storage - though we've several users planning on
putting their data there.

Limits are pretty big (16M) and you can make them even bigger.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#10 (comment)

from prometheus.

johann8384 commented on April 30, 2024

Additional items I should mention, I have sustained 10 million writes per
second to OpenTSDB, and repeated that on a other setup at a different
company. HBase scales really well for this.

The 2.3.0 branch, which should have an RC1 any day now has expression
support, things like Sum(), Timeshift(), etc. These should make writing
your query support easier.

There is a new query engine Splicer, written by Turn which provides
significant improvements in query time. It works by breaking up the
incoming queries into slices and querying the TSD that is local to the
Regionserver. It will also cache the results in 1 hour blocks using Redis.
We use it in conjunction with multiple tsd instances per Regionserver
running in Docker containers. This allows us to run queries in parallel
blocks.

These new features and my experience scaling OpenTSDB should help to make
an ideal long term storage solution, in my opinion.
On Mar 30, 2016 8:47 AM, "Jonathan Creasy" [email protected] wrote:

I am an OpenTSDB comitter, I would be happy to coordinate the roadmap to
meet your needs if possible.

This would be very beneficial to me. I believe the authentication and
startup plugins will be useful for this. You could have a TSD startup and
get its configuration from Prometheus. The startup plugins are designed to
allow tight coupling between systems.

Another way to go about this would potentially be to use the Realtime
Publishing plugin in OpenTSDB to accept data into OpenTSDB and publish
additionally to Prometheus. There are benefits and drawbacks to that I am
sure.

-Jonathan
On Mar 30, 2016 7:42 AM, "Ivan Babrou" [email protected] wrote:

There's actually more problems with OpenTSDB than the 8 tags

It's not an issue anymore.

there's also limits on the number of values a label can have so we can't
even really use it as storage - though we've several users planning on
putting their data there.

Limits are pretty big (16M) and you can make them even bigger.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#10 (comment)

from prometheus.

brian-brazil commented on April 30, 2024

It's not an issue anymore.

That's good to know, how many tags does it support now?

Limits are pretty big (16M) and you can make them even bigger.

I can imagine users hitting that, mostly by accident.

You could have a TSD startup and get its configuration from Prometheus.

That doesn't make sense to me, I'd expect Prometheus to be configured to send information to a given OpenTSB endpoint and that'd be the entire configuration required on the write side.

Realtime Publishing plugin in OpenTSDB to accept data into OpenTSDB and publish
additionally to Prometheus.

That's not the Prometheus architecture, Prometheus would be gathering data and also sending it on to OpenTSDB.

If we wanted to pull data in the other direction we'd write an OpenTSDB exporter, similar to the how the InfluxDB exporter works.

Additional items I should mention, I have sustained 10 million writes per
second to OpenTSDB, and repeated that on a other setup at a different
company. HBase scales really well for this.

Can you give an idea of the hardware involved in that?

These should make writing your query support easier.

The minimal support we need is the ability to specify a vector selector like {__name__="up",job="myjob",somelabel!="foo",otherlabel~="a|b"} and get back all the data for all matching timeseries for a given time period efficiently. Queries may not include a name (though usually should), and it's not out of the question for a single name to have millions of timeseries across all time and tends of thousands would not be unusual.

I am an OpenTSDB comitter, I would be happy to coordinate the roadmap to
meet your needs if possible.

It's important for us to store full float64s. Given that ye support 64bit integers, could we just send them as that or is full support an option?

Full utf-8 support in tag values would also be useful, though we've already worked around that.

from prometheus.

juliusv commented on April 30, 2024

@bobrik @johann8384 Thanks for those infos, that's great to know!

For anything more than small or toy use cases, we'll need to move query computation to the long-term storage (otherwise, data sets that need to be transferred back to Prometheus would become too large). So any existing remote storage would have to implement pretty much all of Prometheus's query language features in a semantically compatible way (even if maybe aggregation is the most important one, that's not necessarily at the leaf node of a query).

So having float64 value support is kind of crucial if you want to achieve the above, but the OpenTSDB docs actually mention that that's on the roadmap, so that's good: http://opentsdb.net/docs/build/html/user_guide/writing.html#floating-point-values

Whether OpenTSDB would ever be able to compatibly execute all of Prometheus's query language features is another question.

from prometheus.

beorn7 commented on April 30, 2024

For anything more than small or toy use cases, we'll need to move query computation to the long-term storage (otherwise, data sets that need to be transferred back to Prometheus would become too large). So any existing remote storage would have to implement pretty much all of Prometheus's query language features in a semantically compatible way (even if maybe aggregation is the most important one, that's not necessarily at the leaf node of a query).

While that's the ideal case, I don't think that's achievable. I would more think along Brian's lines above: A vector selector gives us all the data for the relevant time interval, and then query evaluation is done on Prometheus's side. Obviously, that limits queries to those that don't require Gigabits of sample data. But that's probably fine. The same caution as usual applies where you create recording rules for expensive queries.

from prometheus.

brian-brazil commented on April 30, 2024

Yeah. I can see us pushing down parts of some queries in future, but likely only ever to other Prometheus servers in a sharded setup. More than that would be nice, but I don't see it happening in the forseeable future.

For long-term storage, the amount of data you'd need to pull in within the storage system itself is likely to be enough of a bottleneck to prevent a large query from working, before we get to sending the result back to Prometheus.

from prometheus.

fabxc commented on April 30, 2024

Limits are pretty big (16M) and you can make them even bigger.

I can imagine users hitting that, mostly by accident.

I'd be willing to live with that limitation.

For anything more than small or toy use cases, we'll need to move query computation to the long-term storage

Should Open TSDB turn out to be suitable after all, I doubt that we can get query feature parity. If we do, it will always be a limiting factor when we want to extend our PromQL.
With a "generic" read/write path it will be using a bridge anyway. That could be extended to care about distributed evaluation.

from prometheus.

juliusv commented on April 30, 2024

Ok, if we're fine not supporting large aggregation use cases (but keep a road open to them in the future), that makes things easier of course. I guess there's an argument to be made for the smaller use cases since usually people don't care about as many dimensions (like instance) for historical data, so you might only be operating on metrics with far fewer series.

from prometheus.

juliusv commented on April 30, 2024

(edited the comment above for clarification, for the people only reading emails)

from prometheus.

brian-brazil commented on April 30, 2024

I think backfilling would be more important with long-term storage, to support any needed aggregations - not a primary concern though (and backfilling in long-term storage may obviate the need for it in Prometheus itself).

from prometheus.

johann8384 commented on April 30, 2024

On Wed, Mar 30, 2016 at 9:39 AM, Brian Brazil [email protected]
wrote:

It's not an issue anymore.

That's good to know, how many tags does it support now?

The tag limitation was previously a hard coded constant, I usually always
set this to 16 in my setups. It is not a configuration value rather than a
constant in the code.

Limits are pretty big (16M) and you can make them even bigger.

I can imagine users hitting that, mostly by accident.

I use a UID width of 4 instead of three, I didn't do the math but this
significantly increases this number. Even on the largest setups I have
deployed, there have only been 3 or 4 million UID assignments, and this
includes, metrics, tag keys and tag values combined.

You could have a TSD startup and get its configuration from Prometheus.

That doesn't make sense to me, I'd expect Prometheus to be configured to
send information to a given OpenTSB endpoint and that'd be the entire
configuration required on the write side.

Yes, that would be the most common use case, but technically you could have
a TSD node "bound" to each prometheus node, or a set of TSDs per prometheus
node, and allow each Prometheus node to have its own query cluster that
way. Perhaps a more common way to say what I was trying to say is that the
intended use of the startup plugins is for service discovery. When OpenTSDB
starts up it can register itself with Curator (ZooKeeper), Consul, Etcd,
etc. It could technically also get parameters from the service discovery
like the location of the zkQuorum, what HBase tables to use, or what ports
to listen on.

Realtime Publishing plugin in OpenTSDB to accept data into OpenTSDB and
publish
additionally to Prometheus.

That's not the Prometheus architecture, Prometheus would be gathering data
and also sending it on to OpenTSDB.

If we wanted to pull data in the other direction we'd write an OpenTSDB
exporter, similar to the how the InfluxDB exporter works.

Additional items I should mention, I have sustained 10 million writes per
second to OpenTSDB, and repeated that on a other setup at a different
company. HBase scales really well for this.

Can you give an idea of the hardware involved in that?

I believe the original cluster was 24 Dell R710 machines, maybe 64GB ram, I
don't remember much of the other specs. The cluster at Turn is 36 nodes, 24
cores, 128GB Ram, 8 disks.

These should make writing your query support easier.

The minimal support we need is the ability to specify a vector selector
like {name="up",job="myjob",somelabel!="foo",otherlabel~="a|b"} and
get back all the data for all matching timeseries for a given time period
efficiently. Queries may not include a name (though usually should), and
it's not out of the question for a single name to have millions of
timeseries across all time and tends of thousands would not be unusual.

I am an OpenTSDB comitter, I would be happy to coordinate the roadmap to
meet your needs if possible.

It's important for us to store full float64s. Given that ye support 64bit
integers, could we just send them as that or is full support an option?

Full utf-8 support in tag values would also be useful, though we've
already worked around that.

As far as I am aware, nothing is off the table, let's work together to
implement that.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#10 (comment)

from prometheus.

johann8384 commented on April 30, 2024

I'm certain I need to learn more about the Prometheus query structure, but
OpenTSDB (seems to be) pretty good at aggregation across series within the
same metric name. There are also new aggregators, aggregator filters, and
of course the expression support.

I would recommend that we find a way to support pre-aggregating and
downsampling the data as we store it to OpenTSDB. So for example, we may
provide a list of tags to strip when writing. Another thought is to write
${metric}.1m-avg, ${metric}.5m-avg and automatically select those
extensions when reading large time ranges. This would mimic the way an RRD
storage system might work. So for the recent part of the query we pull full
resolution but as we get farther back, we can pull from the 1m-avg, and
5m-avg series.

Just thoughts of course.

On Wed, Mar 30, 2016 at 10:39 AM, Julius Volz [email protected]
wrote:

Ok, if we're fine not supporting large aggregation use cases (but keep a
road open to them in the future), that makes things easier of course. I
guess there's an argument to be made for those use cases since usually
people don't care about as many dimensions (like instance) for historical
data, so you might only be operating on metrics with far fewer series.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#10 (comment)

from prometheus.

brian-brazil commented on April 30, 2024

I would recommend that we find a way to support pre-aggregating and
downsampling the data as we store it to OpenTSDB.

That's not something we can do without user input for pretty much every metric, as we don't know which labels are okay to remove. It'd also break user queries as it's no longer the same timeseries.

from prometheus.

johann8384 commented on April 30, 2024

Yes, pre-aggregation would be tricky.

from prometheus.

JensRantil commented on April 30, 2024

I'll just chime in that I think a lot of people are excited about Prometheus partially for it's simplicity when it comes to deployment. I also think that's why InfluxDB was a good candidate because it follows in that same Go-single-statically-linked-binary-fashion.

I understand InfluxDB has been taken out of the equation for good reasons, but I also believe OpenTSDB is a monster when it comes to deployment; No company under 10 employees wants to run a fully fledged Hadoop with HBase and I think long-term storage should be a viable alternative also for smaller organizations not running Hadoop+HBase. I hope that the "long storage" solution, whatever it may be, is a solution that can be easily deployed. Both companies with and without Hadoop will want long-term storage. Those were my two cents...

from prometheus.

johann8384 commented on April 30, 2024

That is a completely valid, and good observation. HBase can run in stand-alone mode, it uses local files for storage rather than HDFS. It isn't really talked about in the documentation, but I have used it for a few small OpenTSDB deployments where the durability and performance of the cluster were not important. This may or may not be a good option here, for obvious reasons.

from prometheus.

mattkanwisher commented on April 30, 2024

Running hbase in standalone defeats the point of it being distributed ;) Prometheus already has its own store. Honestly I'm hoping it doesn't go towards OpenTSDB. We used to run a 50 node cluster and it was a full time job managing the cluster.

from prometheus.

juliusv commented on April 30, 2024

Yeah, OpenTSDB being complex to operate is a common complaint and makes me very wary about it as well. Of course, you'd still have to weigh that against the likelihood of any other viable alternative materializing anytime soon... still also hoping for something more Go-ey and with fewer dependencies, but I'm not seeing it quite yet :)

from prometheus.

johann8384 commented on April 30, 2024

Is anyone actually working on this at this time?

from prometheus.

juliusv commented on April 30, 2024

@johann8384 Nobody is currently working on a completely new distributed storage system, no. But there's some related work and discussion around a generic remote write API (#1487), but nothing concrete about read-back yet.

from prometheus.

brian-brazil commented on April 30, 2024

The general idea is that it's the remote storage that's distributed and does downsampling. Prometheus local storage is quite efficient, but ultimately you want to keep only a few weeks of data in Prometheus itself for fast/reliable access and depend on remote storage beyond that. Then you don't really care about how much space Prometheus uses (as long as it holds at least a few days, you're good) or if you lose one of a HA pair every now and then.

from prometheus.

brian-brazil commented on April 30, 2024

I don't see any way to sanely use BigQuery here due to the columnar data model, unless you're querying the data extremely rarely.

from prometheus.

brian-brazil commented on April 30, 2024

Hawkular has float64, millisecond timestamps and key/value pair labels. A given metric has only one set of tags, but that we could work around. API is JSON, so unclear if it can handle non-real values. C* is backend. Seems to support the operations we need on labels.

What's unclear are how exactly it's using in C*, and how it has implemented the label lookups. Looking at the schema, neither look to be efficient enough for our use case.

from prometheus.

yhilem commented on April 30, 2024

See :

tags and label based search (openshift/origin-metrics#34)
Querying based on tag (https://github.com/openshift/origin-metrics/blob/master/docs/hawkular_metrics.adoc#querying-based-on-tag)

Also, Cassandra 3.4 added support for the SASI custom index :

SASIIndex (https://github.com/apache/cassandra/blob/trunk/doc/SASI.md)
Improved Secondary Indexing with new Query Capabilities (OR, scoping) for Cassandra (https://github.com/xedin/sasi)
SASI Empowering Secondary Indexes (http://www.planetcassandra.org/blog/sasi-empowering-secondary-indexes/)
Indexing with SSTable attached secondary indexes (SASI) (https://docs.datastax.com/en/cql/3.3/cql/cql_using/useSASIIndexConcept.html)
SASI: Real-Time Search and Analytics with Cassandra (http://www.meetup.com/fr-FR/DataStax-Cassandra-South-Bay-Users/events/229467895/?eventId=229467895)

SASI use case : Implement string metric type (https://issues.jboss.org/browse/HWKMETRICS-384)

from prometheus.

yhilem commented on April 30, 2024

I created the issue "Hawkular metrics as the Long-term storage backend for prometheus.io" (https://issues.jboss.org/browse/HWKMETRICS-400)

from prometheus.

brian-brazil commented on April 30, 2024

I think it'd take a major redesign of Hawkular to make it work for the data volumes Prometheus produces. For example a single label matcher such as job=node can easily match tens of millions of time series on a large setup. That's going to blow out the 4GB row size limit in C* for metrics_tags_idx.

Hawkular also appears to use at least 32 bytes per sample, before replication.

from prometheus.

yhilem commented on April 30, 2024

I do not know the 4GB row size limit in C*
Cassandra has a 2 billion column limit (kairosdb/kairosdb#224)
CQL limits (https://docs.datastax.com/en/cql/3.1/cql/cql_reference/refLimits.html)

What's the relation with PR "Generic write #1487" (#1487) ?

from prometheus.

jsanda commented on April 30, 2024

I think it'd take a major redesign of Hawkular to make it work for the data volumes Prometheus produces. For example a single label matcher such as job=node can easily match tens of millions of time series on a large setup. That's going to blow out the 4GB row size limit in C* for metrics_tags_idx.

You are correct that there is the potential that we could wind up with very wide rows in metrics_tags_idx. It has not been a concern because we have not been dealing with data sets large enough to necessitate a change.

Changing the schema for metrics_tags_idx as well as our other index tables is something we certainly could do. One possibility would be to implement some manual sharding. We would add a shard or hash id column to the partition key. This would also allow us to effectively cap the number of rows per partition. As the data set grows we might need to reshard and increase the number of shards. I think the solution would have to take this into account as well.

from prometheus.

jsanda commented on April 30, 2024

Hawkular also appears to use at least 32 bytes per sample, before replication.

Which table(s) are you referring to? We recently moved to Cassandra 3.x which includes a major refactoring of the underlying storage engine. I need review the changes some before I can say precisely how many bytes are used per sample. Keep the following in mind though. In Cassandra 2.x the column name portion of clustering columns was repeated on disk for every cell. This is no longer the case in Cassandra 3. This can result in considerable space savings. And Cassandra stores data in compressed format by default.

from prometheus.

leecalcote commented on April 30, 2024

Just wanting to make sure I understand the state of this area of enhancement... are the remote providers in /storage/remote supported now? If so, what remains to be completed here?

from prometheus.

brian-brazil commented on April 30, 2024

They are effectively deprecated. They will be removed once we have some form of generic write in Prometheus (e.g. #1487).

from prometheus.

danburkert commented on April 30, 2024

Hi all, I wanted to bring up another possibility for a storage layer: Apache Kudu (incubating). Kudu bills itself as an analytics storage layer for the Hadoop ecosystem, and while it is that, it's also an excellent platform for storing timeseries metrics data. Kudu brings some interesting features to the table:

Columnar on-disk format: supports great compression with timeseries data, and enables extremely fast scans. Fancy encodings like bitshuffle, dictionary, and run length, as well as multiple compression types (LZ4, etc.) are built in
Strongly consistent replication via Raft
No dependencies: doesn't require HDFS/Zookeeper/anything else
Designed with operations in mind: no garbage collection pauses, and advanced heuristics for disk compactions that make them predictable and smooth
Advanced partitioning: metrics can be hash-distributed among nodes, and partitions can be organized by time, so that new partitions can be brought online, and old partitions can be dropped as they ttl. This provides great scalability for metrics workloads
Designed for scan-heavy workloads- unlike a lot of databases that are optimized first and formost for single record or value retrieval, Kudu is optimized for scans

My goal is to let you all know about Kudu as a potential storage solution, and ask what Prometheus is looking for in a storage layer. I've had some experimental success with a TSDB-compatible HTTP endpoint in front of Kudu, but perhaps Prometheus is looking for a different sort of API? It would be great to get a sense of what Prometheus needs from a distributed storage layer, and if Kudu could fill the role.

As far as the maturity of Kudu, we are planning to have a 1.0 production ready release later this summer. The project has been under development for more than three years, and we already have some very large production installations (75 nodes, ingesting 250K records/sec continuously), and routinely test on 200+ node clusters.

from prometheus.

brian-brazil commented on April 30, 2024

I expect Kudu to fall down for the same reasons BigQuery did, the access patterns for time series data and columnar data are very different.

75 nodes, ingesting 250K records/sec continuously

A single Prometheus server can generate over three times that load.

from prometheus.

danburkert commented on April 30, 2024

@brian-brazil could you be more specific about the problematic access patterns? Except for being columnar, our architecture isn't really comparable to BigQuery. Kudu is designed for low-latency, highly concurrent scans.

A single Prometheus server can generate over three times that load.

Yah, that is probably a bad example, that usecase isn't metrics collection and I believe the record sizes are quite large comparatively. I was able to max out a single node Kudu setup at ~200K metrics datapoint writes/second without Kudu breaking a sweat on my laptop (it was bottlenecked in the HTTP proxy layer). I haven't really dug in and gotten solid numbers yet, though. Definitely on my TODO list.

from prometheus.

danburkert commented on April 30, 2024

In particular - Kudu keeps data sorted by a primary key index, so scanning for a particular metric and time range only requires reading the required data. As a result, timeseries scans can have <10ms latencies.

from prometheus.

brian-brazil commented on April 30, 2024

If I had a TSDB with 1 billion metrics, and gave you an expression that matched 100 of them over a year how would that work out?

from prometheus.

danburkert commented on April 30, 2024

With the experimental project I linked to earlier, it keeps a secondary index in a side table of (tag key, tag value) -> [series id], and then uses the resulting IDs to perform the scan. So if you have a billion point dataset but your query only matches 100 points based on the metric name, tagset and time, it will only scan the data table for exactly those 100 points. It's modeled after how OpenTSDB uses HBase, but with a few important differences. There's a bit more info on that here. One huge benefit of having a columnar format instead of rowwise, is that a system like this doesn't need to do any external compactions/datapoint rewriting, like OpenTSDB has to do for HBase. The columnar encoding and compression options are already built in and better than anything that OpenTSDB will do.

That is just a particular instance of how timeseries storage could be done with Kudu. If the data model doesn't look like the OpenTSDB-style metric/tagset/timestamp/value, it could be done differently.

from prometheus.

brian-brazil commented on April 30, 2024

One of those tag matches (of which there's likely 2-4) could easily be for 10M time series. For 100 time series over a year with 5m resolution, that's about 10 billion points.

from prometheus.

danburkert commented on April 30, 2024

Sorry, I don't follow. So for query such as

SELECT * WHERE
   metric = some_metric AND
   timestamp >= 2015-01-01T00:00:00 AND
   timestamp < 2016-01-01T00:00:00 AND
   tag_a = "val_a" AND
   tag_b = "val_b" AND
   tag_c = "val_c";

it finds the set of timeseries where tag_a = "val_a", the set where tag_b = "val_b" and the set where tag_c = "val_c". It takes these three sets, finds the intersection (in order to find the series which match all three predicates), and then issues a scan in parallel for each. Each of these scans can read back only the necessary data (although the data may be spread across multiple partitions). The schema wasn't really designed for the case where an individual (tag_key, tag_value) pair might have millions of matching series, so there is probably a more efficient way to do it with that constraint in mind.

from prometheus.

brian-brazil commented on April 30, 2024

Yes, a label such as job=node could easily have 10s of millions of matching metrics.

from prometheus.

danburkert commented on April 30, 2024

The tag to series ID lookup I mentioned earlier is independent of the metric, so many metrics can share a single series ID. A series ID really just describes a unique set of labels, in Prometheus terms. Again, this is how I structured that solution, so it's not inherent to Kudu as a storage layer. Obviously multi-attribute indexing is a difficult problem and there are a lot of ways to go about it.

@brian-brazil more generally, it sounds like you have a pretty good idea of what you are looking for in an external storage system. Is that written up anywhere, or could you elucidate?

from prometheus.

gouthamve commented on April 30, 2024

Hi,

This is an interesting thread, and yes, even I was about to ask if there was some proposal somewhere about the long-term storage. It would make it easy for us to get context and understand if this is a task that we can take up and experiment with.

Thanks,
Goutham.

from prometheus.

brian-brazil commented on April 30, 2024

The idea behind #1487 is to expose an interface to allow users to experiment with solutions. The full-on solution is quite difficult, however more constrained forms of the problem (such as not needing indexing) are more tractable.

from prometheus.

Remote storage about prometheus HOT 170 CLOSED

Comments (170)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs