GithubHelp home page GithubHelp logo

vonng / pg_exporter Goto Github PK

View Code? Open in Web Editor NEW
167.0 10.0 44.0 15.7 MB

Advanced PostgreSQL & Pgbouncer Metrics Exporter for Prometheus

Home Page: https://pigsty.io

License: Apache License 2.0

Dockerfile 1.06% Makefile 4.18% Go 76.22% Shell 18.55%
postgres prometheus prometheus-exporter pgbouncer pg-exporter monitoring

pg_exporter's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pg_exporter's Issues

Add option to test selected metric(s) and also options to select metric(s) to test, explain and configuration dump

Hello. I suggest to add

Flags
       [...]

       --test
           Testing run (single pass run returning collected '/metrics' to the standard output to be revised).
           To limit the test run to a certain list of metrics use the '-M' option.

       -M, --metrics LIST
           Limit testing run (see the '--test' option) to a list of comma (',') separated metric names.
           Only the SQL queries of the selected metrics will be run (eg. '--test --metrics pg_up,pg_uptime'
           will run the SQL queries of the 'pg_up' and 'pg_uptime' metrics).

       [...]

I also suggest to apply --metrics LIST to --explain similarly, if possible.

The --dry-run option should not produce any output instead of eg. warnings or errors etc. In fact, it must not even try to perform the connection to the Postgres etc. Therefor it's "dry run" (!) and everybody expects that it does literally nothing, no real action (which connection certainly is).

To dump the actual configuration I suggest to provide another option, say --show-config, --dump-config or --format-config or similar and to apply --metrics LIST to it as well.

That way you could easily

  1. see what's configured for certain metric (eg. --show-config -M pg_up);
  2. check if there is no error or discrepancy in the configuration (eg. --dry-run);
  3. test the metric of interest (eg. --test -M pg_up).

The exporter should return proper exit codes so that these features could be used in scripts etc.

Whether to provide also environment variables for the suggested options remains a question. I generally do not see any point in not doing so (it could be good for container users who tend to use environment variables).


I think all these things are somewhat related one to each other so I aggregated them in this single feature request, sorry for that.

Allow to collect metrics from multiple sources

Hello. Thank you for the unique Postgres exporter first! 🥇 IMHO, it could be even a improved significantly (with maybe quite a low effort in Golang) if you added higher level of configuration allowing us to configure the exporter to collect metrics from multiple sources, distinquished by some "top level" configuration of either tags (eg. a hostname), metric prefix (eg pg1, pg2), or both of them.

I don't like the idea to have separated exporter instance running for every single Postgres server I happen to manage. Also some of the clusters are replicated across distant locations (which indeed asks for collecting metrics from far point or multiple points). Sometimes it's viable to run exporter on the same server as Postgres cluster and sometimes it would be really nice to monitor many Postgres instances belonging to a single cluster flared across various providers etc. from a single point.

Side note:

I also play with some (vague yet) idea that having data from a replicated cluster collected from one place could lead to even further development of the exporter (eg. so that we could use collected metrics in some sort of templating of the probes to gain additional flexibility and dynamicity of the monitoring process, or to slow down collecting in case of certain conditions, or to add something like priority queue to the scheduler, or perhaps even writting some results back to the database in a "loop" or to completely different database from the monitored ones (well, it's just SQL and if it was templated as well as if the check paramethers were templated base on some measurements then...)

😁

AFAIK, no Postgres collector is so finely grained as this one and also no I know of allows to collect metrics from multiple sources at once and make some use of the fact (if nothing else then at least "administratively" or saving overall resources). Therefor I think it can be a great improvement. Thank you.

Add Column.Default option to set default value when null

It would be convenient to have the ability to define default value for metrics if null is acquired.

E.g: Lot's of metrics could be better to have 0 instead of NaN.

While using coalesce(col, 0) could work. It's just too ugly and hacky.

It would be nice to do it in configuration

    - exec_time:
        usage: COUNTER
        default: 0
        description: Total time spent executing the statement, in µs

0.5.0 会 coredump

rpm 中 upx 之后的 binary 会 coredump,环境:5.14.0-162.23.1.el9_1.x86_64
手动编译未经 upx 则正常。

fail connecting to primary server: fail fetching server version

Thanks for the great work on this! I'm running pg_exporter and I'm hitting an error on the precheck steps.

$ pg_exporter
INFO[0000] retrieve target url  from PG_EXPORTER_URL     source="pg_exporter.go:1938"
INFO[0000] retrieve config path pg_exporter.yaml from PG_EXPORTER_CONFIG  source="pg_exporter.go:2009"
ERRO[0000] fail connecting to primary server: fail fetching server version: driver: bad connection, retrying in 10s  source="pg_exporter.go:1517"

This appears to be where the query in question is made, SHOW server_version_num;

When I connect directly using psql at the same URI, I'm able to run it without issue.

$ psql $PG_EXPORTER_URL
psql (12.8 (Ubuntu 12.8-0ubuntu0.20.04.1), server 14.0 (Debian 14.0-1.pgdg110+1))
WARNING: psql major version 12, server major version 14.
         Some psql features might not work.
Type "help" for help.

postgres=# show server_version_num;
 server_version_num 
--------------------
 140000
(1 row)

I'm not sure how/if it matters but I'm accessing this db over a TCP proxy using kubectl proxy. It doesn't seem to impact any other postgres clients but worth mentioning.

What am I missing?

Add pg_exporter connect PG instance timeout setting

If we deploy the pigsty to monitor the PG instance which is in the same datacenter, the pg_exporter connect time out is 100ms.

But in the read prod env, we will have multi-region PG instance. If we develop one pigsty to monitor all the PG instances in the different regions, the pg_exporter will have this error:

Nov 24 10:27:07 staging-gcp-sg-vm-platform-pigsty-1 pg_exporter_staging-gcp-hk-pgsql12-platform-1-1[28699]: time="2021-11-24T10:27:07Z" level=error msg="fail connecting to primary server: fail fetching server version: driver: bad connection, retrying in 10s" source="pg_exporter.go:1521"
Nov 24 10:27:07 staging-gcp-sg-vm-platform-pigsty-1 pg_exporter: time="2021-11-24T10:27:07Z" level=error msg="fail connecting to primary server: fail fetching server version: driver: bad connection, retrying in 10s" source="pg_exporter.go:1521"

But in fact, we can use psql command line to connect the PG instance in the pigsty host.

The time out config in the pg_exporter is:

image
image

As we discuss:

100ms以上会主动取消,判定抓取失败,避免雪崩。之前我也没想到会有跨数据中心抓取的情况。
从这个抓包情况看,大概正好打到100ms的阈值。
+150ms已经返回结果了,但是还是因为超时而主动请求报错

-- 但对于这个 timeout 的阈值,下一个版本是否也可以设置成为一个可选参数,默认还是 100ms, 如果有这种跨region的情况,那可以调整大一些,比如调整为1s, 这样按说对于监控也够了,好处是 一个 region一个VM部署 pigsty,就可以监控所有 region的PG实例了

-- 
Reasonable
欢迎帮我提个Issue啊,https://github.com/Vonng/pg_exporter
我下个Release修改一下

So I suggest that in the pg_export, we can set the time pg_exporter time_out threshold.
By default, the value is 100ms.
In the special env, such as we use one pigsty to monitor multi-data center PG instance, we can increase the value.

Increase Caller Skip Depth by One Layer

image
Using the current caller depth skip function only outputs the tool method layer, making it impossible to pinpoint the actual problematic code. The reason is that the logging library by default skips the first 3 layers of calls, but in reality, reaching the actual call requires skipping 4 layers. To solve this problem, it is necessary to specify an additional caller skip layer when initializing the logger.
Current code(utils.go:44):
logger = level.NewFilter(logger, lvl) logger = log.With(logger, "timestamp", log.DefaultTimestampUTC, "caller", log.DefaultCaller)
Correct code:
logger = level.NewFilter(logger, lvl) logger = log.With(logger, "timestamp", log.DefaultTimestampUTC, "caller", log.Caller(4))

TLS/SSL support

Are there any plans for the exporter output to have TLS/SSL support? Just looking through docs and didn't see it mentioned.

Generic "predicate query" to decide whether to collect a metric?

I'm looking at adopting pg_exporter as a potential replacement for the CloudNative-PG built-in scraper I currently use. It looks pretty impressive other than the lack of test cover and test facilities.

One of the few things it doesn't do that the CNPG scraper does is permit the use of a generic boolean expression query to decide "should I run this metric query or not" (added by cloudnative-pg/cloudnative-pg#4503). pg_exporter has tags, but the tags must be specified from the outside by whatever provisions the service, and apply to all DBs being scraped.

I'm wondering if you'd be open to submission of a patch to implement that, assuming that my initial PoC with this scraper finds it otherwise a good fit for my needs.

I need to express things like "collect this metric set if the foo extension is installed and is at least version 2.1", "collect this metric only when running on (some specific vendor postgres flavour)" or "only try to collect this if this specific table exists and has this specific column".

This isn't possible in plain postgres SQL dialect due to its lack of bare procedural blocks and the inability of a DO block to return a result set. Hacks with refcursors can be used to get a result from a DO block but only with multiple statements, which this scraper doesn't support (and probably shouldn't as it opens a whole can 'o worms).

I propose an extension to the syntax like:

predicate_queries:
  - predicate_query: |
      SELECT EXISTS (SELECT 1 FROM information_schema.tables WHERE table_schema = 'pg_catalog' AND table_name = 'foo')
query: |
  SELECT a, b, c FROM pg_catalog.some_vendor_extension;

The predicate queries would be run in the order specified. If any returns zero rows or a 1-column boolean false, the metric collection is skipped. If it returns true, collection proceeds to check the next predicate (if any) or to collect the metric. A multi-column result or multi-row result logs an error the first time it runs then disables the metric for the lifetime of the process.

The purpose of multiple queries is to permit things like (a bit contrived):

predicate_queries:
  - predicate_query: |
      SELECT 1 FROM pg_extension WHERE extname = 'foo';
  - predicate_query: |
      SELECT foo_is_enabled FROM foo.foo_is_enabled()

where it's first necessary to know that some database object exists before attempting to access it, since Pg requires that names must resolve at parse/plan time so you can't lazy-bind possible-undefined names in e.g. a CASE.

Predicate query results would support the ttl key and if set, use the cache. If the main metric query reports an error, that would automatically invalidate the predicate query cache. E.g.

predicate_queries:
  - predicate_query: |
      SELECT 1 FROM pg_extension WHERE extname = 'foo';
    ttl: 300

I'm also tempted to add separate positive and negative cache TTLs since it's common to have a metric initially not scrape-able due to an extension not yet being created, but once it's scrape-able once it stays that way. It's probably unnecessary though, and exponential backoff might be better if it proves to be required.

If I get the chance I may be able to add named query indirection so one predicate query set can be used for multiple scrapers to reduce verbosity. I'm not sure if it'll be worth it, will see.

Thoughts? The in-house implementation of this logic turned out to be very simple to do, and should translate well to this scraper.

Built-in metrics definition

Embed a default metrics definition into pg_exporter which support PG 10 - 14.

  • Embed static config into binary with Go 1.16 embed
  • Add option to disabled default metrics.
  • Add option to append new collectors rather than overwrite default collectors.

PostgreSQL 14 Support

Add support for PostgreSQL new metrics:

  • pg_stat_database
  • pg_locks
  • pg_stat_wal
  • pg_stat_replication_slots
  • pg_stat_wal
  • pg_stat_progress_copy
  • pg_prepared_statements

PG_EXPORTER_INCLUDE_DATABASE doesn't work

When I set PG_EXPORTER_AUTO_DISCOVERY = true
to discover the other databases they must absolutely be in --include_database.
The PG_EXPORTER_INCLUDE_DATABASE parameter is not supported.
why is it necessary to have the list of databases in this parameter?
For example if it is empty then we take all the databases which are not in exclude_database.

extension, namespace tags use values from default DB only?

If I read this code correctly, the scraper collects the list of extensions and namespaces present from the default DB it connects to.

This would result in incorrectly running a particular metric set on DBs where it should not run, or not running it where it should, if the set of extensions and/or namspaces differs between DBs on the postgres instance.

Am I missing something there? It looks like the namespace and extensions lists probably need to be collected per-discovered-database.

It also appears to collect DB names including those with false datallowconn and datistemplate, which will either fail to scrape or not be useful to scrape.

New `jackc/pgx` driver does not work well with pgbouncer metrics

Two problems with the new jackc/pgx drvier when working with pgBouncer:

Pgbouncer works with SimpleQueryProtocol, which can be solved by:

config, err := pgx.ParseConfig(s.dsn)
config.DefaultQueryExecMode = pgx.QueryExecModeSimpleProtocol

The driver will send -- ping command to pgbouncer, which will trigger an error log periodically

Jun 30 02:11:47 rocky8 pgbouncer[109508]: C-0x55e5664ad6c0: pgbouncer/dbuser_monitor@unix(5155):6432 pooler error: invalid command '-- ping', use SHOW HELP;
Jun 30 02:11:47 rocky8 pgbouncer[109508]: C-0x55e5664ad6c0: pgbouncer/dbuser_monitor@unix(5155):6432 pooler error: invalid command '-- ping', use SHOW HELP;
Jun 30 02:11:56 rocky8 pgbouncer[109508]: C-0x55e5664ad6c0: pgbouncer/dbuser_monitor@unix(5155):6432 pooler error: invalid command '-- ping', use SHOW HELP;

The latter one seems tricky; I'm looking into it.

;-) Any thoughts on this ? @ringerc

Crash when address already bound

Instead of a graceful exit the scraper panics with a segfault if the listening address is already bound

level=error timestamp=2024-06-12T01:21:25.773821082Z caller=utils.go:76 msg="http server failed: listen tcp :9630: bind: address already in use"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x88f2d5]

goroutine 1 [running]:
database/sql.(*DB).Close(0x0)
	/home/craig/apps/go-1.22/src/database/sql/sql.go:910 +0x35
github.com/Vonng/pg_exporter/exporter.(*Exporter).Close(0xc000228000)
	/home/craig/projects/EDB/pg_exporter/exporter/exporter.go:209 +0x188
github.com/Vonng/pg_exporter/exporter.Run()
	/home/craig/projects/EDB/pg_exporter/exporter/main.go:219 +0xd17
main.main()
	/home/craig/projects/EDB/pg_exporter/main.go:22 +0xf

Add includeDatabase option

There should be two kind of multiple database support policy

  • List target database with --include-database
    • can be a comma separated datname list
    • can be a regex which be used for name matching
  • Exclude specific database with --exclude-database

Add Column.Scale to multiply a factor to scraped metrics

It would be convenient to have the ability to scale metric value by a center factor.

E.g The origin metric is in µs , while common practice was transform those metrics into standard unit. such as second.

It would be annoying to do so in original raw SQL. But would be great to do it in configuration.

    - exec_time:
        usage: COUNTER
        scale: 1e-6
        description: Total time spent executing the statement, in µs

error with PG_EXPORTER_NAMESPACE="pg"

Hi,

When in i put export PG_EXPORTER_NAMESPACE="pg" in default env file there is an error
..;pg_exporter[3999]: pg_exporter: error: strconv.ParseBool: parsing "pg": invalid syntax, try --help

If i exclude PG_EXPORTER_NAMESPACE but i set --namespace=pg in PG_EXPORTER_OPTS it work's
PG_EXPORTER_OPTS='--namespace=pg --log.level=info --log.format="logger:syslog?appname=pg_exporter&local=7"'

Error when scraping with 2 databases

Having 2 databases assigned, when I try to extract the metrics I have the following error:
An error has occurred while serving metrics:

11 error (s) occurred:

  • collected metric "pg_bgwriter_checkpoints_timed" {counter: <value: 73>} was collected before with the same name and label values
  • collected metric "pg_bgwriter_checkpoints_req" {counter: <value: 7>} was collected before with the same name and label values
  • collected metric "pg_bgwriter_checkpoint_write_time" {counter: <value: 51>} was collected before with the same name and label values
  • collected metric "pg_bgwriter_checkpoint_sync_time" {counter: <value: 17>} was collected before with the same name and label values
  • collected metric "pg_bgwriter_buffers_checkpoint" {counter: <value: 84>} was collected before with the same name and label values
  • collected metric "pg_bgwriter_buffers_clean" {counter: <value: 0>} was collected before with the same name and label values
  • collected metric "pg_bgwriter_buffers_backend" {counter: <value: 48>} was collected before with the same name and label values
  • collected metric "pg_bgwriter_maxwritten_clean" {counter: <value: 0>} was collected before with the same name and label values
  • collected metric "pg_bgwriter_buffers_backend_fsync" {counter: <value: 0>} was collected before with the same name and label values
  • collected metric "pg_bgwriter_buffers_alloc" {counter: <value: 1516>} was collected before with the same name and label values
  • collected metric "pg_bgwriter_stats_reset" {counter: <value: 1.621928245e + 09>} was collected before with the same name and label values

Histogram Support

Histogram are perfect for lock & session duration distribution.
e.g

pg_lock_histo{le=0}
pg_lock_histo{le=10}
pg_lock_histo{le=50}
pg_lock_histo{le=100}
pg_lock_histo{le=1000}
...

Dial tcp: lookup postgres on 127.0.0.53:53: server misbehaving, retrying in 10s source="pg_exporter.go:1379"

When I run the following command.

pg_exporter --url=postgresql://user:password@localhost:5432/?sslmode=disable&host=/var/run/postgresql --config=pg_exporter.yaml

I get the following error

INFO[0000] retrieve target url postgresql://user:password@localhost:5432/?sslmode=disable from command line  source="pg_exporter.go:1626"
INFO[0000] fallback on default config path: pg_exporter.yaml  source="pg_exporter.go:1666"
ERRO[0000] fail connecting to primary server: fail fetching server version: dial tcp: lookup postgres on 127.0.0.53:53: server misbehaving, retrying in 10s  source="pg_exporter.go:1379"
ERRO[0010] fail connecting to primary server: fail fetching server version: dial tcp: lookup postgres on 127.0.0.53:53: server misbehaving, retrying in 10s  source="pg_exporter.go:1383"
ERRO[0020] fail connecting to primary server: fail fetching server version: dial tcp: lookup postgres on 127.0.0.53:53: server misbehaving, retrying in 10s  source="pg_exporter.go:1383"

I am using Ubuntu 20.04 Focal Fossa LTS.
pg_exporter version being used: v0.2.0
Any assistance will be highly appreciated.

Retry backoff on scrape failure to reduce log spam when queries fail?

Do you have any opinion on the idea of having exponential back-off on re-trying failed metric scrapes to reduce log-spam in case of problems?

If it's an idea you're open to I can look at cooking up a patch to support it if my initial PoC of this scraper works out.

CloudNative-PG's built-in scraper, which I currently use, doesn't do this either. But log-spam is a real problem with it if there's a mistake in a query. So it's something I'd like to see if I can implement here if I adopt this scraper.

PostgreSQL 17 support

https://www.postgresql.org/docs/17/release-17.html#RELEASE-17-MIGRATION

Rename I/O block read/write timing statistics columns of pg_stat_statements (Nazir Bilal Yavuz)

This renames "blk_read_time" to "shared_blk_read_time", and "blk_write_time" to "shared_blk_write_time".

Remove buffers_backend and buffers_backend_fsync from pg_stat_checkpointer (Bharath Rupireddy)

These fields are considered redundant to similar columns in pg_stat_io.

Change pg_attribute.attstattarget and pg_attribute.stxstattarget to represent the default statistics target as NULL (Peter Eisentraut)

Change pg_stat_progress_vacuum columns max_dead_tuples to max_dead_tuple_bytes and num_dead_tuples to dead_tuple_bytes (Masahiko Sawada)

These columns now report bytes instead of tuples.

Rename SLRU columns in system view pg_stat_slru (Alvaro Herrera)

The column names accepted by pg_stat_slru_rest() are also changed.

IMPROVEMENT: Respond fake pg_up metrics before planning

Since Queries (Collectors) are dynamically planned. PgExporter need an alive server to fetch facts.

Which comes into a dilemma: if target PostgreSQL Instance is dead, pg_exporter can not gathering facts thus will abort or wait target online according to config parameter. But sometimes we may want pg_exporter report the situation that target postgres instance is down.

There's a work around: Create a dummy server before PgExporter is waiting dead server online (responding with constant pg_up{} 0). And destroy that server after PgExporter successfully connecting to target server.

Selective scrape via query parameters

I'm looking at adopting pg_exporter as a potential replacement for the CloudNative-PG built-in scraper I currently use.

CNPG doesn't support this either, but I was considering implementing it for CNPG and wonder if I should do so for pg_exporter instead.

Do you have any advice on how I might go about adding support for scraping selected subsets of metrics, specified by tag-set, with query parameters?

I want to scrape different subsets at different frequencies without having to deploy multiple instances of the scraper. This can't be done with command-line or env-var level tags.

E.g. I want to scrape /metrics?tags=expensive_metrics,wide_cardinality_metrics every 5min, and /metrics?cheap_narrow_metrics every 30s.

The scraper's caching feature makes this a less pressing need than it is in other scrapers, but since it already has tagging I'm wondering if it might be fairly easy to add. I can't immediately see how to access the query params from/within promhttp in order to apply them selectively though. Do you know if this might be possible, and have any ideas about how? If so, I can see if I can cook up a patch if my initial PoC with this scraper works out.

(See related #41, #43 )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.