bucardo / check_postgres Goto Github PK

Nagios check_postgres plugin for checking status of PostgreSQL databases

Home Page: http://bucardo.org/wiki/Check_postgres

License: Other

Perl 82.25% HTML 17.75%

hacktoberfest monitoring nagios postgres postgresql

check_postgres's Introduction

Bucardo - a table-based replication system

DESCRIPTION:
------------

This is version 5.6.0 of Bucardo.

COPYRIGHT:
----------

    Copyright (c) 2005-2023 Greg Sabino Mullane <[email protected]>

REQUIREMENTS:
-------------

    build, test, and install Perl 5                (at least 5.8.3)
    build, test, and install PostgreSQL            (at least 8.2)
    build, test, and install the DBI module        (at least 1.51)
    build, test, and install the DBD::Pg module    (at least 2.0.0)
    build, test, and install the DBIx::Safe module (at least 1.2.4)
    
    You must have at least one database that has PL/pgSQL and PL/Perl installed.
    Target databases may need PL/pgSQL.


INSTALLATION:
-------------

To install this module type the following:

   perl Makefile.PL
   make
   make test (but see below first)
   make install

EXAMPLES:
---------

See the test suite in the t/ subdirectory for some examples.

WEBSITE:
--------

Please visit https://bucardo.org for complete documentation.

DEVELOPMENT:
------------

To follow or participate in the development of Bucardo, use:

git clone [email protected]:bucardo/bucardo.git

GETTING HELP:
-------------

For general questions and troubleshooting, please use the [email protected]
mailing list.  GitHub issues which are support-oriented will be closed and referred to
the mailing list anyway, so help save time for everyone by posting there directly.

Post, subscribe, and see previous archives here:

https://bucardo.org/mailman/listinfo/bucardo-general

check_postgres's People

Contributors

Stargazers

Watchers

Forkers

selenamarie elecnix radekz davidfetter kabalin tosivakumar dalibo aolmezov klando xzilla guniorobot denishpatel f0rk pajju shivrajk greyfairer melor solute imadyukov gibheer waisbrot gavioto mixtli fanyeren ninefold devinfoley davmik rjuju localfilmmaker phunehehe scau amutu glynastill vocatan samukasmk ccdos enter08 jeffjanes irumman matteodurighetto petere timecatalyst kleptog richyen martinmarques anayrat topecz sysnove tdevelioglu kpettijohn gilesw sebastianwebber rodmur lvazquez insomniasalt veerubhai5c1 swrd david-rowley dobrinov gburiticato talkless ormod phoefflin bannama mhagander maletin df7cb dargor subito rmadiwale jwh5566 keithf4 hanattaw vmohanan lemontreehuang theundefined subhankarc popovnv thomasservage stpstp wind39 drazul acscott liuqian1990 shamim-ahmad askrana peteblamire like2k1 swanson8r tejasbhosale009 uu obmondo moench-tegeder yoshi314 heguoya easytechclick mattthehearn foot596 renard thefrancishe

check_postgres's Issues

unknown txn-idle

ID: 49
Version: unspecified
Date: 2010-09-06 10:25 EDT
Author: Martin von Oertzen ([email protected])

check_postgres_txn_idle 2.15.0 (from today) results in
Status unknown, if there are no idle-transactions at all.

whitespace differences get picked up in same_schema

The --same_schema action picks up whitespace differences. Can't paste here and see the output, so adding image:

Note the newline between "u.first_name" and "u.last_name" in the second database.

It might be useful to note that the first database is v. 8.4.17 and the second database is v. 9.2.7

Other examples:

same_schema name filtering does not work

ID: 91
Version: unspecified
Date: 2011-11-23 09:07 EST
Author: Peter Eisentraut ([email protected])

Database test1:

create schema foo;
create table foo.bar1(a int);
create table foo.bar2(a int);

Database test2:

create schema foo;
create table foo.bar1(a int);

Now I would like to exclude schema "foo" from comparison:

$ check_postgres_same_schema -db test1,test2 --filter='noschema=foo'
POSTGRES_SAME_SCHEMA CRITICAL: (databases:test1,test2) Databases were
different. Items not matched: 1 | time=1.85s 
DB 1: port=5432 host=<none> dbname=test1 user=postgres 
DB 1: PG version: 8.4.9
DB 1: Total objects: 31
DB 2: port=5432 host=<none> dbname=test2 user=postgres 
DB 2: PG version: 8.4.9
DB 2: Total objects: 29
Table "foo.bar2" does not exist on all databases:
  Exists on:  1
  Missing on: 2

Or just the table:

$ check_postgres_same_schema -db test1,test2 --filter='notable=bar'
POSTGRES_SAME_SCHEMA CRITICAL: (databases:test1,test2) Databases were
different. Items not matched: 1 | time=1.88s 
DB 1: port=5432 host=<none> dbname=test1 user=postgres 
DB 1: PG version: 8.4.9
DB 1: Total objects: 31
DB 2: port=5432 host=<none> dbname=test2 user=postgres 
DB 2: PG version: 8.4.9
DB 2: Total objects: 29
Table "foo.bar2" does not exist on all databases:
  Exists on:  1
  Missing on: 2

This "radical" solution works:

$ check_postgres_same_schema -db test1,test2 --filter='notables'
POSTGRES_SAME_SCHEMA OK: (databases:test1,test2) All databases have identical
items | time=1.70s

But this doesn't:

$ check_postgres_same_schema -db test1,test2 --filter='noschemas'
POSTGRES_SAME_SCHEMA CRITICAL: (databases:test1,test2) Databases were
different. Items not matched: 1 | time=1.75s 
DB 1: port=5432 host=<none> dbname=test1 user=postgres 
DB 1: PG version: 8.4.9
DB 1: Total objects: 27
DB 2: port=5432 host=<none> dbname=test2 user=postgres 
DB 2: PG version: 8.4.9
DB 2: Total objects: 25
Table "foo.bar2" does not exist on all databases:
  Exists on:  1
  Missing on: 2

It's somewhat unclear what the "schema" filtering option does anyway. In older
releases I was able to compare just the public schema by using
--exclude='^(?!public)'. I was hoping that noschema=regex would provide
that, but then 'noschemas' by itself would make little sense. This should also
be clarified.

same_schema --exclude does not apply to schemas

ID: 74
Version: unspecified
Date: 2011-04-19 08:23 EDT
Author: Peter Eisentraut ([email protected])

The --exclude option when used with the same_schema check does not apply when
comparing schemas, and a few other things such as user and languages, because
$opt{exclude} isn't looked at there, even though the documentation makes no
such distinction. You can exclude these things separately, using noschema=foo
etc., but it would be convenient, say, to exclude an entire schema and contents
using --exclude='^foo'.

Table / Relation Size don't take into account TOAST'd pages

ID: 82
Version: unspecified
Date: 2011-08-01 12:25 EDT
Author: [email protected]

Postgresql Version 8.3.14

Neither check_postgres_table_size or check_postgres_relation_size take into
account TOAST'd data.

SELECT relname, reltoastrelid, relpages FROM pg_class WHERE
relname='my_bigtable';
     relname      | reltoastrelid | relpages 
------------------+---------------+----------
 my_bigtable      |      51687163 |  8351228
(1 row)

SELECT relname, reltoastrelid, relpages FROM pg_class WHERE oid = 51687163;
      relname      | reltoastrelid | relpages 
-------------------+---------------+----------
 pg_toast_51687160 |             0 | 33528180

There are a lot more pages in TOAST'd space.


SELECT pg_size_pretty(pg_total_relation_size('my_bigtable'));
 pg_size_pretty 
----------------
 333 GB

SELECT pg_size_pretty(pg_relation_size('my_bigtable'));
 pg_size_pretty 
----------------
 64 GB
(1 row)

==========================

check_postgres_relation_size --warning='4 GB' --critical='4.5 GB'
--include=my_bigtable --dbname=my_db
Password for user postgres: 
POSTGRES_RELATION_SIZE CRITICAL: DB "my_db" largest relation is table
"public.my_bigtable": 64 GB | time=8.27  public.my_bigtable=69034262528

custom_query performance data duplication

When custom_query returns multiple rows, their performance data string is duplicated in subsequent rows. For example:

./check_postgres.pl --action=custom_query --dbname=postgres --dbuser=peisentraut --warning=100 --query="select 30 as result, 'foo' as data union select 20, 'bar' union select 10, 'baz' order by 1 desc"

POSTGRES_CUSTOM_QUERY OK: DB "postgres" 30 * 20 * 10 | time=0.08s data=foo;100 time=0.08s data=foo;100 data=bar;100 time=0.08s data=foo;100 data=bar;100; data=baz;100

Correct would be

POSTGRES_CUSTOM_QUERY OK: DB "postgres" 30 * 20 * 10 | time=0.08s data=foo;100 time=0.08s data=bar;100 time=0.08s data=baz;100

The fix appears to be to change line 4100

$db->{perf} .= sprintf ' %s=%s;%s;%s',

$db->{perf} = sprintf ' %s=%s;%s;%s',

Compatability with EnterpriseDB

Our production environment uses both EnterpriseDB and PostgreSQL. check_postgres.pl doesn't work pretty well with EnterpriseDB and here is the first cut fix to make things work. My perl is not that great however this is the patch which we have been using in production.

diff --git a/check_postgres.pl b/check_postgres.pl
index 32bd338..f9e55e0 100755
--- a/check_postgres.pl
+++ b/check_postgres.pl
@@ -1024,9 +1024,9 @@ if (! defined $PSQL or ! length $PSQL) {
}
-x $PSQL or ndie msg('opt-psql-noexec', $PSQL);
$res = qx{$PSQL --version};
-$res =~ /^psql (PostgreSQL) (\d+.\d+)(\S_)/ or ndie msg('opt-psql-nover');
-our $psql_version = $1;
-our $psql_revision = $2;
+$res =~ /((?:^edb-)?psql) ((PostgreSQL)|EnterpriseDB) (\d+.\d+)(\S_)/ or ndie msg('opt-psql-nover');
+our $psql_version = $3;
+our $psql_revision = $4;
$psql_revision =~ s/\D//g;

$VERBOSE >= 2 and warn qq{psql=$PSQL version=$psql_version\n};
@@ -1940,10 +1940,10 @@ sub run_command {
if ($db->{error}) {
ndie $db->{error};
}

           if ($db->{slurp} !~ /PostgreSQL (\d+.\d+)/) {

           if ($db->{slurp} !~ /(PostgreSQL|EnterpriseDB) (\d+.\d+)/) {
             ndie msg('die-badversion', $db->{slurp});
         }

```
           $db->{version} = $1;
```

           $db->{version} = $2;
         $db->{ok} = 0;
         delete $arg->{versiononly};
         ## Remove this from the returned hash

@@ -3040,7 +3040,7 @@ sub check_connection {

$db = $info->{db}[0];

my $ver = ($db->{slurp}[0]{v} =~ /PostgreSQL (\d+.\d+\S+)/o) ? $1 : '';
my $ver = ($db->{slurp}[0]{v} =~ /(PostgreSQL|EnterpriseDB) (\d+.\d+\S+)/o) ? $2 : '';

$MRTG and do_mrtg({one => $ver ? 1 : 0});

psql path check is too restrictive

ID: 81
Version: unspecified
Date: 2011-07-28 10:39 EDT
Author: Peter Eisentraut ([email protected])

When using the --PSQL option, the path check is too restrictive, for example:

./check_postgres.pl --PSQL=/usr/lib/postgresql/8.4/bin/psql --action=connection
--db=test
ERROR: Invalid psql argument: must be full path to a file named psql

The code is fairly simple-minded about this:

$PSQL =~ m{^/[\w\d\/]*psql$}

I would just simplify this to something like

$PSQL =~ m{^/.*/psql$}

(or remove it altogether). Consider typical paths on Windows (bug 36).

same_schema gets confused if constraints are not uniquely named

If I have different constraints for two tables, but they have the same name, the --same_schema action will mix them up between the two databases (especially if one database is v. 8.4.x and the other database is v. 9.2.x):

EXAMPLE 1:
Constraint "public.min_password_length_check":
"conkey" is different:
Database 1: {2}
Database 2: {6}
"consrc" is different:
Database 1: (length((join_password)::text) >= 4)
Database 2: (length((enrollment_password)::text) >= 4)
"tname" is different:
Database 1: table_1
Database 2: table_2

EXAMPLE 2:
Constraint "public.$1":
"confdeltype" is different:
Database 1: a
Database 2:
"conffeqop" is different:
Database 1: {96}
Database 2:
"confkey" is different:
Database 1: {1}
Database 2:
"confmatchtype" is different:
Database 1: u
Database 2:
"confupdtype" is different:
Database 1: a
Database 2:
"conkey" is different:
Database 1: {2}
Database 2: {2,3}
"conpfeqop" is different:
Database 1: {96}
Database 2:
"conppeqop" is different:
Database 1: {96}
Database 2:
"consrc" is different:
Database 1:
Database 2: (start_date <= end_date)
"contype" is different:
Database 1: f
Database 2: c
"tname" is different:
Database 1: table_3
Database 2: table_4

same_schema --exclude behaves differently from all other actions

ID: 75
Version: unspecified
Date: 2011-04-19 08:27 EDT
Author: Peter Eisentraut ([email protected])

Although the documentation is technically correct on this, it would be really
helpful to make it crystal clear that in the same_schema action the --exclude
options works on a completely different logic than in most other actions that
follow the scheme explained in the section "BASIC FILTERING".

I would change this paragraph

"You may exclude all objects of a certain name by using the "exclude" option.
It takes a Perl regular expression as its argument."

"You may exclude all objects of a certain name by using the "exclude" option.
It takes a Perl regular expression as its argument. The option can be repeated
to specify multiple patterns to exclude. (Note that the --exclude option for
this action does not follow the logic explained in the "BASIC FILTERING"
section.)"

The alternative would be to eliminate this distinction, but that might break
too many things for users.

clarify permissions for pgbouncer_backends check

ID: 96
Version: Unspecified
Date: 2012-01-03 06:44 EST
Author: Peter Eisentraut ([email protected])

The documentation for the pgbouncer_backends check says:

"Note that the user you are connecting as must be a superuser for this to work
properly."

PgBouncer doesn't really have the concept of a superuser. It has admin_users
and stats_users. I think the permission required for this check is
stats_users. Please correct that in the docs.

Invalid Message when Unexcluded Tables Have Never Been Vacuumed

We're running check_postgres.pl like this:

check_postgres.pl -u nagios --action=last_vacuum --exclude=~^pg --db=csi

The output is:

No matching tables found due to exclusion/inclusion options

But that can't be right. The query that ends up running is (reformatted):

SELECT current_database(), nspname, relname,
       CASE WHEN v IS NULL THEN -1 ELSE round(extract(epoch FROM now()-v)) END,
       CASE WHEN v IS NULL THEN '?' ELSE TO_CHAR(v, 'HH24:MI FMMonth DD, YYYY') END
  FROM (
          SELECT nspname, relname, GREATEST(
              pg_stat_get_last_vacuum_time(c.oid), 
              pg_stat_get_last_autovacuum_time(c.oid)
          ) AS v
            FROM pg_class c, pg_namespace n
           WHERE relkind = 'r'
             AND n.oid = c.relnamespace
             AND n.nspname <> 'information_schema'
           ORDER BY 3
  ) AS foo;

When I run that manually, I get rows like:

csi | pg_catalog  | pg_authid       | 390524 | 23:16 March 20, 2010
csi | public      | foo             |     -1 | ?
csi | pg_catalog  | pg_auth_members |     -1 | ?

So we do have a table, "foo", that is not excluded. However, it's never been vacuumed, so round is set to -1. The error should not be that no matching tables were found because of the exclusion, because a table is found and not excluded, but has never been vacuumed. So it should probably say something like "no unvacuumed tables found" instead.

I think that the reason it works that way is this bit of code in check_last_vacuum_analyze():

    SLURP: while ($db->{slurp} =~ /(\S+)\s+\| (\S+)\s+\| (\S+)\s+\|\s+(\-?\d+) \| (.+)\s*$/gm) {
        my ($dbname,$schema,$name,$time,$ptime) = ($1,$2,$3,$4,$5);
        $maxtime = -3 if $maxtime == -1;
        if (skip_item($name, $schema)) {
            $maxtime = -2 if $maxtime < 1;
            next SLURP;
        }

So looking at the three rows returned above, it looks like:

row one is excluded and $maxtime set to -2
row two is not excluded, but $maxtime is -1 and so gets set to -3
row three is excluded and $maxtime set to -2

Since the last row fetched set $maxtime to -2, this code then gets triggered:

    if ($maxtime == -2) {
        add_unknown msg('no-match-table');
    }

But that's wrong. I think what needs to happen is that it needs to know that unexcluded rows were returned (the second row in this example) but were never vacuumed. Not sure how you'd go about that using $maxtime as a flag; maybe you need some other flag? Maybe something like this?

--- a/check_postgres.pl
+++ b/check_postgres.pl
@@ -3469,6 +3469,7 @@ sub check_last_vacuum_analyze {
        my ($minrel,$maxrel) = ('?','?'); ## no critic
        my $mintime = 0; ## used for MRTG only
        my $count = 0;
+                my $unskipped;
        SLURP: while ($db->{slurp} =~ /(\S+)\s+\| (\S+)\s+\| (\S+)\s+\|\s+(\-?\d+) \| (.+)\s*$/gm) {
            my ($dbname,$schema,$name,$time,$ptime) = ($1,$2,$3,$4,$5);
            $maxtime = -3 if $maxtime == -1;
@@ -3476,6 +3477,7 @@ sub check_last_vacuum_analyze {
                $maxtime = -2 if $maxtime < 1;
                next SLURP;
            }
+                        $unskipped ||= 1;
            $db->{perf} .= " $dbname.$schema.$name=${time}s;$warning;$critical" if $time >= 0;
            if ($time > $maxtime) {
                $maxtime = $time;
@@ -3497,7 +3499,7 @@ sub check_last_vacuum_analyze {
        }

        if ($maxtime == -2) {
-           add_unknown msg('no-match-table');
+           add_unknown msg($unskipped ? 'no-vacuumed-table' : 'no-match-table');
        }
        elsif ($maxtime < 0) {
            add_unknown $type eq 'vacuum' ? msg('vac-nomatch-v') : msg('vac-nomatch-a');

Thanks.

David

Error message: Use of uninitialized value $ENV{"HOME"} in concatenation...

The full error message for query time:
"Use of uninitialized value $ENV{"HOME"} in concatenation (.) or string at /usr/lib64/nagios/plugins/check_postgres.pl line 868. "

This message occurs on icinga's front-end screen, but returns the proper output on the commandline including running it as user icinga.

Here's my config:
check_postgres.pl -H $HOSTADDRESS$ --dbuser=X --dbpass=X --action=query_time -w 10 -c 15

Using Perl v5.10.1., CentOS release 6.3 (Final), icinga 1.8.4 and the latest check_postgres.pl (10454 lines (8340 sloc) 382.487 kb).

Feature Request: option on query_time to include the SQL

We're using check_postgres to email alerts out to interested parties on a select number of actions, one of which is query_time to help identify long running queries.

I keep getting asked if there is a way to see what the query is - it's great having the notification but they don't have access to the live servers, which means the DBA's then need to log in and pull the information out.

An optional switch (Disabled by default) to output the SQL the alert is referring to would be useful if it's at all possible?

Thanks

strange installation locations

ID: 53
Version: unspecified
Date: 2010-11-03 12:42 EDT
Author: Peter Eisentraut ([email protected])

I run

perl Makefile.PL
make
make install

and the resulting installation is

/usr/local/share/perl/5.10.1/check_postgres.pl
/usr/local/man/man3/check_postgres.3pm
/usr/local/bin/check_postgres.pl

I think this is a bit odd. Why is check_postgres.pl installed in two
locations?

I tried tweaking this a little bit. Adding

PM => {},

to %opts prevents the installation under /usr/local/share. Then removing the
line

MAN1PODS => {},

installs the man page, but it's then called check_postgres.pl.1p, whereas the
documentation says to use man check_postgres.

I think the sort of installation layout I'd expect is approximately

/usr/local/bin/check_postgres
/usr/local/man/man1/check_postgres.1p

There is some room for variation, but the current behavior is strange.

replicate_row usage doesn't seem to match docs

With version 2.20.1, the docs say to do:

check_postgres.pl --action=replicate_row --host=master --host2=slave1,slave2

But when I tried that I got "ERROR: No slaves found". I did get it to work with:

check_postgres.pl --action=replicate_row --host=master,slave1,slave2

though.

It's certainly conceivable that I'm doing something wrong, but it seems like a bug. Please let me know if you need any other info from me.

Thanks.

output pg_stat_activity on warn/error

check_postgres works well. If I receive an error or warning, I know that something is wrong.

But I can't see what is going on.

It would be great to see the pg_stat_activity output if there is something wrong.

http://www.postgresql.org/docs/9.2/static/monitoring-stats.html#PG-STAT-ACTIVITY-VIEW

Our applications set the "application_name" which makes the output even more useful, but I guess a lot of applications don't do it.

check_postgres incorrectly tests psql version for server-side query building

In the disk_space', 'txn_idle' and 'txn_time' checks (at least), check_postgres misidentifies the server version when presented with a version like:

PostgreSQL 9.4beta1_bdr0601

It looks like an issue within sub run_command within the version-specific statement code.

I initially thought it was testing the client-side psql version, but while it does separately do that, that isn't the cause of this problem.

bloat has this issue in older releases, but works with the current release.

check_postgres uses server_version instead of server_version_num

check_postgres tries to determine the server's version in a number of places, in order to craft queries that cope with different catalogs on different server versions.

It does so by fetching and parsing the server_version in pg_settings. This is simply wrong - it should be using server_version_num.

It was added in 8.2, so there's no reason to bother with backward compatibility.

commit 04912899e792094ed00766b99b6c604cadf9edf7 refs/tags/REL8_2_BETA1
Author: Bruce Momjian <[email protected]>
Date:   Sat Sep 2 13:12:50 2006 +0000

    Add new variable "server_version_num", which is almost the same as
    "server_version" but uses the handy PG_VERSION_NUM which allows apps to
    do things like if ($version >= 80200) without having to parse apart the
    value of server_version themselves.

    Greg Sabino Mullane [email protected]

This is related to the prior investigation I did on issue #70 .

hitratio won't report on dbs owned by a role

hitratio ( and maybe others ) are joining pg_user to pg_database. If the database is owned by a role ( newer versions of PG ), hitratio is not returned:

SELECT
round(100.*sd.blks_hit/(sd.blks_read+sd.blks_hit), 2) AS dhitratio
d.datname,
u.usename
FROM pg_stat_database sd
JOIN pg_database d ON (d.oid=sd.datid)
JOIN pg_user u ON (u.usesysid=d.datdba)
WHERE sd.blks_read+sd.blks_hit<>0;

should be:

SELECT
round(100.*sd.blks_hit/(sd.blks_read+sd.blks_hit), 2) AS dhitratio,
d.datname,
u.rolname as usename
FROM pg_stat_database sd
JOIN pg_database d ON (d.oid=sd.datid)
JOIN pg_roles u ON (u.oid=d.datdba)
WHERE sd.blks_read+sd.blks_hit<>0;

2.21.0 Documentation bug

OS: SL 2.6.32-358.18.1.el6.x86_64

plugins]# ./check_postgres.pl --action=connection --db=chimera -H localhost -p 5432 -u postgres
Cannot find Time::HiRes, needed if 'showtime' is true at ./check_postgres.pl line 1267.

This was solved by installing:

yum install perl-Time-HiRes.x86_64

New action for bloat using pg_stattuple

ID: 22
Version: unspecified
Date: 2009-12-03 12:58 EST
Author: Greg Sabino Mullane ([email protected])

Use pg_stattuple to get the bloat information, either as a new action or a flag
to the current one.

vacuum full does not update last vacuum statistics

Since we switched to regulary do a "vacuum full", the last_vacuum-Check reports errors, because pg_stat_user_tables has old values.

http://postgresql.1045698.n5.nabble.com/BUG-5722-vacuum-full-does-not-update-last-vacuum-statistics-td3235344.html

If there is no other way to find out the last_full_vacuum-date, that should be mentioned at the check_postgres-Documentation.

custom_query

Hi,
I've some errors using custom_query.
With nagios output :
Use of uninitialized value $data in numeric ge (>=) at ./check_postgres_custom_query line 3121.
Use of uninitialized value $data in numeric ge (>=) at ./check_postgres_custom_query line 3130.
Use of uninitialized value $data in string at ./check_postgres_custom_query line 3139.
With mrtg ad simple output :
Use of uninitialized value $data in string at ./check_postgres_custom_query line 3139.
Action custom_query failed: Unknown error

txn_time/idle flap from OK to UNKNOWN

I have been dealing with a problem in a new installation I was doing today. The problem came up as I saw in Icinga that txn_time and txn_idle were flapping and for no reason I could think of.

Checking the code, I found that the problem came here:

    ## Return unknown if stats_command_string / track_activities is off
    if ($cq =~ /disabled/o or $cq =~ /<command string not enabled>/) {
        add_unknown msg('psa-disabled');
        return;
    }

What would happen if the query had in some part the word 'disabled'? For example a column of a table which has been selected.

I think this is not the best way to check if a postgres setting is set accordinly.

Removing the offending code (as I'm sure the postgres setting has track_activities on) made everything work as expected.

wish: address "waiting queries" through code or docs

I've used an old alternate monitoring plugin called "check_pg_waiting_queries.pl". It's poorly written, so that when it's run against PostgreSQL 9.1, it always reports "success" when in fact the query contains a syntax error, so it should fail.

I presume that check_postgres.pl provides an appropriate upgrade path, or that I don't actually need this monitor at all, but I'm not quite clear. I see that check_postgres.pl provides a check for locks, and I know that locks are closely related to "waiting queries". If the "locks" check is an appropriate replacement for a "waiting queries" check, it would be helpful if a sentence could be added to those docs to clarify it.

When checking backends using negative, check should be less-than not greater-than

ID: 57
Version: 2.14.3
Date: 2010-12-01 14:19 EST
Author: [email protected]

When using negative numbers to check free backends, the checks on line 2541 and
2556 in check_backends procedure should check for less-equal, not greater-equal
for warning and critical status

Improve bloat calculation

ID: 21
Version: unspecified
Date: 2009-12-03 12:57 EST
Author: Greg Sabino Mullane ([email protected])

The current bloat calculation is very rough and fails to account for many
things. Make it more accurate.

check_postgres.pl action custom_query does not show the performance data

ID: 84
Version: unspecified
Date: 2011-10-18 08:12 EDT
Author: tom ([email protected])

In wiki page,you said:
It is required that one of the columns be named "result" and is the item that
will be checked against your warning and critical values. The second column is
for the performance data and any name can be used: this will be the 'value'
inside the performance data section.

It's wrong,It is required that second of the columns be named "data"

The second issue is the message include the performance data can be shown in
the linux console,but nagios page can't get the message,it can only read the
message before the pipe symbol.

I try to delete the line print '| '; or change the pipe symbol to other
character,for exmple 'kk' in the method dumpresult() of source code
'check_postgres.pl' ,it can show the complete message.

console:

POSTGRES_CUSTOM_QUERY CRITICAL: DB "demodb" (host:postgresql.demo.dev) 0 |
time=2.12  Check for records added in the last 4 hours to the demo table,the
result is:0

nagios page:(miss the above messages after a pipe symbol)
POSTGRES_CUSTOM_QUERY CRITICAL: DB "demodb" (host:postgresql.demo.dev) 0

Warning: the following files are missing in your kit: MYMETA.yml

[.... check_postgres]# perl Makefile.PL
Configuring check_postgres 2.21.0
Checking if your kit is complete...
Warning: the following files are missing in your kit:
MYMETA.yml
Please inform the author.
Writing Makefile for check_postgres

Please inform the author. Which I have done by opening this issue

Thanks.

hot_standby_delay doesn't work at all or poorly documented

ID: 95
Version: unspecified
Date: 2011-12-22 09:54 EST
Author: Peter Eisentraut ([email protected])

check_postgres version 2.18.0

I cannot get hot_standby_delay to work at all. Something like this ought to do
something:

$ check_postgres_hot_standby_delay --dbhost=localhost --dbhost2=localhost
--dbport=5435 --dbport=5435 --dbuser=postgres -w 30 -c 100
Use of uninitialized value $slave in numeric eq (==) at
/usr/bin/check_postgres_hot_standby_delay line 4581.
POSTGRES_HOT_STANDBY_DELAY UNKNOWN: DB "postgres" (host:localhost) (port=5435)
Invalid query returned: receive <PIPE> \n replay  <PIPE> \n  | time=0.09s

An actual example in the documentation would be nice, in case this is not a
real bug but just a misunderstanding.

pg_bloat reports UNKNOWN after index-drop

i created a unique index concurrently.
after it was build, i droped the old index.
nagios reports

[2011-02-16 16:48:05] SERVICE ALERT:
pg1;PG-bloat;
CRITICAL;HARD;10;CHECK_NRPE:
Socket timeout after 30 seconds.

[2011-02-16 16:52:45] SERVICE ALERT:
pg1;PG-bloat;UNKNOWN;HARD;10;ERROR:
ERROR: relation with OID 344939906 does not exist

maybe it should not be an error, if the oid does not exists.

dbservice and dbpass options dont work along

Hi, I've just noticed a bug.

./check_postgres.pl --dbservice="theService_in_pg_service.conf"--dbuser=theUser --dbpass=heyhey --action=connection
Use of uninitialized value in printf at ./check_postgres.pl line 1748.
Use of uninitialized value in printf at ./check_postgres.pl line 1748.
Password for user theUser:

So it seems that the script doesnt retrieve the password given as a parameter when using dbservice.

check_disk_space section unable to find pg_logs

ID: 94
Version: unspecified
Date: 2011-12-09 19:56 EST
Author: [email protected]

Great program!

I have installed the RHEL 6 rpm version (Version : 2.18.0) from the
postgresql repository (From repo : pgdg91) and I am having a problem that
unless I execute the program from the pg_hba directory it's unable to find the
correct paths:

/usr/bin/check_postgres.pl -H localhost --dbuser=postgres --action disk_space
ERROR: Invalid result from command "/bin/df -kP "../../pg_logs/instance1"
2>&1": /bin/df: `../../pg_logs/instance1': No such file or directory
/bin/df: no file systems processed

however if I print out "$i{S}{data_directory}" the full data path is valid.

my case "/opt/postgresql/data/instance1."

So not sure what the right way of doing this is but my current workaround was
to apply the following:

###############################################################################

*** /usr/bin/check_postgres.pl    2010-09-12 21:58:34.953908816 -0700
--- /usr/bin/check_postgres.pl.chg    2010-09-12 22:00:24.125192752 -0700
***************
*** 4206,4211 ****
--- 4206,4212 ----
              add_unknown msg('diskspace-nodata');
              next;
          }
+     chdir($i{S}{data_directory});
          my ($datadir,$logdir) =
($i{S}{data_directory},$i{S}{log_directory}||'');

          if (!exists $dir{$datadir}) {

###########################################################################

Please let me know if this is something that can be updated? Thank you-

Use of uninitialized value $slave in numeric eq (==) at ./check_postgres_hot_standby_delay line 4750.

I have an error when running the check_postgres_hot_standby_delay function. I issue the following command:
./check_postgres_hot_standby_delay --host=ipprimary --dbuser=user --dbpass=pass --host2=ipsec --dbuser2=user --dbpass2=pass --warning=50 --critical=1024
When I execute this command the response is the following.
Use of uninitialized value $slave in numeric eq (==) at ./check_postgres_hot_standby_delay line 4750.
POSTGRES_HOT_STANDBY_DELAY UNKNOWN: DB "postgres" (host:ipprimary) Invalid query returned: receive \n replay \n | time=0.23s

I hope that someone can make sens of it.

Filter from application_name in pg_stat_activity

Would it be possible to add a filter at application_name as to not check txn_idle for certain applications (for example long pg_dumps)?

If so, which would be the best aproach? Add a APPNAMEWHERECLAUSE, or just add specific filters in check_txn_time and pass them on to check_txn_idle?

I think the second would do, as I don't fint any where else for this to be needed.

psql version vs server version

On a machine with multiple versions of postgres installed you get some version mismatches because check_postgres uses the version of psql not the version of the server that it connects to.

We have version 8.4 running, but we also have version 9.3 installed but not turned on.

psql on its own tells the state of affairs.

% psql 
psql (9.3.5, server 8.4.22)
Type "help" for help.
...

Asking psql what version it is gets the version of psql not postgres

 % psql --version
 psql (PostgreSQL) 9.3.5

action=sequence returns an error with two databases with the same sequences

ID: 115
Version: 2.19.0
Date: 2012-10-30 03:21 EDT
Author: [email protected]

I have multiple databases with the same schema including same sequences. I want
to check their sequences with --action=sequence.

check_postgres.pl -H 10.1.30.5 --action=sequence -db=staging,production
--critical=95%

ERROR: FEHLER:  Relation »public.customer_addresses_id_seq« existiert nicht
LINE 7: FROM public.customer_addresses_id_seq) foo

When I am doing this with single commands like ...

check_postgres.pl -H 10.1.30.5 --action=sequence -db=staging --critical=95%
check_postgres.pl -H 10.1.30.5 --action=sequence -db=production --critical=95%

... everything goes well.

Seems to be a bug.

sequence-check with multiple databases

ID: 98
Version: 2.19.0
Date: 2012-01-19 05:46 EST
Author: Martin von Oertzen ([email protected])

one part of the --action=sequence needs over 3 minutes with postgres 9.1.2:

SELECT nspname, seq.relname, typname
 FROM pg_attrdef
 JOIN pg_attribute ON (attrelid, attnum) = (adrelid, adnum)
 JOIN pg_type on pg_type.oid = atttypid
 JOIN pg_class rel ON rel.oid = attrelid
 JOIN pg_class seq ON seq.relname = regexp_replace(adsrc,
$re$^nextval\('(.+?)'::regclass\)$$re$, $$\1$$)
 AND seq.relnamespace = rel.relnamespace
 JOIN pg_namespace nsp ON nsp.oid = seq.relnamespace
 WHERE adsrc ~ 'nextval' AND seq.relkind = 'S' AND typname IN ('int2', 'int4',
'int8')

on an other computer i use postgres 8.3.16:

$ check_postgres_sequence --db=postgres --perflimit=1
POSTGRES_SEQUENCE OK: DB "postgres" public.db_clients_id_seq=0% (calls
left=2147483539) | time=0.05s public.db_clients_id_seq=0%;85%;95%

$ check_postgres_sequence --db=mydb --perflimit=1
Can't use an undefined value as an ARRAY reference at check_postgres_sequence
line 7118.

$ check_postgres_sequence --db=postgres,mydb --perflimit=1
ERROR: ERROR:  relation "public.db_clients_id_seq" does not exist

Hostname Confusing When Using Unix Domain Socket Connection

When connecting explicitly by Unix domain socket (rather than over TCP), the output is a bit confusing. To do this, psql and other tools need you to specify the directory that contains the Unix socket (in my case, that's /tmp) in the "-h" option. For example, you might get something like this:

OK: DB "drupal" (host:/tmp) longest txn: 0s

Could this be altered to say something like: "OK: DB "drupal" (host:local) longest txn: 0s"? In terms of perl, it's something like this:

$hostname = 'local' if($hostname =~ /^//);

Thanks!

slony_status does not check all slaves of a cluster

Hi! If I understand the script correctly, the slony-status checks only 1 slave of a cluster, the one which is returned first from the query. This is random. E.g. here a scenario where one slave is behind:

SELECT
ROUND(EXTRACT(epoch FROM st_lag_time)) AS lagtime,
st_origin,
st_received,
current_database() AS cd,
COALESCE(n1.no_comment, '') AS com1,
COALESCE(n2.no_comment, '') AS com2
FROM _regdnscluster.sl_status
JOIN _regdnscluster.sl_node n1 ON (n1.no_id=st_origin)
JOIN _regdnscluster.sl_node n2 ON (n2.no_id=st_received);
lagtime | st_origin | st_received | cd | com1 | com2
---------+-----------+-------------+--------+-------------+------------------
67 | 1 | 3 | regdns | Master Node | regdev-tst2 node
1792 | 1 | 2 | regdns | Master Node | regdev-tst1 node
(2 rows)

I would expect that the script reports ERROR as one of the nodes is behind, but it reports:

./check_postgres.pl --action=slony_status --schema=_regdnscluster --dbname=regdns --warning=300 --critical=600
POSTGRES_SLONY_STATUS OK: DB "regdns" schema:_regdnscluster Slony lag time: 68 (68 seconds) | time=0.08s 'regdns._regdnscluster Node 1(Master Node) -> Node 3(regdev-tst2 node)'=68;300;600

In my opinion, it should either check all slaves, or at least the slave with the highest lag. Here a proposed fix (ORDER BY lag DESC):

--- check_postgres.pl.orig 2013-12-09 09:49:57.000000000 +0000
+++ check_postgres.pl 2013-12-09 09:50:40.000000000 +0000
@@ -7418,7 +7418,8 @@
COALESCE(n2.no_comment, '') AS com2
FROM SCHEMA.sl_status
JOIN SCHEMA.sl_node n1 ON (n1.no_id=st_origin)
-JOIN SCHEMA.sl_node n2 ON (n2.no_id=st_received)};
+JOIN SCHEMA.sl_node n2 ON (n2.no_id=st_received)
+ORDER BY lagtime DESC};

 my $maxlagtime = -1;

regards
Klaus

--same-schema check does not check indexes

ID: 54
Version: unspecified
Date: 2010-11-19 05:11 EST
Author: Aleksey Tsalolikhin ([email protected])

Hi. First of all, thanks for a great and most useful tool!

Secondly, we've just discovered that --same-schema check misses indexes.

We have some indexes that don't involve primary key constraints, and --same-schema check fails to report differences between tables if database A has a table with 1 index, and database B has that same table with 2 indexes. The 2nd index does not involve primary key constraints.

Thanks again for a great tool!

Yours truly,
Aleksey

replicate_row quoting issue

Hi Guys,

A long time ago we had an issue with replicate_row not quoting table names and this was fixed (currently line 6120) with a simple

$table = qq{"$table"}

However I've just had an issue where I want to check two tables with the same name in different schemas and end up getting my table names quoted as "schema.table". I couldn't see any option to pass a schema parameter, and ended up swerving the issue by passing the table name as schema"."tablename - but unless I've missed something wouldn't it be best to replace the above line for something to quote around the full stop too like:

$table =~ s/([^.]+)/"$1"/g;

Feature Request: Nagios multiline output - only include DBs above threshold

Hi,

On a server I have a large number of databases.
I would love it if the multiline nagios plugin output for checks like DB size was only showing the ones that actually exceed the threshold.

The way it is now it's unreadable.

query_time gives warning about use of uninitialized variable SQL2

the full error message is:
"Use of uninitialized value $SQL2 in concatenation (.) or string at ./check_postgres.pl line 7555."

The simple fix for it is the addition of

$SQL2 = $SQL;

in (new) line 7546.

pgbouncer checks should perhaps exclude pgbouncer database

ID: 107
Version: 2.19.0
Date: 2012-07-10 07:19 EDT
Author: Peter Eisentraut ([email protected])

Example, check that there are not less than 10 clients per database:

check_postgres --action=pgb_pool_cl_active -w 10 --reverse
POSTGRES_PGB_POOL_CL_ACTIVE WARNING: DB "pgbouncer" otherdb=53 | time=0.05s

Come to think of it, I don't know whether this output makes sense. With
2.16.0, it says

POSTGRES_PGB_POOL_CL_ACTIVE WARNING:  pgbouncer=1 | time=0.05

which is closer to the problem. Is it really useful to check the special
"pgbouncer" database in these checks? It's easy to exclude them using
--exclude=pgbouncer, of course.

Add a check for correlation dropping below a certain value.

ID: 72
Version: unspecified
Date: 2011-03-29 03:34 EDT
Author: Andy Lester ([email protected])

We had an app that turned out to be very dependent on tuples being clustered in
a certain order. We had an 80M-row table with two columns, keyword and id.
The table happened to be predominantly in keyword order, physically.
Correlation from pg_stats on this table was around 0.90. Throughout the day,
searches would read thousands or tens of thousands of tuples in keyword order.
Life was good.

This weekend, we rebuilt this table, but rebuilt it in ID order. When we
rolled the table out, our performance tanked. Reading thousands of tuples in
keyword order required thousands of seeks throughout the table rows. It
crushed performance. Turns out correlation on the keyword column went down to
about 0.03. Re-clustering the table fixed our performance problem.

And, it's not just this one table. We have about 15 of these tables. As they
get updated, we want to make sure that the correlation on the keyword column in
all these tables never gets below, say, 0.90, and check_postgres seems like the
ideal tool to monitor this.

query_time defaults are wrong in documentation

The documentation claims for query_time:

The values for the --warning and --critical options are amounts of time, and default to '2 minutes' and '5 minutes' respectively.

Actually, there doesn't appear to be any default. If you leave the options off, you get

ERROR: Invalid argument for 'critical' options: must be an integer, time or integer for time

It appears that this might possibly have been broken in commit 1ceb887.

(Personally, I like having no default better. But of course the documentation or the code should be corrected either way.)

check_postgres_sequence errors with "cannot access temporary tables of other sessions

ID: 97
Version: unspecified
Date: 2012-01-06 20:59 EST
Author: Ryan Kelly ([email protected])

check_postgres --command=sequence errors with:

ERROR: ERROR:  cannot access temporary tables of other sessions

This is when run as the 'postgres' user.

uname -a
Linux prodsql 2.6.38-11-virtual #50-Ubuntu SMP Mon Sep 12 21:51:23 UTC 2011
x86_64 x86_64 x86_64 GNU/Linux

psql -c 'select version();'
PostgreSQL 9.0.5 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.5.real
(Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2, 64-bit

check_postgres --version
check_postgres version 2.18.0

custom_query --reverse with critical and warning values returns OK

When doing a --reverse check with custom_query and using both warning and critical values, the result is always OK.

This is caused by the sub validate_range () which returns no values back to custom_query.

I fixed it with the following lines:

2187 if (length $warning and length $critical and $warning > $critical) { {
2188 # Original
2189 #return if $opt{reverse};
2190 # Option 1, following checks won't get executed
2191 #return ($warning,$critical) if $opt{reverse};
2192 # Option 2, break out of the if statement, needs another { }
2193 last if $opt{reverse};
2194 ndie msg('range-warnbig');
2195 } }

As I'm not a programmer, could you please review and if ok, include it in the next release.

Cheers

Tobias

bucardo / check_postgres Goto Github PK

check_postgres's Introduction

check_postgres's People

Contributors

Stargazers

Watchers

Forkers

check_postgres's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs