GithubHelp home page GithubHelp logo

sadsfae / ansible-nagios Goto Github PK

View Code? Open in Web Editor NEW
102.0 18.0 107.0 866 KB

:white_check_mark: Ansible playbook for setting up the Nagios monitoring system and clients.

Home Page: https://hobo.house/2016/06/24/automate-nagios-deployment-with-ansible/

License: Apache License 2.0

Python 11.76% Perl 7.25% Shell 57.25% Jinja 23.73%
nagios ansible monitoring playbook idrac centos rhel

ansible-nagios's Issues

make things more friendly for non-root users

I noticed some issues using this with Amazon EC2 when a non-root user is initially controlling systems, but that user can sudo.

We should wrap everything with become: true it has no ill effect and helps these cases.

Split out elasticsearch and elk server templates

There is a need for having separate elasticsearch and elkserver templates, if you're using the ELK ansible playbook you'll have an all-in-one ELK with different monitoring requirements.

Also you'll need to append the kibana username/password for check_http

If you're using a modular ELK deployment (multiple ES instances, perhaps separate kibana and several master nodes) you'll only care about monitoring elasticsearch.

(idrac) split out SNMP checks to scale better for lots of hosts

After implementing idrac checks across ~200+ servers there seems to be some scalability problems. SNMP queries take on average of 60-100seconds to return at times, even with increasing the service_check_timeout and other parameters some checks still seem time out.

We probably need to split out checks into individual queries per component, possibly removing ones that aren't that useful versus how long they take to return.

fatal: [host-01]: UNREACHABLE!

Your System Details

  • Ansible version: ansible 2.9.18
  • Operating System: CentOS 8

Describe the bug

I tried to install this playground, using the reader me instruction but when I run the command ansible-playbook -i hosts install/nagios.yml I get the error; fatal: [host-01]: UNREACHABLE!

I have tried using

[nagios]
host-01

[nagios]
demo

[nagios]
159.x.x.247

image

To Reproduce / What were you doing?
Steps to reproduce the behavior:

git clone https://github.com/sadsfae/ansible-nagios
cd ansible-nagios
sed -i 's/host-01/159.x.x.247/' hosts
time ansible-playbook -i hosts install/nagios.yml

Support HTTP Auth for oobserver category

we use the [oobserver] inventory group for things like generic out-of-band and PDUs but we need a configurable option to support the check_http Nagios Plugin with authentication. This should be turned off by default and can be configured in install/group_vars/all.yml

Nagios start error

fatal: [hostname]: FAILED! => {"changed": true, "cmd": ["systemctl", "restart", "nagios.service"], "delta": "0:00:00.037135", "end": "2017-01-28 17:05:23.980575", "failed": true, "rc": 1, "start": "2017-01-28 17:05:23.943440", "stderr": "Job for nagios.service failed because the control process exited with error code. See "systemctl status nagios.service" and "journalctl -xe" for details.", "stdout": "", "stdout_lines": [], "warnings": []}

[root@li849-175 ansible-nagios]# journalctl -xe
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Nagios Core 4.0.8
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Copyright (c) 1999-2009 Ethan Galstad
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Last Modified: 08-12-2014
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: License: GPL
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Website: http://www.nagios.org
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Reading configuration data...
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Warning: use_embedded_perl_implicitly is deprecated and will be removed.
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Warning: enable_embedded_perl is deprecated and will be removed.
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Warning: p1_file is deprecated and will be removed.
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Warning: sleep_time is deprecated and will be removed.
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Warning: external_command_buffer_slots is deprecated and will be removed. All commands are always processed upon arriva
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Warning: command_check_interval is deprecated and will be removed. Commands are always handled on arrival
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Read main config file okay...
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Warning: Duplicate definition found for host 'hostname' (config file '/etc/nagios/conf.d/webservers.cfg', starting
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Error: Could not add object property in file '/etc/nagios/conf.d/webservers.cfg' on line 9.
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Error processing object config files!
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: ***> One or more problems was encountered while processing the config files...
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Check your configuration file(s) to ensure that they contain valid
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: directives and data defintions. If you are upgrading from a previous
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: version of Nagios, you should be aware that some variables/definitions
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: may have been removed or modified in this version. Make sure to read
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: the HTML documentation regarding the config files, as well as the
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: 'Whats New' section to find out what has changed.
Jan 28 17:05:23 li849-175.members.linode.com systemd[1]: nagios.service: control process exited, code=exited status=1
Jan 28 17:05:23 li849-175.members.linode.com systemd[1]: Failed to start Nagios Network Monitoring.
-- Subject: Unit nagios.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit nagios.service has failed.

-- The result is failed.
Jan 28 17:05:23 li849-175.members.linode.com systemd[1]: Unit nagios.service entered failed state.
Jan 28 17:05:23 li849-175.members.linode.com systemd[1]: nagios.service failed.

Can not update servers.cfg

Your System Details

  • Ansible version (rpm -qa | grep ansible):
  • ansible-2.9.15-1.el7.noarch
  • Operating System: (cat /etc/redhat-release)
  • CentOS Linux release 7.5.1804 (Core)

Describe the bug
failed: [localhost] (item=servers.cfg) => {"ansible_loop_var": "item", "changed": false, "item": "servers.cfg", "msg": "AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_os_family'"}

To Reproduce / What were you Doing?
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected Behavior
A clear and concise description of what you expected to happen.

Logs / Screenshots
If applicable, add logs or screenshots to help explain your problem.

Additional Details
Add any other context or details about the problem here.

Don't force SuperMicro Perl Deps unless needed

Currently we require perl-IO-Tty and perl-IPC-Run packages for EL7 for SuperMicro IPMI checks regardless or not if people are using them. This wouldn't normally be an issue except they don't appear in the base RHEL7 repos, but do appear for CentOS7.

Let's make a change to have supermicro_enable_checks: be a configurable option with the default set to false.

Add new hosts

Can you add a bunch of hosts to the already installed nagios?

[RFE] Add alerting via webhook

This is an RFE for adding a contact definition to be a webhook URL e.g. posting alerting information to a chat platform that supports receiving webhooks like G-Chat or Slack.

reloading the iptables service

Restarting the iptables service can be dangerous if you're using conntrack. Reloading is much nicer I think..

--- a/install/roles/nagios-client/tasks/main.yml
+++ b/install/roles/nagios-client/tasks/main.yml
@@ -99,7 +99,7 @@
register: iptables_needs_restart

  • name: Restart iptables-services for TCP/{{nrpe_tcp_port}} (iptables-services)
  • shell: systemctl restart iptables.service
  • shell: systemctl reload iptables.service
    ignore_errors: true
    when: iptables_needs_restart != 0 and firewalld_in_use.rc != 0 and firewalld_is_active.rc != 0

Create Jenkins Server Role

We need a jenkins server role here, probably best to use NRPE to monitor localhost:jenkinsport and make it configurable.

[RFE] FreeNAS Status API Script Status needs Cleanup

The check_freenas.py script returns a dictionary when it should return a more digestible format. The entire status is present but it could be cleaned up a little bit.

{u'meta': {u'previous': None, u'total_count': 1, u'offset': 0, u'limit': 20, u'next': None}, u'objects': [{u'timestamp': 1573610556, u'message': u'Pool pool0 state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.', u'id': u'A:VolumeStatus:["Pool %(volume)s state is %(state)s: %(status)s", {"state": "DEGRADED", "status": "One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.", "volume": "pool0"}]', u'dismissed': False, u'level': u'CRITICAL'}]} 

https://github.com/sadsfae/ansible-nagios/blob/master/install/roles/nagios/templates/check_freenas.py.j2#L115

Refresh NRPE for good measure

I have noticed that NRPE is not always started, let's employ a service refresh. We're going to remove the nrpe_needs_restart register, though using that is the correct way to do it I've noticed that it doesn't always work.

Nagios 4.2.4 require workaround for /var/log/nagios/spool/checkresults

nagios-4.2.4-2 does not properly create the /var/log/nagios/spool/checkresults directory and the following error occurs:

Feb 24 03:26:33 example.com nagios[401]: Nagios Core 4.2.4
Feb 24 03:26:33 example.com nagios[401]: Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Feb 24 03:26:33 example.com nagios[401]: Copyright (c) 1999-2009 Ethan Galstad
Feb 24 03:26:33 example.com nagios[401]: Last Modified: 12-07-2016
Feb 24 03:26:33 example.com nagios[401]: License: GPL
Feb 24 03:26:33 example.com nagios[401]: Website: https://www.nagios.org
Feb 24 03:26:33 example.com nagios[401]: Reading configuration data...
Feb 24 03:26:33 example.com nagios[401]: Error in configuration file '/etc/nagios/nagios.cfg' - Line 454 (Check result path '/var/log/nagios/spool/checkresults' is not a valid direc
Feb 24 03:26:33 shithole.hobopiss.com nagios[401]: Error processing main config file!

Working with large number of hosts

Great work on this project. I'm trying to get this working with a largish number of servers and running into some issues. From my understanding, for this method to work it needs to gather facts for all servers, which can take quite some time and I've been finding issues if a server can't be connected to.

Are there any good ways to speed things up or work around the fact gathering?

Add Same server in Multiple group

Hi,

I downloaded this module, but I have one query. If i add same server in multiple groups it should be run.
For eg I added one server [10.1.2.86] in [switches][webservers] [servers] Group. I am thinking that it would use the checks that we have mentioned in [switches][webservers] [servers] i.e. in Nagios My server will be like this :- 10.1.2.86 : [switches] :- All Checks
[webservers] All Checks
[servers] All Checks

How can it be done ? Can you please give any suggestion ?

Wrap SuperMicro IPMI Checks to Control Status Codes

Using IPMI as the only means of monitoring SuperMicro servers via out-of-band management isn't that ideal but it works. We should wrap the raw IPMI return values better to control the following false positives that occur.

  • Powered off machines incorrectly alert as being down
  • Disk Checks or chassis intrusion may result false positive.

AnsibleUndefinedVariable: 'dict object' has no attribute 'ansible_default_ipv4'"

Hi,

I am facing one issue "AnsibleUndefinedVariable: 'dict object' has no attribute 'ansible_default_ipv4''

Here you go my host file.

[nagios]
103.225.77.2
[servers]

mysql-conversions-1 ansible_host=192.168.139.74


[webservers]

api-docs ansible_host=192.168.152.188
[elkservers]

[elasticsearch]
[switches]

[oobservers]

[idrac]

[RFE] Add mdadm software raid checks

Is your feature request related to a problem? Please describe.

We need to have good mdadm raid checks.

Describe the Possible Solution
Provide ansible hostgroup entries for generic linux server and DNS server for starters to also include an option for those using mdadm raid.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional Info
Add any other info or context about the feature request here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.