« The sky is full of hypocritical stars | Main | There. I said it. »
This is an update of this article, because a) it needed it and b) it's gotten simpler. You can in fact install Nagios completely via DarwinPorts/MacPorts, but that's not how I did it.
Some background for those who may not know what Nagios is. Nagios is a network monitoring tool. It allows you to monitor services running on various hosts on your network. For example, you can monitor your switches to ensure they’re working, or you can monitor various critical server processes on an Xserve, like the KDC process, AFP processes, etc. You can, with various plugins from the Nagios Exchange site, at http://www.nagiosexchange.org/ monitor internal counters on Windows servers too. If you’re skilled with Perl, or any one of dozens of programming languages, you can write your own plugins.
While you can install everything Nagios needs manually, I just use MacPorts, nee DarwinPorts. You can get the initial MacPorts download at http://svn.macports.org/repository/macports/downloads/DarwinPorts-1.3.2/. The docs are pretty solid, so just RTFM and you should be set. Since MacPorts installs everything into /opt/local, you'll want to modify your .profile to reflect that. I added the various /opt/local/ trees, and modified my PKG_CONFIG_PATH to account for /opt/local/ as well. Once you have that set up, you can install the basics, if you don't have them already. For my setup it was:
sudo port install zlib
sudo port install libpng
sudo port install jpeg
sudo port install gd2
Once that's done, you'll need to make sure the Perl Net-SNMP libraries are installed. You can install these however you like. I'm lazy, so I punted:
sudo cpan –i Net::SNMP
Now we can get down to actually installing Nagios.
First, set up the directories. I use /usr/local/nagios/, that's the default, and it simplifies things. You then need to create a Nagios user, a Nagios group, and a Nagios command group. Again, I'm lazy, so I use nagios/nagios/nagioscmd respectively. Obviously, you don't want the Nagios user to have a shell or a login, and leave the password blank. The Nagios group needs two members, nagios and www, (or whatever user you run your web server under).
Okay, now we download Nagios and the plugins from http://www.nagios.org/download/. The current version of Nagios as of this article is 2.7 and the plugins are up to 1.4.1. Download both to the Nagios server, unzip and untar the files.
Nagios's configure parameters are a little complicated, but not too hard. For my setup, I used:
./configure --with-gd-lib=/opt/local/lib --with-gd-inc=/opt/local/include \
--prefix=/usr/local/nagios --with-cgiurl=/cgi-bin --with-htmlurl=/ \
--with-nagios-user=nagios --with-nagios-group=nagios --with-command-group=nagioscmd
You should see the Nagios configure screen if all went well. Note the HTML and CGI URLs. If you have to change those, you can easily enough, but wait until it's all built and done, (you only have to fix it in one HTML file).
Then it's just:
make all (build the binaries)
sudo make install (install them)
sudo make install-commandmode (installs and configures the directory for external commands)
sudo make install-config (installs sample Nagios command files in /usr/local/nagios/etc/)
Barring any errors in install-config, the basic installation is done.
The Nagios Directory Tree
To help you understand Nagios better, let's take a look at the structure of the Nagios Directory Tree, (this assumes that nagios/ is in /usr/local/):
/usr/local/nagios/
- /bin
- Contains the Nagios executable and sometimes the nagiostats executable. nagiostats is a graphing add-on for Nagios, and as such is beyond the scope of this article
- /etc
- contains the configuration files for Nagios. You'll be spending a lot of time here with text editors, especially at first
- /lib
- various support files
- /libexec
- the directory where the Nagios probe files live. These can be compiled C code, shell scripts, etc. The Nagios web site has some good docs on developing your own probe.
- /sbin
- this is the directory containing the Nagios cgi files. You can either link them to your standard CGI-BIN directory or just copy them. Nagios doesn't update too often, so either is about as easy
- /share
- directory for the Nagios html files
- /var
- location for various log files, lock files, dat files, etc.
Nagios has a fairly simple directory structure, but knowing what lives where always helps.
Initial Setup
Before we start setting up probes, etc., there are a few things we need to take care of. First, you want to make sure that Apache knows where the Nagios html files are. You can do this however you wish. The simple way is to create a site that points to /usr/local/nagios/share/. You'll (obviously) need to enable CGI execution for that site, and make sure that you have the modules for running Perl et al CGIs are enabled. You'll also want to make sure the evil Performance Cache is not running.
The next step is to make sure the CGIs are usable. You can either link them to /System/Library/WebServer/CGI-Executables/ or copy them there, whichever you prefer. As I said, Nagios isn't updating every other week, so the pain level is about the same for either method. Use whichever one works best in your environment.
Now, we make sure that Nagios's html pages know where its CGIs are. Even though we thought we told Nagios to use the root of our web directory and cgi directory, it really likes to add a /nagios/ in front of things. Luckily, you only have to deal with this in one file, /usr/local/nagios/share/side.html. Pull that up in BBEdit, pico/nano or whatever text editor floats your boat and make sure the CGI paths are correct for your setup.
If everything is correct so far, pointing a browser at your Nagios home page should get you:

Now, nothing is going to work yet, because we have to do a little more setup and initial configs, but, at least we have something. The next thing to set up is some .htaccess love and make sure httpd.conf is set up right. But first....
Nagios relies on SNMP. SNMP is simply not secure at all unless you use SNMPv3, which still isn't quite ready for most prime time usage. SNMP v1 / v2c doesn't encrypt anything. I only run SNMP where there's no direct access from the public Internet with a solid firewall and VPN. If someone gets into your network and can use SNMP on your systems, they can have all kinds of evil fun.
That's not to say SNMP is some kind of big evil risk, however, it is completely inappropriate outside of a properly protected network. Even on a properly protected network, you want to limit what machines can request SNMP information, receive traps, etc. SNMP is a really useful tool, but it is a high-speed way to shoot yourself in the foot if you are not careful. If you are new to SNMP, stop reading now, and grab the O'Reilly book “Essential SNMP”. Read it, then come back and set up Nagios.
More Setup
Now we want to configure a .htaccess file to limit who can do what with Nagios, (i.e disable notifications, enable notifications, shut down the process, etc.). Nagios will scream bloody murder if you set it up to bypass its (admittedly) simple security features. So what I did was take this directive:
<Directory “/Library/WebServer/CGI-Executables”>
AllowOverride None
Options None
Options ExecCGI
Order allow,deny
Allow from all
</Directory>
and changed it to read:
<Directory “/Library/WebServer/CGI-Executables”>
#AllowOverride None
AllowOverride AuthConfig
# Options None
Options ExecCGI
Order allow,deny
Allow from all
</Directory>
I then added a .htaccess file in /Library/WebServer/CGI-Executables/ and set it up as:
AuthName “Nagios Access”
AuthType Basic
AuthUserFile /usr/local/nagios/etc/htpasswd.users
require valid-user
(There are dozens of ways to do this, many much better than this one. Since this server is on a well-protected network, I kept it simple. Your mileage may vary.)
Now you want to create the htpasswd.users file in /usr/local/nagios/etc/ and populate it with a couple users, nagiosadmin and jwelch:
Create the file and give it the first user:
sudo htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin
Now add the next user:
sudo htpasswd /usr/local/nagios/etc/htpasswd.users jwelch
You only need the -c switch when you initially create the file. Now, we want to make sure that Nagios knows what's what. To do that, we open up the /usr/local/nagios/cgi.cfg file. First, take some time to read the file. Ethan and the rest of the Nagios team have done an excellent job of documenting things, both in the formal documentation, and in the file comments. Read the file comments first, so you can get an idea of what is going on in the file. You'll want to scroll down to the use_authentication line and make sure it is uncommented and reads:
use_authentication=1
From there, we set up the lines that create specific access for users. Again, read the comments in the cgi.cfg file, they explain what's going on much better than I can here. One advantage of this is that you can set up “View only” users, or even set it up so that someone can only view information for hosts that they are contacts for. Again, SNMP is not terribly secure, but Nagios does a good job with the basics of making it harder for people to accidently do dumb things.
Once you have this all set up, we're ready to delve into the most critical part of Nagios:
The Config Files
Now, it's tempting to just start setting up configs and adding devices. However, before you do that, let's take a look at some basic Nagios Configuration File Theory. With Nagios, you have commands that are part of services which are run on hosts. Both hosts and services can be grouped together as hostgroups and servicegroups. When certain conditions are met, contacts, which can be part of contactgroups are notified via email, pagers, what have you.
Now, because you can have one host (a switch for example) that controls many other hosts, (computers and printers) we have dependencies. Dependencies allow you to say “these 96 hosts are dependent on this host, so if this host goes down, DON'T notify me about the other 96!” Dependencies are a huge help not just with notifications, but with how Nagios can show you maps of your network.
Each host has both basic information, such as IP address, name, hours of operation, etc., and extended info, which allows for custom icons, 3-D icons, connecting Nagios to various graphing utilities, etc.
So, before you start adding hosts, you want to think about your hostgroups. For example, Active Directory and/or Open Directory domain controllers can all be in one group, since you're going to look for the same things on each. That way, you can apply a service to that group, and every host in that group gets that service. When you're talking about potentially hundreds or thousands of hosts per service, and tens of services per host, the use of hostgroups is almost required. Bring up a new host, add it to the appropriate group and voila, it has all the services running on it that it should rather than you having to add that host to every service manually. Service groups allow you to group services in the same way. Need to apply ten services at once? Use a servicegroup, save yourself some time.
Now, on to the files themselves, (note that all of these are in /usr/local/nagios/etc/, and I am not going to talk about all of them, just the ones that I primarily use):
nagios.cfg
nagios.cfg is the master config file, and so is less of an 'active' config file, and more of a starting point for the other config files. It contains the listings for:
- Log file path
- command file, contact file, host file, and other config file paths
- cache and resource file paths
- status file paths
- user name for the nagios user
- name for the nagios group
- enable/disable external commands
- external command check intervals
- comment, downtime, and lock file paths
- temp file path
- event broker settings
- log options
- global service and host check options
- scheduling and state settings
- interval settings
- service check enable/disable settings
- other global options
resource.cfg
resource.cfg is a collection of placeholders and shortcuts. You can set up various variables in here that you use in commands and other things. So for example, you can set up the base nagios path here with this:
# Sets $USER1$ to be the path to the plugins
$USER1$=/usr/local/nagios/libexec
Base SNMP community:
#standard SNMP community
$USER3$=mycom
NT4 Domain Name:
#NT Domain name
$USER4$=mydom
LDAP domain:
#LDAP base for myds.net
$USER5$=“dc=myds,dc=net”
You can have quite a few of these, and they come in handy for saving you time when setting up check commands, which brings us to our next file:
checkcommands.cfg
checkcommands.cfg is where you set up the commands that use the Nagios probes in the /usr/local/nagios/libexec/ directory. A basic checkcommand looks like this:
#check_macshare command definition
define command{
command_name check_smb
command_line $USER1$/check_disk_smb -H $HOSTADDRESS$ -s MacShare -u username -p password -W $USER4$
}
Remember that you don't directly use commands, but rather incorporate them into services. So here we have a command that is going to use the check_disk_smb command to check the status of an SMB share. (To find out what the options for a specific probe are, run that probe with a -h switch. In this case, we see that this probe uses SMB to check available vs total disk space on an smb server.)
We know that $USER1$ is the root path to the Nagios libexec directory. -H is the host address expansion, a standard one for Nagios, which we'll look at when we look at the hosts file. -s is the share name, “MacShare” in this case, and -W is the workgroup/NT4 Domain name, in this case, mydom, with a user name and password that can access this share. When I run this command on one of my servers I get:
Disk ok - 1989.4G (97%) free on \\server\sharename
This is the basic check command syntax. If you spend some time with the options for commands, you can get really flexible with them. Even without that, one command can tell you multiple things. For example, if this command runs okay, you know the space stats for that share. You also know that the SMB server on the remote machine is running, because if it wasn't, the command would fail. So right there, you have one command giving you not just disk usage info, but SMB Server up/down status from the client POV.
However, that's only the information as the share sees it. If I want to check from a different viewpoint, I can use the check_disk_freespace service, which uses the check_snmp_storage.pl command, and looks at the disk volume that contains the share. Here's the checkcommand for that:
# 'check_snmp_storage' command definition
define command{
command_name check_disk_freespace
command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C $USER3$ -m $ARG1$ $ARG2$ -w 20% -c 10% -T pl
}
If I run that on the same machine on the disk, I get:
/Volumes/MacShare : 85 %left (2037144Mb/2384544Mb) (> 20) : OK
But they're different numbers. Well, welcome to the wonderful world of network stat monitoring, where how you observe can give you different data. Now, we have to run our commands against something, and what we run them against are hosts.
hosts.cfg
This is the file where you set up your basic host configurations. Like a lot of things in Nagios, you can set up templates for a lot of this, so as to make the individual host entries easier to create. In my case, I have this as my base host template:
# Generic host definition template
define host{
name generic-host ; The name of this host template - referenced in other host definitions, used for template recursion/resolution
check_command check-host-alive ;standard ping test for is it there
notifications_enabled 1 ; Host notifications are enabled
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
notification_options d,u,r,f ;in order notify on down, unreachable,up, flap
notification_period 24x7 ; Host problems we want to know about immediately
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
}
Going down this template, we see some basic things. The name of this template is generic-host, and this is used by other host entries to apply these settings to the entry's host. You can set up multiple templates, just give them different names and apply them as you need. The next line is an up/down check command that is run against every host using this template. It's just a check to see if the host is responding. The next three lines enable notifications for a host, (i.e. up/down information), enable the event handler for this template, and enable flap detection.
Flap detection is one of the nicer features of Nagios. There are times when you're going to run into a situation where the connection to a given host, or a service on a host is going to be going from up to down and back, good to bad and back, etc. rather rapidly. To avoid sending you gobs of good/bad messages, Nagios has what it calls “flap” detection. Once a host or service on a host goes from good to bad and back at a fast enough pace, it is said to be flapping, and Nagios will stop sending individual messages on that host or service. Instead it will send a “flapping” message, and then when the host or service has stopped flapping, it sends a “not flapping” message. From that point, normal notifications resume. In general, flap detection is your friend, and you should have it enabled by default.
The next three settings deal with processing performance data, (for external applications like graphing addons), and retaining (non) status information across Nagios restarts. When you're modifying the Nagios configs, you restart the process a lot.
The notification options set when Nagios sends out host notifications. Here, we have four conditions:
- Notify when host is down, (d)
- Notify when host recovers, (r)
- Notify when host is unreachable, (u)
- Notify when host starts/stops flapping, (f)
The next line sets up the default notification, “24x7”. This is a period defined by you, and is contained in timeperiods.cfg, (at least in my setup), which we'll talk about later.
The final line tells Nagios that this is not a host definition, but rather just a template, so don't treat it as a host.
Great, we have our defaults, now, let's look at an actual host entry, in this case, for an Xserve:
# 'xserve01' host definition
define host{
use generic-host ; Name of host template to use
host_name xserve01
alias Xserve #1
address <numeric ip address>
parents coreswitch
max_check_attempts 10 ;number of attempts for a bad return
notification_interval 1 ;how long to wait in minutes to notify someone
contact_groups ISadmins
}
The first line tells Nagios to apply our “generic-host” template to this host, setting quite a few defaults. The next line is the host_name. The host_name can be the same as the DNS name, (and that's the convention I use), but it is not a requirement. This is the name used within Nagios to refer to this host. In general, it's a good idea to use DNS naming standards here, (no spaces, no dots in the host name, etc.). The next line is the alias for the host. This is where you can use more “human” names, with spaces, etc. The address line is for the TCP/IP address here. I recommend using the numerical address, it saves you DNS lookup time, and when you think about how often a Nagios setup can be trying to talk to hosts on a network, even a second or two adds up.
The “parents” line is that whole dependencies setup. With the parents line, I can ensure that if the host named "coreswitch" goes down, I'm not going to get hit for every service and host running behind it. Since that could be multiple hosts and even multiple hosts on multiple switches, “parents” is a good thing to use. “max_check_attempts” is the number of times Nagios will retry a check if it gets anything other than an “OK” response, (however that is defined). The interval length is the number of “interval units” to wait before re-notifying a contact (group) that a host is still down or unreachable. In my case, my standard “interval unit” is 60 seconds, so it works in minutes. The Interval Length is defined in nagios.cfg. The final line is the contact group who gets all notifications from this host; “ISadmins” in this case. This is defined in contactgroups.cfg, which we'll get to in a bit.
These aren't all the possible options for a host entry or template. Those are all described, and well, in the Nagios documentation. However, since Nagios has a standard way to set up host entries, it creates a lot of options with regard to how you add, edit, and delete host entries, something that every sysadmin loves.
hostgroups.cfg
Now, as easy as hosts.cfg can be to set up and maintain, if you have to modify individual hosts every time you make changes, you're going to lose your mind. So, we use hostgroups, which are, well, groups of hosts that you can use to make your life easier. If you have ten email servers, what would you rather do, apply an SMTP check to all ten individually, or put them all in a host group and just apply the SMTP check to the group? Right, so now let's look at a hostgroup definition for my Xserves:
# 'Xserves' host group definition
define hostgroup {
hostgroup_name xserves ;name used to refer to the hostgroup
alias Xserves ;More human name
members xserve02,xserve01 ;members of this host group
}
I (obviously) have a fairly simple hostgroup setup. However, that's my choice. You can apply essentially the same settings to a hostgroup as you can a host. The Nagios documentation on this has all the info you need. In my case, I just have name of the hostgroup that Nagios uses, a more human - readable alias, and the members of the group. You use the host names to list the members of the group, separating each host with a comma.
This is a fairly simple group, but I've others with 20 or more members, and there are installations with FAR more. When you hit one of those, you really appreciate hostgroups.
timeperiods.cfg
This is the file where you define your time period. Three examples here:
# '24x7' timeperiod definition
define timeperiod {
timeperiod_name 24x7
alias 24x7
sunday 00:00-24:00
monday 00:00-24:00
tuesday 00:00-24:00
wednesday 00:00-24:00
thursday 00:00-24:00
friday 00:00-24:00
saturday 00:00-24:00
}
# 'workhours' timeperiod definition
define timeperiod{
timeperiod_name workhours
alias “Normal” Working Hours
monday 08:00-17:00
tuesday 08:00-17:00
wednesday 08:00-17:00
thursday 08:00-17:00
friday 08:00-17:00
}
# 'nonworkhours' timeperiod definition
define timeperiod{
timeperiod_name nonworkhours
alias Non-Work Hours
sunday 00:00-24:00
monday 00:00-08:00,17:00-24:00
tuesday 00:00-08:00,17:00-24:00
wednesday 00:00-08:00,17:00-24:00
thursday 00:00-08:00,17:00-24:00
friday 00:00-08:00,17:00-24:00
saturday 00:00-24:00
}
Now, the formats here are pretty simple for “24x7” and “workhours”. You just list the time periods for each day using 24hr notation. “non-workhours” is a bit different. Because that works around workhours, I have to deal with non-contiguous time blocks. However, it's pretty easy, you just separate the timeblocks with commas, and Bob's your uncle.
This gives you a lot of flexibility in defining when you want things to happen, so that you aren't getting yelled at about a non-essential printer being down. It's also handy if you have equipment that needs to be off during certain timeperiods. You could set up a “negative” notify, i.e. if a piece of equipment comes up when it shouldn't be, you can set up a notification for that too using different settings in timeperiods.cfg
services.cfg
If you recall, when I was talking about checkcommands.cfg, I said that we don't use checkcommands directly, but rather they are a part of services. Well, services.cfg is where you combine commands, hosts and the rest to get the information out of Nagios that you want. Another thing about services is that they let you use the same checkcommand in different ways. Let's take a look at how you can do that, and see how you can combine various Nagios elements to get some really cool flexibility without having to constantly reinvent the wheel.
First, just like hosts.cfg, we can set up some templates for stuff we want to apply to all services by default:
# Generic service definition template
define service{
name generic-service ; The 'name' of this service template, referenced in other service definitions
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
is_volatile 0
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}
If this looks rather redundant to our hosts.cfg template, well, it is and it isn't. When you're talking about hosts, they're black boxes. The hosts.cfg file doesn't care about what's going on inside the host, only the host itself. To see what's going on inside the hosts, we use services, and so we need to have a separate set of templates for them.
Just like hosts.cfg, we have a name for the template, “generic-service” in this case. The next line enables active checks for all services using this template. Active checks are initiated by Nagios, passive checks are submitted to Nagios. If you think of “The Hunt for Red October” or any other undersea warfare book/movie, active checks are you “pinging” the host to check the service., i.e. “Just one ping”. The passive checks are you sitting quietly and receiving information, i.e. cavitation noise from the other sub, or Nagios receiving SNMP traps. The line after active_checks_enabled allows passive checks for any service using this template.
The "parallelize_check" option allows Nagios to run any services using this template in parallel. (Do read the documentation on parallelization. It's a neat feature, but not a magic spell.) “obsess_over_service” is used where you want to run a command after every service check. Again, read documentation on this. (If it seems I say this a lot, well, I do, but Nagios has top - notch documentation, so it's a good thing to...obsess over :-P) The “check_freshness” is used with passive checks, and is a way to make sure you receive your passive checks on a regular basis. Since I don't use passive checks that way, I don't care, and I leave it in the default state, which is off.
The next few lines; “notification_enabled”, “event_handler_enabled”, “flap_detection_enabled”, “process_perf_data”, “retain_status_information” and “retain_nonstatus_information” do the same things for services as they do for hosts. “is_volatile” is used for deciding if a service is volatile. Volatile services are used in special situations, such as detecting port scans, where you want very specific things to happen every time there is a problem, even if it's an ongoing problem. In this case, our generic setting is to have our services not be volatile. Finally, we have the “register” line set to 0 so Nagios treats this as a service template, not a service.
Now, let's look at the checkcommand from our earlier example:
# 'check_snmp_storage' command definition
define command{
command_name check_disk_freespace
command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C $USER3$ -m $ARG1$ $ARG2$ -w 20% -c 10% -T pl
}
Note the $ARG1$ and $ARG2$ variables? Those are going to be very important here in a second. The -T switch says “Calculate percentage of available space”, the -w is our percentage available warning level, and the -c is our percentage available critical level. At 20% it goes yellow, at 10% it goes red.
Now, here's my service for using this against Windows boot disks:
# Service definition
define service{
use generic-service ; Name of service template to use
hostgroups win_domain_servers,win_servers
service_description Windows_C_FreeSpace
check_period 24x7
max_check_attempts 5
normal_check_interval 3
retry_check_interval 1
contact_groups CIS-admins
notification_interval 20
notification_period 24x7
notification_options w,u,c,r,f
check_command check_disk_freespace!C:
}
Now, this looks a lot like a host entry, but since it's a service, we have some differences. First off, the "hostgroups" line. We see that I have two groups that this applies to, win_domain_servers and win_servers. That's one of the great things about using hostgroups and services. I can apply the same service to different groups with ease, and by adding a group, I can add a hundred hosts with the same effort as adding one. Next is our description, “Windows_C_Freespace”. This is the name of the service we'll see in Nagios when we look at that host or hostgroup. We use our “24x7” group for the check_period as I want constant monitoring on the free space available on the boot drive of my Windows servers.
“max_check_attempts”, “contact_groups” and “notification_interval” are the same as for hosts. The “normal_check_interval” is just the “normal” interval to wait between checks. In this case, it's every three time units, which in my case is every 3 minutes. (I deal with servers that can run close to the limits at times, so checking more often is a good thing in my situation.) This applies both to “OK” returns, or “!OK” returns where the “max_check_attempts” limit has been reached.
If a service returns a “!OK” status, then the “retry_check_interval” is used to see how long to wait before trying again. This “check it again” interval is used in conjunction with “max_check_attampts”. I have my retry interval set to 1 minute, so if a service returns !OK, Nagios will try it again every minute for the next five minutes. If the service returns “OK” in that time period, then it goes back to a normal cycle. Otherwise, it tries five times, then waits three minutes, and tries 5 more times on the same one - minute interval, repeating this until the service goes to an “OK” state.
If the service is in a continual !OK state, then Nagios is going to notify me every 20 minutes, as determined by the “notification_interval”. Since these servers are critical, I also set my “notification_period” to be 24x7 too. For non-critical devices, I set the notification period to be workhours. Face it, if the CEO's printer gets a little wierd at 2am, do I really care? No. At 9am? Yeah, then I care. The “notification_options” setting is similar to those for the hosts.cfg, but there are some differences:
- w is send notifications on a warning state
- u is send notifications on an unknown state, (you see this a lot if snmpd dies)
- c is send notifications on a critical state
- r is send notifications on a recovered state
- f is send notifications when the service starts/stops flapping
Finally, we see the checkcommand used, and the variables passed to the command, check_disk_freespace!C: (Note the ! is how Nagios defines a passed argument from a service. Also note that in this case, even though the checkcommand allows for two arguments, I can leave one blank if I need. This will come into play in a minute.)
When this service runs, it runs the following command against every host in the specified hostgroups:
/usr/local/nagios/libexec/check_snmp_storage.pl -H <host name from hosts.cfg> -C <snmp community from resource.cfg> -m C -w 20% -c 10%
The return I get from a system that's normal looks like this:
C:\ Label: Serial Number 983596ca : 54 %left (9525Mb/17508Mb) (> 20) : OK
This is the return for a host with a C: drive that has a “normal” amount of space left.
But what about my non-Windows servers or domain controllers? What if I want to check a specific volume's space on a specific host? QED with Nagios. For the former, where we want to monitor free space on the root volume, we reuse our checkcommand for Unix systems as follows:
# Service definition
define service{
use generic-service ; Name of service template to use
hostgroups xserves
service_description Unix_Root_FreeSpace
check_period 24x7
max_check_attempts 5
normal_check_interval 3
retry_check_interval 1
contact_groups CIS-admins
notification_interval 20
notification_period 24x7
notification_options w,u,c,r,f
check_command check_disk_freespace!/!-r
}
Note there are only two differences, the “hostgroups” line and the “check_command” line. The former says this service only gets applied to hosts in the “xserves” group. The switches in the checkcommand say “check the root volume, but only check /”. This way, nothing in /Volumes gets checked.
The actual command looks like:
/usr/local/nagios/libexec/check_snmp_storage.pl -H <host name from hosts.cfg> -C <snmp community from resource.cfg> -m / -r -w 20% -c 10%
The answer I get back looks like this:
/ : 69 %left (121297Mb/176700Mb) (> 20) : OK
Now, for the latter case, where I have a specific volume I want to monitor on only a single host? Same checkcommand, different service entry:
# Service definition
define service{
use generic-service ; Name of service template to use
host_name xserve01
service_description MacShare_FreeSpace
check_period 24x7
max_check_attempts 5
normal_check_interval 3
retry_check_interval 1
contact_groups CIS-admins
notification_interval 20
notification_period 24x7
notification_options w,u,c,r,f
check_command check_disk_freespace!/Volumes/MacShare!-r
}
Again, only two differences. First, I don't have a hostgroups entry, I have a host_name entry. Since I only want to run this service on one machine, I use the host name, not the hostgroup name. Second, we change the checkcommand to run on /Volumes/Macshare and only /Volumes/MacShare on xserve01. You can guess what the command will look like, and the return is:
/Volumes/MacShare : 85 %left (2036864Mb/2384544Mb) (> 20) : OK
Same command, used three different ways against two completely different machine types, and even against the same host. All I had to do was make minor changes in each service entry, while re-using the same checkcommand. This is where Nagios rules. Nagios' probes, both the standard probes from the Nagios site, and the probes from Nagios Exchange are quite customizable. So even if you don't see a probe that jumps out at you as the solution to a particular problem, check out the options for the probe. To do this, just run the probe with the -h switch, and you should get a short explanation of the probe's options, and the information it returns.
If you still can't find a probe that does what you want, check out the documentation on writing your own. Nagios probes are fairly simple to create, and new ones pop up all the time. While I was writing this article, Pascal Robert, from the MacEnterprise list created a probe for the Megaraid controller in the G5 Xserves. I'm so stealing it (with credit of course, thanks Pascal!). Here it is if you're interested:
This is what I did, and at least the man page for megaraid do list
the status : Optimal, Degraded and Failed. So code like this work :
my $logical_drive_status = undef;
my $pid = open(MEGARAID, “/usr/sbin/megaraid -showconfig
$logic_drive_id |”) or $state = “UNKNOWN”;
while(
if ($_ =~ m/\s*Status = (.*)/gi) {
$logical_drive_status = $1;
if ($logical_drive_status =~ m/OPTIMAL/gi) {
$answer = “Megaraid : $logical_drive_status\n”;
} elsif ($logical_drive_status =~ m/DEGRADED/gi) {
$state = “WARNING”;
$answer = “Megaraid : $logical_drive_status\n”;
} else {
$state = “CRITICAL”;
$answer = “Megaraid : $logical_drive_status\n”;
}
}
}
close(MEGARAID);
if (!defined($logical_drive_status)) {
$answer = “Can't find Megaraid status\n”;
$state = “UNKNOWN”;
}
Dunno about anyone else, but that's a new check I'll be adding into my own Nagios installation, and soon.
Testing Your Config
So, you've set up your config the way you want, now we start up Nagios, and go, right?
WRONG
First we test our config, and Nagios gives us an easy way to do this:
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
If you have any errors, the -v will catch the formatting / “compile” errors for you, and tell you the file, the line number, and give you a really good idea of what the error is. Not to wax too fanboyish, but honestly, the work that Ethan and the others have put into the unglamorous parts, like documentation and error messages make Nagios ten times easier to use than it could be, and in a lot of ways, easier to use than many similar commercial products. I've yet to hit too many problems that I can't fix with some RTFM magic, and of the ones that can't be fixed, the nagios -v error messages take care of the rest.
Starting Nagios
So, now that you've no errors, start it and go. To run Nagios as a daemon, it's simple:
/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
If you want to make sure it's always running, you can set up a launchd item for it, and it will just keep running along. That's what I did when I started running Nagios under Mac OS X 10.4 Server, and it works really well. I prefer to use Lingon to set up launchd items, as Lingon makes it dead simple to use launchd for more than just a replacement for StartupItems. Here's my Nagios launch daemon:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>org.nagios.nagiosserver</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/nagios/bin/nagios</string>
<string>-d</string>
<string>/usr/local/nagios/etc/nagios.cfg</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>ServiceDescription</key>
<string>Nagios Daemon</string>
</dict>
</plist>
If you have to make any changes to the config files, you'll have to restart the Nagios process to have Nagios see them. Make sure you test all changes with the nagios -v switch.
What's it look like
Below are some redacted screenshots of my current Nagios setup. It's not complete, but it should give you an idea of what you can do with Nagios as you get used to it.
The Nagios Tactical Overview:

Nagios' Hostgroups Overview

Nagios' Services Overview

Nagios' Host Problem Overview

Nagios' Performance Information

Nagios' Process Information

Nagios' Checkcommands

Nagios' 2-D Map

Conclusion
I'm not even going to try to pretend that what I've done in this article is a complete howto or step-by-step guide to installing Nagios on Mac OS X 10.4 Server under all situations. Nagios is nigh-infinitely flexible, and (fortunately) extremely well documented on its own. If this helps make it seem a tad less impossible, well then I've accomplished my goal here. But really, read the docs and the samples. Nagios is a well-done bit of project, and well worth the effort.
Technorati Tags: DarwinPorts, MacPorts, Nagios, Network Monitoring, NagiosExchange, Stuff that doesn't suck, Technology
Trackback Pings
TrackBack URL for this entry:
http://www.bynkii.com/cgi-bin/mt/mt-tb.cgi/537
Comments
John -
Great how-to document. Thanks for your detailed and informative notes. I'll be trying to put it to use soon.
One question, the title obviously mentions Tiger Server specifically. After reading and re-reading your doc, it seems that one doesn't have to necessarily use the Server version of Tiger.
Can you confirm or deny and elaborate a little?
Thanks.
PTeeter@vitrorobertson.com
Posted by: Paul Teeter | September 3, 2007 12:04 PM
