Standard Nagios monitoring information

Tags: ,

A sample service check, annotated:

    host_name                       dbsrp2076                  <-- name of server
    service_description             SSH                        <-- service being monitored
    servicegroups                   PROD-ssh                   <-- service groups
    is_volatile                     0                          <-- Does this service spontaneously start and stop (always 0 for "no")
    check_period                    24x7                       <-- during what hours is this service checked?
    max_check_attempts              10                         <-- How many failed attempts before an alert is generated?
    check_interval                  15                         <-- how often is this service ordinarily checked, in minutes
    retry_interval                  1                          <-- On failed attempts, how many minutes between retries?
    contact_groups                  oipeds_dba,smapigroup      <-- who is notified if there is an issue?
    notification_options            w,u,c,r                    <-- notification option list (see below)
    notification_interval           60                         <-- when a service is failed, how often should an alert be sent out if the problem is not acknowledged? (in minutes)
    notification_period             24x7                       <-- during what hours of the day are alerts sent?
    check_command                   check_ssh                  <-- script used to execute the check.

Standard Nagios notification options:

w: Notify on WARNING service states
u: Notify on UNKNOWN service states
c: Notify on CRITICAL service states
r: Notify on service RECOVERY (OK states)
f: Notify when the service starts and stops FLAPPING
n (none): Do not notify the contact on any type of service notifications

Furthermore, here are the current thresholds in the monitoring scripts:

  • check_auto_increment:  Checks all integers in a database to see how close they are to their max values as defined by the datatype.  Warns at 70% of capacity, critical alert at 85% of capacity.  Information on integer data types and their capacity can be found here:  dev.mysql.com/doc/refman/5.7/en/integer-types.html
  • check_cpuram: Enumerates current CPU and RAM on the server.
  • check_mysql: Checks to see if MySQL is running.
  • check_mysql_active: Checks the number of active, running threads.  Warns at 20, critical at 40.  Does *not* count sleeping threads.
  • check_mysql_cluster: Checks to ensure all three nodes of a cluster report as available.  Returns the cluster configuration and nodes if OK, returns a critical alert if the cluster size is less than three.  Returns an unknown error if other states are encountered.
  • check_mysql_schemata:  Returns a list of all schemata in the database with the exception of all system schemata.
  • check_mysql_size: Returns the total size in GB of the entire database.  Informational.
  • check_mysql_sleep: Checks the number of sleeping threads.  Warns at 500, critical at 600.
  • check_mysql_version: Checks the MySQL version running.  Approved versions are >= 5.7.28 and 8.0.x.  Other versions return a warning alert.
  • check_os: Returns the distribution name, major, and minor version of the operating system.  Informational only.
  • check_remote_disk: Checks the following mount points for disk usage: /mysql/binlog0 /mysql/tmp0 /mysql/audit0 /mysqladm /mystemp /mysql/data0 /mysqlshare  Warns if utilization is breater than 85%, sends critical alert if >= 90%.