Update to Zabbix NetApp Monitoring – volumes

I’ve been updating my Zabbix monitoring server lately; I have added some node statistics and individual Snapmirror interfaces a while ago and finially updated my previous pages accordingly. I’ve also discovered, as I suspected might happen, that in certain cases my monitoring architecture doesn’t scale. Recently with a couple of our clusters I’ve been receiving the following warning emails:

Node: <node name>
Time: <American formatted date>
Severity: ERROR

Message: xinetd.hit.cps.limit: Number of incoming network connections exceeded the configured limit of 10 connections per second for the service ssh. This service will be stopped and restarted after 60 seconds.

Looking at the crontab of my monitoring server I could see that there were too many ssh connections at the same time, mainly from my netapp_scaleout_throughput_checker.sh script which my autodiscovery script adds to the crontab to run every 10 mins. For a refresher go here.

I’ve since manually adjusted the timings of some of these scripts and manually removed some that were no longer needed.

I also noticed that my NetApp volume capacity checking user defined parameters were also using individual ssh connections so could eventually also cause the same issue:

UserParameter=netappvol_size_total[*]
UserParameter=netappvol_size_used[*]
UserParameter=netappvol_size_avail[*]
UserParameter=netappvol_snap_total[*]
UserParameter=netappvol_snap_used[*]
UserParameter=netappvol_snap_avail[*]

What I have done is to change these parameters to be more like other user parameters that check a text file in the /tmp/ directory that a cron’d command runs. I have manged to reduce this to one SSH command per cluster per 10 mins, perfect!

I have also done this for the aggregate capacity stats too, though not thye performance stats that still uses the statistics command. I hope to have time to look at the python SDK and look to do API calls instead of ssh which I hope would cause less cluster load (tbf I can’t see the commands causing issues) and be more scalable.

Firstly I created scripts that collects all the aggregates and volumes’ capacity stats with a simple df:

/etc/zabbix/zabbix_agentd.d/netapp_aggr_capacity_check.sh

#!/bin/bash
if [ -z $1 ]
then
    printf "supply cluster name!\n"
    exit 1
else
    ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null monitor@$1 "df -A -x" 2>/dev/null > /tmp/$1_netapp_aggr
fi

/etc/zabbix/zabbix_agentd.d/netapp_volume_capacity_check.sh

#!/bin/bash
if [ -z $1 ]
then
   printf "supply cluster name!\n"
   exit 1
else
   ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null monitor@$1 "df -fields filesys-name,vserver,total-space,used-space,available-space" 2>/dev/null > /tmp/$1_volumes
fi

I then tag to the end of my NetApp aggregate discovery script after the JSON creation the code below, which will eventually populate the monitoring server’s crontab with the required capacity collection runs. This will not only work for newly added clusters from now on, but will also work on any current clusters being monitored when it runs every hour:

if ! grep -q "netapp_aggr_capacity_check.sh $cluster"$ /var/spool/cron/root
then
   echo "01,11,21,31,41,51 * * * * /etc/zabbix/zabbix_agentd.d/netapp_aggr_capacity_check.sh $cluster" >> /var/spool/cron/root
fi
   if ! grep -q "netapp_volume_capacity_check.sh $cluster"$ /var/spool/cron/root
then
   echo "01,11,21,31,41,51 * * * * /etc/zabbix/zabbix_agentd.d/netapp_volume_capacity_check.sh $cluster" >> /var/spool/cron/root
fi

This then creates a text files in /tmp/ with all the required info. Now the volume capacity user parameters simple search a single text file; their config change from:

UserParameter=netappvol_size_total[*], ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null monitor@$1 "df -x -vserver $2 -volume $3" 2>/dev/null | grep $3 | awk '{print $$2}'
UserParameter=netappvol_size_used[*], ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null monitor@$1 "df -x -vserver $2 -volume $3" 2>/dev/null | grep $3 | awk '{print $$3}'
UserParameter=netappvol_size_avail[*], ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null monitor@$1 "df -x -vserver $2 -volume $3" 2>/dev/null | grep $3 | awk '{print $$4}'
UserParameter=netappvol_snap_total[*], ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null monitor@$1 "df -vserver $2 -volume $3" 2>/dev/null | grep snapshot | grep $3 | awk '{print $$2}'
UserParameter=netappvol_snap_used[*], ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null monitor@$1 "df -vserver $2 -volume $3" 2>/dev/null | grep snapshot | grep $3 | awk '{print $$3}'
UserParameter=netappvol_snap_avail[*], ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null monitor@$1 "df -vserver $2 -volume $3" 2>/dev/null | grep snapshot | grep $3 | awk '{print $$4}'

To:

UserParameter=netappvol_size_total[*], grep "$2 " /tmp/$1_volumes | grep "$3 " | grep -v snapshot | awk '{ print $$5 }'
UserParameter=netappvol_size_used[*], grep "$2 " /tmp/$1_volumes | grep "$3 " | grep -v snapshot | awk '{ print $$6 }'
UserParameter=netappvol_size_avail[*], grep "$2 " /tmp/$1_volumes | grep "$3 " | grep -v snapshot | awk '{ print $$7 }'
UserParameter=netappvol_snap_total[*], grep "$2 " /tmp/$1_volumes | grep "$3 " | grep snapshot | awk '{ print $$5 }'
UserParameter=netappvol_snap_used[*], grep "$2 " /tmp/$1_volumes | grep "$3 " | grep snapshot | awk '{ print $$5 }'
UserParameter=netappvol_snap_avail[*], grep "$2 " /tmp/$1_volumes | grep "$3 " | grep snapshot | awk '{ print $$5 }'

This change kept all the current collected stats intact and kept the number of ssh commands collecting capacity stats static per cluster no matter how many aggregates or volumes I want to graph. This also created an oppurtunity to tweak some other parts of my monitoring to give a better “big picture” estate summary and capacity planing. I’ll create another post for that.

Rejected I/O

General Geeky Meanderings

Update to Zabbix NetApp Monitoring – volumes