I’ve been updating my Zabbix monitoring server lately; I have added some node statistics and individual Snapmirror interfaces a while ago and finially updated my previous pages accordingly. I’ve also discovered, as I suspected might happen, that in certain cases my monitoring architecture doesn’t scale. Recently with a couple of our clusters I’ve been receiving the following warning emails:
Node: <node name> Time: <American formatted date> Severity: ERROR Message: xinetd.hit.cps.limit: Number of incoming network connections exceeded the configured limit of 10 connections per second for the service ssh. This service will be stopped and restarted after 60 seconds.
Looking at the crontab of my monitoring server I could see that there were too many ssh connections at the same time, mainly from my netapp_scaleout_throughput_checker.sh script which my autodiscovery script adds to the crontab to run every 10 mins. For a refresher go here.
I’ve since manually adjusted the timings of some of these scripts and manually removed some that were no longer needed.
I also noticed that my NetApp volume capacity checking user defined parameters were also using individual ssh connections so could eventually also cause the same issue:
UserParameter=netappvol_size_total[*] UserParameter=netappvol_size_used[*] UserParameter=netappvol_size_avail[*] UserParameter=netappvol_snap_total[*] UserParameter=netappvol_snap_used[*] UserParameter=netappvol_snap_avail[*]
What I have done is to change these parameters to be more like other user parameters that check a text file in the /tmp/ directory that a cron’d command runs. I have manged to reduce this to one SSH command per cluster per 10 mins, perfect!
I have also done this for the aggregate capacity stats too, though not thye performance stats that still uses the statistics command. I hope to have time to look at the python SDK and look to do API calls instead of ssh which I hope would cause less cluster load (tbf I can’t see the commands causing issues) and be more scalable.
Firstly I created scripts that collects all the aggregates and volumes’ capacity stats with a simple df:
/etc/zabbix/zabbix_agentd.d/netapp_aggr_capacity_check.sh
#!/bin/bash if [ -z $1 ] then printf "supply cluster name!\n" exit 1 else ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null monitor@$1 "df -A -x" 2>/dev/null > /tmp/$1_netapp_aggr fi
/etc/zabbix/zabbix_agentd.d/netapp_volume_capacity_check.sh
#!/bin/bash if [ -z $1 ] then printf "supply cluster name!\n" exit 1 else ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null monitor@$1 "df -fields filesys-name,vserver,total-space,used-space,available-space" 2>/dev/null > /tmp/$1_volumes fi
I then tag to the end of my NetApp aggregate discovery script after the JSON creation the code below, which will eventually populate the monitoring server’s crontab with the required capacity collection runs. This will not only work for newly added clusters from now on, but will also work on any current clusters being monitored when it runs every hour:
if ! grep -q "netapp_aggr_capacity_check.sh $cluster"$ /var/spool/cron/root then echo "01,11,21,31,41,51 * * * * /etc/zabbix/zabbix_agentd.d/netapp_aggr_capacity_check.sh $cluster" >> /var/spool/cron/root fi if ! grep -q "netapp_volume_capacity_check.sh $cluster"$ /var/spool/cron/root then echo "01,11,21,31,41,51 * * * * /etc/zabbix/zabbix_agentd.d/netapp_volume_capacity_check.sh $cluster" >> /var/spool/cron/root fi
This then creates a text files in /tmp/ with all the required info. Now the volume capacity user parameters simple search a single text file; their config change from:
UserParameter=netappvol_size_total[*], ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null monitor@$1 "df -x -vserver $2 -volume $3" 2>/dev/null | grep $3 | awk '{print $$2}' UserParameter=netappvol_size_used[*], ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null monitor@$1 "df -x -vserver $2 -volume $3" 2>/dev/null | grep $3 | awk '{print $$3}' UserParameter=netappvol_size_avail[*], ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null monitor@$1 "df -x -vserver $2 -volume $3" 2>/dev/null | grep $3 | awk '{print $$4}' UserParameter=netappvol_snap_total[*], ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null monitor@$1 "df -vserver $2 -volume $3" 2>/dev/null | grep snapshot | grep $3 | awk '{print $$2}' UserParameter=netappvol_snap_used[*], ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null monitor@$1 "df -vserver $2 -volume $3" 2>/dev/null | grep snapshot | grep $3 | awk '{print $$3}' UserParameter=netappvol_snap_avail[*], ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null monitor@$1 "df -vserver $2 -volume $3" 2>/dev/null | grep snapshot | grep $3 | awk '{print $$4}'
To:
UserParameter=netappvol_size_total[*], grep "$2 " /tmp/$1_volumes | grep "$3 " | grep -v snapshot | awk '{ print $$5 }' UserParameter=netappvol_size_used[*], grep "$2 " /tmp/$1_volumes | grep "$3 " | grep -v snapshot | awk '{ print $$6 }' UserParameter=netappvol_size_avail[*], grep "$2 " /tmp/$1_volumes | grep "$3 " | grep -v snapshot | awk '{ print $$7 }' UserParameter=netappvol_snap_total[*], grep "$2 " /tmp/$1_volumes | grep "$3 " | grep snapshot | awk '{ print $$5 }' UserParameter=netappvol_snap_used[*], grep "$2 " /tmp/$1_volumes | grep "$3 " | grep snapshot | awk '{ print $$5 }' UserParameter=netappvol_snap_avail[*], grep "$2 " /tmp/$1_volumes | grep "$3 " | grep snapshot | awk '{ print $$5 }'
This change kept all the current collected stats intact and kept the number of ssh commands collecting capacity stats static per cluster no matter how many aggregates or volumes I want to graph. This also created an oppurtunity to tweak some other parts of my monitoring to give a better “big picture” estate summary and capacity planing. I’ll create another post for that.
One thought on “Update to Zabbix NetApp Monitoring – volumes”