NetApp Cluster Trend Reporting using Zabbix

Preamble

Note: I have updated various parts of this page (May 2018) with additional discovery and collection entries. I have create another post explaing some of the changes here.

The following is a summary of what we’re monitoring of a NetApp cluster:

  • Auto Discovery of Flexgroups
  • Auto Discovery of Infinite Volumes
  • Auto Discovery of Aggregates
  • Auto Discovery of Cluster Nodes
  • Auto Discovery of Cluster SnapMirror Lifs
  • Storage utilisation of Flexgroups, Infinite Volumes, and Aggregates <- UPDATED!
  • IOPS and throughput of Flexgroups, Infinite Volumes, and Aggregates
  • Average latency of Flexgroups and Infinite Volumes and just latency of Aggregates (ONTAP 9)
  • Total Cluster aggregate utilisation
  • Cluster CPU utilisation
  • Cluster Node stats
  • Cluster Snapmirror Lifs

Because there is no single SNMP OID for a Flexgroup or Infinite Volume utilisation the easiest way is to get Zabbix to run a df against the manually defined flexgroup.

Most metric collection apart from Cluster CPU is done via the statistics command.

Create Monitoring Role and user

Create a restricted role and user and allow public key authentication using the Zabbix’s root user’s key, note the statistics samples delete permission is critical:

security login role create -role monitor -cmddirname "df" -access readonly
security login role create -role monitor -cmddirname "statistics" -access readonly
security login role create -role monitor -cmddirname "statistics samples delete" -access all
security login role create -role monitor -cmddirname "volume" -access readonly
security login role create -role monitor -cmddirname "storage aggregate" -access readonly
security login role create -role monitor -cmddirname "system" -access readonly
security login role create -role monitor -cmddirname "vserver" -access readonly

ONTAP < 9

security login create -user-or-group-name monitor -application ssh -authmethod publickey -role monitor

ONTAP => 9

security login create -user-or-group-name monitor -application ssh -authentication-method publickey -role monitor
security login publickey create -username monitor -index 1 -publickey "ssh-rsa <public key>"

Then from the Zabbix monitoring server run the following command so the cluster is added to known hosts:

ssh monitor@<new cluster>

The Zabbix server will now have restricted passwordless ssh access to the cluster.

CLI/Custom Configuration

Auto Discovery

The main parts of the auto discovery process for anything I’ve done is based on:

  • Discovery script – this runs a command that returns any instance of what is being checked for in a JSON format setting a variable name for Zabbix to use. It also adds a cron entry if missing (for further dicoveries after the initial run) that then runs a second script to gather the statistics for that item instance.
  • Statistics gathering script – this is run by cron which runs a specific statistics command to gather valid metrics against the item being polled. It then writes these individual stats out to various files within the /tmp directory for zabbix to read for stats collection and graphing.
  • The Aggregate script runs the statistics command against the entire cluster for all aggregates in one command to keep the number of concurrent statistics commands to a minimum.
  • The discovery script is configured as an User Parameter
  • Any specific metrics to be measured are defined as an User Parameter refered to by Items or Item Prototypes, except when it’s a SNMP OID.
  • The rest of the config is created within the standard Zabbix GUI:
    • Discovery rules use the predefind discovery User Parameter; the returned JSON file is read and each returned item creates a Zabbix item, referencable by the variable name within the JSON file.
    • Item Prototypes are created to monitor the Metrics of each discovered instance by the discovery rule; these are custom User parameters.
    • Items are created for monitoring metrics that do not need to be auto discovered, such as combined aggregate utilisation and cluster CPU. These can be SNMP OIDs or again custom User parameters.

To validate the auto discovery’s JSON use a validator like JsonLint.

Custom Discovery Rules

Create a script that returns the required objects in JSON format. Place the script in the monitoring server’s Zabbix agent directory:

/etc/zabbix/zabbix_agentd.d

The discovery scripts can be seen here.

The statistics gathering scripts can be found here. Note the formatting of the command output using  sed and tr is due to some instance stats been wrapped around multiple lines. These commands eventually reformat the output so all figures are on single lines and comma seperated. This makes data extraction consitent and possible.

Create Custom Discovery User Parameters

Create a new User Parameter in the file:

/etc/zabbix/zabbix_agentd.d/userparameter_netapp.conf

Add one parameter per discovery type referencing the discovery script and accepting a user input, which will be the cluster name specified in the GUI:

UserParameter=netapp_fg_disco[*],    /etc/zabbix/zabbix_agentd.d/netapp_cluster_flexgroup_disco.sh $1
UserParameter=netapp_ivol_disco[*],  /etc/zabbix/zabbix_agentd.d/netapp_cluster_ivol_disco.sh $1
UserParameter=netapp_aggr_disco[*],  /etc/zabbix/zabbix_agentd.d/netapp_cluster_aggr_disco.sh $1
UserParameter=netapp_smlif_disco[*], /etc/zabbix/zabbix_agentd.d/netapp_cluster_sm_lif_disco.sh $1
UserParameter=netapp_node_disco[*],  /etc/zabbix/zabbix_agentd.d/netapp_cluster_node_disco.sh $1

Create Custom Monitoring Metric User Parameters

Anything being monitored by custom commands will need to be defined as a user parameter located in the same file as the custom discovery rules.

The discovery rules create a cron job that generates a file in /tmp per monitored metric, a user parameter per metric need to be defined that simply reads the file that the cron jobs create:

UserParameter=netapp_fg_ops_read[*],     cat /tmp/$1_$2_read_ops.txt
UserParameter=netapp_fg_ops_write[*],    cat /tmp/$1_$2_write_ops.txt
UserParameter=netapp_fg_ops_other[*],    cat /tmp/$1_$2_other_ops.txt
UserParameter=netapp_fg_bps_read[*],     cat /tmp/$1_$2_read_bps.txt
UserParameter=netapp_fg_bps_write[*],    cat /tmp/$1_$2_read_bps.txt
UserParameter=netapp_fg_av_latency[*],   cat /tmp/$1_$2_latency.txt

UserParameter=netapp_ivol_ops_read[*],   cat /tmp/$1_$2_read_ops.txt
UserParameter=netapp_ivol_ops_write[*],  cat /tmp/$1_$2_write_ops.txt
UserParameter=netapp_ivol_ops_other[*],  cat /tmp/$1_$2_other_ops.txt
UserParameter=netapp_ivol_bps_read[*],   cat /tmp/$1_$2_read_bps.txt
UserParameter=netapp_ivol_bps_write[*],  cat /tmp/$1_$2_write_bps.txt
UserParameter=netapp_ivol_av_latency[*], cat /tmp/$1_$2_latency.txt

UserParameter=netapp_aggr_ops_total[*],  grep $2 /tmp/$1_aggr_stats.txt | awk -F, '{print $$3}'
UserParameter=netapp_aggr_ops_read[*],   grep $2 /tmp/$1_aggr_stats.txt | awk -F, '{print $$4}'
UserParameter=netapp_aggr_ops_write[*],  grep $2 /tmp/$1_aggr_stats.txt | awk -F, '{print $$5}'
UserParameter=netapp_aggr_bps_read[*],   grep $2 /tmp/$1_aggr_stats.txt | awk -F, '{print $$6}'
UserParameter=netapp_aggr_bps_write[*],  grep $2 /tmp/$1_aggr_stats.txt | awk -F, '{print $$7}'
UserParameter=netapp_aggr_latency[*],    grep $2 /tmp/$1_aggr_stats.txt | awk -F, '{print $$8}'

UserParameter=netapp_smlif_recpkts[*],   cat /tmp/$1_$2_stats.txt | awk -F" " '{print $$1}'
UserParameter=netapp_smlif_recpbps[*],   cat /tmp/$1_$2_stats.txt | awk -F" " '{print $$2}'
UserParameter=netapp_smlif_recperr[*],   cat /tmp/$1_$2_stats.txt | awk -F" " '{print $$3}'
UserParameter=netapp_smlif_sentkts[*],   cat /tmp/$1_$2_stats.txt | awk -F" " '{print $$4}'
UserParameter=netapp_smlif_sentbps[*],   cat /tmp/$1_$2_stats.txt | awk -F" " '{print $$5}'
UserParameter=netapp_smlif_senterr[*],   cat /tmp/$1_$2_stats.txt | awk -F" " '{print $$6}'
UserParameter=netapp_smlif_recptotal[*], cat '/tmp/'$1'_total_recp_sm_lif.txt'
UserParameter=netapp_smlif_senttotal[*], cat '/tmp/'$1'_total_sent_sm_lif.txt'

UserParameter=netapp_node_cpu[*],        grep $2 /tmp/$1_node_stats.txt | awk '{ print $$2 }'
UserParameter=netapp_node_ops[*],        grep $2 /tmp/$1_node_stats.txt | awk '{ print $$3 }'
UserParameter=netapp_node_bps[*],        grep $2 /tmp/$1_node_stats.txt | awk '{ print $$4 }'
UserParameter=netapp_node_lat[*],        grep $2 /tmp/$1_node_stats.txt | awk '{ print $$5 }'

The other custom parameters are for Flexgroup and Infinite Volume utilisations and the total combined aggregate utilisation:

UserParameter=netappvol_size_total[*],   ssh monitor@$1 df -x -vserver $2 -volume $3 | grep $3 | awk '{print $$2}'
UserParameter=netappvol_size_used[*],    ssh monitor@$1 df -x -vserver $2 -volume $3 | grep $3 | awk '{print $$3}'
UserParameter=netappvol_size_avail[*],   ssh monitor@$1 df -x -vserver $2 -volume $3 | grep $3 | awk '{print $$4}'
UserParameter=netapp_combaggr_total[*],  ssh monitor@$1 df -A -x | grep aggr | grep -v aggr0 | awk '{print $$2}' | paste -sd+ - | bc
UserParameter=netapp_combaggr_used[*],   ssh monitor@$1 df -A -x | grep aggr | grep -v aggr0 | awk '{print $$3}' | paste -sd+ - | bc

The volume utilisation parameters will take the custom discovery defind variables, shown below.

GUI Configuration

Zabbix Discovery Rules

There are three rules that call on the custom discovery scripts/user parameters:

NameTypeKeyHost InterfaceInterval
Aggregate DiscoveryZabbix Agentnetapp_aggr_disco[{HOST.HOST}]127.0.0.1:100503600
Flexgroup DiscoveryZabbix Agentnetapp_fg_disco[{HOST.HOST}] 127.0.0.1:100503600
Ivol Discovery Zabbix Agentnetapp_ivol_disco[{HOST.HOST}]127.0.0.1:100503600

Zabbix Item Prototypes

Discovery RuleNameTypeKeyHost InterfaceTypeData TypeUnitCustom XInterval
Aggregate Discovery{#AGGRNAME} Latency (ms)Zabbix Agentnetapp_aggr_latency[{HOST.HOST},{#AGGRNAME}]127.0.0.1:10050Float--0.00013600
Aggregate Discovery{#AGGRNAME} Read B/sZabbix Agentnetapp_aggr_bps_read[{HOST.HOST},{#AGGRNAME}]127.0.0.1:10050UnsignedDecimalB-3600
Aggregate Discovery{#AGGRNAME} Write B/sZabbix Agentnetapp_aggr_bps_write[{HOST.HOST},{#AGGRNAME}]127.0.0.1:10050UnsignedDecimalB-3600
Aggregate Discovery{#AGGRNAME} Read OPSZabbix Agentnetapp_aggr_ops_read[{HOST.HOST},{#AGGRNAME}]127.0.0.1:10050UnsignedDecimal--3600
Aggregate Discovery{#AGGRNAME} Write OPSZabbix Agentnetapp_aggr_ops_write[{HOST.HOST},{#AGGRNAME}]127.0.0.1:10050UnsignedDecimal--3600
Aggregate Discovery{#AGGRNAME} Other OPSZabbix Agentnetapp_aggr_ops_other[{HOST.HOST},{#AGGRNAME}]127.0.0.1:10050UnsignedDecimal--3600
Flexgroup Discovery{#FGNAME} Latency (ms)Zabbix Agentnetapp_fg_latency[{HOST.HOST},{#FGNAME}]127.0.0.1:10050Float--0.00013600
Flexgroup Discovery{#FGNAME} Read B/sZabbix Agentnetapp_fg_bps_read[{HOST.HOST},{#FGNAME}]127.0.0.1:10050UnsignedDecimalBps-3600
Flexgroup Discovery{#FGNAME} Write B/sZabbix Agentnetapp_fg_bps_write[{HOST.HOST},{#FGNAME}]127.0.0.1:10050UnsignedDecimalBps-3600
Flexgroup Discovery{#FGNAME} Read OPSZabbix Agentnetapp_fg_ops_read[{HOST.HOST},{#FGNAME}]127.0.0.1:10050UnsignedDecimal--3600
Flexgroup Discovery{#FGNAME} Write OPSZabbix Agentnetapp_fg_ops_write[{HOST.HOST},{#FGNAME}]127.0.0.1:10050UnsignedDecimal--3600
Flexgroup Discovery{#FGNAME} Other OPSZabbix Agentnetapp_fg_ops_other[{HOST.HOST},{#FGNAME}]127.0.0.1:10050UnsignedDecimal--3600
Flexgroup Discovery{#FGNAME} TotalZabbix Agentnetappvol_size_total[{HOST.HOST},{#FGSVMNAME},{#FGNAME}]127.0.0.1:10050UnsignedDecimalB-3600
Flexgroup Discovery{#FGNAME} UsedZabbix Agentnetappvol_size_used[{HOST.HOST},{#FGSVMNAME},{#FGNAME}]127.0.0.1:10050UnsignedDecimalB-3600
Infinite Volume Discovery{#IVOLNAME} Latency (ms)Zabbix Agentnetapp_fg_latency[{HOST.HOST},{#IVOLNAME}]127.0.0.1:10050Float--0.00013600
Infinite Volume Discovery{#IVOLNAME} Read B/sZabbix Agentnetapp_fg_bps_read[{HOST.HOST},{#IVOLNAME}]127.0.0.1:10050UnsignedDecimalBps-3600
Infinite Volume Discovery{#IVOLNAME} Write B/sZabbix Agentnetapp_fg_bps_write[{HOST.HOST},{#IVOLNAME}]127.0.0.1:10050UnsignedDecimalBps-3600
Infinite Volume Discovery{#IVOLNAME} Read OPSZabbix Agentnetapp_fg_ops_read[{HOST.HOST},{#IVOLNAME}]127.0.0.1:10050UnsignedDecimal--3600
Infinite Volume Discovery{#IVOLNAME} Write OPSZabbix Agentnetapp_fg_ops_write[{HOST.HOST},{#IVOLNAME}]127.0.0.1:10050UnsignedDecimal--3600
Infinite Volume Discovery{#IVOLNAME} Other OPSZabbix Agentnetapp_fg_ops_other[{HOST.HOST},{#IVOLNAME}]127.0.0.1:10050UnsignedDecimal--3600
Infinite Volume Discovery{#IVOLNAME} TotalZabbix Agentnetappvol_size_total[{HOST.HOST},{#IVOLSVMNAME},{#IVOLNAME}]127.0.0.1:10050UnsignedDecimalB-3600
Infinite Volume Discovery{#IVOLNAME} UsedZabbix Agentnetappvol_size_used[{HOST.HOST},{#IVOLSVMNAME},{#IVOLNAME}]127.0.0.1:10050UnsignedDecimalB-3600

Zabbix Items

NameTypeKeyHost InterfaceSNMP OID / TypeSNMP CommData TypeUnitCustom XInterval
Combined Agggregate TotalZabbix Agentnetapp_combaggr_total[{HOST.HOST}]]127.0.0.1:10050-UnsignedDecimalB10003600
Combined Agggregate UsedZabbix Agentnetapp_combaggr_used[{HOST.HOST}]]127.0.0.1:10050-UnsignedDecimalB10003600
Combined Cluster CPU UtilisationSNMPv2cpuBusyTimePerCent:161UnsignedDecimal%-300

Zabbix Graph Prototypes

Discovery RuleNameY axis MIN valueItemsFuncDraw TypeColour
Aggregate Discovery{#AGGRNAME} Latency (ms)Fixed{#AGGRNAME} Latency (ms)avgBold LineDD0000
Aggregate Discovery{#AGGRNAME} OPSFixed{#AGGRNAME} Write OPSavgBold LineDD0000
Aggregate Discovery::::::{#AGGRNAME} Read OPSavgBold Line00DD00
Aggregate Discovery::::::{#AGGRNAME} Total OPSavgBold LineDDDD00
Aggregate Discovery{#AGGRNAME} ThroughputFixed{#AGGRNAME} Write B/savgBold LineDD0000
Aggregate Discovery::::::{#AGGRNAME} Read B/savgBold Line00DD00
Flexgroup Discovery{#FGNAME} Average Latency (ms)Fixed{#FGNAME} Latency (ms)avgBold LineDD0000
Flexgroup Discovery{#FGNAME} IOPSFixed{#FGNAME} Write OPSavgBold LineDD0000
Flexgroup Discovery::::::{#FGNAME} Read OtheravgBold LineDDDD00
Flexgroup Discovery::::::{#FGNAME} Read OPSavgBold Line00DD00
Flexgroup Discovery::::::{#FGNAME} Total OPSavgBold LineDDDD00
Flexgroup Discovery{#FGNAME} ThroughputFixed{#FGNAME} Write B/savgBold LineDD0000
Flexgroup Discovery::::::{#FGNAME} Read B/savgBold Line00DD00
Flexgroup Discovery{#FGNAME} UtilisationFixed{#FGNAME} TotalavgFilled region00DD00
Flexgroup Discovery::::::{#FGNAME} UsedavgFilled regionDD0000
Infinite Volume Discovery{#IVOLNAME} Average Latency (ms)Fixed{#IVOLNAME} Latency (ms)avgBold LineDD0000
Infinite Volume Discovery{#IVOLNAME} IOPSFixed{#IVOLNAME} Write OPSavgBold LineDD0000
Infinite Volume Discovery::::::{#IVOLNAME} Read OtheravgBold LineDDDD00
Infinite Volume Discovery::::::{#IVOLNAME} Read OPSavgBold Line00DD00
Infinite Volume Discovery::::::{#IVOLNAME} Total OPSavgBold LineDDDD00
Infinite Volume Discovery{#IVOLNAME} ThroughputFixed{#IVOLNAME} Write B/savgBold LineDD0000
Infinite Volume Discovery::::::{#IVOLNAME} Read B/savgBold Line00DD00
Infinite Volume Discovery{#IVOLNAME} UtilisationFixed{#IVOLNAME} TotalavgFilled region00DD00
Infinite Volume Discovery::::::{#IVOLNAME} UsedavgFilled regionDD0000

Zabbix Graphs

NameY axis MIN valueItemsFuncDraw TypeColour
Cluster CPU UtilisationFixedCombined Cluster CPU UtilisationavgBold LineDD0000
Total Aggregate UtilisationFixedCombined Aggregate TotalavgFilled region00DD00
:::FixedCombined Aggregate UsedavgFilled regionDD0000

One thought on “NetApp Cluster Trend Reporting using Zabbix”

Leave a Reply

Your email address will not be published.