Preamble
Note: I have updated various parts of this page (May 2018) with additional discovery and collection entries. I have create another post explaing some of the changes here.
The following is a summary of what we’re monitoring of a NetApp cluster:
- Auto Discovery of Flexgroups
- Auto Discovery of Infinite Volumes
- Auto Discovery of Aggregates
- Auto Discovery of Cluster Nodes
- Auto Discovery of Cluster SnapMirror Lifs
- Storage utilisation of Flexgroups, Infinite Volumes, and Aggregates <- UPDATED!
- IOPS and throughput of Flexgroups, Infinite Volumes, and Aggregates
- Average latency of Flexgroups and Infinite Volumes and just latency of Aggregates (ONTAP 9)
- Total Cluster aggregate utilisation
- Cluster CPU utilisation
- Cluster Node stats
- Cluster Snapmirror Lifs
Because there is no single SNMP OID for a Flexgroup or Infinite Volume utilisation the easiest way is to get Zabbix to run a df against the manually defined flexgroup.
Most metric collection apart from Cluster CPU is done via the statistics command.
Create Monitoring Role and user
Create a restricted role and user and allow public key authentication using the Zabbix’s root user’s key, note the statistics samples delete permission is critical:
security login role create -role monitor -cmddirname "df" -access readonly security login role create -role monitor -cmddirname "statistics" -access readonly security login role create -role monitor -cmddirname "statistics samples delete" -access all security login role create -role monitor -cmddirname "volume" -access readonly security login role create -role monitor -cmddirname "storage aggregate" -access readonly security login role create -role monitor -cmddirname "system" -access readonly security login role create -role monitor -cmddirname "vserver" -access readonly
ONTAP < 9
security login create -user-or-group-name monitor -application ssh -authmethod publickey -role monitor
ONTAP => 9
security login create -user-or-group-name monitor -application ssh -authentication-method publickey -role monitor
security login publickey create -username monitor -index 1 -publickey "ssh-rsa <public key>"
Then from the Zabbix monitoring server run the following command so the cluster is added to known hosts:
ssh monitor@<new cluster>
The Zabbix server will now have restricted passwordless ssh access to the cluster.
CLI/Custom Configuration
Auto Discovery
The main parts of the auto discovery process for anything I’ve done is based on:
- Discovery script – this runs a command that returns any instance of what is being checked for in a JSON format setting a variable name for Zabbix to use. It also adds a cron entry if missing (for further dicoveries after the initial run) that then runs a second script to gather the statistics for that item instance.
- Statistics gathering script – this is run by cron which runs a specific statistics command to gather valid metrics against the item being polled. It then writes these individual stats out to various files within the /tmp directory for zabbix to read for stats collection and graphing.
- The Aggregate script runs the statistics command against the entire cluster for all aggregates in one command to keep the number of concurrent statistics commands to a minimum.
- The discovery script is configured as an User Parameter
- Any specific metrics to be measured are defined as an User Parameter refered to by Items or Item Prototypes, except when it’s a SNMP OID.
- The rest of the config is created within the standard Zabbix GUI:
- Discovery rules use the predefind discovery User Parameter; the returned JSON file is read and each returned item creates a Zabbix item, referencable by the variable name within the JSON file.
- Item Prototypes are created to monitor the Metrics of each discovered instance by the discovery rule; these are custom User parameters.
- Items are created for monitoring metrics that do not need to be auto discovered, such as combined aggregate utilisation and cluster CPU. These can be SNMP OIDs or again custom User parameters.
To validate the auto discovery’s JSON use a validator like JsonLint.
Custom Discovery Rules
Create a script that returns the required objects in JSON format. Place the script in the monitoring server’s Zabbix agent directory:
/etc/zabbix/zabbix_agentd.d
The discovery scripts can be seen here.
The statistics gathering scripts can be found here. Note the formatting of the command output using sed and tr is due to some instance stats been wrapped around multiple lines. These commands eventually reformat the output so all figures are on single lines and comma seperated. This makes data extraction consitent and possible.
Create Custom Discovery User Parameters
Create a new User Parameter in the file:
/etc/zabbix/zabbix_agentd.d/userparameter_netapp.conf
Add one parameter per discovery type referencing the discovery script and accepting a user input, which will be the cluster name specified in the GUI:
UserParameter=netapp_fg_disco[*], /etc/zabbix/zabbix_agentd.d/netapp_cluster_flexgroup_disco.sh $1 UserParameter=netapp_ivol_disco[*], /etc/zabbix/zabbix_agentd.d/netapp_cluster_ivol_disco.sh $1 UserParameter=netapp_aggr_disco[*], /etc/zabbix/zabbix_agentd.d/netapp_cluster_aggr_disco.sh $1 UserParameter=netapp_smlif_disco[*], /etc/zabbix/zabbix_agentd.d/netapp_cluster_sm_lif_disco.sh $1 UserParameter=netapp_node_disco[*], /etc/zabbix/zabbix_agentd.d/netapp_cluster_node_disco.sh $1
Create Custom Monitoring Metric User Parameters
Anything being monitored by custom commands will need to be defined as a user parameter located in the same file as the custom discovery rules.
The discovery rules create a cron job that generates a file in /tmp per monitored metric, a user parameter per metric need to be defined that simply reads the file that the cron jobs create:
UserParameter=netapp_fg_ops_read[*], cat /tmp/$1_$2_read_ops.txt UserParameter=netapp_fg_ops_write[*], cat /tmp/$1_$2_write_ops.txt UserParameter=netapp_fg_ops_other[*], cat /tmp/$1_$2_other_ops.txt UserParameter=netapp_fg_bps_read[*], cat /tmp/$1_$2_read_bps.txt UserParameter=netapp_fg_bps_write[*], cat /tmp/$1_$2_read_bps.txt UserParameter=netapp_fg_av_latency[*], cat /tmp/$1_$2_latency.txt UserParameter=netapp_ivol_ops_read[*], cat /tmp/$1_$2_read_ops.txt UserParameter=netapp_ivol_ops_write[*], cat /tmp/$1_$2_write_ops.txt UserParameter=netapp_ivol_ops_other[*], cat /tmp/$1_$2_other_ops.txt UserParameter=netapp_ivol_bps_read[*], cat /tmp/$1_$2_read_bps.txt UserParameter=netapp_ivol_bps_write[*], cat /tmp/$1_$2_write_bps.txt UserParameter=netapp_ivol_av_latency[*], cat /tmp/$1_$2_latency.txt UserParameter=netapp_aggr_ops_total[*], grep $2 /tmp/$1_aggr_stats.txt | awk -F, '{print $$3}' UserParameter=netapp_aggr_ops_read[*], grep $2 /tmp/$1_aggr_stats.txt | awk -F, '{print $$4}' UserParameter=netapp_aggr_ops_write[*], grep $2 /tmp/$1_aggr_stats.txt | awk -F, '{print $$5}' UserParameter=netapp_aggr_bps_read[*], grep $2 /tmp/$1_aggr_stats.txt | awk -F, '{print $$6}' UserParameter=netapp_aggr_bps_write[*], grep $2 /tmp/$1_aggr_stats.txt | awk -F, '{print $$7}' UserParameter=netapp_aggr_latency[*], grep $2 /tmp/$1_aggr_stats.txt | awk -F, '{print $$8}' UserParameter=netapp_smlif_recpkts[*], cat /tmp/$1_$2_stats.txt | awk -F" " '{print $$1}' UserParameter=netapp_smlif_recpbps[*], cat /tmp/$1_$2_stats.txt | awk -F" " '{print $$2}' UserParameter=netapp_smlif_recperr[*], cat /tmp/$1_$2_stats.txt | awk -F" " '{print $$3}' UserParameter=netapp_smlif_sentkts[*], cat /tmp/$1_$2_stats.txt | awk -F" " '{print $$4}' UserParameter=netapp_smlif_sentbps[*], cat /tmp/$1_$2_stats.txt | awk -F" " '{print $$5}' UserParameter=netapp_smlif_senterr[*], cat /tmp/$1_$2_stats.txt | awk -F" " '{print $$6}' UserParameter=netapp_smlif_recptotal[*], cat '/tmp/'$1'_total_recp_sm_lif.txt' UserParameter=netapp_smlif_senttotal[*], cat '/tmp/'$1'_total_sent_sm_lif.txt' UserParameter=netapp_node_cpu[*], grep $2 /tmp/$1_node_stats.txt | awk '{ print $$2 }' UserParameter=netapp_node_ops[*], grep $2 /tmp/$1_node_stats.txt | awk '{ print $$3 }' UserParameter=netapp_node_bps[*], grep $2 /tmp/$1_node_stats.txt | awk '{ print $$4 }' UserParameter=netapp_node_lat[*], grep $2 /tmp/$1_node_stats.txt | awk '{ print $$5 }'
The other custom parameters are for Flexgroup and Infinite Volume utilisations and the total combined aggregate utilisation:
UserParameter=netappvol_size_total[*], ssh monitor@$1 df -x -vserver $2 -volume $3 | grep $3 | awk '{print $$2}' UserParameter=netappvol_size_used[*], ssh monitor@$1 df -x -vserver $2 -volume $3 | grep $3 | awk '{print $$3}' UserParameter=netappvol_size_avail[*], ssh monitor@$1 df -x -vserver $2 -volume $3 | grep $3 | awk '{print $$4}' UserParameter=netapp_combaggr_total[*], ssh monitor@$1 df -A -x | grep aggr | grep -v aggr0 | awk '{print $$2}' | paste -sd+ - | bc UserParameter=netapp_combaggr_used[*], ssh monitor@$1 df -A -x | grep aggr | grep -v aggr0 | awk '{print $$3}' | paste -sd+ - | bc
The volume utilisation parameters will take the custom discovery defind variables, shown below.
GUI Configuration
Zabbix Discovery Rules
There are three rules that call on the custom discovery scripts/user parameters:
Name | Type | Key | Host Interface | Interval |
---|---|---|---|---|
Aggregate Discovery | Zabbix Agent | netapp_aggr_disco[{HOST.HOST}] | 127.0.0.1:10050 | 3600 |
Flexgroup Discovery | Zabbix Agent | netapp_fg_disco[{HOST.HOST}] | 127.0.0.1:10050 | 3600 |
Ivol Discovery | Zabbix Agent | netapp_ivol_disco[{HOST.HOST}] | 127.0.0.1:10050 | 3600 |
Zabbix Item Prototypes
Discovery Rule | Name | Type | Key | Host Interface | Type | Data Type | Unit | Custom X | Interval |
---|---|---|---|---|---|---|---|---|---|
Aggregate Discovery | {#AGGRNAME} Latency (ms) | Zabbix Agent | netapp_aggr_latency[{HOST.HOST},{#AGGRNAME}] | 127.0.0.1:10050 | Float | - | - | 0.0001 | 3600 |
Aggregate Discovery | {#AGGRNAME} Read B/s | Zabbix Agent | netapp_aggr_bps_read[{HOST.HOST},{#AGGRNAME}] | 127.0.0.1:10050 | Unsigned | Decimal | B | - | 3600 |
Aggregate Discovery | {#AGGRNAME} Write B/s | Zabbix Agent | netapp_aggr_bps_write[{HOST.HOST},{#AGGRNAME}] | 127.0.0.1:10050 | Unsigned | Decimal | B | - | 3600 |
Aggregate Discovery | {#AGGRNAME} Read OPS | Zabbix Agent | netapp_aggr_ops_read[{HOST.HOST},{#AGGRNAME}] | 127.0.0.1:10050 | Unsigned | Decimal | - | - | 3600 |
Aggregate Discovery | {#AGGRNAME} Write OPS | Zabbix Agent | netapp_aggr_ops_write[{HOST.HOST},{#AGGRNAME}] | 127.0.0.1:10050 | Unsigned | Decimal | - | - | 3600 |
Aggregate Discovery | {#AGGRNAME} Other OPS | Zabbix Agent | netapp_aggr_ops_other[{HOST.HOST},{#AGGRNAME}] | 127.0.0.1:10050 | Unsigned | Decimal | - | - | 3600 |
Flexgroup Discovery | {#FGNAME} Latency (ms) | Zabbix Agent | netapp_fg_latency[{HOST.HOST},{#FGNAME}] | 127.0.0.1:10050 | Float | - | - | 0.0001 | 3600 |
Flexgroup Discovery | {#FGNAME} Read B/s | Zabbix Agent | netapp_fg_bps_read[{HOST.HOST},{#FGNAME}] | 127.0.0.1:10050 | Unsigned | Decimal | Bps | - | 3600 |
Flexgroup Discovery | {#FGNAME} Write B/s | Zabbix Agent | netapp_fg_bps_write[{HOST.HOST},{#FGNAME}] | 127.0.0.1:10050 | Unsigned | Decimal | Bps | - | 3600 |
Flexgroup Discovery | {#FGNAME} Read OPS | Zabbix Agent | netapp_fg_ops_read[{HOST.HOST},{#FGNAME}] | 127.0.0.1:10050 | Unsigned | Decimal | - | - | 3600 |
Flexgroup Discovery | {#FGNAME} Write OPS | Zabbix Agent | netapp_fg_ops_write[{HOST.HOST},{#FGNAME}] | 127.0.0.1:10050 | Unsigned | Decimal | - | - | 3600 |
Flexgroup Discovery | {#FGNAME} Other OPS | Zabbix Agent | netapp_fg_ops_other[{HOST.HOST},{#FGNAME}] | 127.0.0.1:10050 | Unsigned | Decimal | - | - | 3600 |
Flexgroup Discovery | {#FGNAME} Total | Zabbix Agent | netappvol_size_total[{HOST.HOST},{#FGSVMNAME},{#FGNAME}] | 127.0.0.1:10050 | Unsigned | Decimal | B | - | 3600 |
Flexgroup Discovery | {#FGNAME} Used | Zabbix Agent | netappvol_size_used[{HOST.HOST},{#FGSVMNAME},{#FGNAME}] | 127.0.0.1:10050 | Unsigned | Decimal | B | - | 3600 |
Infinite Volume Discovery | {#IVOLNAME} Latency (ms) | Zabbix Agent | netapp_fg_latency[{HOST.HOST},{#IVOLNAME}] | 127.0.0.1:10050 | Float | - | - | 0.0001 | 3600 |
Infinite Volume Discovery | {#IVOLNAME} Read B/s | Zabbix Agent | netapp_fg_bps_read[{HOST.HOST},{#IVOLNAME}] | 127.0.0.1:10050 | Unsigned | Decimal | Bps | - | 3600 |
Infinite Volume Discovery | {#IVOLNAME} Write B/s | Zabbix Agent | netapp_fg_bps_write[{HOST.HOST},{#IVOLNAME}] | 127.0.0.1:10050 | Unsigned | Decimal | Bps | - | 3600 |
Infinite Volume Discovery | {#IVOLNAME} Read OPS | Zabbix Agent | netapp_fg_ops_read[{HOST.HOST},{#IVOLNAME}] | 127.0.0.1:10050 | Unsigned | Decimal | - | - | 3600 |
Infinite Volume Discovery | {#IVOLNAME} Write OPS | Zabbix Agent | netapp_fg_ops_write[{HOST.HOST},{#IVOLNAME}] | 127.0.0.1:10050 | Unsigned | Decimal | - | - | 3600 |
Infinite Volume Discovery | {#IVOLNAME} Other OPS | Zabbix Agent | netapp_fg_ops_other[{HOST.HOST},{#IVOLNAME}] | 127.0.0.1:10050 | Unsigned | Decimal | - | - | 3600 |
Infinite Volume Discovery | {#IVOLNAME} Total | Zabbix Agent | netappvol_size_total[{HOST.HOST},{#IVOLSVMNAME},{#IVOLNAME}] | 127.0.0.1:10050 | Unsigned | Decimal | B | - | 3600 |
Infinite Volume Discovery | {#IVOLNAME} Used | Zabbix Agent | netappvol_size_used[{HOST.HOST},{#IVOLSVMNAME},{#IVOLNAME}] | 127.0.0.1:10050 | Unsigned | Decimal | B | - | 3600 |
Zabbix Items
Name | Type | Key | Host Interface | SNMP OID / Type | SNMP Comm | Data Type | Unit | Custom X | Interval |
---|---|---|---|---|---|---|---|---|---|
Combined Agggregate Total | Zabbix Agent | netapp_combaggr_total[{HOST.HOST}]] | 127.0.0.1:10050 | - | Unsigned | Decimal | B | 1000 | 3600 |
Combined Agggregate Used | Zabbix Agent | netapp_combaggr_used[{HOST.HOST}]] | 127.0.0.1:10050 | - | Unsigned | Decimal | B | 1000 | 3600 |
Combined Cluster CPU Utilisation | SNMPv2 | cpuBusyTimePerCent | Unsigned | Decimal | % | - | 300 |
Zabbix Graph Prototypes
Discovery Rule | Name | Y axis MIN value | Items | Func | Draw Type | Colour |
---|---|---|---|---|---|---|
Aggregate Discovery | {#AGGRNAME} Latency (ms) | Fixed | {#AGGRNAME} Latency (ms) | avg | Bold Line | DD0000 |
Aggregate Discovery | {#AGGRNAME} OPS | Fixed | {#AGGRNAME} Write OPS | avg | Bold Line | DD0000 |
Aggregate Discovery | ::: | ::: | {#AGGRNAME} Read OPS | avg | Bold Line | 00DD00 |
Aggregate Discovery | ::: | ::: | {#AGGRNAME} Total OPS | avg | Bold Line | DDDD00 |
Aggregate Discovery | {#AGGRNAME} Throughput | Fixed | {#AGGRNAME} Write B/s | avg | Bold Line | DD0000 |
Aggregate Discovery | ::: | ::: | {#AGGRNAME} Read B/s | avg | Bold Line | 00DD00 |
Flexgroup Discovery | {#FGNAME} Average Latency (ms) | Fixed | {#FGNAME} Latency (ms) | avg | Bold Line | DD0000 |
Flexgroup Discovery | {#FGNAME} IOPS | Fixed | {#FGNAME} Write OPS | avg | Bold Line | DD0000 |
Flexgroup Discovery | ::: | ::: | {#FGNAME} Read Other | avg | Bold Line | DDDD00 |
Flexgroup Discovery | ::: | ::: | {#FGNAME} Read OPS | avg | Bold Line | 00DD00 |
Flexgroup Discovery | ::: | ::: | {#FGNAME} Total OPS | avg | Bold Line | DDDD00 |
Flexgroup Discovery | {#FGNAME} Throughput | Fixed | {#FGNAME} Write B/s | avg | Bold Line | DD0000 |
Flexgroup Discovery | ::: | ::: | {#FGNAME} Read B/s | avg | Bold Line | 00DD00 |
Flexgroup Discovery | {#FGNAME} Utilisation | Fixed | {#FGNAME} Total | avg | Filled region | 00DD00 |
Flexgroup Discovery | ::: | ::: | {#FGNAME} Used | avg | Filled region | DD0000 |
Infinite Volume Discovery | {#IVOLNAME} Average Latency (ms) | Fixed | {#IVOLNAME} Latency (ms) | avg | Bold Line | DD0000 |
Infinite Volume Discovery | {#IVOLNAME} IOPS | Fixed | {#IVOLNAME} Write OPS | avg | Bold Line | DD0000 |
Infinite Volume Discovery | ::: | ::: | {#IVOLNAME} Read Other | avg | Bold Line | DDDD00 |
Infinite Volume Discovery | ::: | ::: | {#IVOLNAME} Read OPS | avg | Bold Line | 00DD00 |
Infinite Volume Discovery | ::: | ::: | {#IVOLNAME} Total OPS | avg | Bold Line | DDDD00 |
Infinite Volume Discovery | {#IVOLNAME} Throughput | Fixed | {#IVOLNAME} Write B/s | avg | Bold Line | DD0000 |
Infinite Volume Discovery | ::: | ::: | {#IVOLNAME} Read B/s | avg | Bold Line | 00DD00 |
Infinite Volume Discovery | {#IVOLNAME} Utilisation | Fixed | {#IVOLNAME} Total | avg | Filled region | 00DD00 |
Infinite Volume Discovery | ::: | ::: | {#IVOLNAME} Used | avg | Filled region | DD0000 |
Zabbix Graphs
Name | Y axis MIN value | Items | Func | Draw Type | Colour |
---|---|---|---|---|---|
Cluster CPU Utilisation | Fixed | Combined Cluster CPU Utilisation | avg | Bold Line | DD0000 |
Total Aggregate Utilisation | Fixed | Combined Aggregate Total | avg | Filled region | 00DD00 |
::: | Fixed | Combined Aggregate Used | avg | Filled region | DD0000 |
One thought on “NetApp Cluster Trend Reporting using Zabbix”