GPFS node rejects – Rejected I/O

Recently our new(ish) GPFS cluster has been used more and more by people at work. We’ve seen network throughput of up to 200Gb/s across our storage nodes which is seriously impressive.

I’m enjoying learning all this new HPC storage stuff and it brings the storage and compute team closer as there’s some cross over involved.

Last week however we started to suffer some serious performance issues and node ejections. Our storage nodes are RHEL 6.3 (2.6.32-279.22.1.el6.x86_64) with four 10g Ethernet interfaces bonded in to one LACP interface. For some reason one of our nodes couldn’t ping two of the other 5 nodes and visa versa; we’d seen this before and a temporary fix was to bounce the bonding interface. We also saw a substantial number of overrun frames on all of the interfaces on all of the nodes.

We updated the following settings, which required the client and storage clusters to be shutdown:

socketMaxListenConnections

From:

socketMaxListenConnections 128

To:

socketMaxListenConnections 1500

WARNING Disruptive change! A mmshutdown -a and mmstartup -a was needed to apply this change.

net.core.somaxconn

This is a sysctrl.conf change, it’s best practice to have this match the gpfs setting socketMaxListenConnections above

From:

net.core.somaxconn = 1024

To:

net.core.somaxconn = 1500

Disable NIC offloads

This is another change to help with the large number of overrun errors seen on all of the storage nodes’ network interfaces, including the bonding interface. These errors are a known issue mentioned in an IBM KB article.

RH documentation with respect to txqueuelen and bonding here.

The following changes were made:

for i in `seq 0 5` ; do ethtool -K eth$i tso off gso off lro off tx off rx off ; done
/sbin/ifconfig bond0 txqueuelen 10000

The NIC changes are currently not persistent across reboots.

failureDetectionTime

This timeout was increased to help with the node and client expulsion issue. The default was meant to be 35 but our setting was unusual.

From:

failureDetectionTime = -1

To:

failureDetectionTime = 60

WARNING Disruptive change! A mmshutdown -a and mmstartup -a was needed to apply this change.

IBM GPFS Best practices

Below are some IBM wiki documents relating to performance tuning that we’ll be going through and possibly implementing. Our GPFS implementation isn’t a dedicated HPC cluster with a private Infiniband data network but more of a general purpose compute farm with a diverse number of storage systems attached so many of these settings won’t be relevant on either or both GPFS clusters (client/server)

Network Tuning

OS Tuning