So we believe our rejects have been resolved. Colleagues led the investigation which basically consisted of watching the waiters on the storage cluster side, noting which client was listed in a growing waiter entry (we’re talking seconds here), logging in to said client and then watching it’s waiters. They’d then try and work out what the waiting GPFS command/process was that was waiting before the client was expelled.
So the basic command to view the current waiters is:
/usr/lpp/mmfs/bin/mmfsadm dump waiters
For more sense and information you’d want to watch the waiters of all of the nodes in the cluster. To do this I’d hacked together a simple script:
#!/bin/bash
#
# 21/08/14 –
#
# script to watch the waiters of all
# nodes in the storage cluster. I have to set the
# WCOLL variable as a file, when I tried to set
# them in an array and then exported that array
# it errored with couldn’t open file. WCOLL stands
# for:
#
# working collective
#
# this variable is expected to be set before running
# the mmdsh command.
#
#
# I’ve tried to write the script so it’s dynamic. It
# works on both the client and storage clusters even if
# the number of nodes have changed.cluster=$(mmlscluster | grep “GPFS cluster name” | awk -F: ‘{print $2}’ | awk ‘{print $1}’ | awk -F. ‘{print $1}’)
temp_file1=/tmp/nodelist.tmp
temp_file2=/tmp/nodelist.tmp2
node_list=/tmp/nodelist.txtmmlsnode | grep $cluster | awk -F$cluster ‘{print $2}’ > $temp_file1
tr ‘ ‘ ‘\n’ < $temp_file1 &> $temp_file2
grep -v ‘^$’ $temp_file2 > $node_list
export WCOLL=$node_list
watch -n 1 mmdsh /usr/lpp/mmfs/bin/mmfsadm dump waiters
However a colleague had worked out the following command that did the same without a bespoke script:
mmlsnode -N waiters -L
So if you use watch and the command above you get the same functionality!
What we were seeing was waiters similar to below (I don’t have the exact output from the troubleshooting sessions as I wasn’t involved directly)
0x7F2F4002AE30 waiting 0.151472644 seconds, FileBlockReadFetchHandlerThread: on ThCond 0x7F2D5C003F68 (0x7F2D5C003F68) (MsgRecordCondvar), reason ‘RPC wait’ for NSD I/O completion on node xxx.xxx.xxx.xxx <c1n1>
These entries would increase constantly until the waiter would be higher than the configures GPFS setting idleSocketTimeout and the node would be expelled.
Why these were occuring? Well a colleague assumed (correctly) that the client was waiting on the File System Manager to give it a file lock that was already being written to. This was verified by another colleague running some simple dd tests on a single large file from a couple of client nodes.
We had a very good description of the GPFS locking mechanism from someone on the GPFS user group email list:
Once Client B opens the file for writing it would have contacted the File System Manager to obtain the lock. The FS manager would have told Client B that Client A has the lock and that Client B would have to contact Client A and revoke the write lock token. If Client A does not respond to Client B’s request to revoke the write token, then Client B will ask that Client A be expelled from the cluster for NOT adhering to the proper protocol for write lock contention.
Once we knew what the cause of the ejections it was suggested barring any network related issues (which people say is normally always the cause of expels) we re-visit the storage and client cluster settings and adjust any in line with IBM’s best practices.
During a previous problem session we had intentionally not changed every setting on either cluster to be in-line with IBM’s network and OS recommendations. This time we did and also used another document given to us by IBM’s support (who were very slow to start off with).
Since then, the system has been stable, with no further unexpected expulsions. I believe the major setting that helped was changing idleSocketTimeout from 60 to 0 which basically means wait indefinitely.
UPDATE!
I forgot to add that someone visited last year and looked at the architecture of the system and said that was the issue. Basically having the GPFS storage system use the same network as general data rather than a dedicated network (think Fibre Channel) then there will always be issues.
Colleagues are currently migrating users data of it before a full rebuild and re-architecture.