NetApp NFS chgrp issue – Rejected I/O

Had an interesting issue that I’ve resolved just now. Because of my history in networking and Windows administration I’m not that aware of Linux and NFS legacy stuff like I maybe am with Windows.

So the issue was that a virtual user’s script that changed the group of various files to various groups had stopped fully working once it was migrated to a NetApp Flexgroup. The user account was a member of all the groups. When the files were on and the the script ran against an Isilon it worked.

When testing it became apparent time and time again it was the same groups working and the same groups failing on the NetApp for example:

chgrp group_1 /nfs/production/<path to file>/test01
chgrp: changing group of `/nfs/production/<path to file>/test01': Operation not permitted

Doing the same on a local file system or Isilon worked.

Doing the same against a 7-mode and C-mode flexvol had the same behaviour as the Flexgroup.

Interestingly if we changed the user’s primary group for that session via newgrp to a failing group, it worked. Changing it back, the broken group broke again.

The user was a member of a lot of groups so maybe group membership was being trunkated by the NetApps. Internally we use LDAP and NIS but as the LDAP servers have been hardened too much for the NetApps (even for 9.2, TLS issues) they have to use NIS.

production::> set adv
production::*> name-service ns-switch show -vserver vnas-production
(vserver services name-service ns-switch show)
Source
Vserver Database Order
--------------- ------------ ---------
vnas-production hosts files,
dns
vnas-production group nis
vnas-production passwd nis
vnas-production netgroup nis
vnas-production namemap files
5 entries were displayed.
production::*> name-service getxxbyyy getgrlist -node nas60a -vserver vnas-production -username <username>
(vserver services name-service getxxbyyy getgrlist)
pw_name: <username>
Groups: 1098 1367 1368 1316 1307 1369 1370 1371 1341 1372 1303 1373 1374 1375 1338 1376 1377 1378 1379 1380 1310 1381 1382 1285 1383 1384 1385 1386 1387 1342 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1311 1402 1403 1309 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1312 1421 1422 1290 1423 1424 1425 1426 1427 1292 1428 1429 1430 1431 1306 1432 1433 1434 1435 1436 1437 1438 1439 1334 1440 1441 1442 1344 1343 1443 1444 1340 1339 1445 1301 1446 1314 1447 1286 1448 1449 1450 1451 1452 1453 1454 1308 1345 1313 1455 1456 1457 1458 1459 1460 1461 1093 1110 1234

So it seems the group look up was fine.

To help get to the bottom of the issue our support people suggested a packet trace for the test host. Firstly I found out the storage system’s IP address via the /proc/mounts file the test host was using to mount the file system as we run on box DNS.

Then discover the cluster node hosting that IP/lif via:

ssh production net int show -vserver vnas-production | grep <ip address>

On the cluster I started the trace via:

ssh production network tcpdump start -node nas60a -port a0a-4001 -address <test host IP>

Then ran the same tests, trying to change the group of various files to generate a trace file. To access the tracefile, via a webrowser go to:

https://<cluster name>/spi/<node name>/etc/log/packet_traces/

When analysing the tracefile with Wireshark I noticed something interesting. The client was sending over it’s UID, primary GID and only 16 auxiliary GIDs.

So it seems that the client itself was truncating the user’s group list. After some reasearch I come to the point I made at the start of the post, there’s history here. It seems everyone is following the required standards and behaving as they should. In all the years running NetApps and NFS this issue has never happened before.

So the NetApp is not using NIS for user and group enumeration by default. Hunting through the NFS best practices document on page 170 it confirmed my thoughts. The main (Advanced) setting to fix this is -auth-sys-extended-groups and then use -extended-groups-limit to increase it over the default 32:

Once I’d run the following command the issue had gone:

production::> set adv
production::*> vserver nfs modify -vserver vnas-production -extended-groups-limit 200 -auth-sys-extended-groups enabled

Excellent fixed!

Or so I’d thought. Coming into work the next day the entire compute farm had lost it’s mount to the Flexgroup. Checking netgroup, group and DNS resolution it all checked out:

production::*> name-service nis-domain show-bound -vserver vnas-production -domain ebi
 (vserver services name-service nis-domain show-bound)

Vserver: vnas-production
 NIS Domain: <domain name>
Bound NIS Servers: <list of NIS server IPs>

production::*> name-service getxxbyyy netgrp -node nas60a -vserver vnas-production -netgroup <netgroup name> -client-name <FQDN host>
 (vserver services name-service getxxbyyy netgrp)
<FQDN host> is a member of <netgroup name>
production::*> name-service getxxbyyy getgrbyname -node nas60a -vserver vnas-production -groupname <group>
 (vserver services name-service getxxbyyy getgrbyname)
name: <group>
gid: 1367
gr_mem: <account names>

Even adding a host to an export policy rule via IP or hostname rather than netgroup didn’t help. To try and understand this a bit better I enabled secd tracing to view what was going on:

production::*> set diag
production::*> diag secd trace set -node nas60a -trace-all yes
Trace spec set successfully for trace-all.

Access the log file via the path:

https://<cluster name>/spi/<node name>/etc/log/mlog/secd.log

Going straight to the bottom it seemed to click straight away why this was failing:

[kern_secd:info:8732] .------------------------------------------------------------------------------.
[kern_secd:info:8732] | TRACE MATCH |
[kern_secd:info:8732] | RPC is being dumped because of a tracing match on: |
[kern_secd:info:8732] | All |
[kern_secd:info:8732] .------------------------------------------------------------------------------.
[kern_secd:info:8732] | RPC FAILURE: |
[kern_secd:info:8732] | secd_rpc_auth_user_id_to_unix_ext_creds has failed |
[kern_secd:info:8732] | Result = 0, RPC Result = 6909 |
[kern_secd:info:8732] | RPC received at <SNIP> |
[kern_secd:info:8732] |------------------------------------------------------------------------------'
[kern_secd:info:8732] Failure Summary:
[kern_secd:info:8732] Error: Acquire UNIX extended credentials procedure failed
[kern_secd:info:8732] [ 0 ms] Entry for user-id: 0 not found in the current source: NIS. Entry for user-id: 0 not found in any of the available sources
[kern_secd:info:8732] **[ 1] FAILURE: Unable to retrieve credentials for UNIX user with UID 0
[kern_secd:info:8732] Details:

There is no entry for root within our NIS domain, and as root is who mounts this explains why the mount requests are being denied. I then add files to the ns-switch databases:

ssh production name-service ns-switch modify -vserver vnas-production -database group -sources files,nis
ssh production name-service ns-switch modify -vserver vnas-production -database passwd -sources files,nis

Instantly (thankfully) mounts were being allowed and everything was back to normal.