NetApp C-mode Autoassign disk gotchas

A couple of things happened recently that has made me discover a recent change in C-mode for the disk auto assignment. In 7-mode, for me at least, swapping disks was a simple process however not so it seems for C-mode.

One of our systems is a small dual shelf system with some SSDs for a flash storage pool. The disk allocation is one shelf per head and the disk auto assignment is the default:

ssh netapp06 disk show -fields disk,owner,type
disk   owner  type
------ ------ ----
1.0.0  nas42a SSD
1.0.1  nas42a SSD
1.0.2  nas42a SSD
1.0.3  nas42a SSD
1.0.4  nas42a FSAS
1.0.5  nas42a FSAS
1.0.6  nas42a FSAS
1.0.7  nas42a FSAS
1.0.8  nas42a FSAS
1.0.9  nas42a FSAS
1.0.10 nas42a FSAS
1.0.11 nas42a FSAS
1.0.12 nas42a FSAS
1.0.13 nas42a FSAS
1.0.14 nas42a FSAS
1.0.15 nas42a FSAS
1.0.16 nas42a FSAS
1.0.17 nas42a FSAS
1.0.18 nas42a FSAS
1.0.19 nas42a FSAS
1.0.20 nas42a FSAS
1.0.21 nas42a FSAS
1.0.22 nas42a FSAS
1.0.23 nas42a FSAS
1.1.0  nas42b SSD
1.1.1  nas42b SSD
1.1.2  nas42b SSD
1.1.3  nas42b SSD
1.1.4  nas42b FSAS
1.1.5  nas42b FSAS
1.1.6  nas42b FSAS
1.1.7  nas42b FSAS
1.1.8  nas42b FSAS
1.1.9  nas42b FSAS
1.1.10 nas42b FSAS
1.1.11 nas42b FSAS
1.1.12 nas42b FSAS
1.1.13 nas42b FSAS
1.1.14 nas42b FSAS
1.1.15 nas42b FSAS
1.1.16 nas42b FSAS
1.1.17 nas42b FSAS
1.1.18 nas42b FSAS
1.1.19 nas42b FSAS
1.1.20 nas42b FSAS
1.1.21 nas42b FSAS
1.1.22 nas42b FSAS
1.1.23 nas42b FSAS
48 entries were displayed.
ssh netapp06 disk option show
Node           BKg. FW. Upd.  Auto Copy     Auto Assign    Auto Assign Policy
-------------  -------------  ------------  -------------  ------------------
nas42a         on             on            on             Default
nas42b         on             on            on             Default
2 entries were displayed.

I had a disk failure on node A, fine, swapped it out and thought nothing more of it for a couple of days until I discovered that the new disk was still unowned. I had to manually assign it to the A head and forgot about it again!

A colleague and I have since had a discusion about assignment for a couple of very large HA pairs as they had come across the Auto Assign Policy setting (shown above but not known about until then) when researching disk layout and RAID groups as the main system’s SAS stacks had been moved about from the initial design.

After discussing this new setting and how there is now assignment based on bay, shelf and stack I thought this could be the cause of the failed auto assignment on the smaller system. For example on page 48 of the Physical Storage Management Guide:

When you assign ownership for disks, you need to follow certain guidelines to maximize fault isolation
and to keep automatic ownership assignment working. The guidelines are impacted by your autoassignment policy.

So I went to have a look though the messages logs of the cluster (https://<cluster>/spi/<node>/etc/log/mlog/) and found this very telling entry:

0000000c.0047ce6e 07364b0b <date> 16:43:03 +00:00 [diskown.AutoAssign.MultipleOwners:warning] Automatic
  assigning failed for disk 0a.00.19 (S/N Z1Z9J77G) because the disks on the loop are owned by multiple 
  systems. Automatic assigning failed for all unowned disks on this loop.

BINGO! The system is so small it only has one SAS stack so the default, out of the factory setting is wrong! I’ve now changed the policy to “Shelf” for all our systems where the disk allocation is done via shelf.

Back the big system, even though it’s 4 stacks they’re not equally balanced as I requested one stack per rack for astetic and possible expansion reasons. This left the system with 3 full stacks and 1 with only 4 shelves, without appreciating the issue this will have with auto assignment and manual disk assignment I did at the start explained below. I used the following bash code (assuming ssh keys have been setup) to unaasign all the 0-11 disks in all shelves from node B; assign them to A; then unassign all the 12-23 disks in all shelves from node A and then assign them to B so I had pretty much a 50/50 split across all shelves. Ignoring the aggr0 allocation:

ssh netapp10 storage disk option modify -node * -autoassign off
shelves=(00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 30 31 32 33 34 35 36 37 38 39)
first_half_disks=(0 1 2 3 4 5 6 7 8 9 10 11)
second_half_disks=(12 13 14 15 16 17 18 19 20 21 22 23)
for zz in "${shelves[@]}"; do for a in "${first_half_disks[@]}"; do for b in `ssh netapp10 node run
   -node nas47b disk show -o nas47b | grep ".$zz.$a " | awk '{print $1}'`; do ssh netapp10 node run
   -node nas47b disk assign -s unowned $b; done; done; done
ssh netapp10 node run -node nas47a disk assign all
for zz in "${shelves[@]}"; do for a in "${second_half_disks[@]}"; do for b in `ssh netapp10 node run
   -node nas47a disk show -o nas47a | grep ".$zz.$a " | awk '{print $1}'`; do ssh netapp10 node run
   -node nas47a disk assign -s unowned $b; done; done; done
ssh netapp10 node run -node nas47b disk assign all

Note that the above commands ignore any disks not assigned to the node unassigning the disks and it get’s over ridden when it tries to unassign an aggr0 disk, so it’s safe (for an empty system). This configuration doesn’t fit though with the autoallocation policies available, so I may see similar auto assignment issues when disks start to fail. I’m still tring to find out the exact policiy for shelf (I’m imagining it’s simply it’ll auto assign based on current owner of a disk or disk 0) but to summarise:

Bay – Even and Odd per shelf
Shelf – ?
Stack – per stack

For me initially the best option is to go for the bay policy and then reassign all 700 or so disks with a slight adjustment of the code above after destroying all the aggregates and storage pools (I have one shelf of SSDs):

ssh netapp10 storage disk option modify -node * -autoassign off
shelves=(00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 30 31 32 33 34 35 36 37 38 39)
even_disks=(0 2 4 6 8 10 12 14 16 18 20 22)
odd_disks=(1 3 5 7 9 11 13 15 17 19 21 23)
for zz in "${shelves[@]}"; do for a in "${even_disks[@]}"; do for b in `ssh netapp10 node run
   -node nas47b disk show -o nas47b | grep ".$zz.$a " | awk '{print $1}'`; do ssh netapp10 node
   run -node nas47b disk assign -s unowned $b; done; done; done
ssh netapp10 node run -node nas47a disk assign all
for zz in "${shelves[@]}"; do for a in "${odd_disks[@]}"; do for b in `ssh netapp10 node run
   -node nas47a disk show -o nas47a | grep ".$zz.$a " | awk '{print $1}'`; do ssh netapp10 node
   run -node nas47a disk assign -s unowned $b; done; done; done
ssh netapp10 node run -node nas47b disk assign all

The annoying thing is that the system won’t allow me to choose bay; it’s not the recommended configuration in NetApp’s Physical Storage Management Guide, but there’s nothing saying I can’t. After speaking to our support the thought is that because of the current disk assignment it won’t let me choose it as the policy won’t work. I’m not sure this is the case as I could assign the shelf policy no problem and I have other systems on default setting that can’t auto assign disks. I suspect that the bay policy isn’t supported on the 8000 series. I’ll just leave the SSD shelf split down the middle and then assign alternative shelves to each head:

ssh netapp10 storage disk option modify -node * -autoassign off
even_shelves=(02 04 06 08 10 12 14 16 18 20 22 30 32 34 36 38)
odd_shelves=(01 03 05 07 09 11 13 15 17 19 21 23 31 33 35 37 39)
disks=(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23)
for zz in "${even_shelves[@]}"; do for a in "${disks[@]}"; do for b in `ssh netapp10 node run
   -node nas47b disk show -o nas47b | grep ".$zz.$a " | awk '{print $1}'`; do ssh netapp10 node
   run -node nas47b disk assign -s unowned $b; done; done; done
ssh netapp10 node run -node nas47a disk assign all
for zz in "${odd_shelves[@]}"; do for a in "${disks[@]}"; do for b in `ssh netapp10 node run
   -node nas47a disk show -o nas47a | grep ".$zz.$a " | awk '{print $1}'`; do ssh netapp10 node
   run -node nas47a disk assign -s unowned $b; done; done; done
ssh netapp10 node run -node nas47b disk assign all

Interesting times dealing with a new disk option so quietly introduced in 8.3 (very light in the man pages and other documentation) that some of our systems come incorrectly configured out of the factory and our supplier’s field engineers aren’t aware of the setting!

As a test I moved the aggr0/vol0 of each of the heads to disks inline with the new assignment policy and then unassigned a couple disk from the old aggr0 on the wrong shelves ones from head A whilst auto assignment was enabled; odds are head B. Ran the command:

ssh netapp10 node run -node nas47a disk assign -s unowned 0a.03.0
Note: Disks may be automatically assigned to this node, since option disk.auto_assign is on.

Checked the B head messages log:

0000000d.00002c53 008525e9 <date> 12:38:18 +00:00 [diskown.changingOwner:info] Changing ownership
   of disk 0a.03.0 (S/N 1EK3Z71F) from node "unowned" (ID 4294967295, DR home ID 4294967295) to
   node "nas47b" (ID 536989303, DR home ID 4294967295).

Did the same to a couple from B to see if A would grab them as well:

ssh netapp10 node run -node nas47b disk assign -s unowned 3a.10.0
Note: Disks may be automatically assigned to this node, since option disk.auto_assign is on.
ssh netapp10 node run -node nas47b disk assign -s unowned 3d.20.0
Note: Disks may be automatically assigned to this node, since option disk.auto_assign is on.

Checked the A head messages logs:

0000000d.00001790 00853f64 <date> 12:48:12 +00:00 [diskown.changingOwner:info] Changing ownership
   of disk 3a.10.0 (S/N 1EK3YZZF) from node "unowned" (ID 4294967295, DR home ID 4294967295) to
   node "nas47a" (ID 536989265, DR home ID 4294967295). 
0000000d.000017f1 008541bb <date> 12:49:12 +00:00 [diskown.changingOwner:info] Changing ownership
   of disk 0c.20.0 (S/N 1EK3Y8HF) from node "unowned" (ID 4294967295, DR home ID 4294967295) to
   node "nas47a" (ID 536989265, DR home ID 4294967295).

It works! 🙂

Rejected I/O

General Geeky Meanderings

NetApp C-mode Autoassign disk gotchas