Failing Drive FreeNAS 9.3

This morning I received an email warning that one of my drives is problematic:

Device: /dev/ada9, 1 Currently unreadable (pending) sectors

So checking the device from the box it doesn’t look good!

freenas# smartctl -a /dev/ada9
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p13 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Hitachi Ultrastar 7K3000
Device Model: Hitachi HUA723030ALA640
Serial Number: MK0351YHGUXEHA
LU WWN Device Id: 5 000cca 225cbc805
Firmware Version: MKAOA5C0
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed May 13 18:16:16 2015 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

<SNIP>

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate 0x000b 090 090 016 Pre-fail Always - 1835041
 2 Throughput_Performance 0x0005 135 135 054 Pre-fail Offline - 86
 3 Spin_Up_Time 0x0007 122 122 024 Pre-fail Always - 634 (Average 628)
 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 98
 5 Reallocated_Sector_Ct 0x0033 092 092 005 Pre-fail Always - 321
 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
 8 Seek_Time_Performance 0x0005 135 135 020 Pre-fail Offline - 26
 9 Power_On_Hours 0x0012 096 096 000 Old_age Always - 28404
 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 95
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 1170
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 1170
194 Temperature_Celsius 0x0002 153 153 000 Old_age Always - 39 (Min/Max 18/65)
196 Reallocated_Event_Count 0x0032 085 085 000 Old_age Always - 447
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

OK So I need to fail this drive and replace it. Having not done this before I thought it a good time to document the procedure!

Interestingly, or not, the zpool looks fine, maybe a scrub would show the errors?

freenas# zpool status
 pool: freenas-boot
 state: ONLINE
 scan: scrub repaired 0 in 0h0m with 0 errors on Wed Apr 22 03:45:16 2015
config:

 NAME STATE READ WRITE CKSUM
 freenas-boot ONLINE 0 0 0
 ada12p2 ONLINE 0 0 0

errors: No known data errors

 pool: storage01
 state: ONLINE
 scan: scrub repaired 0 in 1h22m with 0 errors on Sun Apr 26 01:22:05 2015
config:

 NAME STATE READ WRITE CKSUM
 storage01 ONLINE 0 0 0
 raidz2-0 ONLINE 0 0 0
 gptid/2e1acf56-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 gptid/2e79b57a-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 gptid/2ed9cb08-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 gptid/2f3d2255-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 gptid/2fa89090-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 gptid/300d6a0b-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 raidz2-1 ONLINE 0 0 0
 gptid/30720ee2-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 gptid/30d928c3-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 gptid/313faf6d-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 gptid/31a19255-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 gptid/3209af25-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 gptid/327278da-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0

errors: No known data errors
freenas# zpool status -x
all pools are healthy

Initially I thought I’d do it via the CLI but I saw an old post where the GUI was out of sync afterwards so I’ll just do the GUI procedure.

Even though all of the disks are set up as ACHI and I have server grade CPU, motherboard and RAM I noticed when testing that my system wasn’t truly hot swap, I guess it’s due to the cheap SATA controllers I’ve used (See this post for a hardware rundown). Not a problem, it’s a personal server so I’ll just shut it down.

So offline the drive, ada9 via Storage > Click on my main volume, storage01, then click on the volume status button:

Then on the list of disks click on the problem one, ada9p2 and then the OFFLINE button:

Now the zpool should show some warnings:

freenas# zpool status -x
 pool: storage01
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
 Sufficient replicas exist for the pool to continue functioning in a
 degraded state.
action: Online the device using 'zpool online' or replace the device with
 'zpool replace'.
 scan: scrub repaired 0 in 1h22m with 0 errors on Sun Apr 26 01:22:05 2015
config:

 NAME STATE READ WRITE CKSUM
 storage01 DEGRADED 0 0 0
 raidz2-0 ONLINE 0 0 0
 gptid/2e1acf56-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 gptid/2e79b57a-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 gptid/2ed9cb08-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 gptid/2f3d2255-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 gptid/2fa89090-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 gptid/300d6a0b-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 raidz2-1 DEGRADED 0 0 0
 gptid/30720ee2-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 gptid/30d928c3-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 gptid/313faf6d-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 14192876183538871392 OFFLINE 0 0 0 was /dev/gptid/31a19255-ccc1-11e4-bfca-002590a19320
 gptid/3209af25-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0
 gptid/327278da-ccc1-11e4-bfca-002590a19320 ONLINE 0 0 0

errors: No known data errors

Shutdown the server, replace the drive (I have previously labelled each drive enclosure) and power on the server.

Once logged in check and see that a new ada9 is there via Storage > View Disks button:

Then go back to the Volume status screen, click on the offline disk and click on the Replace button and replace with the new ada9 drive:

This will then start a resilver, which is the ZFS rebuild. Once that is done I’ll initiate a ZFS scrub via Storage > main volume > Scrub button.

Thanks to the following sources of information:

http://blog.henrikandersen.se/2013/03/13/replacing-a-harddrive-in-a-zfs-pool-on-freenas/
https://www.freebsdnews.com/2015/02/28/freenas-replace-failed-hdds/
http://doc.freenas.org/9.3/freenas_storage.html#replacing-a-failed-drive

Rejected I/O

General Geeky Meanderings

Failing Drive FreeNAS 9.3