RAID Bad Stripe Corruption and VMFS Recovery

In the early Monday morning, we have found our VMware ESXi 5.0 server showing all VMFS datastores located on direct-attached storage disks (DAS) as offline. The disks themselves were offline as well. The server appeared to function bull all local disks were gone. A graceful shutdown attempt was in vain. After a cold boot one of the VMFS was still offline. The underlying RAID volume was online though.

Controller Status

This server uses Adaptec RAID 51245 controller so we use ARCCONF utility for monitoring and management. A quick glance at the controller’s state and configuration didn’t reveal much: no failed or degraded disks, the number of logical disk is correct.

arcconf getconfig 1 AD

----------------------------------------------------------------------
Controller information
----------------------------------------------------------------------
   Controller Status                        : Optimal
   Channel description                      : SAS/SATA
   Controller Model                         : Adaptec 51245
   Controller Serial Number                 : 8A11102B779
   Physical Slot                            : 3
   Temperature                              : 67 C/ 152 F (Normal)
   Installed memory                         : 512 MB
   Copyback                                 : Enabled
   Background consistency check             : Enabled
   Automatic Failover                       : Enabled
   Global task priority                     : High
   Performance Mode                         : Default/Dynamic
   Stayawake period                         : Disabled
   Spinup limit internal drives             : 0
   Spinup limit external drives             : 0
   Defunct disk drive count                 : 0
   Logical devices/Failed/Degraded          : 4/0/0
   SSDs assigned to MaxIQ Cache pool        : 0
   Maximum SSDs allowed in MaxIQ Cache pool : 8
   MaxIQ Read Cache Pool Size               : 0.000 GB
   MaxIQ cache fetch rate                   : 0
   MaxIQ Cache Read, Write Balance Factor   : 3,1
   NCQ status                               : Enabled
   Statistics data collection mode          : Enabled
   --------------------------------------------------------
   Controller Version Information
   --------------------------------------------------------
   BIOS                                     : 5.2-0 (18252)
   Firmware                                 : 5.2-0 (18252)
   Driver                                   : 1.1-7 (28000)
   Boot Flash                               : 5.2-0 (18252)

VMware ESXi logs

Then we went to see the logs of the ESXi server. To access full logs of the ESXi server, one may either use vSphere client, choosing Administration>Export System Logs or connect to ESXi via SSH and use less or similar command.

SSH Access:

Use putty or any other SSH client to connect to your ESXi host.
Issue cd /var/log to change directory
There is a bunch of interesting logs; issuing ls –l command to see all files.
Open vmkernel.log using less ./vmkernel.log
press Q to exit when done

We found out that the vmkernel.log was continually populated by the same message: failed to initialize vmfs3 distributed locking on volume This error alone doesn’t give us much. There is a VMware Knowledge Base article mentioning this error, but since the Adaptec 51245 doesn’t support VAAI it wasn’t our case. Since we knew that the problem affected all disks, for a moment, we adopted an assumption that the problem is at the RAID controller level.

Controller logs

A thorough examination of state and controller logs revealed some interesting details.

arcconf getlogs 1 dead

This command returns the log entries associated with failed drives. But since we had no degraded or failed RAID volumes, this log turned out empty (as expected).

arcconf getlogs 1 device

This returns the log entries about any event associated with hard drives. This log was full. All disks in the system were in the log indicating numerous aborted commands (abortedCmds).

<ControllerLog controllerID="0" type="0" time="1343024575" version="3" tableFull="false">
    <driveErrorEntry smartError="false" vendorID="WDC     " serialNumber="WD-WMAYXXXXXXXX" wwn="0000000000000000" deviceID="0" productID="WD2003FY" numParityErrors="0" linkFailures="0" hwErrors="0" abortedCmds="196" mediumErrors="0" smartWarning="0" />
    <driveErrorEntry smartError="false" vendorID="WDC     " serialNumber="WD-WMAYXXXXXXXX" wwn="0000000000000000" deviceID="1" productID="WD2003FY" numParityErrors="0" linkFailures="0" hwErrors="0" abortedCmds="298" mediumErrors="0" smartWarning="0" />
    <driveErrorEntry smartError="false" vendorID="WDC     " serialNumber="WD-WMAYXXXXXXXX" wwn="0000000000000000" deviceID="2" productID="WD2003FY" numParityErrors="0" linkFailures="0" hwErrors="0" abortedCmds="187" mediumErrors="0" smartWarning="0" />
    ...and so on...
</ControllerLog>

An aborted command usually indicates that a disk operation has timed out. HDD pre-failure condition is a likely scenario for such errors, but having all (10) disks experiencing same problem at the same time was suspicious. This highlighted a common point of failure – the controller itself.

arcconf getconfig 1

This returns the whole configuration, including current state of the controller, of each logical disk (RAID volume) and each physical disk state. Carefully going through the configuration we found that one of the logical volumes (associated with the failed VMFS) had a status Impacted and was indicating having Failed Stripes.

Logical device number 1
   Logical device name                      : Array Bravo
   RAID level                               : 5
   Status of logical device                 : Impacted
   Size                                     : 1429494 MB
   Stripe-unit size                         : 256 KB
   Read-cache mode                          : Enabled
   MaxIQ preferred cache setting            : Disabled
   MaxIQ cache setting                      : Disabled
   Write-cache mode                         : Enabled (write-back)
   Write-cache setting                      : Enabled (write-back) when protected by battery/ZMM
   Partitioned                              : No
   Protected by Hot-Spare                   : Yes
   Global Hot-Spare                         : 0,10
   Bootable                                 : No
-> Failed stripes                           : Yes          <<<

Failed or Bad Stripes A Failed Stripe or a Bad Stripe is a form of RAID Volume corruption that can occur should a write operation hang at a critical moment. Some of the reasons that lead to the RAID corruption are: power failure, controller fault. Using a write-cache without a battery (controller battery, not the UPS) can greatly increase the risk. Usually there is a setting for a write-cache to be enabled only in the presence of functioning battery. So, should the power failure occur, the information stored in the cache will be written to disk as soon as the power is back and the system is running. We had this battery and the appropriate setting. But we are lucky! Further analysis indicated that we had a dual failure. The battery died at the same time! It might have contributed to the RAID corruption. To plan further recovery attempts, one needs to understand the implications of a failure. Adaptec Knowledge Base article states the following important points about Bad Stripes:

A bad stripe doesn’t take the entire drive offline but only the data located in that stripe.
The data in the affected stripe cannot be guaranteed to be consistent; therefore the access to the data is blocked.
Bad stripes are tracked using Bad Stripe Table (BST) located in the area reserved for the RAID configuration on physical disks grouped into an array.
The most important: there is no way to fix or clear a BST entry without destroying data .

Recovery

Preparations

Before attempting any recovery action, we wanted to secure the original disk. Due to business requirements, the server had to be up and running virtual machines on the unaffected storage. This caused a small complication. Logs of the VMware ESXi were filling up with “failed to initialize vmfs3 distributed locking on volume” messages indicating that server continuously tried to mount the damaged disk. And a concurrent I/O during recovery may cause even more damage. Since it was a DAS disk, it was impossible to prevent ESXi from trying to mount the partition without shutting down the whole server. So a decision was made to delete the partition. It is possible to delete a partition from the drive and then restore it without losing the data. By deleting the partition, we’ll get rid of unwanted I/O operations on our disk. The whole operation is done using simple fdisk tool.

First login to the ESXi using SSH.
Issue cd /vmfs/devices/disks to change directory.

Locate the required volume using: ls –l

...
lrwxrwxrwx    1 root     root                 19 Sep 17 13:21 vml.0000000000766d686261323a313a30 -> mpx.vmhba2:C0:T1:L0
...

This directory contains mapping between datastore name/id and its physical address.
vmhba2 is our Adaptec controller. It is possible to see it in the vSphere Configuration tab. T1 is associated with our logical drive 1 – the damaged RAID.
Run fdisk /vmfs/devices/disks/vml.0000000000766d686261323a313a30
Delete the partition (press d and follow the instructions).
In the vSphere interface, issue a “Rescan for datastores…” selecting devices and VMFS by right clicking on the server.

After completing this procedure, ESXi should no longer see the damaged disk and cease all attempts to mount it.

Recovery Strategy

From the Adaptec’s knowledge base we know that BST entry is maintained in the configuration area on the disks in the RAID and there is no documented procedure to clear BST without deleting data. One should know, however, that usually, array controllers provide a way to create an array without initialization. Adaptec calls it SKIP Initialization method (Adaptec KB). This is an emergency measure to restore limited functionality of an array should multiple disks fail. This implies data loss, but some parts may be available. This method updates metadata only. Knowing that, we decided to destroy the RAID and then recreate it using SKIP Initialization, restore previously deleted partition and try to mount it. WARINING: this method is of a last resort type. Since there is no way to assure that the recovered data is consistent, the recovered array should not be used. Recover what you can to other drives and verify the integrity of the data before using it. Destroy the damaged volume and recreate it using normal procedures.

Recovery – RAID

To recreate the RAID it is important to write down all the details about the drive: it’s size, drives used in the array, block size… Just about everything! Earlier in this post, there is an output of arcconf getconfig command that shows information about logical volume.

Delete the volume. Our disk is a logical drive 1, so we issue arcconf delete 1 logicaldrive 1

arcconf delete 1 logicaldrive 1
Controllers found: 1

WARNING: logical device 1 may be mounted.
All data in logical device 1 will be lost and all application data that is currently live will be disrupted.
Are you sure to delete the logical device?
Press y, then ENTER to continue or press ENTER to abort:y

Deleting: logical device 1 ("Array Bravo")

Command completed successfully.

Creating the same drive using SKIP initialization method. In our case, we have default settings: We used the whole size to create our original array, we used RAID5 and used disks 0:4, 0:5, 0:6. Issue: arcconf create 1 logicaldrive Method SKIP MAX 5 0 4 0 5 0 6

arcconf create 1 logicaldrive Method SKIP MAX 5 0 4 0 5 0 6
Controllers found: 1
You have chosen not to initialize this logical device.
This option is provided as a last resort for recovering failed logical devices.
You should verify the integrity of the data on any logical devices created with this option.

Do you want to add a logical device to the configuration?
Press y, then ENTER to continue or press ENTER to abort:y

Creating logical device: Device 1

Command completed successfully.

Now we have the same drive with no BST entries. Now we should restore our partition.

Recovery – VMFS Partition

Connect to the ESXi host using SSH.
Remember the drive where we deleted the partition? Run the fdisk with -u flag. fdisk –u /vmfs/devices/disks/vml.0000000000766d686261323a313a30
Now create the partition:
1. Press n and press Enter – Enters the dialog to create a partition.
2. Press p and press Enter – Says to create a primary partition.
3. Press 1 and press Enter – It will be the first partition.
4. Press 128 and press Enter – This will align the starting position to sector 128.
5. Press Enter again use default value for the end block.
Change partition type to fb (VMFS)
1. Press t and press Enter.
2. Enter fb and press Enter.
Press w and press Enter to save and exit.
Now issue a “Rescan for datastores…” using vSphere GUI or type vmkfstools –V in SSH.

The official recovery procedure may be found in VMware Knowledge Base.

VMFS Volume as a Snapshot

If the VMFS partition didn’t show up after the rescan operation this may indicate that the volume is now considered as a snapshot. This is because the RAID Controller has changed the volume ID. Read the VMWare Knowledge Base article about LUN as snapshot. There is a possibility to mount it permanently and the volume will be mounted across reboots. Since we need this volume for recovery purposes only we’ll just temporary mount the volume using esxcfg-volume -m VMFS_UUID|label. Example: esxcfg-volume -m "49d22e2e-996a0dea-b555-001f2960aed8" Now our VMFS Datastore should be accessible.

Conclusion

In our case, we successfully recovered all VMs from the Datastore without any visible damage. Your mileage may vary. Again, the datastore and the RAID was then destroyed and recreated using normal RAID initialization. Remember that RAID is not a replacement for a backup. RAID will eventually fail. In our case, we performed the whole recovery for practice and for better understanding of what had happen to our server. And since double or even triple failures are not so uncommon, every such practice opportunity should be taken.
[fb_button]