Objective: Impact Assessment of Loss of connectivity / All paths down(APD) to USB/SD card boot device
Overview of ESXi boot process :
- ESXi hypervisor can be embedded or installed on USB storage or SD card taking advantage of the thin footprint of its installation size.
- During boot, the entire ESXi OS is loaded on to memory (aka RAMDisk), i.e. the complete OS, Configuration State & Logs(if there is no persistent scratch partition/syslog configured)
- The Configuration state is periodically backed up to disk-This can be USB, SD card, local disk or remote disk(boot from SAN) http://kb.vmware.com/kb/2001780
- The disk has the current configuration and previous configuration in the form of bootbank and altbootbank.
So what really happens when there is a loss of access to the disk holding the install media on USB or SD card and how is it different from a standard controller locally or a boot from SAN….
ESX hosts conventionally uses Vital Product Data(VPD) pages to send inquiry to storage devices to ascertain the device details and capabilities, that amongst various other exchanges , lets host identify the device uniquely.
The USB media does not support VPD data and hence is tracked through a runtime name.
On boot, the USB/SD device is enumerated as mpx.vmhba32 or mpx.vmhba33 and so on(you would see messages at boot time logging as device mpx.vmhba32 calimed path successfully)
During runtime if connectivity to a USB/SD device that is hosting the bootbank is lost, it cannot be regained. Even if the connectivity is restored immediately, it obtains a newer name as the previous pseudo name is blocked.
i.e. For instance the location of the bootbank which was enumerated and accessed as vmhba32 will most likely be observed as vmhba33.
The ESXi cannot save configuration or any other data back into the bootbank any longer.
This would lead to Virtual Machine crash and /or erratic behaviour of the ESXi host.
I have recently worked on an Escalation wherein the during a firmware upgrade the SD card had reset after the host booted causing the APD to bootbank. Virtual Machine that vmotioned to this host crashed immediately.
*timestamp* The OS booted.
*timestamp* Server reset.
*timestamp* The OS booted.
ESXi Vmkernel Logs
*timestamp* StorageApdHandler: StorageAPDSendEvent:395: Device or filesystem with identifier [mpx.vmhba32:C0:T0:L0] has entered the All Paths Down state.
*timestamp* StorageApdH-andler: Storage_APDStart:846: APD Start for ident [mpx.vmhba32:C0:T0:L0]!
There has also been an earlier report of electric signal loss to SD card lead to the same issue on different hardware.
In Conclusion :
- The loss of USB/SD card boot device is very costly and intrusive to an ESXi host.
- Consult your hardware vendor to ascertain the cause of device loss.
- Always ensure that USB/SD media is supported
- Avoid configuring log repository to USB device, the more the I/O the lesser the lifetime of the device