VPS Node #2 Service Disruption - Resolved
Posted in Service Status by Jon on 29/12/2015 @ 03:35
We are aware of a service impacting issue on VPS Node #2 and are currently diagnosing the fault.Update @ 04:07 on 29/12/15 by Jon
Further updates will be posted as more information becomes available.
Our investigations so far has discovered that the relevant Xen kerne isn't fully loading on boot and we believe this issue to be related to a degraded RAID array.
We're currently attempting to get the LSI RAID controller to rebuild the array with one of the two available global hot spares but this has so far, proven problematic.
We'll continue to update this task with our progress.Update @ 04:47 on 29/12/15 by Jon
Unfortunately we've reached a point in our diagnosis where we can make no further progress until the RAID array issue has been resolved.
We've been able to get the VPS node to boot into it's non Xen DomU kernel and from there can launch LSI Mega RAID tools without a problem which provides us much better tools and diagnostics than the RAID BIOS does.
Currently we are rebuilding the RAID array using one of the hot spares but are refraining from re-attempting to load the Xen kernel until the RAID array health is optimal - this task is due to complete in approx 50 mins.
As such, we expect to post a further update in approx 50-60 mins.Update @ 05:47 on 29/12/15 by Jon
The RAID array has just finished rebuilding and as such, we're just about to re-try the Xen kernel.Update @ 05:58 on 29/12/15 by Jon
Despite the RAID array now being healthy, loading the Xen kernel still isn't working properly and as such, we're resuming our investigation.Update @ 06:14 on 29/12/15 by Jon
After initially being sent down the wrong path for the last few hours due to discovering the degraded RAID issue, it seems that the root of the cause was actually related to an automatically applied Ksplice kernel patch.
We utilise tools like Ksplice and KernelCare as they are designed to provide our customers uninterrupted service whilst retaining the necessary benefits of automatic security updates but clearly in this instance, something went terribly wrong!
In due course, we'll be looking into exactly what went wrong and how we can better protect ourselves and our clients from future, similar issues but to conclude this status announcement thread:
- Service to VPS2 should now be operational as of 06:15 GMT - if your VPS is not back online already then please contact support
- The initially reported RAID degradation issue was not the cause and seems to have occurred due to the forced reboot of the VPS node after it initially crashed
- The RAID array of VPS #2 is back to optimal including with available global hot spares
Please accept our sincere apologies for any inconvenience this extended period of service disruption may have caused.