[bitfolk] Why "hobgoblin"'s maintenance took so long today (…

Top Page
Author: Andy Smith
To: announce
Old-Topics: [bitfolk] Security reboot needed, likely to be weekend of 21/22/23 Apr
Subject: [bitfolk] Why "hobgoblin"'s maintenance took so long today (Was: Re: Security reboot needed, likely to be weekend of 21/22/23 Apr)

Reply to this message
gpg: Signature made Sun Apr 22 04:04:28 2018 UTC using DSA key ID BF15490B
gpg: Good signature from "Andy Smith <andy@strugglers.net>"
gpg: aka "Andrew James Smith <andy@strugglers.net>"
gpg: aka "Andy Smith (UKUUG) <andy.smith@ukuug.org>"
gpg: aka "Andy Smith (BitFolk Ltd.) <andy@bitfolk.com>"
gpg: aka "Andy Smith (Linux User Groups UK) <andy@lug.org.uk>"
gpg: aka "Andy Smith (Cernio Technology Cooperative) <andy.smith@cernio.com>"

I made a mistake and as a result today's scheduled maintenance for
host "hobgoblin" took much longer than expected. The first customer
VPSes on this host were shut down at around 01:30Z and VPSes began
to be started up again at 01:57Z, which is about twice as long as it
normally takes.

I first noticed there was a problem when the host reached its grub
bootloader and the desired hypervisor version wasn't present.

At this point I didn't want to allow a normal boot into the previous
hypervisor because all of the suspended VPSes would be started, and
then all of the others would be booted. This would take significant
time and would of course result in two suspend/shutdown events for
most customers on that server.

Instead I had to boot the server into single user mode, disable
starting of VPSes, then tell it to boot normally. I then installed
the new hypervisor that should have been there already and updated
the bootloader. It was then possible to reboot and boot as intended.

The majority of the excess time was spent waiting for BIOS etc to go
through the motions.

After the rest of the work was completed I looked into what happened
here. It seems that due to a typo, the new version of the hypervisor
was not copied to this one particular host, and this was not noticed
until I came to try to boot into it.

I will see if I can revise my procedures to detect problems of this
nature in future. I'm not yet sure what I can do, but after the last
incident of human error¹ I started writing up a plan for each
maintenance and have found this useful even for work which I once
considered routine. At the very least I can include an explicit step
in such plans to check that the hypervisor and kernel version to be
booted in to are actually present before a reboot takes place.

Apologies again for the extended outage period.


¹ https://lists.bitfolk.com/lurker/message/20170914.003408.5d4ddfaf.en.html
announce mailing list