[bitfolk] Emergency reboot of jack.bitfolk.com, 2021-04-25 ~…

Top Page
Author: Andy Smith
To: announce
Subject: [bitfolk] Emergency reboot of jack.bitfolk.com, 2021-04-25 ~06:47Z

Reply to this message
gpg: Signature made Sun Apr 25 07:11:02 2021 UTC
gpg: using DSA key 2099B64CBF15490B
gpg: Good signature from "Andy Smith <andy@strugglers.net>" [unknown]
gpg: aka "Andrew James Smith <andy@strugglers.net>" [unknown]
gpg: aka "Andy Smith (UKUUG) <andy.smith@ukuug.org>" [unknown]
gpg: aka "Andy Smith (BitFolk Ltd.) <andy@bitfolk.com>" [unknown]
gpg: aka "Andy Smith (Linux User Groups UK) <andy@lug.org.uk>" [unknown]
gpg: aka "Andy Smith (Cernio Technology Cooperative) <andy.smith@cernio.com>" [unknown]

After receiving a number of alerts for VMs hosted on server "jack",
I investigated and found the server largely unresponsive.
Unfortunately I had no option but to forcibly reboot it, which I did
at about 06:47Z

It's now 07:01Z and monitoring says everything is back up, except
for two customer VMs which are waiting for a LUKS passphrase on
their console.

This problem was the same as what was experienced with some of the
other servers a few months ago. With the months-long gap I had hoped
it was some undiagnosed kernel issue which we had got past, but
apparently not, as "jack" is on the latest available kernel package.

I'm pursuing some ideas about a config change that may help, and I
managed to put that into place before "jack" was rebooted - it does
require a reboot so if it does help it won't be able to take effect
on the others until next reboot. On the other hand it doesn't hurt
either, so I've made the same change elsewhere also.

If that doesn't fix things then the next line of investigation will
be an upgrade of the hypervisor to latest stable release, though
that is a rather major undertaking.

Apologies for the disruption. It is challenging to debug a problem
that can take several months to occur, with no reliable way of
triggering it. :(


https://bitfolk.com/ -- No-nonsense VPS hosting
announce mailing list