[bitfolk] 2021-06-09 ~23:33Z - 2021-06-10 ~00:16Z: Emergency…

Top Page
Author: Andy Smith
Date:  
To: announce
Subject: [bitfolk] 2021-06-09 ~23:33Z - 2021-06-10 ~00:16Z: Emergency reboot of server "clockwork"

Reply to this message
gpg: Signature made Thu Jun 10 00:57:29 2021 UTC
gpg: using DSA key 0E4236CB52951E14536066222099B64CBF15490B
gpg: Good signature from "Andy Smith <andy@strugglers.net>" [unknown]
gpg: aka "Andrew James Smith <andy@strugglers.net>" [unknown]
gpg: aka "Andy Smith (UKUUG) <andy.smith@ukuug.org>" [unknown]
gpg: aka "Andy Smith (BitFolk Ltd.) <andy@bitfolk.com>" [unknown]
gpg: aka "Andy Smith (Linux User Groups UK) <andy@lug.org.uk>" [unknown]
gpg: aka "Andy Smith (Cernio Technology Cooperative) <andy.smith@cernio.com>" [unknown]
Hi,

At around 00:33 BST (23:33Z) we started to receive alerts regarding
services on host "clockwork". Upon investigation it was showing all
the signs of being the intermittent "frozen I/O" problem we've been
having:

    https://lists.bitfolk.com/lurker/message/20210425.071102.9d9a1cc5.en.html


As mentioned in that earlier email, I'd decided that the next step
would be prepare new hypervisor packages and I did do that the next
day.

As this issue seems to happen only every few months and on different
servers we do not yet know if the new packages fix the problem.
They've been in use on other servers since late April without
incident, but that isn't yet proof enough given the long periods
between occurrences.

Anyway, after "clockwork" was power cycled the new packages were
installed there and then all VMs were started again. This was
completed by about 01:16 BST (00:16Z).

There are still many of our servers where we know this is going to
happen again at some point. I don't feel comfortable scheduling
maintenance to upgrade them when I don't know if the upgrade will be
effective. If we can go a significant period of time on the newer
version without incident then we will schedule a maintenance window
to get the remaining servers on those versions too. It is also
possible that there will be a security patch that forces a
maintenance, in which case we'll upgrade the hypervisor packages to
the newer version at the same time.

There are also some servers still left to be emptied so that their
operating systems can be upgraded. Those are "hen" and "paradox".
Once they are emptied and upgraded they will of course be put on the
newer version of the hypervisor. It is expected that customer
services we move from these servers will be put on servers that
already have the newer hypervisor version.

Thank you for your patience and I apologise for the disruption. I'm
doing all that I can to try to find a solution.

Thanks,
Andy

--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce@???
https://lists.bitfolk.com/mailman/listinfo/announce