[bitfolk] 2021-07-03 ~05:23Z - ~06:08Z: Emergency reboot of …

Author: Andy Smith
To: announce
Subject: [bitfolk] 2021-07-03 ~05:23Z - ~06:08Z: Emergency reboot of server "jack"

At about 05:23Z we started receiving alerts for customer services on
server "jack". There had been some alerts for about 40 minutes
before that, but they weren't serious enough to send push
notifications, only emails.

On investigation it quickly became apparent that this was the
intermittent "I/O stall" problem we've been seeing on all servers
and have been grappling with for months now.

All I could do was power cycle the server, which happened at about

My current line of investigation is to upgrade both the hypervisor
and the kernel when this happens, and so far it hasn't reoccurred on
any of the servers where that has been done, though the sometimes
months long gap between incidents means it's not possible to be

With the upgrades done, the server was rebooted again and at about
05:54Z customer VMs started booting again. This was complete by
about 06:08Z.


