[bitfolk] 2021-07-03 ~05:23Z - ~06:08Z: Emergency reboot of …

Top Page
Author: Andy Smith
Date:  
To: announce
Subject: [bitfolk] 2021-07-03 ~05:23Z - ~06:08Z: Emergency reboot of server "jack"

Reply to this message
gpg: Signature made Sat Jul 3 13:46:59 2021 UTC
gpg: using DSA key 0E4236CB52951E14536066222099B64CBF15490B
gpg: Good signature from "Andy Smith <andy@strugglers.net>" [unknown]
gpg: aka "Andrew James Smith <andy@strugglers.net>" [unknown]
gpg: aka "Andy Smith (UKUUG) <andy.smith@ukuug.org>" [unknown]
gpg: aka "Andy Smith (BitFolk Ltd.) <andy@bitfolk.com>" [unknown]
gpg: aka "Andy Smith (Linux User Groups UK) <andy@lug.org.uk>" [unknown]
gpg: aka "Andy Smith (Cernio Technology Cooperative) <andy.smith@cernio.com>" [unknown]
Hi,

At about 05:23Z we started receiving alerts for customer services on
server "jack". There had been some alerts for about 40 minutes
before that, but they weren't serious enough to send push
notifications, only emails.

On investigation it quickly became apparent that this was the
intermittent "I/O stall" problem we've been seeing on all servers
and have been grappling with for months now.

All I could do was power cycle the server, which happened at about
05:30Z.

My current line of investigation is to upgrade both the hypervisor
and the kernel when this happens, and so far it hasn't reoccurred on
any of the servers where that has been done, though the sometimes
months long gap between incidents means it's not possible to be
sure.

With the upgrades done, the server was rebooted again and at about
05:54Z customer VMs started booting again. This was complete by
about 06:08Z.

Cheers,
Andy

--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce@???
https://lists.bitfolk.com/mailman/listinfo/announce