[bitfolk] 2021-07-19 ~19:30Z - ~20:45Z: Emergency reboot of …

Top Page
Author: Andy Smith
Date:  
To: announce
Subject: [bitfolk] 2021-07-19 ~19:30Z - ~20:45Z: Emergency reboot of server "limoncello"

Reply to this message
gpg: Signature made Mon Jul 19 21:11:32 2021 UTC
gpg: using DSA key 0E4236CB52951E14536066222099B64CBF15490B
gpg: Good signature from "Andy Smith <andy@strugglers.net>" [unknown]
gpg: aka "Andrew James Smith <andy@strugglers.net>" [unknown]
gpg: aka "Andy Smith (UKUUG) <andy.smith@ukuug.org>" [unknown]
gpg: aka "Andy Smith (BitFolk Ltd.) <andy@bitfolk.com>" [unknown]
gpg: aka "Andy Smith (Linux User Groups UK) <andy@lug.org.uk>" [unknown]
gpg: aka "Andy Smith (Cernio Technology Cooperative) <andy.smith@cernio.com>" [unknown]
Hi,

At about 19:30Z we started receiving alerts for customer services on
server "limoncello".

On investigation it quickly became apparent that this was the
intermittent "I/O stall" problem we've been seeing on all servers
and have been grappling with for months now.

All I could do was power cycle the server.

My current line of investigation is to upgrade both the hypervisor
and the kernel when this happens, and so far it hasn't reoccurred on
any of the servers where that has been done, though the sometimes
months long gap between incidents means it's not possible to be
sure.

Although this last happened 16 days ago, that was on a different
server ("jack").

With the upgrades done the server was rebooted again and at about
20:28Z customer VMs started booting again. This was complete by
about 20:45Z.

Cheers,
Andy

--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce@???
https://lists.bitfolk.com/mailman/listinfo/announce