[bitfolk] 2021-06-19 ~20:37Z - 2021-06-19 ~21:3Z: Emergency …

Top Page
Author: Andy Smith
Date:  
To: announce
Subject: [bitfolk] 2021-06-19 ~20:37Z - 2021-06-19 ~21:3Z: Emergency reboot of server "hobgoblin"

Reply to this message
gpg: Signature made Sat Jun 19 22:15:51 2021 UTC
gpg: using DSA key 0E4236CB52951E14536066222099B64CBF15490B
gpg: Good signature from "Andy Smith <andy@strugglers.net>" [unknown]
gpg: aka "Andrew James Smith <andy@strugglers.net>" [unknown]
gpg: aka "Andy Smith (UKUUG) <andy.smith@ukuug.org>" [unknown]
gpg: aka "Andy Smith (BitFolk Ltd.) <andy@bitfolk.com>" [unknown]
gpg: aka "Andy Smith (Linux User Groups UK) <andy@lug.org.uk>" [unknown]
gpg: aka "Andy Smith (Cernio Technology Cooperative) <andy.smith@cernio.com>" [unknown]
Hi,

At about 20:37Z we started receiving alerts for customer services on
server "hobgoblin". It quickly became apparent that this was the
intermittent "I/O stall" problem we've been seeing on all servers
and have been grappling with for months now.

All I could do was power cycle the server, which happened at about
20:58Z.

We're still not able to reproduce the problem on demand and it can
be several months between incidents. We've tried upgrading
hypervisor and that's not helped. It's looking more like a problem
in the Linux kernel. So, I upgraded that as well to a newer
self-made package.

I've been communicating with a couple of the linux-raid devs and we
have some ideas but gathering information and making changes is
going slowly because of the lack of reproducibility and long time
between incidents. It's basically a case of making a single change
any time there is an issue.

With the upgrades done, the server was rebooted again and at about
21:19Z customer VMs started booting again. This was complete by
about 21:33Z.

Obviously I am not happy with these outages and I'm doing everything
I can to find the root cause.

Cheers,
Andy

--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce@???
https://lists.bitfolk.com/mailman/listinfo/announce