[bitfolk] 2021-05-15 ~21:00 BST (20:00Z) onwards forced rebo…

Top Page
Author: Andy Smith
Date:  
To: announce
Subject: [bitfolk] 2021-05-15 ~21:00 BST (20:00Z) onwards forced reboot of host "jack"

Reply to this message
gpg: Signature made Sat May 15 21:11:28 2021 UTC
gpg: using DSA key 0E4236CB52951E14536066222099B64CBF15490B
gpg: Good signature from "Andy Smith <andy@strugglers.net>" [unknown]
gpg: aka "Andrew James Smith <andy@strugglers.net>" [unknown]
gpg: aka "Andy Smith (UKUUG) <andy.smith@ukuug.org>" [unknown]
gpg: aka "Andy Smith (BitFolk Ltd.) <andy@bitfolk.com>" [unknown]
gpg: aka "Andy Smith (Linux User Groups UK) <andy@lug.org.uk>" [unknown]
gpg: aka "Andy Smith (Cernio Technology Cooperative) <andy.smith@cernio.com>" [unknown]
Hi,

Tonight from around 21:00 BST onwards we started getting alerts and
support tickets regarding customer services on host
jack.bitfolk.com.

I had a look and unfortunately it appears to be a re-occurrence of
previous issues regarding stalled IO:

    https://lists.bitfolk.com/lurker/message/20210425.071102.9d9a1cc5.en.html
    https://lists.bitfolk.com/lurker/message/20210220.032844.00dc9600.en.html
    https://lists.bitfolk.com/lurker/message/20201116.003514.25278824.en.html


Things were largely unresponsive so I had to forcibly reboot the
server. Customer services were all booted or in the process of
booting by about 21:53.

What we know so far:

- It must be a software issue, not a hardware issue, as it's
happened on multiple servers of different specifications.

- It's only happening with servers that we've upgraded to Xen
version 4.12.x.

- It's going to be really difficult to track down because there can
be months between occurrences.

Each time this has occurred I've made some change that I'd hoped
would lead to a solution, but I've now tried all the easy things and
so all that remains is to do another software upgrade.

I think we're going to have to build some packages for Xen 4.14 and
install that on a test host and see how it goes. The difficulty is
that once I do that and it seems to work, we'll never really know
because it could just be in the long period of time where the
problem is not triggered. Clearly once we have seemingly-working
packages we can't leave them spinning for 6 months just to reassure
ourselves of that.

I also am unsure about whether it is a good idea to force additional
downtimes on you in order to upgrade servers to 4.14.x when I don't
even know yet if that will fix the issue. What I can do is have the
upgrade ready and then if/when the issue re-occurs do the upgrade
then, so it boots into that.

Anyway, all I can say is that this is a really unfortunate state of
affairs that obviously I'm not happy with and I'm doing all that I
can to resolve it. These outages are unacceptable and rest assured
they are aggravating me more than anyone else.

Thanks,
Andy Smith
BitFolk Ltd

--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce@???
https://lists.bitfolk.com/mailman/listinfo/announce