Re: [bitfolk] Reboots will be necessary to address security …

Top Page
Author: Andy Smith
Date:  
To: announce
Subject: Re: [bitfolk] Reboots will be necessary to address security issues, probably early hours 19/20/21 September

Reply to this message
gpg: Signature made Mon Sep 21 03:01:13 2020 UTC
gpg: using DSA key 2099B64CBF15490B
gpg: Good signature from "Andy Smith <andy@strugglers.net>" [unknown]
gpg: aka "Andrew James Smith <andy@strugglers.net>" [unknown]
gpg: aka "Andy Smith (UKUUG) <andy.smith@ukuug.org>" [unknown]
gpg: aka "Andy Smith (BitFolk Ltd.) <andy@bitfolk.com>" [unknown]
gpg: aka "Andy Smith (Linux User Groups UK) <andy@lug.org.uk>" [unknown]
gpg: aka "Andy Smith (Cernio Technology Cooperative) <andy.smith@cernio.com>" [unknown]
Hello,

On Tue, Sep 08, 2020 at 04:51:52PM +0000, Andy Smith wrote:
> Unfortunately some serious security bugs have been discovered in the
> Xen hypervisor and fixes for these have now been pre-disclosed, with
> an embargo that ends at 1200Z on 22 September 2020.
>
> As a result we will need to apply these fixes and reboot everything
> before that time. We are likely to do this in the early hours of the
> morning UK time, on 19, 20 and 21 September.


This maintenance work has now been completed, without incident. The
details of the security issues which were fixed will appear at:

    https://xenbits.xen.org/xsa/


after 1200Z on 22 September. We also took the opportunity to upgrade
CPU microcode where available.

Thanks for your patience during this disruption.

The rest of this email is some comments about suspend and restore so
if you have no interest in that it's safe to stop reading now.

During the course of this work 3 VMs were almost-live migrated¹. All
three worked fine.

96 VMs were suspended and restored². 94 of them appeared to cope
fine; 2 failed to restore properly.

One of the failures was a Debian buster VPS which didn't respond to
pings after restore. This was noticed by monitoring and the VPS was
then cleanly shut down and booted, after which it worked. Many
Debian buster VMs were suspended and restored so I do not think this
is a general problem with the kernel in buster but perhaps something
with the particular kernel modules in use in that case.

The other failure was an Ubuntu 16.04 VPS. Unfortunately this did
continue to respond to pings, but every process was hung. This was
not noticed until the customer investigated many hours later and
they had to use the Xen Shell "destroy" command then boot it again.

When customers opt-in to suspend&restore we add a ping monitor so we
stand some chance of noticing if the restore should fail, and can
then take action on your behalf. Clearly there are failure modes
where your kernel is able to respond to a ping but some or all
processes don't work properly. It would be a good idea to ask for
additional checks of whatever services you are running.

We're not really in a position to actively debug suspend and restore
problems aside from recommending that as new a kernel as
possible/convenient is used. We can certainly provide information if
any of you want to open a bug report with your Linux distribution or
the upstream Linux kernel.

You can learn more about suspend and restore here:

    https://tools.bitfolk.com/wiki/Suspend_and_restore


Cheers,
Andy

¹ This involves syncing the storage and a dump of the memory image
between servers, so typically involves a pause in execution of
30–60 seconds. It is still experimental so we won't do it unless
you specially ask, and have patched destination hardware
available.

² Memory dumped to storage, restored again after the bare metal host
is rebooted. Typically involves a pause in execution of 10–20
minutes. We will use this method if you opt in to it from:

    https://panel.bitfolk.com/account/config/


--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce@???
https://lists.bitfolk.com/mailman/listinfo/announce