[bitfolk] We need to make some decisions about suspend+restore

Author: Andy Smith
Date:
To: users
Subject: [bitfolk] We need to make some decisions about suspend+restore

gpg: Signature made Sun Aug 29 01:02:18 2021 UTC
gpg: using DSA key 0E4236CB52951E14536066222099B64CBF15490B
gpg: Good signature from "Andy Smith <andy@strugglers.net>" [unknown]
gpg: aka "Andrew James Smith <andy@strugglers.net>" [unknown]
gpg: aka "Andy Smith (UKUUG) <andy.smith@ukuug.org>" [unknown]
gpg: aka "Andy Smith (BitFolk Ltd.) <andy@bitfolk.com>" [unknown]
gpg: aka "Andy Smith (Linux User Groups UK) <andy@lug.org.uk>" [unknown]
gpg: aka "Andy Smith (Cernio Technology Cooperative) <andy.smith@cernio.com>" [unknown]

Hello,

TL;DR: There were some serious problems with suspend+restore during
the last maintenance and 2 customer VMs were shredded. It looks like
that was due to a kernel bug on the guest side which was fixed in
Linux v4.2 but until we can test that hypothesis we won't be doing
any more suspend+restore. If that holds true then we have to decide
how/if to reintroduce the feature. Unless you have an interest in
this topic you can ignore this email. There's nothing you can/should
do at this stage - though regardless of this you should keep your
operating system up to date of course.

Detailed version follows:

As you're probably aware, we allow customers to opt in to something
we call "suspend and restore". There's more info about that here:

    https://tools.bitfolk.com/wiki/Suspend_and_restore

A summary is: if you opt in to it then any time we need to shut the
bare metal host down or move your VM between bare metal hosts, we
save your VM's state to storage and then restore it again
afterwards. The effect seems like a period of paused execution, so
everything remains running and often even TCP sessions will remain
alive. It's a lot less disruptive than a shutdown and boot.

It hasn't always worked perfectly. Sometimes, usually with older
kernels, what is restored doesn't work properly. It locks up or
spews error messages on the console and has to be forcibly
terminated and then booted. These problems have tended to be
deterministic, i.e. either your kernel has problems or it's fine,
and this is repeatable, so when someone has had problems we've
advised them to opt back out of suspend+restore.

Also when there have been problems it hasn't been destructive. I've
never seen the on-boot fsck say more than "unclean shutdown". So
I've always regarded suspend+restore to be pretty safe.

All BitFolk infrastructure VMs use suspend+restore; this is
currently 35 VMs. I've done it several hundred times at this point
and before this week the failure rate was very low and failure mode
not too terrible,

Suspend+restore was used during the maintenance this week on the
first two servers, "clockwork" and "hobgoblin", for customers who
opted in (and BitFolk stuff). There was nothing to indicate problems
with the 10 VMs on "clockwork" that were suspended and restored.

Of the 17 VMs on "hobgoblin" that had opted in, two of them failed
to restore and were unresponsive. Destroy and boot was the only
option left, and then it was discovered that they had both suffered
serious filesystem corruption which was not recoverable.

Losing customer data is the worst. It hasn't happened in something
like 8 years and even then that was down to (my) human error not a
bug in the software we use to provide the service.

After having that happen we did not honour requests to
suspend+restore for the remainder of the maintenance on other
servers and the work proceeded without incident other than that.

We've tracked this bug down to this:

    https://lore.kernel.org/lkml/1437449441-2964-1-git-send-email-bob.liu@oracle.com/

This is fixing a bug in the Linux kernel on the guest side. It's in
the block device driver that Xen guests use. It made its way
upstream in v4.2-rc7 of the Linux kernel.

If I understand it correctly it's saying that across a migration (a
suspend+restore is a migration on the same host) this change is
necessary for a guest to notice if a particular feature/capability
has changed, and act accordingly.

So, without the patch there can be a mismatch between what the
backend driver on the bare metal host and the frontend driver in
your VM understand about the protocol they use, and that is what
caused the corruption for these two VMs.

It is believed that this has never been seen before because we only
recently upgraded our host kernels from 4.19 to 5.10, which is using
some new features. There hasn't been any suspending and restoring
happening with those newer host kernels until this last week. Though
I did test that, but admittedly not with any EOL guest operating
systems.

The two VMs were running obsolete kernels: A Debian 8 (jessie) 3.16
kernel and a CentOS 6 2.6.32 kernel. These kernels are never going
to receive official backports of that patch because they're out of
support by their distributors.

Since reporting this issue another person has told me that they've
now tested migration with Debian 8 guests and it breaks every time
for them, sometimes with disastrous circumstances as we have seen.

I am at this time unable to explain why several other customer VMs
did not experience this calamity, though I am of course glad that
they did not. Out of the 27, many were running kernels as old as
this or even older, but only 2 exploded like this.

So what are we going to do about it?

I think initially we are of course not going to be able to use
suspend+restore any more even if you have opted in to it. We just
won't honour that setting for now.

Meanwhile I think I will test the hypothesis that it's okay with guest
kernels newer than 4.2. Obviously if it's not then the feature
remains disabled. But assuming I cannot replicate this issue with
kernels that have the fix, then we have to decide if and how we're
going to reintroduce this feature.

The rest of this email assumes that guest kernels of 4.2+ are okay,
in which case I am minded to:

1. Reset everyone back to opting out

2. Add a warning on the opt in bit that says it mustn't be used with
kernels older than 4.2 because of this known and serious bug

3. Post something on the announce list saying to opt back in again
if you want (with details of what's been tested etc.)

We can't easily tell what kernels people are running so we don't
have the option of disabling it just for those running older
kernels. There are 85 customer VMs that have opted in to
suspend+restore.

When the time comes we can perhaps do some testing for people who
are interested in re-enabling that but want more reassurance. I can
imagine that this testing would take the form of:

1. Snapshot your storage while your VM is running

2. Suspend+restore your VM.

3. If it works, great. If it explodes then we rollback your
snapshot, which would take a couple of minutes. This would appear
to your operating system to be like a power off or abrupt crash.
Which is something that all software should be robust against,
but occasionally isn't.

I'm undecided on whether it will be worth sending a direct email to
the contacts of those 85 VMs with the background info and offer of
this testing or whether just posting something to the announce list
will be enough.

If you are running a kernel this old then I would of course always
recommend an upgrade or reinstall anyway, regardless of this. You
don't have any security support in your kernel and it's at least 6
years since its release.

On Debian jessie it is fairly easy to just install the
jessie-backports kernel (a 4.9 kernel):

    https://www.lucas-nussbaum.net/blog/?p=947

Summary:

# echo 'Acquire::Check-Valid-Until no;' > /etc/apt/apt.conf.d/99no-check-valid-until
# echo 'deb http://archive.debian.org/debian/ jessie-backports main' > /etc/apt/sources.list.d/backports.list
# apt update
# apt install linux-image-amd64/jessie-backports

or:

# apt install linux-image-686/jessie-backports

Doing this will (a) disable checking of the validity of all your
package sources, and (b) still leave you on a kernel that has no
security support. But, you already were in that position.

No one needs to take any action at this stage because we just won't
be doing any more suspend+restore until we know more, and probably
the next step after that will be to opt everyone back out and tell
you that.

Your comments and thoughts on all of this are welcome.

Cheers,
Andy

--
https://bitfolk.com/ -- No-nonsense VPS hosting

This message is part of the following thread:
	the complete thread tree sorted by date

[bitfolk] We need to make some decisions about suspend+resto…