[bitfolk] Major maintenance scheduled for 2021-04-20 (1 serv…

Top Page
Author: Andy Smith
Date:  
To: announce
Subject: [bitfolk] Major maintenance scheduled for 2021-04-20 (1 server) and 2021-04-27 (another 6 servers)

Reply to this message
gpg: Signature made Sat Mar 20 12:34:50 2021 UTC
gpg: using DSA key 2099B64CBF15490B
gpg: Good signature from "Andy Smith <andy@strugglers.net>" [unknown]
gpg: aka "Andrew James Smith <andy@strugglers.net>" [unknown]
gpg: aka "Andy Smith (UKUUG) <andy.smith@ukuug.org>" [unknown]
gpg: aka "Andy Smith (BitFolk Ltd.) <andy@bitfolk.com>" [unknown]
gpg: aka "Andy Smith (Linux User Groups UK) <andy@lug.org.uk>" [unknown]
gpg: aka "Andy Smith (Cernio Technology Cooperative) <andy.smith@cernio.com>" [unknown]
Hello all,

A web version of this email with any updates that have come to light
since posting is available here:

    https://tools.bitfolk.com/wiki/Maintenance/2021-04-Re-racking


== TL;DR:

We need to relocate some servers to a different rack within
Telehouse.

On Tuesday 20 April 2021 at some point in the 2 hour window starting
at 20:00Z (21:00 BST) all customers on the following server will
have their VMs either powered off or suspended to storage:

* hen.bitfolk.com

We expect to have it powered back on within 30 minutes.

On Tuesday 27 April 2021 at some point in the 4 hour window starting
at 22:00Z (23:00 BST) all customers on the following servers will
have their VMs either powered off or suspended to storage:

* clockwork.bitfolk.com
* hobgoblin.bitfolk.com
* jack.bitfolk.com
* leffe.bitfolk.com
* macallan.bitfolk.com
* paradox.bitfolk.com

We expect the work on each server to take less than 30 minutes.

See "Frequently Asked Questions" at the bottom of this email for how
to determine which server your VM is on.

If you can't tolerate a ~30 minute outage at these times then please
contact support as soon as possible to ask for your VM to be moved
to a server that won't be part of this maintenance.

== Maintenance Background

Our colo provider needs to rebuild one of their racks that houses 7
of our servers. This is required because the infrastructure in the
rack (PDUs, switches etc) is of a ten year old vintage and all needs
replacing. To facilitate this, all customer hardware in that rack
will need to be moved to a different rack or sit outside of the rack
while it is rebuilt. We are going to have to move our 7 servers to a
different rack.

This is a significant piece of work which is going to affect several
hundred of our customers, more than 70% of the customer base.
Unfortunately it is unavoidable.

== Networking upgrade

We will also take the opportunity to install 10 gigabit NICs in the
servers which are moved. The main benefit of this will be faster
inter-server data transfer for when we want to move customer
services about. The current 1GE NICs limit this to about 90MiB/sec.

== Suspend & Restore

If you opt in to suspend & restore then instead of shutting your VM
down we will suspend it to storage and then when the server boots
again it will be restored. That means that you should not experience
a reboot, just a period of paused execution. You may find this less
disruptive than a reboot, but it is not without risk. Read more
here:

    https://tools.bitfolk.com/wiki/Suspend_and_restore


== Avoiding the Maintenance

If you cannot tolerate a ~30 minute outage during the maintenance
windows listed above then please contact support to agree a time
when we can move your VM to a server that won't be part of the
maintenance.

Doing so will typically take just a few seconds plus the time it
takes your VM to shut down and boot again and nothing will change
about your VM.

If you have opted in to suspend & restore then we'll use this to do
a "semi-live" migration. This will appear to be a minute or two of
paused execution.

Moving your VM is extra work for us which is why we're not doing it
by default for all customers, but if you prefer that to experiencing
the outage then we're happy to do it at a time convenient to you, as
long as we have time to do it and available spare capacity to move
you to. If you need this then please ask as soon as possible to
avoid disappointment.

It won't be possible to change the date/time of the planned work on
an individual customer basis. This work involves 7 of our servers,
will affect several hundred of our customers, and also has needed to
be scheduled with our colo provider and some of their other
customers. The only per-customer thing we may be able to do is move
your service ahead of time at a time convenient to you.

== Rolling Upgrades Confusion

We're currently in a cycle of rolling software upgrades to our
servers. Many of you have already received individual support
tickets to schedule that. It involves us moving your VM from one of
our servers to another and full details are given in the support
ticket.

This has nothing to do with the maintenance that's under discussion
here and we realise that it's unfortunately very confusing to have
both things happening at the same time. We did not know that moving
our servers would be necessary when we started the rolling upgrades.

We believe we can avoid moving any customer from a server that is
not part of this maintenance onto one that will be part of this
maintenance. We cannot avoid moving customers between servers that
are both going to be affected by this maintenance. For example, at
the time of writing, customer services are being moved off of
jack.bitfolk.com and most of them will end up on
hobgoblin.bitfolk.com.

== Further Notifications

Every customer is supposed to be subscribed to this announcement
mailing list, but no doubt some aren't. The movement of customer
services between our servers may also be confusing for people, so we
will send a direct email notification to the main contact of
affected customers a week before the work is due to take place.

So, on Tuesday 13 April we'll send a direct email about this to
customers that are hosted on hen.bitfolk.com, and then on Tuesday 20
April we'll send a similar email to customers on all the rest of the
affected servers.

== 20 April Will Be a Test Run

We are only planning to move one server on 20 April. The reasons for this are:

* We want to check our assumptions about how long this work will
take, per server.
* We're changing the hardware configuration of the server by adding
10GE NICs, and we want to make sure that configuration is stable.

The timings for the maintenance on 27 April may need to be altered
if the work on 20 April shows our guesses to be wildly wrong.

== Frequently Asked Questions

=== How do I know if I will be affected?

If your VM is hosted on one of the servers that will be moved then
you are going to be affected. There's a few different ways that you
can tell which server you are on:

1. It's listed on https://panel.bitfolk.com/
2. It's in DNS when you resolve <youraccountname>.console.bitfolk.com
3. It's on your data transfer email summaries
4. You can see it on a `traceroute` or `mtr` to or from your VPS.

=== If you can "semi-live" migrate VMs, why don't you just do that?

* This maintenance will involve some 70% of our customer base, so we
don't actually have enough spare hardware to move customers to.
* Moving the data takes significant time at 1GE network speeds.

For these reasons we think that it will be easier for most customers
to just accept a ~30 minute outage. Those who can't tolerate such a
disruption will be able to have their VMs moved to servers that
aren't going to be part of the maintenance.

=== Why are you needing to test out adding 10GE NICs to a live server? Isn't it already tested?

The main reason for running through this process first on one server
only (hen) is to check timings and procedure before doing it on
another six servers all at once. The issue of installing 10GE NICs
is a secondary concern and considered low risk.

The hardware for all 7 of the servers that are going to be moved is
obsolete now, so it's not possible to obtain identical spares now.
The 10GE NICs have been tested in general, but not with this
specific hardware, so it's just an extra cautionary measure.

The 10GE NICs will not be in use immediately in order to avoid too
much change at once, but this still does involve plugging in a PCIe
card which on boot will load a kernel module so while the risk is
considered low, it's not zero.

=== Further questions?

If there's anything we haven't covered or you need clarified please
do ask here or privately to support.

--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce@???
https://lists.bitfolk.com/mailman/listinfo/announce