Re: [bitfolk] Server "elephant" has crashed a few times, on…

Top Page

Reply to this message
Author: Conrad Wood
Date:  
To: users, announce
Subject: Re: [bitfolk] Server "elephant" has crashed a few times, ongoing problems
On Fri, 2020-10-23 at 16:37 +0000, Andy Smith wrote:
> Hi,
>
> On Fri, Oct 23, 2020 at 11:46:11AM +0000, Andy Smith wrote:
> > On Fri, Oct 23, 2020 at 11:19:21AM +0000, Andy Smith wrote:
> > > I'm trying to isolate the issue to one particular VM because if a
> > > guest can crash the host then it's a bug in the hypervisor and
> > > just moving guests around won't solve the problem.
> >
> > I can't find it. As we have had problems with elephant before I'm
> > going to assume hardware problem and start moving customer VMs to
> > other hosts.
>
> While moving customer VMs to other hosts, booting one of them caused
> server "macallan" to crash in exactly the same way. So, I am ruling
> out hardware issues with "elephant".
>
> By preventing this particular VM from booting I was able to boot all
> of the other VMs on "macallan". I have some hope that it is just
> this one VM that is tickling a particularly nasty bug.
>
> I am going to try now starting the remainder of VMs on "elephant".
>
> If that is successful I will then take the suspect VM to test
> hardware to see if I can further reproduce.
>
> I am confused because I am sure I tried reverting last weekend's
> hypervisor upgrade to the previous version while investigating
> matters on "elephant", yet it still crashed. Possibly I made a
> mistake (e.g. booted with wrong hypervisor).
>
> Also, everything obviously booted up fine last weekend when I did
> the maintenance so possibly this customer has found a new and
> unrelated bug.
>
> The best case at this point is that I can reproduce the problem with
> just that one VM, report it, get it fixed and then have to reboot
> everything to deploy the fix.
>
> Cheers,
> Andy


Interesting - could you share the kernel panic or does it contain too
much sensitive information?

Conrad