Re: [bitfolk] Host "hen" crashed again (Was: Re: Host "hen" …

Top Page

Reply to this message
Author: Johnathon Tinsley
Date:  
To: users
Subject: Re: [bitfolk] Host "hen" crashed again (Was: Re: Host "hen" unexpectedly rebooted 2018-11-26 22:24)
This unfortunately has happened again today, at about 14:23Z.
> This time I was logging the serial console to a file and so am able
> to see that there was the equivalent of a kernel panic in the
> hypervisor.
>
> That is, I do not believe that hen's hardware is at fault. I think
> it's tripping against a bug in Xen, and it's happened to the same
> host twice because it's been triggered by the same guest doing
> something (I do not believe malicious at this stage).
>
> I've not got a quick fix to this because moving all customers on hen
> to new hardware is likely just going to crash the hypervisor on the
> other hardware. I need to discuss the problem with the Xen
> developers and see if I get anywhere.
>
> In between last time and this I also built a new version of the
> hypervisor and set every host to boot into it, so hen is now
> actually running a very slightly newer version than everything else
> (and also compared to what it was running before). This possibly
> could help, just by chance, though as far as I am aware it is not a
> known bug.
>
> So I am very sorry but I am going to have to ask you to bear with me
> for a little while, while I investigate this more. Until I can
> establish which guest triggered it I can't move any of the customers
> on host hen to other hosts because that possibly just triggers it
> elsewhere. And it could still elsewhere anyway.
>
> If I don't make headway with this then I can revert to earlier
> versions that we've been stable on for a long time, but security
> issues have been fixed since then so I'm not going to do that except
> as a last resort.
>
> I will provide more information as soon as I can.
>
> Thanks,
> Andy
>


If you have a spare HV, why not try to identify the guest involved by
moving half the guests off of hen? The HV that crashes has the errant
guest, move another half from that HV, see which HV crashes. Continue
till you've identified, or you've a small enough number to be worth
contacting users and asking what their guests are doing at the panic time?