[bitfolk] Re Robert's VPS Setup Guide

Top Page

Reply to this message
Author: Keith Williams
Date:  
Subject: [bitfolk] Re Robert's VPS Setup Guide
@secure.newtonnet.co.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <f8f02349005bad6c81af61c6acb4b28f.squirrel@???>
OpenPGP: id=BF15490B; url=http://strugglers.net/~andy/pubkey.asc
X-URL: http://strugglers.net/wiki/User:Andy
User-Agent: Mutt/1.5.18 (2008-05-17)
X-Virus-Scanner: Scanned by ClamAV on bitfolk.com at Tue,
    20 Sep 2011 20:20:18 +0000
X-SA-Exim-Connect-IP: <locally generated>
X-SA-Exim-Mail-From: andy@???
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
    spamd3.lon.bitfolk.com
X-Spam-Level: 
X-Spam-ASN: 
X-Spam-Status: No, score=-0.0 required=5.0 tests=NO_RELAYS shortcircuit=no
    autolearn=disabled version=3.3.1
X-Spam-Report: * -0.0 NO_RELAYS Informational: message was not relayed via SMTP
X-SA-Exim-Version: 4.2.1 (built Wed, 25 Jun 2008 17:14:11 +0000)
X-SA-Exim-Scanned: Yes (on bitfolk.com)
Subject: Re: [bitfolk] Sudden time shifts with ntpd
X-BeenThere: users@???
X-Mailman-Version: 2.1.11
Precedence: list
List-Id: Users of BitFolk hosting <users.lists.bitfolk.com>
List-Unsubscribe: <https://lists.bitfolk.com/mailman/options/users>,
    <mailto:users-request@lists.bitfolk.com?subject=unsubscribe>
List-Archive: <http://lists.bitfolk.com/lurker/list/users.html>
List-Post: <mailto:users@lists.bitfolk.com>
List-Help: <mailto:users-request@lists.bitfolk.com?subject=help>
List-Subscribe: <https://lists.bitfolk.com/mailman/listinfo/users>,
    <mailto:users-request@lists.bitfolk.com?subject=subscribe>
X-List-Received-Date: Tue, 20 Sep 2011 20:20:18 -0000


Hi Mathew,

On Tue, Sep 20, 2011 at 08:04:30PM +0100, Mathew Newton wrote:
> You may recall recently a couple of us had time issues on Dunkel and I
> gave your suggestion a try:


[...]

> Unfortunately the problem is back!


It's now looking like it's a problem with that host (dunkel). I
suspect software rather than hardware problem because it has run for
a year+ under Xen 3.x before it was recently taken out of service
and upgraded to Xen 4.x.

Yours is not the only VM on there to have seen a sudden 47s skew in
time; two of BitFolk's own VMs saw similar at around the same time
(15:54Z today):

panel0:
Sep 20 16:00:07 panel0 ntpd[1802]: no servers reachable
[...]
Sep 20 16:11:39 panel0 ntpd[1802]: time reset -46.869813 s

resntp2:
Sep 20 15:54:56 resntp2 heartbeat: [1407]: WARN: Gmain_timeout_dispatch: Dispatch function for send local status was delayed 46000 ms (> 510 ms) before being called (GSource: 0x86a9720)
Sep 20 15:54:56 resntp2 heartbeat: [1407]: WARN: Late heartbeat: Node resntp2: interval 47000 ms
Sep 20 15:54:56 resntp2 heartbeat: [1407]: WARN: node resntp0: is dead
Sep 20 15:54:56 resntp2 heartbeat: [1407]: WARN: node resntp1: is dead
Sep 20 15:54:56 resntp2 heartbeat: [1407]: WARN: node resntp3: is dead

(yes, heartbeat got *really* upset and caused a cluster failover)

There's only three BitFolk VMs on that server: panel0, resntp2,
spamd0. There's an ntpd on all of the VMs, but only the ntpd of
panel0 complained. Only heartbeat on resntp2 complained, not ntpd.
Nothing complained on spamd0 but that could just be because it has
no software which noticed.

        | Distro  | Kernel              | Clock  | Available
        | version | version             | source | clock sources
--------+---------+---------------------+--------+----------------
panel0  | lenny   | 2.6.26-2-686-bigmem | xen    | xen tsc jiffies
--------+---------+---------------------+--------+----------------
resntp2 | squeeze | 2.6.32-5-686-bigmem | tsc    | xen tsc
--------+---------+---------------------+--------+----------------
spamd0  | squeeze | 2.6.32-5-686-bigmem | xen    | xen tsc
--------+---------+---------------------+--------+----------------


Interesting how it is always ~47s.

Also:

panel0 $ zgrep "time reset" /var/log/syslog*
/var/log/syslog:Sep 20 16:11:39 panel0 ntpd[1802]: time reset -46.869813 s
/var/log/syslog.2.gz:Sep 6 12:58:41 panel0 ntpd[1742]: time reset -46.869301 s
/var/log/syslog.32.gz:Jul 27 11:19:35 panel0 ntpd[1742]: time reset -46.869786 s
/var/log/syslog.59.gz:Jun 29 19:05:10 panel0 ntpd[1708]: time reset -46.868480 s

resntp2 $ zgrep "Late heartbeat" /var/log/syslog*
/var/log/syslog:Sep 20 15:54:56 resntp2 heartbeat: [1407]: WARN: Late heartbeat: Node resntp2: interval 47000 ms
/var/log/syslog:Sep 20 15:54:56 resntp2 heartbeat: [1407]: WARN: Late heartbeat: Node resntp0: interval 47780 ms
/var/log/syslog:Sep 20 15:54:56 resntp2 heartbeat: [1407]: WARN: Late heartbeat: Node resntp1: interval 47780 ms
/var/log/syslog:Sep 20 15:54:57 resntp2 heartbeat: [1407]: WARN: Late heartbeat: Node resntp3: interval 48280 ms
/var/log/syslog.2.gz:Sep 6 12:43:02 resntp2 heartbeat: [1407]: WARN: Late heartbeat: Node resntp2: interval 47590 ms
/var/log/syslog.2.gz:Sep 6 12:43:02 resntp2 heartbeat: [1407]: WARN: Late heartbeat: Node resntp0: interval 47970 ms
/var/log/syslog.2.gz:Sep 6 12:43:02 resntp2 heartbeat: [1407]: WARN: Late heartbeat: Node resntp3: interval 47590 ms
/var/log/syslog.2.gz:Sep 6 12:43:02 resntp2 heartbeat: [1407]: WARN: Late heartbeat: Node resntp1: interval 48100 ms

(logs on this host only back to Aug 20th)

dunkel was last booted on June 16th so I suspect this has always
been the case on this host and software combination.

Can anyone else experiencing similar things please send me an email
off-list with what you saw and the distribution, kernel versions and
clocksources? You can find the clocksource info in:

/sys/devices/system/clocksource0/current_clocksource
/sys/devices/system/clocksource0/available_clocksource

I'm afraid this could be tricky to fix and I'm going to have to ask
you and others to bear with me while I investigate.

I would rather not take the whole server out of service right now,
but if you or anyone else wants to be moved to another server please
send in a support ticket and it will be done.

If I don't find a solution in a reasonable period of time then I am
absolutely willing to put you on different hardware and take it out
of service (and probably bring it home and beat it to pieces in the
front garden, Office Space style).

Cheers,
Andy

--
http://bitfolk.com/ -- No-nonsense VPS hosting


From bitfolklist@??? Tue Sep 20 21:45:26 2011
Received: from [2001:ba8:1f1:f137::feed] (helo=secure.newtonnet.co.uk)
    by bitfolk.com with esmtp (Exim 4.72)
    (envelope-from <bitfolklist@???>) id 1R6886-0002bu-5t
    for users@???; Tue, 20 Sep 2011 21:45:26 +0000
Received: by secure.newtonnet.co.uk (Postfix, from userid 108)
    id 19CAAA1098; Tue, 20 Sep 2011 22:45: