[Linux-HA] two node firewall using heartbeat v2 problems [SOLVED]
Dejan Muhamedagic
dejanmm at fastmail.fm
Tue Oct 2 04:44:41 MDT 2007
Hi,
On Mon, Oct 01, 2007 at 03:02:06PM -0500, Matt Zagrabelny wrote:
>
> On Mon, 2007-10-01 at 21:20 +0200, Dejan Muhamedagic wrote:
> > Hi,
> >
> > On Mon, Oct 01, 2007 at 01:56:46PM -0500, Matt Zagrabelny wrote:
> > >
> > > On Mon, 2007-10-01 at 14:37 +0200, Andrew Beekhof wrote:
> > > > On 10/1/07, Dejan Muhamedagic <dejanmm at fast
> > >
> > > [...]
> > >
> > > > > > (there will be no <status/> element in the following file, I believe
> > > > > > that this is due to me manually 'kill -9'ing the processes after they
> > > > > > would not stop nicely)
> > > > >
> > > > > No, the status section is never saved to a file. It only exists
> > > > > in running nodes.
> > >
> > > I know that the actual status doesn't get written out, but doesn't the "<status/>" tag get written out when the processes exit?
> > >
> > > > >
> > > > > > Here are some snippets from the log files, I am not sure what are the
> > > > > > valuable pieces and what are not. The files themselves are long (600 and
> > > > > > 900 lines for the primary and backup servers). Locations of the (almost
> > > > > > complete) log files is:
> > > > > >
> > > > > > http://www.d.umn.edu/~mzagrabe/ha-log.cody.txt
> > > > > > http://www.d.umn.edu/~mzagrabe/ha-log.tim.txt
> > > > >
> > > > > >From cody:
> > > > >
> > > > > heartbeat[18326]: 2007/09/28_11:29:27 WARN: string2msg_ll: node [tim] failed authentication
> > > > >
> > > > > This one's interesting. It shouldn't be happening.
> > > > >
> > > > > heartbeat[18326]: 2007/09/28_11:29:27 WARN: 6 lost packet(s) for [tim] [253:260]
> > > > > heartbeat[18326]: 2007/09/28_11:29:27 WARN: Late heartbeat: Node tim: interval 3000 ms
> > > > >
> > > > > Flaky network?
> > > > >
> > > > > heartbeat[18330]: 2007/09/28_11:29:29 WARN: glib: TTY write timeout on [/dev/ttyS0] (no connection or bad cable? [see documentation])
> > > > >
> > > > > Problems with serial?
> > > >
> > > > one of the nice things about v2 is that it keeps the resource config
> > > > in sync between nodes. however this also includes the status section
> > > > and means that the data being transferred could quite conceivably
> > > > max-out a serial connection.
> > > >
> > > > a second NIC and a crossover cable is usually a good alternative
> > >
> > > I am already using a pair of NIC's (between the nodes) for heartbeat, in
> > > addition to the serial link. Are you suggesting using two NIC's per node
> > > to send heartbeat messages?
> >
> > No, it's just that you are better off with some redundancy in
> > communication links.
>
> Sure. That is what I currently have, a dedicated NIC on each node (via
> crossover cable) and a dedicated serial port on each node (via null
> modem) connecting the two nodes.
>
> > > Are the status messages sent across both links? (ie. do they go across
> > > the serial link and the ethernet link between the nodes?) I would assume
> > > they would, but I thought I would ask for clarification.
> >
> > No, I don't think so. The heartbeats go over both links, but the
> > messages only over one.
>
> Hmmm. I wonder if this is place that could use some better logic to
> choose a faster link to send the messages (if one is available)? I
> currently have no intentions of digging into the source code, but am
> just thinking aloud.
Perhaps they are (the messages) still travelling over both links.
And perhaps it should be that way, because otherwise how does the
cluster know that both links are functional.
> > > > > heartbeat[18326]: 2007/09/28_11:29:56 CRIT: Cluster node tim returning after partition.
> > > > >
> > > > > The node is leaving and coming back. Looks like the
> > > > > network/serial connection doesn't deliver what we expect. Perhaps
> > > > > you could try some other combinations:
> > > > >
> > > > > - without serial/higher baud
> > >
> > > Yes! Both of these solutions fix the problem. Should the default baud
> > > rate for a serial line be higher than 19200? What baud rate do others
> > > use for v2 heartbeat configurations? The reason I ask is that currently
> > > I have it set to 115200 and I am wondering if I am just above the
> > > threshold of saturating the serial link. Perhaps I will run some tests
> > > as well to see when the serial link gets saturated and report the
> > > findings.
> >
> > Yes, that would be interesting. And we should probably print a
> > warning for v2 configurations and low speed serial links.
>
> I looks like 38400 is okay speed for the serial line. 19200 causes it to
> implode. I don't know if the keepalives directive have any bearing on
> the equation, but here is a snippet of my "working" config file:
>
> keepalive 1
> deadtime 5
> initdead 120
>
> baud 38400
Thanks. This should definitely be a minimum, though bigger CIBs
would probably need still more bandwidth.
Dejan
> --
> Matt Zagrabelny - mzagrabe at d.umn.edu - (218) 726 8844
> University of Minnesota Duluth
> Information Technology Systems & Services
> PGP key 1024D/84E22DA2 2005-11-07
> Fingerprint: 78F9 18B3 EF58 56F5 FC85 C5CA 53E7 887F 84E2 2DA2
>
> He is not a fool who gives up what he cannot keep to gain what he cannot
> lose.
> -Jim Elliot
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
More information about the Linux-HA
mailing list