[Linux-HA] two node firewall using heartbeat v2 problems [SOLVED]
Matt Zagrabelny
mzagrabe at d.umn.edu
Mon Oct 1 14:02:06 MDT 2007
On Mon, 2007-10-01 at 21:20 +0200, Dejan Muhamedagic wrote:
> Hi,
>
> On Mon, Oct 01, 2007 at 01:56:46PM -0500, Matt Zagrabelny wrote:
> >
> > On Mon, 2007-10-01 at 14:37 +0200, Andrew Beekhof wrote:
> > > On 10/1/07, Dejan Muhamedagic <dejanmm at fast
> >
> > [...]
> >
> > > > > (there will be no <status/> element in the following file, I believe
> > > > > that this is due to me manually 'kill -9'ing the processes after they
> > > > > would not stop nicely)
> > > >
> > > > No, the status section is never saved to a file. It only exists
> > > > in running nodes.
> >
> > I know that the actual status doesn't get written out, but doesn't the "<status/>" tag get written out when the processes exit?
> >
> > > >
> > > > > Here are some snippets from the log files, I am not sure what are the
> > > > > valuable pieces and what are not. The files themselves are long (600 and
> > > > > 900 lines for the primary and backup servers). Locations of the (almost
> > > > > complete) log files is:
> > > > >
> > > > > http://www.d.umn.edu/~mzagrabe/ha-log.cody.txt
> > > > > http://www.d.umn.edu/~mzagrabe/ha-log.tim.txt
> > > >
> > > > >From cody:
> > > >
> > > > heartbeat[18326]: 2007/09/28_11:29:27 WARN: string2msg_ll: node [tim] failed authentication
> > > >
> > > > This one's interesting. It shouldn't be happening.
> > > >
> > > > heartbeat[18326]: 2007/09/28_11:29:27 WARN: 6 lost packet(s) for [tim] [253:260]
> > > > heartbeat[18326]: 2007/09/28_11:29:27 WARN: Late heartbeat: Node tim: interval 3000 ms
> > > >
> > > > Flaky network?
> > > >
> > > > heartbeat[18330]: 2007/09/28_11:29:29 WARN: glib: TTY write timeout on [/dev/ttyS0] (no connection or bad cable? [see documentation])
> > > >
> > > > Problems with serial?
> > >
> > > one of the nice things about v2 is that it keeps the resource config
> > > in sync between nodes. however this also includes the status section
> > > and means that the data being transferred could quite conceivably
> > > max-out a serial connection.
> > >
> > > a second NIC and a crossover cable is usually a good alternative
> >
> > I am already using a pair of NIC's (between the nodes) for heartbeat, in
> > addition to the serial link. Are you suggesting using two NIC's per node
> > to send heartbeat messages?
>
> No, it's just that you are better off with some redundancy in
> communication links.
Sure. That is what I currently have, a dedicated NIC on each node (via
crossover cable) and a dedicated serial port on each node (via null
modem) connecting the two nodes.
> > Are the status messages sent across both links? (ie. do they go across
> > the serial link and the ethernet link between the nodes?) I would assume
> > they would, but I thought I would ask for clarification.
>
> No, I don't think so. The heartbeats go over both links, but the
> messages only over one.
Hmmm. I wonder if this is place that could use some better logic to
choose a faster link to send the messages (if one is available)? I
currently have no intentions of digging into the source code, but am
just thinking aloud.
>
> > > > heartbeat[18326]: 2007/09/28_11:29:56 CRIT: Cluster node tim returning after partition.
> > > >
> > > > The node is leaving and coming back. Looks like the
> > > > network/serial connection doesn't deliver what we expect. Perhaps
> > > > you could try some other combinations:
> > > >
> > > > - without serial/higher baud
> >
> > Yes! Both of these solutions fix the problem. Should the default baud
> > rate for a serial line be higher than 19200? What baud rate do others
> > use for v2 heartbeat configurations? The reason I ask is that currently
> > I have it set to 115200 and I am wondering if I am just above the
> > threshold of saturating the serial link. Perhaps I will run some tests
> > as well to see when the serial link gets saturated and report the
> > findings.
>
> Yes, that would be interesting. And we should probably print a
> warning for v2 configurations and low speed serial links.
I looks like 38400 is okay speed for the serial line. 19200 causes it to
implode. I don't know if the keepalives directive have any bearing on
the equation, but here is a snippet of my "working" config file:
keepalive 1
deadtime 5
initdead 120
baud 38400
--
Matt Zagrabelny - mzagrabe at d.umn.edu - (218) 726 8844
University of Minnesota Duluth
Information Technology Systems & Services
PGP key 1024D/84E22DA2 2005-11-07
Fingerprint: 78F9 18B3 EF58 56F5 FC85 C5CA 53E7 887F 84E2 2DA2
He is not a fool who gives up what he cannot keep to gain what he cannot
lose.
-Jim Elliot
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20071001/449a56b8/attachment.pgp
More information about the Linux-HA
mailing list