[Linux-HA] one dead ping node caused partially group restart

Andreas Kurz andreas.kurz at gmail.com
Thu Sep 13 08:34:42 MDT 2007


On 9/13/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> On 8/30/07, Andreas Kurz <andreas.kurz at gmail.com> wrote:
> > On 8/24/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> > > On 8/21/07, Andreas Kurz <andreas.kurz at gmail.com> wrote:
> > > > Hello all,
> > > >
> > > > I have a Heartbeat 2.1.2 two-node cluster installation using three
> > > > ping nodes to check network connectivity. Today one of the ping nodes
> > > > was restarted and the loss of one ping node was detected by both nodes
> > > > at the same time (according to the logs).
> > > >
> > > > The problem was that the pingd score_attribute was only decreased for
> > > > one node (holding two resource groups) in a first attempt and
> > > > heartbeat began to stop resources to migrate the groups away. During
> > > > the resource shutdown the pingd score_attribute of the second node was
> > > > also decreased and the resource migration was stopped and restarted on
> > > > the current node. Some seconds later the third ping node was up again,
> > > > the  pingd score_attribute was updated for both nodes and the
> > > > resources were untouched.
> > > >
> > > > My question is: Is there a way to 'tune' the configuration to avoid
> > > > souch resource restarts and why  was the pingd score_attribute updated
> > > > 'simultaneously' when the ping node came up again but not when it got
> > > > down?
> > >
> > > if you're using the RA, increase the value of "dampen"
> > >
> > > if an event happens, we'll wait 'dampen' seconds (or milliseconds) to
> > > see if another one occurs on another node - so that we can update the
> > > CIB with both of them at the same time.
> >
> > According to my logs the pingds on both nodes recognized the loss of
> > one ping-node at the same time but immediatly after the cib was
> > updated once with the first pingd-attribute heartbeat started to stop
> > the resources, then the second pingd-attribute update happens a second
> > later and the already stopped resources were started again on the same
> > host .... was this some sort of a race-condition? Should heartbeat
> > maybe wait one additional ping intervall for pingd-attribute updates
> > before starting the recalculation of the scores in case one node is a
> > little bit  late when sending the updates .... or does this make no
> > nense?
>
> in the current design of attrd, there is a small chance that the node
> that triggers the update will get in a little too quickly and the
> updates don't show up close enough together.  basically thats what
> happened here.
>
> the real solution is to have all peers supply their changes to one
> node that does the update (hence ensuring the updates are truly
> atomic)
>
> we know what we need to do, its just a matter of getting the time to do it...

OK ... I see ... Thanks for your reply and the information Andrew.

Regards,
Andreas

>
> > ..
> > attrd[7171]: 2007/08/21_10:09:55 info: attrd_perform_update: Sent
> > update 16: pingd=2000
> > tengine[932]: 2007/08/21_10:09:55 info: extract_event: Aborting on
> > transient_attributes changes for 738e0605-7e82-47b8-b21a-e69b733eb98b
> > ...
> > tengine[932]: 2007/08/21_10:09:56 info: extract_event: Aborting on
> > transient_attributes changes for dceade77-b3bf-40c7-a4b6-cc8995133aa1
> >
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



More information about the Linux-HA mailing list