[Linux-HA] Heartbeat cannot stop
Andrew Beekhof
beekhof at gmail.com
Tue Nov 13 06:18:54 MST 2007
On Nov 8, 2007, at 8:19 AM, Dejan Muhamedagic wrote:
> Hi,
>
> On Thu, Nov 08, 2007 at 11:32:07AM +0900, HIDEO YAMAUCHI wrote:
>> Hi,
>>
>> I tested behavior of Heartbeat related to split-brain.
>> I just checked recovery from split-brain.
>>
>> I assume the following situation.
>>
>> 1)The cluster group of two nodes of Actvie/Standby.
>> 2)Hertbeat started with we having had a problem in LAN of the
>> Heartbeat communication.
>> 3)DC starts in each node in a few minutes.
>> 4)A resource starts in each node.
>> 5)Heartbeat communication revives.
>>
>> The recognition of the node was strange after this.
>> I was going to stop each Heartbeat service here.
>> Heartbeat stopped in one node, but Heartbeat did not stop in the
>> other node.
>>
>> Version 2.1.2 and the development version became the same results.
>>
>> I think that it is a problem that Heartbeat of both nodes does not
>> stop.
>
> Not sure, but this looks suspicious:
>
> dl380g5c/ha-log:crmd[31979]: 2007/11/08_10:40:16 info:
> do_shutdown_req: Sending shutdown request to DC: <null>
>
> After that, crmd makes no effort to exit.
>
> Another issue could be that for about two minutes, after the
> split brain healed, that node couldn't set the DC:
>
> crmd[31979]: 2007/11/08_10:38:42 info: update_dc: Set DC to <null>
> (<null>)
> ...
>
> There's also an uncommon period of inactivity:
>
> crmd[31979]: 2007/11/08_10:38:48 notice: populate_cib_nodes: Node:
> dl380g5c (uuid: a9abdd7e-0a39-40cd-bea5-74494ad97f89)
> crmd[31979]: 2007/11/08_10:40:11 notice:
> crmd_client_status_callback: Status update: Client dl380g5d/crmd now
> has status [offline]
The root cause seems to be that heartbeat is not providing client
status messages (to say that the crmd processes are active) once the
split-brain heals.
crmd[1350]: 2007/11/08_10:38:43 info: join_make_offer: Peer process on
dl380g5c is not active (yet?)
crmd[1350]: 2007/11/08_10:40:11 WARN: do_state_transition: Only 1 of 2
cluster nodes are eligible to run resources - continue 0
Because of this, the crm doesn't consider dl380g5c online and the PE
can't shut it down.
I think you need to file a bug for alan about this.
More information about the Linux-HA
mailing list