[Linux-HA] Heartbeat - Node Status Woes

Andrew Beekhof beekhof at gmail.com
Wed Feb 14 02:23:17 MST 2007


[insert general note about attaching only fragments of logs]

[insert general note about not including version information]

other comments below

On 2/14/07, Jim Wong <jwong at sharpcast.com> wrote:
> Folks,
>
> We're having a bear of a time with the production cluster we're working
> on setting up.  Bits and pieces of it have been up and down over the
> last week or so as we test out various components with and without
> Heartbeat.  Now, we're working on getting the whole thing put together,
> but Heartbeat is having all sorts of problems.
>
> As I mentioned, Heartbeat has been up and running on each of our nodes
> at various points during the setup process, so each of the nodes seems
> to know something about the others, even after we've wiped out the CIB
> using "cibadmin -E".

you mean like in the <nodes> and <status> sections?

<nodes> is populated from the heartbeat messaging layer.
try stopping all the nodes are removing /var/lib/heartbeat/hostcache -
this should allow heartbeat to forget about old nodes.

<status> is populated from the lrm (so we'll continue to be aware of
any active resources) and also from the heartbeat messaging layer.

>  We're trying to get things rolling by bringing up
> Heartbeat on all of the nodes, without any resources, and then adding
> the resources one at a time.  Unfortunately, we can't get the nodes to
> talk to one another consistently.
>
> The first node we start (abg01 in the logs below) seems to come up fine.
> It knows about the other nodes, but they appear as offline, while it
> shows up as online.  The problems begin when we start initializing
> heartbeat on other systems.
>
> For example, one of the other nodes, abg00, seemed to come up fine at
> first, but after a while started bouncing between online and offline in
> the crm_mon output on abg01.  Finally, it seems to have settled
> unhappily offline.  The logs from abg01 reflect this:
>
> Feb 13 23:21:12 abg01 crmd: [19741]: notice:
> crmd_client_status_callback:callbacks.c Status update: Client abg00/crmd
> now has status [online]
> Feb 13 23:21:12 abg01 crmd: [19741]: info:
> crmd_client_status_callback:callbacks.c Uncaching UUID for abg00
> Feb 13 23:21:12 abg01 heartbeat: [18305]: WARN: 1 lost packet(s) for
> [abg00] [4500:4502]
> Feb 13 23:21:12 abg01 heartbeat: [18305]: info: No pkts missing from
> abg00!
> Feb 13 23:21:14 abg01 crmd: [19741]: notice:
> crmd_client_status_callback:callbacks.c Status update: Client abg00/crmd
> now has status [offline]
> Feb 13 23:21:15 abg01 crmd: [19741]: notice:
> crmd_client_status_callback:callbacks.c Status update: Client abg00/crmd
> now has status [online]
> Feb 13 23:21:15 abg01 crmd: [19741]: info:
> crmd_client_status_callback:callbacks.c Uncaching UUID for abg00
> Feb 13 23:21:16 abg01 heartbeat: [18305]: WARN: 1 lost packet(s) for
> [abg00] [4514:4516]
> Feb 13 23:21:16 abg01 heartbeat: [18305]: info: No pkts missing from
> abg00!
> Feb 13 23:21:24 abg01 crmd: [19741]: notice:
> crmd_client_status_callback:callbacks.c Status update: Client abg00/crmd
> now has status [offline]
> Feb 13 23:21:24 abg01 crmd: [19741]: WARN: join_make_offer:join_dc.c
> Peer process on abg00 is not active
> Feb 13 23:21:24 abg01 crmd: [19741]: WARN: join_make_offer:join_dc.c
> Peer process on abg00 is not active
> Feb 13 23:21:25 abg01 crmd: [19741]: notice:
> crmd_client_status_callback:callbacks.c Status update: Client abg00/crmd
> now has status [online]
> Feb 13 23:21:25 abg01 crmd: [19741]: info:
> crmd_client_status_callback:callbacks.c Uncaching UUID for abg00
> Feb 13 23:21:25 abg01 heartbeat: [18305]: WARN: 1 lost packet(s) for
> [abg00] [4536:4538]
> Feb 13 23:21:25 abg01 heartbeat: [18305]: info: No pkts missing from
> abg00!
> Feb 13 23:21:27 abg01 crmd: [19741]: notice:
> crmd_client_status_callback:callbacks.c Status update: Client abg00/crmd
> now has status [offline]
> Feb 13 23:21:28 abg01 crmd: [19741]: notice:
> crmd_client_status_callback:callbacks.c Status update: Client abg00/crmd
> now has status [online]
> Feb 13 23:21:28 abg01 crmd: [19741]: info:
> crmd_client_status_callback:callbacks.c Uncaching UUID for abg00
> Feb 13 23:21:29 abg01 heartbeat: [18305]: WARN: 1 lost packet(s) for
> [abg00] [4548:4550]
>
> Around the same time, the logs from abg00 seem to paint a picture of a
> very unhealthy system:

All those cib_apply_diff logs don't necessarily imply an unhealthy
system, but they do imply a very busy one.  But without the rest of
the logs its hard to know what it might be doing.

>
> Feb 13 23:21:12 abg00 cib: [14595]: info: cib_diff_notify:notify.c
> Update (client: 19741, call:22): 0.1682.9177 -> 0.1682.9178 (ok)
> Feb 13 23:21:12 abg00 cibmon: [14601]: info: cibmon_diff:cibmon.c
> [cib_diff_notify] cib_apply_diff confirmed
> Feb 13 23:21:12 abg00 cibmon: [14601]: info: cib_apply_diff: Diff: ---
> 0.1682.9177
> Feb 13 23:21:12 abg00 cibmon: [14601]: info: cib_apply_diff: Diff: +++
> 0.1682.9178
> Feb 13 23:21:12 abg00 cibmon: [14601]: info: cib_apply_diff: - <cib
> num_updates="9177"/>
> Feb 13 23:21:12 abg00 cibmon: [14601]: info: cib_apply_diff: + <cib
> num_updates="9178"/>
> Feb 13 23:21:12 abg00 cib: [14595]: info: cib_diff_notify:notify.c
> Update (client: 19741, call:23): 0.1682.9178 -> 0.1683.9179 (ok)
> Feb 13 23:21:12 abg00 cibmon: [14601]: info: cibmon_diff:cibmon.c
> [cib_diff_notify] cib_apply_diff confirmed
> Feb 13 23:21:12 abg00 cibmon: [14601]: info: cib_apply_diff: Diff: ---
> 0.1682.9178
> Feb 13 23:21:12 abg00 cibmon: [14601]: info: cib_apply_diff: Diff: +++
> 0.1683.9179
> Feb 13 23:21:12 abg00 cibmon: [14601]: info: cib_apply_diff: - <cib
> epoch="1682" num_updates="9178"/>
> Feb 13 23:21:12 abg00 cibmon: [14601]: info: cib_apply_diff: + <cib
> epoch="1683" num_updates="9179"/>
> Feb 13 23:21:12 abg00 heartbeat: [14583]: WARN: 1 lost packet(s) for
> [avol02] [2336:2338]
> Feb 13 23:21:12 abg00 heartbeat: [14583]: info: No pkts missing from
> avol02!
> Feb 13 23:21:12 abg00 cib: [14595]: info: cib_diff_notify:notify.c
> Update (client: 19741, call:24): 0.1683.9179 -> 0.1683.9180 (ok)
> Feb 13 23:21:12 abg00 cibmon: [14601]: info: cibmon_diff:cibmon.c
> [cib_diff_notify] cib_apply_diff confirmed
> Feb 13 23:21:12 abg00 cibmon: [14601]: info: cib_apply_diff: Diff: ---
> 0.1683.9179
> Feb 13 23:21:12 abg00 cibmon: [14601]: info: cib_apply_diff: Diff: +++
> 0.1683.9180
> Feb 13 23:21:12 abg00 cibmon: [14601]: info: cib_apply_diff: - <cib
> num_updates="9179"/>
> Feb 13 23:21:12 abg00 cibmon: [14601]: info: cib_apply_diff: + <cib
> num_updates="9180"/>
> Feb 13 23:21:12 abg00 cib: [14595]: info: cib_diff_notify:notify.c
> Update (client: 19741, call:25): 0.1683.9180 -> 0.1683.9181 (ok)
> Feb 13 23:21:12 abg00 cib: [14595]: WARN: cib_peer_callback:callbacks.c
> Discarding cib_replace message (cfc) from asqlweb01: not in our
> membership
>
> Note that it's writing log entries _very_ quickly, and had been doing so
> for some time.
>
> A couple seconds later, more bad stuff:
>
> Feb 13 23:21:14 abg00 cib: [14595]: WARN: cib_peer_callback:callbacks.c
> Discarding cib_apply_diff message (d24) from asqlweb01: not in our
> membership
> Feb 13 23:21:14 abg00 cib: [14595]: WARN: cib_peer_callback:callbacks.c
> Discarding cib_apply_diff message (d26) from asqlweb01: not in our
> membership
> Feb 13 23:21:14 abg00 cib: [14595]: WARN: cib_peer_callback:callbacks.c
> Discarding cib_apply_diff message (d28) from asqlweb01: not in our
> membership
> Feb 13 23:21:14 abg00 cib: [14595]: WARN: cib_peer_callback:callbacks.c
> Discarding cib_replace message (d2a) from asqlweb01: not in our
> membership
> Feb 13 23:21:14 abg00 crmd: [15751]: WARN:
> crmd_ha_msg_callback:callbacks.c Ignoring HA message (op=join_ack_nack)
> from asqlweb01: not in our membership list (size=4)
> Feb 13 23:21:14 abg00 cib: [14595]: WARN: cib_peer_callback:callbacks.c
> Discarding cib_apply_diff message (d2c) from asqlweb01: not in our
> membership
> Feb 13 23:21:14 abg00 crmd: [15751]: ERROR:
> do_cl_join_finalize_respond:join_client.c Join join-117 with asqlweb01
> failed.  NACK'd
> Feb 13 23:21:14 abg00 crmd: [15751]: ERROR: do_log:misc.c [[FSA]] Input
> I_ERROR from do_cl_join_finalize_respond() received in state (S_PENDING)
> Feb 13 23:21:14 abg00 crmd: [15751]: info: do_state_transition:fsa.c
> abg00: State transition S_PENDING -> S_RECOVERY [ input=I_ERROR
> cause=C_FSA_INTERNAL origin=do_cl_join_finalize_respond ]
> Feb 13 23:21:14 abg00 cib: [14595]: WARN: cib_peer_callback:callbacks.c
> Discarding cib_apply_diff message (d2e) from asqlweb01: not in our
> membership
> Feb 13 23:21:14 abg00 crmd: [15751]: ERROR: do_recover:control.c Action
> A_RECOVER (0000000001000000) not supported
> Feb 13 23:21:14 abg00 crmd: [15751]: ERROR: do_log:misc.c [[FSA]] Input
> I_STOP from do_recover() received in state (S_RECOVERY)
> Feb 13 23:21:14 abg00 crmd: [15751]: info: do_state_transition:fsa.c
> abg00: State transition S_RECOVERY -> S_STOPPING [ input=I_STOP
> cause=C_FSA_INTERNAL origin=do_recover ]
> Feb 13 23:21:14 abg00 crmd: [15751]: info: do_dc_release:election.c DC
> role released
> Feb 13 23:21:14 abg00 crmd: [15751]: WARN: do_log:misc.c [[FSA]] Input
> I_RELEASE_SUCCESS from do_dc_release() received in state (S_STOPPING)
> Feb 13 23:21:14 abg00 crmd: [15751]: info: do_state_transition:fsa.c
> abg00: State transition S_STOPPING -> S_TERMINATE [ input=I_TERMINATE
> cause=C_FSA_INTERNAL origin=do_shutdown ]
>
> This goes on and on, and needless to say, this node isn't working
> particularly well.
>
> Another node, avol00, just dumps many, many messages like these:
>
> Feb 13 22:57:26 avol00 heartbeat: [13480]: WARN: 1 lost packet(s) for
> [asqlweb01] [805:807]
> Feb 13 22:57:26 avol00 heartbeat: [13480]: WARN: Rexmit of seq 449
> requested. 92 is max.
> Feb 13 22:57:26 avol00 heartbeat: [13480]: WARN: Rexmit of seq 466
> requested. 92 is max.
> Feb 13 22:57:26 avol00 heartbeat: [13480]: WARN: Rexmit of seq 421
> requested. 92 is max.
> Feb 13 22:57:26 avol00 heartbeat: [13480]: WARN: Rexmit of seq 486
> requested. 92 is max.
> Feb 13 22:57:27 avol00 heartbeat: [13480]: WARN: Rexmit of seq 421
> requested. 92 is max.
> Feb 13 22:57:27 avol00 heartbeat: [13480]: WARN: Rexmit of seq 449
> requested. 92 is max.
> Feb 13 22:57:27 avol00 heartbeat: [13480]: WARN: Rexmit of seq 421
> requested. 92 is max.
> Feb 13 22:57:27 avol00 heartbeat: [13480]: WARN: Rexmit of seq 486
> requested. 92 is max.
> Feb 13 22:57:27 avol00 heartbeat: [13480]: WARN: Rexmit of seq 466
> requested. 92 is max.
> Feb 13 22:57:27 avol00 heartbeat: [13480]: WARN: Rexmit of seq 449
> requested. 92 is max.
> Feb 13 22:57:27 avol00 heartbeat: [13480]: WARN: Rexmit of seq 421
> requested. 92 is max.
> Feb 13 22:57:27 avol00 heartbeat: [13480]: WARN: Rexmit of seq 449
> requested. 92 is max.
> Feb 13 22:57:27 avol00 heartbeat: [13480]: WARN: Rexmit of seq 466
> requested. 92 is max.
> Feb 13 22:57:27 avol00 heartbeat: [13480]: WARN: Rexmit of seq 486
> requested. 92 is max.
> Feb 13 22:57:27 avol00 heartbeat: [13480]: WARN: Rexmit of seq 449
> requested. 92 is max.
> Feb 13 22:57:27 avol00 heartbeat: [13480]: WARN: Rexmit of seq 421
> requested. 92 is max.
> Feb 13 22:57:27 avol00 heartbeat: [13480]: WARN: Rexmit of seq 466
> requested. 92 is max.
> Feb 13 22:57:27 avol00 heartbeat: [13480]: WARN: Rexmit of seq 466
> requested. 92 is max.
> Feb 13 22:57:27 avol00 heartbeat: [13480]: WARN: Rexmit of seq 449
> requested. 92 is max.
>
> At this point, I've more or less reached the conclusion that something
> in our cluster state is hopelessly confused, and that I'm going to have
> to start wiping things out and starting over, but I'd be interested in
> learning a) what I should be looking for in order to effect this cleanup

the hostcache file is a good starting point

> and b) how we got to this point in the first place, so we can avoid the
> problem in the future.
>
> Can anyone help?
> --
> Jim Wong (jwong at sharpcast.com)
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>


More information about the Linux-HA mailing list