[Linux-HA] Help understand an incident
Andrew Beekhof
beekhof at gmail.com
Tue Jul 3 09:15:08 MDT 2007
On 7/3/07, Peter Kruse <pk at q-leap.com> wrote:
> Hello list!
>
> today in one of our clusters a failover occured. Good news: it
> succeeded. But... while looking through the logs we found
> that messages are missing on one node so we can not say exactly
> what happened. Attached is the syslog from node-2 from the
> time where there are no messages on node-1. Is it possible
> to say from that log what happened on node-1?
if it was just resource actions - then yes. they'll all be recorded
in the CIB and produce updates like the one below. look out for
failing monitors which probably triggered everything.
> Especially there are messages like this:
>
> Jul 3 11:22:59 beosrv-c-2 cibmon: [16501]: info: mask(cib_apply_diff):
> + <lrm_rsc_op id="nfs:maillastnfs_stop_0" operation="stop"
> crm-debug-origin="do_update_resource"
> transition_key="6:ad6f57b8-295b-4c20-8e0f-e01494577dfb"
> transition_magic="2:152;6:ad6f57b8-295b-4c20-8e0f-e01494577dfb"
> call_id="45" rc_code="152" op_status="2" interval="0"
> __crm_diff_marker__="added:top"/>
>
> Does that mean the action maillastnfs_stop_0 was run but returned
> the status 2?
correct
> Or is it possible that the action never was run
> on node 1?
you'd have to match the uuid from the enclosing <node_state> object
but yes, it did actually get run - and according to the enum 2 :=
LRM_OP_TIMEOUT
typedef enum {
LRM_OP_PENDING = -1,
LRM_OP_DONE,
LRM_OP_CANCELLED,
LRM_OP_TIMEOUT,
LRM_OP_NOTSUPPORTED,
LRM_OP_ERROR
}op_status_t;
for rc values, refer to ocf-returncodes somewhere under /usr/lib/ocf
More information about the Linux-HA
mailing list