[Linux-HA] Ungraceful shutdown problem
Magnus Brown
mbrown at nexagent.com
Wed Dec 20 08:07:20 MST 2006
HI Andrew, thanks for your reply.
I have included the original cib.xml and ha-log.
I had a problem with the timing of the fuego status and fuego start
commands. The fuego status command could sometimes take up to 45 seconds
and the timeout on the fuego monitor was only set to 30s. SO heartbeat
would fail the monitor and then try and restart fuego. When eth0 is down
fuego cant actually start, but the start command would take up to 10
minutes to return a failed status, again exceeding both the
transition_idle_timeout and default_action_timeout settings.
(incidentally what is the difference between these two variables - I
couldn't find out).
So I have changed the init fuego script to return the failed status in
less time than the timeout and will test again.
Thank you very much
Magnus
-----Original Message-----
From: linux-ha-bounces at lists.linux-ha.org
[mailto:linux-ha-bounces at lists.linux-ha.org] On Behalf Of Andrew Beekhof
Sent: 20 December 2006 13:54
To: General Linux-HA mailing list
Subject: Re: [Linux-HA] Ungraceful shutdown problem
On 12/20/06, Magnus Brown <mbrown at nexagent.com> wrote:
> Hi all,
>
> Sorry I forgot I had unsubscribed from the list before sending this
> email so it will go to a moderator first.
>
> I have some more info though.
>
> I have tried removing the eth1 connection (as opposed to the eth0 one
> which gives the problem) and the resources all remain running on their
> respective nodes as they should - hurrah. So the problem only occurs
> when I remove the eth0 connection.
it looks like heartbeat is panic-ing which makes life hard for its
clients
if its repeatable, which it sounds like it is, can you log a heartbeat
bug for this please and attach the complete logs?
> Another problem is that when the eth0 connection is restored and the
> node which went down ungracefully has heartbeat restarted it starts
both
> ldap and weblogic as predicted (well actually it finds that ldap and
> weblogic are already running),
right, thats what the crmd is warning about as it exists
> but if the other node where fuego is
> running is shutdown gracefully, fuego is not moved over to the other
> node. So I have a situation where a resource required to run is not
> running in the cluster, nor does the previously failed node try to
> restart it.
can you attach the result of "cibadmin -Ql" when the cluster is in
this state (and before you reset failcounts etc)?
if there is a bug i'll be able to fix it pretty quick with this
information
>
> I thought that maybe the failure count had been set for fuego so tried
> to check it with: -
>
> crm_failcount -V -G -U edlapp02.eds.lcms.com -r fuego_res
> name=fail-count-fuego_res value=(null) Error performing operation: The
> object/attribute does not exist
>
> If I try and reset the failure count with: -
>
> Crm_failcount -D -U edlapp02.eds.lcms.com -r fuego_res
>
> It has no effect. In order to get fuego to run on this previously
failed
> node I have to stop heartbeat, remove the following: -
>
> rm -f /var/lib/heartbeat/cores/root/*
> rm -f /var/lib/heartbeat/cores/nobody/*
> rm -f /var/lib/heartbeat/cores/hacluster/*
> rm -f /var/lib/heartbeat/hb_generation
> rm -f /var/lib/heartbeat/hb_uuid
> rm -f /var/lib/heartbeat/hostcache
> rm -f /var/lib/heartbeat/pengine/*
> rm -f /var/lib/heartbeat/crm/cib.xml.last
> rm -f /var/lib/heartbeat/crm/cib.xml.sig
> rm -f /var/lib/heartbeat/crm/cib.xml.sig.last
>
> and copy back the initial cib.xml I used to start with.
>
> If I could get the same behaviour with eth0 as eth1 I would be happy
as
> fuego fails to run correctly without eth0 and so is failed over
> correctly when eth0 is down. I just need to stop heartbeat shuttind
> itself down when eth0 is taken down,
>
> Thank you
> Magnus
>
> -----Original Message-----
> From: Magnus Brown
> Sent: 19 December 2006 12:39
> To: 'linux-ha at lists.linux-ha.org'
> Subject: Ungraceful shutdown problem
>
> Hi all,
>
> I have a problem with heartbeat shutting down ungracefully and leaving
> managed processes still running. I have attached the cib.xml and a
> zipped ha-log.
>
> I have 2 nodes which are connected via 2 lan connections. The ha.cf is
> shown below: -
>
> use_logd on
> udpport 694
> keepalive 1
> deadtime 45
> mcast eth0 239.192.0.1 694 1 0
> mcast eth1 239.192.0.2 694 1 0
> node edlapp01.eds.lcms.com edlapp02.eds.lcms.com crm yes
>
> When I pull the eth0 cable on edlapp01, fuego is successfully moved to
> edlapp02. I am then expecting ldap and weblogic to continue running on
> edlapp01 but I get the following messages in ha-log: -
>
> tengine[31320]: 2006/12/18_11:59:13 info: te_update_diff:callbacks.c
> Processing diff (cib_update): 0.51.7185 -> 0.51.7186
> tengine[31320]: 2006/12/18_11:59:13 info: match_graph_event:events.c
> Action fuego_res_stop_0 (2) confirmed
> tengine[31320]: 2006/12/18_11:59:13 info: te_pseudo_action:actions.c
> Pseudo action 31 confirmed
> tengine[31320]: 2006/12/18_11:59:13 info: te_pseudo_action:actions.c
> Pseudo action 28 confirmed
> tengine[31320]: 2006/12/18_11:59:13 info: send_rsc_command:actions.c
> Initiating action 26: fuego_res_start_0 on edlapp02.eds.lcms.com
> cib[2433]: 2006/12/18_11:59:13 info: write_cib_contents:io.c Wrote
> version 0.51.7186 of the CIB to disk (digest:
> 3e4b41e1e8ce2f632e64696ae11c8b9d)
> heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Cannot write to media
pipe
> 0: Resource temporarily unavailable
> heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Shutting down.
> heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Cannot write to media
pipe
> 0: Resource temporarily unavailable
> heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Shutting down.
> heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Cannot write to media
pipe
> 0: Resource temporarily unavailable
> heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Shutting down.
> heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Cannot write to media
pipe
> 0: Resource temporarily unavailable
> heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Shutting down.
> heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Cannot write to media
pipe
> 0: Resource temporarily unavailable
> heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Shutting down.
> heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Cannot write to media
pipe
> 0: Resource temporarily unavailable
> heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Shutting down.
> heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Cannot write to media
pipe
> 0: Resource temporarily unavailable
> heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Shutting down.
>
> These cannot write and shutting down messages continue until: -
>
> heartbeat[31250]: 2006/12/18_12:01:04 ERROR: Shutting down.
> heartbeat[31250]: 2006/12/18_12:01:04 ERROR: Message hist queue is
> filling up (200 messages in queue)
> ccm[31309]: 2006/12/18_12:01:04 ERROR: Lost connection to heartbeat
> service. Need to bail out.
> cib[31310]: 2006/12/18_12:01:04 ERROR:
cib_ha_connection_destroy:main.c
> Heartbeat connection lost! Exiting.
> stonithd[31312]: 2006/12/18_12:01:04 ERROR: Disconnected with
heartbeat
> daemon
> mgmtd[31315]: 2006/12/18_12:01:04 ERROR: Lost connection to heartbeat
> service.
> tengine[31320]: 2006/12/18_12:01:04 ERROR: stonithd_op_result_ready:
> failed due to not on signon status.
> cib[31310]: 2006/12/18_12:01:04 info: uninitializeCib:io.c The CIB has
> been deallocated.
> attrd[31313]: 2006/12/18_12:01:04 CRIT: attrd_ha_dispatch:attrd.c Lost
> connection to heartbeat service.
> stonithd[31312]: 2006/12/18_12:01:04 notice:
/usr/lib/heartbeat/stonithd
> normally quit.
> crmd[31314]: 2006/12/18_12:01:04 CRIT:
crmd_ha_msg_dispatch:callbacks.c
> Lost connection to heartbeat service.
> mgmtd[31315]: 2006/12/18_12:01:04 ERROR:
> cib_native_msgready:cib_native.c Message pending on command channel
> [31310]
> tengine[31320]: 2006/12/18_12:01:04 ERROR:
> tengine_stonith_connection_destroy:callbacks.c Fencing daemon has left
> us
> pengine[31321]: 2006/12/18_12:01:04 info: pengine_shutdown:main.c
> Exiting PEngine (SIGTERM)
> attrd[31313]: 2006/12/18_12:01:04 CRIT:
> attrd_ha_connection_destroy:attrd.c Lost connection to heartbeat
> service!
> crmd[31314]: 2006/12/18_12:01:04 info: mem_handle_func:IPC broken, ccm
> is dead before the client!
>
> And then: -
>
>
>
> crmd[31314]: 2006/12/18_12:01:13 info: verify_stopped:lrm.c Checking
for
> active resources before exit
> crmd[31314]: 2006/12/18_12:01:13 ERROR: verify_stopped:lrm.c Resource
> ldap_res:0 was active at shutdown. You may ignore this error if it is
> unmanaged.
> crmd[31314]: 2006/12/18_12:01:13 ERROR: verify_stopped:lrm.c Resource
> weblogic_res:0 was active at shutdown. You may ignore this error if
it
> is unmanaged.
> crmd[31314]: 2006/12/18_12:01:13 info: verify_stopped:lrm.c Checking
for
> active resources before exit
> crmd[31314]: 2006/12/18_12:01:13 ERROR: verify_stopped:lrm.c Resource
> ldap_res:0 was active at shutdown. You may ignore this error if it is
> unmanaged.
> crmd[31314]: 2006/12/18_12:01:13 ERROR: verify_stopped:lrm.c Resource
> weblogic_res:0 was active at shutdown. You may ignore this error if
it
> is unmanaged.
> crmd[31314]: 2006/12/18_12:01:13 info: do_lrm_control:lrm.c
Disconnected
> from the LRM
> crmd[31314]: 2006/12/18_12:01:13 info: do_ha_control:control.c
> Disconnected from Heartbeat
> crmd[31314]: 2006/12/18_12:01:13 info: do_cib_control:cib.c
> Disconnecting CIB
> crmd[31314]: 2006/12/18_12:01:13 ERROR: send_ipc_message:ipc.c IPC
> Channel to 31310 is not connected
>
> and then heartbeat shuts down leaving ldap and weblogic running.
>
> I would like heartbeat to shutdown gracefully and shut ldap and
weblogic
> down if it is going to go down, but even better would it to not
shutdown
> at all as only fuego should move over to the other node.
>
> I'm guessing that there is some config I can add to the cib.xml file
to
> achieve this but I am unable to work out what it is,
>
> Many thanks
>
> Magnus
>
>
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
> ______________________________________________________________________
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________
______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ha-log.gz
Type: application/x-gzip
Size: 16267 bytes
Desc: ha-log.gz
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20061220/2d80dec9/ha-log-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cib.xml
Type: text/xml
Size: 4338 bytes
Desc: cib.xml
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20061220/2d80dec9/cib-0001.bin
More information about the Linux-HA
mailing list