[Linux-HA] Ungraceful shutdown problem

Magnus Brown mbrown at nexagent.com
Wed Dec 20 02:15:00 MST 2006


Hi all,

Sorry I forgot I had unsubscribed from the list before sending this
email so it will go to a moderator first.

I have some more info though.

I have tried removing the eth1 connection (as opposed to the eth0 one
which gives the problem) and the resources all remain running on their
respective nodes as they should - hurrah. So the problem only occurs
when I remove the eth0 connection.

Another problem is that when the eth0 connection is restored and the
node which went down ungracefully has heartbeat restarted it starts both
ldap and weblogic as predicted (well actually it finds that ldap and
weblogic are already running), but if the other node where fuego is
running is shutdown gracefully, fuego is not moved over to the other
node. So I have a situation where a resource required to run is not
running in the cluster, nor does the previously failed node try to
restart it.

I thought that maybe the failure count had been set for fuego so tried
to check it with: -

crm_failcount -V -G -U edlapp02.eds.lcms.com -r fuego_res
name=fail-count-fuego_res value=(null) Error performing operation: The
object/attribute does not exist

If I try and reset the failure count with: -

Crm_failcount -D -U edlapp02.eds.lcms.com -r fuego_res

It has no effect. In order to get fuego to run on this previously failed
node I have to stop heartbeat, remove the following: -

rm -f /var/lib/heartbeat/cores/root/*
rm -f /var/lib/heartbeat/cores/nobody/*
rm -f /var/lib/heartbeat/cores/hacluster/*
rm -f /var/lib/heartbeat/hb_generation
rm -f /var/lib/heartbeat/hb_uuid
rm -f /var/lib/heartbeat/hostcache
rm -f /var/lib/heartbeat/pengine/*
rm -f /var/lib/heartbeat/crm/cib.xml.last
rm -f /var/lib/heartbeat/crm/cib.xml.sig
rm -f /var/lib/heartbeat/crm/cib.xml.sig.last

and copy back the initial cib.xml I used to start with.

If I could get the same behaviour with eth0 as eth1 I would be happy as
fuego fails to run correctly without eth0 and so is failed over
correctly when eth0 is down. I just need to stop heartbeat shuttind
itself down when eth0 is taken down,

Thank you
Magnus

-----Original Message-----
From: Magnus Brown
Sent: 19 December 2006 12:39
To: 'linux-ha at lists.linux-ha.org'
Subject: Ungraceful shutdown problem

Hi all,

I have a problem with heartbeat shutting down ungracefully and leaving
managed processes still running. I have attached the cib.xml and a
zipped ha-log.

I have 2 nodes which are connected via 2 lan connections. The ha.cf is
shown below: -

use_logd on
udpport 694
keepalive 1
deadtime 45
mcast eth0 239.192.0.1 694 1 0
mcast eth1 239.192.0.2 694 1 0
node edlapp01.eds.lcms.com edlapp02.eds.lcms.com crm yes

When I pull the eth0 cable on edlapp01, fuego is successfully moved to
edlapp02. I am then expecting ldap and weblogic to continue running on
edlapp01 but I get the following messages in ha-log: -

tengine[31320]: 2006/12/18_11:59:13 info: te_update_diff:callbacks.c
Processing diff (cib_update): 0.51.7185 -> 0.51.7186
tengine[31320]: 2006/12/18_11:59:13 info: match_graph_event:events.c
Action fuego_res_stop_0 (2) confirmed
tengine[31320]: 2006/12/18_11:59:13 info: te_pseudo_action:actions.c
Pseudo action 31 confirmed
tengine[31320]: 2006/12/18_11:59:13 info: te_pseudo_action:actions.c
Pseudo action 28 confirmed
tengine[31320]: 2006/12/18_11:59:13 info: send_rsc_command:actions.c
Initiating action 26: fuego_res_start_0 on edlapp02.eds.lcms.com
cib[2433]: 2006/12/18_11:59:13 info: write_cib_contents:io.c Wrote
version 0.51.7186 of the CIB to disk (digest:
3e4b41e1e8ce2f632e64696ae11c8b9d)
heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Cannot write to media pipe
0: Resource temporarily unavailable
heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Shutting down.
heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Cannot write to media pipe
0: Resource temporarily unavailable
heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Shutting down.
heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Cannot write to media pipe
0: Resource temporarily unavailable
heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Shutting down.
heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Cannot write to media pipe
0: Resource temporarily unavailable
heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Shutting down.
heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Cannot write to media pipe
0: Resource temporarily unavailable
heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Shutting down.
heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Cannot write to media pipe
0: Resource temporarily unavailable
heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Shutting down.
heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Cannot write to media pipe
0: Resource temporarily unavailable
heartbeat[31250]: 2006/12/18_12:00:52 ERROR: Shutting down.

These cannot write and shutting down messages continue until: -

heartbeat[31250]: 2006/12/18_12:01:04 ERROR: Shutting down.
heartbeat[31250]: 2006/12/18_12:01:04 ERROR: Message hist queue is
filling up (200 messages in queue)
ccm[31309]: 2006/12/18_12:01:04 ERROR: Lost connection to heartbeat
service. Need to bail out.
cib[31310]: 2006/12/18_12:01:04 ERROR: cib_ha_connection_destroy:main.c
Heartbeat connection lost!  Exiting.
stonithd[31312]: 2006/12/18_12:01:04 ERROR: Disconnected with heartbeat
daemon
mgmtd[31315]: 2006/12/18_12:01:04 ERROR: Lost connection to heartbeat
service.
tengine[31320]: 2006/12/18_12:01:04 ERROR: stonithd_op_result_ready:
failed due to not on signon status.
cib[31310]: 2006/12/18_12:01:04 info: uninitializeCib:io.c The CIB has
been deallocated.
attrd[31313]: 2006/12/18_12:01:04 CRIT: attrd_ha_dispatch:attrd.c Lost
connection to heartbeat service.
stonithd[31312]: 2006/12/18_12:01:04 notice: /usr/lib/heartbeat/stonithd
normally quit.
crmd[31314]: 2006/12/18_12:01:04 CRIT: crmd_ha_msg_dispatch:callbacks.c
Lost connection to heartbeat service.
mgmtd[31315]: 2006/12/18_12:01:04 ERROR:
cib_native_msgready:cib_native.c Message pending on command channel
[31310]
tengine[31320]: 2006/12/18_12:01:04 ERROR:
tengine_stonith_connection_destroy:callbacks.c Fencing daemon has left
us
pengine[31321]: 2006/12/18_12:01:04 info: pengine_shutdown:main.c
Exiting PEngine (SIGTERM)
attrd[31313]: 2006/12/18_12:01:04 CRIT:
attrd_ha_connection_destroy:attrd.c Lost connection to heartbeat
service!
crmd[31314]: 2006/12/18_12:01:04 info: mem_handle_func:IPC broken, ccm
is dead before the client!

And then: -



crmd[31314]: 2006/12/18_12:01:13 info: verify_stopped:lrm.c Checking for
active resources before exit
crmd[31314]: 2006/12/18_12:01:13 ERROR: verify_stopped:lrm.c Resource
ldap_res:0 was active at shutdown.  You may ignore this error if it is
unmanaged.
crmd[31314]: 2006/12/18_12:01:13 ERROR: verify_stopped:lrm.c Resource
weblogic_res:0 was active at shutdown.  You may ignore this error if it
is unmanaged.
crmd[31314]: 2006/12/18_12:01:13 info: verify_stopped:lrm.c Checking for
active resources before exit
crmd[31314]: 2006/12/18_12:01:13 ERROR: verify_stopped:lrm.c Resource
ldap_res:0 was active at shutdown.  You may ignore this error if it is
unmanaged.
crmd[31314]: 2006/12/18_12:01:13 ERROR: verify_stopped:lrm.c Resource
weblogic_res:0 was active at shutdown.  You may ignore this error if it
is unmanaged.
crmd[31314]: 2006/12/18_12:01:13 info: do_lrm_control:lrm.c Disconnected
from the LRM
crmd[31314]: 2006/12/18_12:01:13 info: do_ha_control:control.c
Disconnected from Heartbeat
crmd[31314]: 2006/12/18_12:01:13 info: do_cib_control:cib.c
Disconnecting CIB
crmd[31314]: 2006/12/18_12:01:13 ERROR: send_ipc_message:ipc.c IPC
Channel to 31310 is not connected

and then heartbeat shuts down leaving ldap and weblogic running.

I would like heartbeat to shutdown gracefully and shut ldap and weblogic
down if it is going to go down, but even better would it to not shutdown
at all as only fuego should move over to the other node.

I'm guessing that there is some config I can add to the cib.xml file to
achieve this but I am unable to work out what it is,

Many thanks

Magnus


______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________


More information about the Linux-HA mailing list