[Linux-HA] Simple cluster hung in (transition,
none) state for 2 hours
Martin Gazak
martin.gazak at microstep-mis.com
Mon Jan 8 23:32:46 MST 2007
I am resending this post because I did not realized that my e-mail
settings changed and therefore the original post is waiting
for admin approval. Sorry for any incovenience.
Hello,
today I have found our cluster hung in a very special state
(not good of course) - one node hung in "transition" state
for more than 2 hours, the other in "none" state.
It happened also once last week, but I did not have log
files from that case.
We use heartbeat on 2 linux boxes (ims0 and ims1) just for failover
of the IP (192.168.145.173). The ha.cf and haresources files including
portions of ha.log files are included at the bottom of the e-mail
(I have also ha-debug logs available, I just does not want to send
too much data).
Box ims0 runs SuSE Enterprise Linux and heartbeat 2.0.5.
Box ims1 runs OpenSuSE Linux and heartbeat 2.0.7.
(The reason why we use heterogenous cluster is the fact, that we were
having troubles with some hardware in SuSE Enterprise Linux and
therefore we are testing also OpenSUSE).
They are interconnected by a crossover cable (eth1) and null modem cable
(seems not to work according to logs).
Situation:
14:22:.. Both nodes running, ims0 with status "none", ims1 with status
"all". Everything OK.
14:22:57 we restarted node ims1 (the heartbeat was killed by
"/usr/bin/killall -9 heartbeat" )
14:22:57 node ims0 changed to "transition" and remained in this state
for another 2 hours (!)
14:25:01 node ims1 (heartbeat) up, however nothing happened with ims0 -
remained in state transition
14:26:.. The system remained in this state for 2 hours:
on ims1 both "clstatus nodestatus ims0" and "clstatus
nodestatus ims1" were "active"
on ims1 both "clstatus rscstatus ims0" and "clstatus rscstatus
ims1" were "none"
on ims0 both "clstatus nodestatus ims0" and "clstatus
nodestatus ims1" were "active"
on ims0 both "clstatus rscstatus ims0" and "clstatus rscstatus
ims1" were "transition"
The shared IP was unavailable.
Restart of node ims1 did not help.
Finally after restart of node ims0 (/usr/bin/killall -9 heartbeat) the
ims1 became MAIN (resources all)
and ims0 became after restart STANDBY (resources none).
Does somebody has any idea what happened ?
I thank you for any suggestion how to prevent such state in advance
(will the use of same heartbeat version 2.0.7 prevent this ?).
Best regards
Martin Gazak
haresources:
---------
ims0 192.168.145.173
---------
ha.cf:
---------
keepalive 400ms
deadtime 2
warntime 800ms
initdead 10
baud 115200
serial /dev/ttyS0 # Linux
ucast eth1 ims-other-local
auto_failback off
node ims0
node ims1
respawn hacluster /usr/lib/heartbeat/ipfail
realtime on
debug 3
apiauth ims uid=root gid=root
use_logd yes
---------
ha.log from ims0 (the node stuck in TRANSITION):
---------------------------------------------------
heartbeat[7101]: 2007/01/08_14:17:40 info: MSG stats: 0/0 ms age
524564656 [pid7118/HBREAD]
heartbeat[7101]: 2007/01/08_14:17:40 info: ha_malloc stats: 469/40321
57028/29925 [pid7118/HBREAD]
heartbeat[7101]: 2007/01/08_14:17:40 info: RealMalloc stats: 58224 total
malloc bytes. pid [7118/HBREAD]
heartbeat[7101]: 2007/01/08_14:17:40 info: Current arena value: 0
heartbeat[7101]: 2007/01/08_14:17:40 info: These are nothing to worry about.
heartbeat[7101]: 2007/01/08_14:22:57 WARN: node ims1: is dead
heartbeat[7101]: 2007/01/08_14:22:57 WARN: No STONITH device configured.
heartbeat[7101]: 2007/01/08_14:22:57 WARN: Shared disks are not protected.
heartbeat[7101]: 2007/01/08_14:22:57 info: Resources being acquired from
ims1.
heartbeat[7101]: 2007/01/08_14:22:57 info: Link ims1:eth1 dead.
harc[9445][9458]: 2007/01/08_14:22:57 info: Running
/etc/ha.d/rc.d/status status
mach_down[9509][9563]: 2007/01/08_14:22:57 info:
/usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired
mach_down[9509][9571]: 2007/01/08_14:22:57 info: mach_down takeover
complete for node ims1.
heartbeat[7101]: 2007/01/08_14:22:57 info: Exiting status process 9445
returned rc 0.
IPaddr[9476][9589]: 2007/01/08_14:22:58 INFO: IPaddr Resource is stopped
req_resource[9459][9590]: 2007/01/08_14:22:58 debug: in
/usr/lib/heartbeat/req_resource 192.168.145.173
req_resource[9459][9591]: 2007/01/08_14:22:58 debug: dont_ask: yes
nice_failback: yes
heartbeat[9446]: 2007/01/08_14:22:58 info: 1 local resources from
[/usr/lib/heartbeat/ResourceManager listkeys ims0]
heartbeat[9446]: 2007/01/08_14:22:58 info: Local Resource acquisition
completed.
heartbeat[9446]: 2007/01/08_14:22:58 info: Writing type [resource]
message to FIFO
heartbeat[9446]: 2007/01/08_14:22:58 info: FIFO message [type resource]
written rc=79
heartbeat[7101]: 2007/01/08_14:22:58 info: Exiting req_our_resources
process 9446 returned rc 0.
heartbeat[7101]: 2007/01/08_14:22:58 info: AnnounceTakeover(local 1,
foreign 1, reason 'req_our_resources' (1))
heartbeat[7101]: 2007/01/08_14:25:01 info: Heartbeat restart on node ims1
heartbeat[7101]: 2007/01/08_14:25:01 info: Link ims1:eth1 up.
heartbeat[7101]: 2007/01/08_14:25:01 info: Status update for node ims1:
status init
heartbeat[7101]: 2007/01/08_14:25:01 info: Status update for node ims1:
status up
harc[9607][9610]: 2007/01/08_14:25:01 info: Running
/etc/ha.d/rc.d/status status
heartbeat[7101]: 2007/01/08_14:25:01 info: Exiting status process 9607
returned rc 0.
harc[9613][9616]: 2007/01/08_14:25:01 info: Running
/etc/ha.d/rc.d/status status
heartbeat[7101]: 2007/01/08_14:25:01 info: Exiting status process 9613
returned rc 0.
heartbeat[7101]: 2007/01/08_14:25:01 info: all clients are now paused
heartbeat[7101]: 2007/01/08_14:25:03 WARN: Late heartbeat: Node ims1:
interval 1940 ms
heartbeat[7101]: 2007/01/08_14:25:03 info: Status update for node ims1:
status active
heartbeat[7101]: 2007/01/08_14:25:03 WARN: T_STARTING received during
takeover.
heartbeat[7101]: 2007/01/08_14:25:03 info: AnnounceTakeover(local 1,
foreign 1, reason 'T_RESOURCES(us)' (1))
heartbeat[7101]: 2007/01/08_14:25:03 info: all clients are now resumed
harc[9621][9624]: 2007/01/08_14:25:03 info: Running
/etc/ha.d/rc.d/status statusg
heartbeat[7101]: 2007/01/08_14:25:03 info: Exiting status process 9621
returned rc 0.
heartbeat[7101]: 2007/01/08_14:25:03 info: other_holds_resources: 0
heartbeat[7101]: 2007/01/08_14:25:03 info: remote resource transition
completed.
heartbeat[7101]: 2007/01/08_14:25:03 info: AnnounceTakeover(local 1,
foreign 1, reason 'T_RESOURCES(us)' (1))
heartbeat[7101]: 2007/01/08_14:25:03 info: other_holds_resources: 0
heartbeat[7101]: 2007/01/08_14:27:40 info: Daily informational memory
statistics
heartbeat[7101]: 2007/01/08_14:27:40 info: MSG stats: 14/84784 ms age 0
[pid7101/MST_CONTROL]
heartbeat[7101]: 2007/01/08_14:27:40 info: ha_malloc stats: 964/2710428
143120/70947 [pid7101/MST_CONTROL]
heartbeat[7101]: 2007/01/08_14:27:40 info: RealMalloc stats: 348756
total malloc bytes. pid [7101/MST_CONTROL]
heartbeat[7101]: 2007/01/08_14:27:40 info: Current arena value: 0
heartbeat[7101]: 2007/01/08_14:27:40 info: MSG stats: 0/0 ms age
525164656 [pid7114/HBFIFO]
heartbeat[7101]: 2007/01/08_14:27:40 info: ha_malloc stats: 461/621
47716/24968 [pid7114/HBFIFO]
heartbeat[7101]: 2007/01/08_14:27:40 info: RealMalloc stats: 47916 total
malloc bytes. pid [7114/HBFIFO]
heartbeat[7101]: 2007/01/08_14:27:40 info: Current arena value: 0
heartbeat[7101]: 2007/01/08_14:27:40 info: MSG stats: 0/0 ms age
525164656 [pid7115/HBWRITE]
heartbeat[7101]: 2007/01/08_14:27:40 info: ha_malloc stats: 466/23232
56776/29793 [pid7115/HBWRITE]
heartbeat[7101]: 2007/01/08_14:27:40 info: RealMalloc stats: 65892 total
malloc bytes. pid [7115/HBWRITE]
heartbeat[7101]: 2007/01/08_14:27:40 info: Current arena value: 0
heartbeat[7101]: 2007/01/08_14:27:40 info: MSG stats: 0/0 ms age
525164656 [pid7116/HBREAD]
heartbeat[7101]: 2007/01/08_14:27:40 info: ha_malloc stats: 466/658
48648/25740 [pid7116/HBREAD]
heartbeat[7101]: 2007/01/08_14:27:40 info: RealMalloc stats: 49312 total
malloc bytes. pid [7116/HBREAD]
heartbeat[7101]: 2007/01/08_14:27:40 info: Current arena value: 0
heartbeat[7101]: 2007/01/08_14:27:40 info: MSG stats: 0/0 ms age
525164656 [pid7117/HBWRITE]
heartbeat[7101]: 2007/01/08_14:27:40 info: ha_malloc stats: 468/23254
56944/29881 [pid7117/HBWRITE]
heartbeat[7101]: 2007/01/08_14:27:40 info: RealMalloc stats: 66060 total
malloc bytes. pid [7117/HBWRITE]
heartbeat[7101]: 2007/01/08_14:27:40 info: Current arena value: 0
heartbeat[7101]: 2007/01/08_14:27:40 info: MSG stats: 0/0 ms age
525164656 [pid7118/HBREAD]
heartbeat[7101]: 2007/01/08_14:27:40 info: ha_malloc stats: 469/43195
57028/29925 [pid7118/HBREAD]
heartbeat[7101]: 2007/01/08_14:27:40 info: RealMalloc stats: 58224 total
malloc bytes. pid [7118/HBREAD]
heartbeat[7101]: 2007/01/08_14:27:40 info: Current arena value: 0
heartbeat[7101]: 2007/01/08_14:27:40 info: These are nothing to worry about.
............ Daily memory statistics just removed, TRANSITION state
continues.
heartbeat[7101]: 2007/01/08_16:27:40 info: Current arena value: 0
heartbeat[7101]: 2007/01/08_16:27:40 info: MSG stats: 0/0 ms age
532364686 [pid7118/HBREAD]
heartbeat[7101]: 2007/01/08_16:27:40 info: ha_malloc stats: 469/86427
57028/29925 [pid7118/HBREAD]
heartbeat[7101]: 2007/01/08_16:27:40 info: RealMalloc stats: 58224 total
malloc bytes. pid [7118/HBREAD]
heartbeat[7101]: 2007/01/08_16:27:40 info: Current arena value: 0
heartbeat[7101]: 2007/01/08_16:27:40 info: These are nothing to worry about.
heartbeat[7101]: 2007/01/08_16:29:08 WARN: node ims1: is dead
heartbeat[7101]: 2007/01/08_16:29:08 info: Dead node ims1 gave up resources.
heartbeat[7101]: 2007/01/08_16:29:08 info: Link ims1:eth1 dead.
heartbeat[7101]: 2007/01/08_16:31:01 info: Heartbeat restart on node ims1
heartbeat[7101]: 2007/01/08_16:31:01 info: Link ims1:eth1 up.
heartbeat[7101]: 2007/01/08_16:31:01 info: Status update for node ims1:
status init
heartbeat[7101]: 2007/01/08_16:31:01 info: Status update for node ims1:
status up
heartbeat[7101]: 2007/01/08_16:31:01 info: all clients are now paused
harc[11567][11571]: 2007/01/08_16:31:02 info: Running
/etc/ha.d/rc.d/status status
heartbeat[7101]: 2007/01/08_16:31:02 info: Exiting status process 11567
returned rc 0.
harc[11574][11577]: 2007/01/08_16:31:02 info: Running
/etc/ha.d/rc.d/status status
heartbeat[7101]: 2007/01/08_16:31:02 info: Exiting status process 11574
returned rc 0.
heartbeat[7101]: 2007/01/08_16:31:02 info: Status update for node ims1:
status active
heartbeat[7101]: 2007/01/08_16:31:02 WARN: T_STARTING received during
takeover.
heartbeat[7101]: 2007/01/08_16:31:02 info: AnnounceTakeover(local 1,
foreign 1, reason 'T_RESOURCES(us)' (1))
harc[11580][11583]: 2007/01/08_16:31:02 info: Running
/etc/ha.d/rc.d/status status
heartbeat[7101]: 2007/01/08_16:31:02 info: other_holds_resources: 0
heartbeat[7101]: 2007/01/08_16:31:02 info: remote resource transition
completed.
heartbeat[7101]: 2007/01/08_16:31:02 info: AnnounceTakeover(local 1,
foreign 1, reason 'T_RESOURCES(us)' (1))
heartbeat[7101]: 2007/01/08_16:31:02 info: other_holds_resources: 0
heartbeat[7101]: 2007/01/08_16:31:02 info: Exiting status process 11580
returned rc 0.
heartbeat[7101]: 2007/01/08_16:31:03 info: all clients are now resumed
logd[12088]: 2007/01/08_16:35:26 info: logd started with /etc/logd.cf.
logd[12089]: 2007/01/08_16:35:26 info: G_main_add_SignalHandler: Added
signal handler for signal 15
logd[12088]: 2007/01/08_16:35:26 info: G_main_add_SignalHandler: Added
signal handler for signal 15
heartbeat[12095]: 2007/01/08_16:35:27 info: Enabling logging daemon
heartbeat[12095]: 2007/01/08_16:35:27 info: logfile and debug file are
those specified in logd config file (default /etc/logd.cf)
heartbeat[12095]: 2007/01/08_16:35:28 info: AUTH: i=2: key = 0x80f3490,
auth=0xb7f90f00, authname=crc
heartbeat[12095]: 2007/01/08_16:35:28 info: **************************
heartbeat[12095]: 2007/01/08_16:35:28 info: Configuration validated.
Starting heartbeat 2.0.5
heartbeat[12105]: 2007/01/08_16:35:28 info: heartbeat: version 2.0.5
heartbeat[12105]: 2007/01/08_16:35:37 info: Heartbeat generation: 529
heartbeat[12105]: 2007/01/08_16:35:37 info: G_main_add_TriggerHandler:
Added signal manual handler
heartbeat[12105]: 2007/01/08_16:35:37 info: G_main_add_TriggerHandler:
Added signal manual handler
heartbeat[12105]: 2007/01/08_16:35:37 info: Creating FIFO
/var/lib/heartbeat/fifo.
heartbeat[12105]: 2007/01/08_16:35:37 info: Removing
/var/run/heartbeat/rsctmp failed, recreating.
heartbeat[12105]: 2007/01/08_16:35:37 info: glib: Starting serial
heartbeat on tty /dev/ttyS0 (115200 baud)
heartbeat[12105]: 2007/01/08_16:35:37 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth1
heartbeat[12105]: 2007/01/08_16:35:37 info: glib: ucast: bound send
socket to device: eth1
heartbeat[12105]: 2007/01/08_16:35:37 info: glib: ucast: bound receive
socket to device: eth1
heartbeat[12105]: 2007/01/08_16:35:37 info: glib: ucast: started on port
694 interface eth1 to 10.10.10.2
heartbeat[12105]: 2007/01/08_16:35:37 info: G_main_add_SignalHandler:
Added signal handler for signal 17
heartbeat[12105]: 2007/01/08_16:35:37 info: Local status now set to: 'up'
heartbeat[12105]: 2007/01/08_16:35:38 info: Link ims1:eth1 up.
heartbeat[12105]: 2007/01/08_16:35:38 info: Status update for node ims1:
status active
harc[12137][12141]: 2007/01/08_16:35:38 info: Running
/etc/ha.d/rc.d/status statush
heartbeat[12105]: 2007/01/08_16:35:38 info: Exiting status process 12137
returned rc 0.
heartbeat[12105]: 2007/01/08_16:35:39 info: Comm_now_up(): updating
status to active
heartbeat[12105]: 2007/01/08_16:35:39 WARN: Late heartbeat: Node ims0:
interval 870 ms
heartbeat[12105]: 2007/01/08_16:35:39 info: Local status now set to:
'active'
heartbeat[12105]: 2007/01/08_16:35:39 info: Starting child client
"/usr/lib/heartbeat/ipfail" (90,90)
heartbeat[12105]: 2007/01/08_16:35:39 WARN: G_CH_dispatch_int: Dispatch
function for read child took too long to execute: 570 ms (GSource:
0x80fe7b8)
heartbeat[12105]: 2007/01/08_16:35:39 WARN: Gmain_timeout_dispatch:
Dispatch function for send local status was delayed 470 ms before being
called (GSource: 0x80fe688)
heartbeat[12105]: 2007/01/08_16:35:39 info: Gmain_timeout_dispatch:
started at 1771271259 should have started at 1771271212
heartbeat[12105]: 2007/01/08_16:35:39 WARN: Gmain_timeout_dispatch:
Dispatch function for check for signals was delayed 480 ms before being
called (GSource: 0x80ff150)
heartbeat[12105]: 2007/01/08_16:35:39 info: Gmain_timeout_dispatch:
started at 1771271260 should have started at 1771271212
heartbeat[12148]: 2007/01/08_16:35:39 info: Starting
"/usr/lib/heartbeat/ipfail" as uid 90 gid 90 (pid 12148)
heartbeat[12105]: 2007/01/08_16:35:39 info: AnnounceTakeover(local 0,
foreign 1, reason 'T_RESOURCES' (0))
heartbeat[12105]: 2007/01/08_16:35:39 info: remote resource transition
completed.
heartbeat[12105]: 2007/01/08_16:35:39 info: AnnounceTakeover(local 0,
foreign 1, reason 'T_RESOURCES' (0))
heartbeat[12105]: 2007/01/08_16:35:39 info: STATE 1 => 3
heartbeat[12105]: 2007/01/08_16:35:39 info: other_holds_resources: 3
heartbeat[12105]: 2007/01/08_16:35:39 info: remote resource transition
completed.
heartbeat[12105]: 2007/01/08_16:35:39 info: Local Resource acquisition
completed. (none)
heartbeat[12105]: 2007/01/08_16:35:39 info: AnnounceTakeover(local 1,
foreign 1, reason 'T_RESOURCES(them)' (0))
heartbeat[12105]: 2007/01/08_16:35:39 info: Initial resource acquisition
complete (T_RESOURCES(them))
heartbeat[12105]: 2007/01/08_16:35:39 info: STATE 3 => 4
heartbeat[12105]: 2007/01/08_16:35:40 info: other_holds_resources: 3
heartbeat[12626]: 2007/01/08_16:39:07 info: Enabling logging daemon
heartbeat[12626]: 2007/01/08_16:39:07 info: logfile and debug file are
those specified in logd config file (default /etc/logd.cf)
heartbeat[12626]: 2007/01/08_16:39:07 info: AUTH: i=2: key = 0x80f3490,
auth=0xb7fd5f00, authname=crc
heartbeat[12626]: 2007/01/08_16:39:07 info: **************************
heartbeat[12626]: 2007/01/08_16:39:07 info: Configuration validated.
Starting heartbeat 2.0.5
heartbeat[12626]: 2007/01/08_16:39:07 info: heartbeat: already running
[pid 12105].
--------------------------
ha.log from ims1 (the node stuck in NONE):
---------------------------------------------------
heartbeat[27535]: 2007/01/08_14:19:22 info: MSG stats: 0/0 ms age
1042697416 [pid27550/HBREAD]
heartbeat[27535]: 2007/01/08_14:19:22 info: ha_malloc stats: 425/104144
50548/24697 [pid27550/HBREAD]
heartbeat[27535]: 2007/01/08_14:19:22 info: RealMalloc stats: 51744
total malloc bytes. pid [27550/HBREAD]
heartbeat[27535]: 2007/01/08_14:19:22 info: Current arena value: 0
heartbeat[27535]: 2007/01/08_14:19:22 info: These are nothing to worry
about.
logd[30450]: 2007/01/08_14:24:57 info: logd started with /etc/logd.cf.
logd[30450]: 2007/01/08_14:24:57 WARN: Core dumps could be lost if
multiple dumps occur
logd[30450]: 2007/01/08_14:24:57 WARN: Consider setting
/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
supportability
logd[30451]: 2007/01/08_14:24:57 info: G_main_add_SignalHandler: Added
signal handler for signal 15
logd[30450]: 2007/01/08_14:24:57 info: G_main_add_SignalHandler: Added
signal handler for signal 15
heartbeat[30457]: 2007/01/08_14:24:58 info: Enabling logging daemon
heartbeat[30457]: 2007/01/08_14:24:58 info: logfile and debug file are
those specified in logd config file (default /etc/logd.cf)
heartbeat[30457]: 2007/01/08_14:24:58 info: AUTH: i=2: key = 0x80f5f30,
auth=0xb7f4a030, authname=crc
heartbeat[30457]: 2007/01/08_14:24:58 WARN: Core dumps could be lost if
multiple dumps occur
heartbeat[30457]: 2007/01/08_14:24:58 WARN: Consider setting
/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
supportability
heartbeat[30457]: 2007/01/08_14:24:58 info: **************************
heartbeat[30457]: 2007/01/08_14:24:58 info: Configuration validated.
Starting heartbeat 2.0.7
heartbeat[30460]: 2007/01/08_14:24:58 info: heartbeat: version 2.0.7
heartbeat[30460]: 2007/01/08_14:24:59 info: Heartbeat generation: 661
heartbeat[30460]: 2007/01/08_14:24:59 info: G_main_add_TriggerHandler:
Added signal manual handler
heartbeat[30460]: 2007/01/08_14:24:59 info: G_main_add_TriggerHandler:
Added signal manual handler
heartbeat[30460]: 2007/01/08_14:24:59 info: Creating FIFO
/var/lib/heartbeat/fifo.
heartbeat[30460]: 2007/01/08_14:24:59 info: Removing
/var/run/heartbeat/rsctmp failed, recreating.
heartbeat[30460]: 2007/01/08_14:24:59 info: glib: Starting serial
heartbeat on tty /dev/ttyS0 (115200 baud)
heartbeat[30460]: 2007/01/08_14:24:59 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth1
heartbeat[30460]: 2007/01/08_14:24:59 info: glib: ucast: bound send
socket to device: eth1
heartbeat[30460]: 2007/01/08_14:24:59 info: glib: ucast: bound receive
socket to device: eth1
heartbeat[30460]: 2007/01/08_14:24:59 info: glib: ucast: started on port
694 interface eth1 to 10.10.10.1
heartbeat[30460]: 2007/01/08_14:24:59 info: G_main_add_SignalHandler:
Added signal handler for signal 17
heartbeat[30460]: 2007/01/08_14:24:59 info: Local status now set to: 'up'
heartbeat[30460]: 2007/01/08_14:25:01 info: Link ims0:eth1 up.
heartbeat[30460]: 2007/01/08_14:25:01 info: Status update for node ims0:
status active
harc[30487][30491]: 2007/01/08_14:25:01 info: Running
/etc/ha.d/rc.d/status status
heartbeat[30460]: 2007/01/08_14:25:02 info: Comm_now_up(): updating
status to active
heartbeat[30460]: 2007/01/08_14:25:02 WARN: Late heartbeat: Node ims1:
interval 1750 ms
heartbeat[30460]: 2007/01/08_14:25:02 info: Local status now set to:
'active'
heartbeat[30460]: 2007/01/08_14:25:02 info: Starting child client
"/usr/lib/heartbeat/ipfail" (90,90)
heartbeat[30460]: 2007/01/08_14:25:02 WARN: G_CH_dispatch_int: Dispatch
function for read child took too long to execute: 1550 ms (> 50 ms)
(GSource: 0x81003e0)
heartbeat[30460]: 2007/01/08_14:25:02 WARN: Gmain_timeout_dispatch:
Dispatch function for send local status was delayed 1350 ms (> 210 ms)
before being called (GSource: 0x8100478)
heartbeat[30460]: 2007/01/08_14:25:02 info: Gmain_timeout_dispatch:
started at 1822290685 should have started at 1822290550
heartbeat[30460]: 2007/01/08_14:25:02 WARN: Late heartbeat: Node ims0:
interval 1550 ms
heartbeat[30460]: 2007/01/08_14:25:02 WARN: G_SIG_dispatch: Dispatch
function for SIGCHLD was delayed 1550 ms (> 100 ms) before being called
(GSource: 0x80fa078)
heartbeat[30460]: 2007/01/08_14:25:02 info: G_SIG_dispatch: started at
1822290685 should have started at 1822290530
heartbeat[30460]: 2007/01/08_14:25:02 info: Exiting status process 30487
returned rc 0.
heartbeat[30460]: 2007/01/08_14:25:02 WARN: Gmain_timeout_dispatch:
Dispatch function for check for signals was delayed 1320 ms (> 210 ms)
before being called (GSource: 0x8100780)
heartbeat[30460]: 2007/01/08_14:25:02 info: Gmain_timeout_dispatch:
started at 1822290685 should have started at 1822290553
heartbeat[30460]: 2007/01/08_14:25:02 WARN: Gmain_timeout_dispatch:
Dispatch function for send_reqnodes_msg was delayed 720 ms (> 400 ms)
before being called (GSource: 0x8109888)
heartbeat[30460]: 2007/01/08_14:25:02 info: Gmain_timeout_dispatch:
started at 1822290685 should have started at 1822290613
heartbeat[30503]: 2007/01/08_14:25:02 info: Starting
"/usr/lib/heartbeat/ipfail" as uid 90 gid 90 (pid 30503)
heartbeat[30460]: 2007/01/08_14:25:03 info: AnnounceTakeover(local 0,
foreign 1, reason 'T_RESOURCES' (0))
heartbeat[30460]: 2007/01/08_14:25:03 info: remote resource transition
completed.
heartbeat[30460]: 2007/01/08_14:25:03 info: AnnounceTakeover(local 0,
foreign 1, reason 'T_RESOURCES' (0))
heartbeat[30460]: 2007/01/08_14:25:03 info: STATE 1 => 3
heartbeat[30460]: 2007/01/08_14:25:03 info: other_holds_resources: 3
heartbeat[30460]: 2007/01/08_14:25:03 info: remote resource transition
completed.
heartbeat[30460]: 2007/01/08_14:25:03 info: Local Resource acquisition
completed. (none)
heartbeat[30460]: 2007/01/08_14:25:03 info: AnnounceTakeover(local 1,
foreign 1, reason 'T_RESOURCES(them)' (0))
heartbeat[30460]: 2007/01/08_14:25:03 info: Initial resource acquisition
complete (T_RESOURCES(them))
heartbeat[30460]: 2007/01/08_14:25:03 info: STATE 3 => 4
heartbeat[30460]: 2007/01/08_14:25:03 info: other_holds_resources: 3
heartbeat[30460]: 2007/01/08_14:34:59 info: Daily informational memory
statistics
heartbeat[30460]: 2007/01/08_14:34:59 info: MSG stats: 8/7184 ms age 0
[pid30460/MST_CONTROL]
heartbeat[30460]: 2007/01/08_14:34:59 info: ha_malloc stats: 731/230361
122428/59315 [pid30460/MST_CONTROL]
heartbeat[30460]: 2007/01/08_14:34:59 info: RealMalloc stats: 164092
total malloc bytes. pid [30460/MST_CONTROL]
heartbeat[30460]: 2007/01/08_14:34:59 info: Current arena value: 0
heartbeat[30460]: 2007/01/08_14:34:59 info: MSG stats: 0/0 ms age
1043634716 [pid30471/HBFIFO]
heartbeat[30460]: 2007/01/08_14:34:59 info: ha_malloc stats: 419/582
41852/20304 [pid30471/HBFIFO]
heartbeat[30460]: 2007/01/08_14:34:59 info: RealMalloc stats: 42052
total malloc bytes. pid [30471/HBFIFO]
heartbeat[30460]: 2007/01/08_14:34:59 info: Current arena value: 0
heartbeat[30460]: 2007/01/08_14:34:59 info: MSG stats: 0/0 ms age
1043634716 [pid30476/HBWRITE]
heartbeat[30460]: 2007/01/08_14:34:59 info: ha_malloc stats: 422/2505
50296/24565 [pid30476/HBWRITE]
heartbeat[30460]: 2007/01/08_14:34:59 info: RealMalloc stats: 59412
total malloc bytes. pid [30476/HBWRITE]
heartbeat[30460]: 2007/01/08_14:34:59 info: Current arena value: 0
heartbeat[30460]: 2007/01/08_14:34:59 info: MSG stats: 0/0 ms age
1043634716 [pid30477/HBREAD]
heartbeat[30460]: 2007/01/08_14:34:59 info: ha_malloc stats: 422/617
42168/20512 [pid30477/HBREAD]
heartbeat[30460]: 2007/01/08_14:34:59 info: RealMalloc stats: 42832
total malloc bytes. pid [30477/HBREAD]
heartbeat[30460]: 2007/01/08_14:34:59 info: Current arena value: 0
heartbeat[30460]: 2007/01/08_14:34:59 info: MSG stats: 0/0 ms age
1043634716 [pid30478/HBWRITE]
heartbeat[30460]: 2007/01/08_14:34:59 info: ha_malloc stats: 424/2526
50464/24653 [pid30478/HBWRITE]
heartbeat[30460]: 2007/01/08_14:34:59 info: RealMalloc stats: 59580
total malloc bytes. pid [30478/HBWRITE]
heartbeat[30460]: 2007/01/08_14:34:59 info: Current arena value: 0
heartbeat[30460]: 2007/01/08_14:34:59 info: MSG stats: 0/0 ms age
1043634716 [pid30479/HBREAD]
heartbeat[30460]: 2007/01/08_14:34:59 info: ha_malloc stats: 425/4254
50548/24697 [pid30479/HBREAD]
heartbeat[30460]: 2007/01/08_14:34:59 info: RealMalloc stats: 51744
total malloc bytes. pid [30479/HBREAD]
heartbeat[30460]: 2007/01/08_14:34:59 info: Current arena value: 0
heartbeat[30460]: 2007/01/08_14:34:59 info: These are nothing to worry
about.
...
heartbeat[30460]: 2007/01/08_16:25:00 info: Daily informational memory
statistics
heartbeat[30460]: 2007/01/08_16:25:00 info: MSG stats: 11/86454 ms age 0
[pid30460/MST_CONTROL]
heartbeat[30460]: 2007/01/08_16:25:00 info: ha_malloc stats: 821/2755779
128548/61646 [pid30460/MST_CONTROL]
heartbeat[30460]: 2007/01/08_16:25:00 info: RealMalloc stats: 164368
total malloc bytes. pid [30460/MST_CONTROL]
heartbeat[30460]: 2007/01/08_16:25:00 info: Current arena value: 0
heartbeat[30460]: 2007/01/08_16:25:00 info: MSG stats: 0/0 ms age
1050234756 [pid30471/HBFIFO]
heartbeat[30460]: 2007/01/08_16:25:00 info: ha_malloc stats: 419/582
41852/20304 [pid30471/HBFIFO]
heartbeat[30460]: 2007/01/08_16:25:00 info: RealMalloc stats: 42052
total malloc bytes. pid [30471/HBFIFO]
heartbeat[30460]: 2007/01/08_16:25:00 info: Current arena value: 0
heartbeat[30460]: 2007/01/08_16:25:00 info: MSG stats: 0/0 ms age
1050234756 [pid30476/HBWRITE]
heartbeat[30460]: 2007/01/08_16:25:00 info: ha_malloc stats: 422/23217
50296/24565 [pid30476/HBWRITE]
heartbeat[30460]: 2007/01/08_16:25:00 info: RealMalloc stats: 59412
total malloc bytes. pid [30476/HBWRITE]
heartbeat[30460]: 2007/01/08_16:25:00 info: Current arena value: 0
heartbeat[30460]: 2007/01/08_16:25:00 info: MSG stats: 0/0 ms age
1050234756 [pid30477/HBREAD]
heartbeat[30460]: 2007/01/08_16:25:00 info: ha_malloc stats: 422/617
42168/20512 [pid30477/HBREAD]
heartbeat[30460]: 2007/01/08_16:25:00 info: RealMalloc stats: 42832
total malloc bytes. pid [30477/HBREAD]
heartbeat[30460]: 2007/01/08_16:25:00 info: Current arena value: 0
heartbeat[30460]: 2007/01/08_16:25:00 info: MSG stats: 0/0 ms age
1050234756 [pid30478/HBWRITE]
heartbeat[30460]: 2007/01/08_16:25:00 info: ha_malloc stats: 424/23238
50464/24653 [pid30478/HBWRITE]
heartbeat[30460]: 2007/01/08_16:25:00 info: RealMalloc stats: 59580
total malloc bytes. pid [30478/HBWRITE]
heartbeat[30460]: 2007/01/08_16:25:00 info: Current arena value: 0
heartbeat[30460]: 2007/01/08_16:25:00 info: MSG stats: 0/0 ms age
1050234756 [pid30479/HBREAD]
heartbeat[30460]: 2007/01/08_16:25:00 info: ha_malloc stats: 425/43884
50548/24697 [pid30479/HBREAD]
heartbeat[30460]: 2007/01/08_16:25:00 info: RealMalloc stats: 51744
total malloc bytes. pid [30479/HBREAD]
heartbeat[30460]: 2007/01/08_16:25:00 info: Current arena value: 0
heartbeat[30460]: 2007/01/08_16:25:00 info: These are nothing to worry
about.
logd[892]: 2007/01/08_16:30:58 info: logd started with /etc/logd.cf.
logd[892]: 2007/01/08_16:30:58 WARN: Core dumps could be lost if
multiple dumps occur
logd[892]: 2007/01/08_16:30:58 WARN: Consider setting
/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
supportability
logd[893]: 2007/01/08_16:30:58 info: G_main_add_SignalHandler: Added
signal handler for signal 15
logd[892]: 2007/01/08_16:30:58 info: G_main_add_SignalHandler: Added
signal handler for signal 15
heartbeat[899]: 2007/01/08_16:30:59 info: Enabling logging daemon
heartbeat[899]: 2007/01/08_16:30:59 info: logfile and debug file are
those specified in logd config file (default /etc/logd.cf)
heartbeat[899]: 2007/01/08_16:30:59 info: AUTH: i=2: key = 0x80f5f30,
auth=0xb7f35030, authname=crc
heartbeat[899]: 2007/01/08_16:30:59 WARN: Core dumps could be lost if
multiple dumps occur
heartbeat[899]: 2007/01/08_16:30:59 WARN: Consider setting
/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
supportability
heartbeat[899]: 2007/01/08_16:30:59 info: **************************
heartbeat[899]: 2007/01/08_16:30:59 info: Configuration validated.
Starting heartbeat 2.0.7
heartbeat[900]: 2007/01/08_16:30:59 info: heartbeat: version 2.0.7
heartbeat[900]: 2007/01/08_16:31:00 info: Heartbeat generation: 662
heartbeat[900]: 2007/01/08_16:31:00 info: G_main_add_TriggerHandler:
Added signal manual handler
heartbeat[900]: 2007/01/08_16:31:00 info: G_main_add_TriggerHandler:
Added signal manual handler
heartbeat[900]: 2007/01/08_16:31:00 info: Creating FIFO
/var/lib/heartbeat/fifo.
heartbeat[900]: 2007/01/08_16:31:00 info: Removing
/var/run/heartbeat/rsctmp failed, recreating.
heartbeat[900]: 2007/01/08_16:31:00 info: glib: Starting serial
heartbeat on tty /dev/ttyS0 (115200 baud)
heartbeat[900]: 2007/01/08_16:31:00 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth1
heartbeat[900]: 2007/01/08_16:31:00 info: glib: ucast: bound send socket
to device: eth1
heartbeat[900]: 2007/01/08_16:31:00 info: glib: ucast: bound receive
socket to device: eth1
heartbeat[900]: 2007/01/08_16:31:00 info: glib: ucast: started on port
694 interface eth1 to 10.10.10.1
heartbeat[900]: 2007/01/08_16:31:00 info: G_main_add_SignalHandler:
Added signal handler for signal 17
heartbeat[900]: 2007/01/08_16:31:00 info: Local status now set to: 'up'
heartbeat[900]: 2007/01/08_16:31:01 info: Link ims0:eth1 up.
heartbeat[900]: 2007/01/08_16:31:01 info: Status update for node ims0:
status active
harc[928][933]: 2007/01/08_16:31:02 info: Running /etc/ha.d/rc.d/status
status
heartbeat[900]: 2007/01/08_16:31:02 info: Comm_now_up(): updating status
to active
heartbeat[900]: 2007/01/08_16:31:02 info: Local status now set to: 'active'
heartbeat[900]: 2007/01/08_16:31:02 info: Starting child client
"/usr/lib/heartbeat/ipfail" (90,90)
heartbeat[900]: 2007/01/08_16:31:02 WARN: G_CH_dispatch_int: Dispatch
function for read child took too long to execute: 620 ms (> 50 ms)
(GSource: 0x81003e0)
heartbeat[900]: 2007/01/08_16:31:02 WARN: Gmain_timeout_dispatch:
Dispatch function for send local status was delayed 240 ms (> 210 ms)
before being called (GSource: 0x8100478)
heartbeat[900]: 2007/01/08_16:31:02 info: Gmain_timeout_dispatch:
started at 1823046613 should have started at 1823046589
heartbeat[900]: 2007/01/08_16:31:02 WARN: G_SIG_dispatch: Dispatch
function for SIGCHLD was delayed 620 ms (> 100 ms) before being called
(GSource: 0x80fa078)
heartbeat[900]: 2007/01/08_16:31:02 info: G_SIG_dispatch: started at
1823046613 should have started at 1823046551
heartbeat[900]: 2007/01/08_16:31:02 WARN: Gmain_timeout_dispatch:
Dispatch function for check for signals was delayed 230 ms (> 210 ms)
before being called (GSource: 0x8100780)
heartbeat[900]: 2007/01/08_16:31:02 info: Gmain_timeout_dispatch:
started at 1823046613 should have started at 1823046590
heartbeat[934]: 2007/01/08_16:31:02 info: Starting
"/usr/lib/heartbeat/ipfail" as uid 90 gid 90 (pid 934)
heartbeat[900]: 2007/01/08_16:31:02 info: Exiting status process 928
returned rc 0.
heartbeat[900]: 2007/01/08_16:31:02 info: AnnounceTakeover(local 0,
foreign 1, reason 'T_RESOURCES' (0))
heartbeat[900]: 2007/01/08_16:31:02 info: remote resource transition
completed.
heartbeat[900]: 2007/01/08_16:31:02 info: AnnounceTakeover(local 0,
foreign 1, reason 'T_RESOURCES' (0))
heartbeat[900]: 2007/01/08_16:31:02 info: STATE 1 => 3
heartbeat[900]: 2007/01/08_16:31:02 info: other_holds_resources: 3
heartbeat[900]: 2007/01/08_16:31:02 info: remote resource transition
completed.
heartbeat[900]: 2007/01/08_16:31:02 info: Local Resource acquisition
completed. (none)
heartbeat[900]: 2007/01/08_16:31:02 info: AnnounceTakeover(local 1,
foreign 1, reason 'T_RESOURCES(them)' (0))
heartbeat[900]: 2007/01/08_16:31:02 info: Initial resource acquisition
complete (T_RESOURCES(them))
heartbeat[900]: 2007/01/08_16:31:02 info: STATE 3 => 4
heartbeat[900]: 2007/01/08_16:31:02 info: other_holds_resources: 3
heartbeat[900]: 2007/01/08_16:32:36 WARN: node ims0: is dead
heartbeat[900]: 2007/01/08_16:32:36 WARN: No STONITH device configured.
heartbeat[900]: 2007/01/08_16:32:36 WARN: Shared disks are not protected.
heartbeat[900]: 2007/01/08_16:32:36 info: Resources being acquired from
ims0.
heartbeat[900]: 2007/01/08_16:32:36 info: Link ims0:eth1 dead.
harc[964][969]: 2007/01/08_16:32:36 info: Running /etc/ha.d/rc.d/status
statusZ
heartbeat[965]: 2007/01/08_16:32:36 info: No local resources
[/usr/lib/heartbeat/ResourceManager listkeys ims1] to acquire.
heartbeat[965]: 2007/01/08_16:32:36 info: Writing type [resource]
message to FIFO
heartbeat[965]: 2007/01/08_16:32:36 info: FIFO message [type resource]
written rc=79
mach_down[972][995]: 2007/01/08_16:32:36 info: Taking over resource
group 192.168.145.173\
heartbeat[900]: 2007/01/08_16:32:36 info: AnnounceTakeover(local 1,
foreign 1, reason 'T_RESOURCES(us)' (1))
heartbeat[900]: 2007/01/08_16:32:36 info: Exiting req_our_resources
process 965 returned rc 0.
heartbeat[900]: 2007/01/08_16:32:36 info: AnnounceTakeover(local 1,
foreign 1, reason 'req_our_resources' (1))
ResourceManager[996][1004]: 2007/01/08_16:32:36 info: Acquiring resource
group: ims0 192.168.145.173
IPaddr[1017][1034]: 2007/01/08_16:32:36 INFO: Resource is stopped
ResourceManager[996][1050]: 2007/01/08_16:32:36 info: Running
/etc/ha.d/resource.d/IPaddr 192.168.145.173 start
IPaddr[1061][1070]: 2007/01/08_16:32:36 INFO: Using calculated nic for
192.168.145.173: eth0
IPaddr[1061][1075]: 2007/01/08_16:32:36 INFO: Using calculated netmask
for 192.168.145.173: 255.255.255.0
IPaddr[1061][1080]: 2007/01/08_16:32:36 INFO: Using calculated broadcast
for 192.168.145.173: 192.168.145.255
IPaddr[1061][1101]: 2007/01/08_16:32:36 DEBUG: Sending Gratuitous Arp
for 192.168.145.173 on eth0:0 [eth0]
IPaddr[1052][1109]: 2007/01/08_16:32:36 INFO: Success
mach_down[972][1111]: 2007/01/08_16:32:36 info:
/usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired_
mach_down[972][1115]: 2007/01/08_16:32:36 info: mach_down takeover
complete for node ims0.
heartbeat[900]: 2007/01/08_16:32:36 info: AnnounceTakeover(local 1,
foreign 1, reason 'T_RESOURCES(us)' (1))
heartbeat[900]: 2007/01/08_16:32:36 info: mach_down takeover complete.
heartbeat[900]: 2007/01/08_16:32:36 info: AnnounceTakeover(local 1,
foreign 1, reason 'mach_down' (1))
heartbeat[900]: 2007/01/08_16:32:36 info: Exiting status process 964
returned rc 0.
heartbeat[900]: 2007/01/08_16:35:38 info: Heartbeat restart on node ims0
heartbeat[900]: 2007/01/08_16:35:38 info: Link ims0:eth1 up.
heartbeat[900]: 2007/01/08_16:35:38 info: Status update for node ims0:
status init
heartbeat[900]: 2007/01/08_16:35:38 info: Status update for node ims0:
status up
harc[1188][1191]: 2007/01/08_16:35:38 info: Running
/etc/ha.d/rc.d/status status
heartbeat[900]: 2007/01/08_16:35:38 info: Exiting status process 1188
returned rc 0.
harc[1194][1197]: 2007/01/08_16:35:38 info: Running
/etc/ha.d/rc.d/status status
heartbeat[900]: 2007/01/08_16:35:38 info: Exiting status process 1194
returned rc 0.
heartbeat[900]: 2007/01/08_16:35:39 info: all clients are now paused
heartbeat[900]: 2007/01/08_16:35:39 info: Status update for node ims0:
status active
heartbeat[900]: 2007/01/08_16:35:39 info: AnnounceTakeover(local 1,
foreign 1, reason 'T_RESOURCES(us)' (1))
heartbeat[900]: 2007/01/08_16:35:39 info: all clients are now resumed
harc[1200][1203]: 2007/01/08_16:35:39 info: Running
/etc/ha.d/rc.d/status status
heartbeat[900]: 2007/01/08_16:35:39 info: Exiting status process 1200
returned rc 0.
heartbeat[900]: 2007/01/08_16:35:39 info: other_holds_resources: 0
heartbeat[900]: 2007/01/08_16:35:39 info: remote resource transition
completed.
heartbeat[900]: 2007/01/08_16:35:39 info: AnnounceTakeover(local 1,
foreign 1, reason 'T_RESOURCES(us)' (1))
heartbeat[900]: 2007/01/08_16:35:39 info: other_holds_resources: 0
---------------------------
--
RNDr. Martin Gazak, MicroStep-MIS
Cavojskeho 1, 841 04 Bratislava, Slovakia
Tel: ++421 2 602 00 128, 111
Fax: ++421 2 602 00 180
e-mail: matog at microstep-mis.com
www: http://www.microstep-mis.com
--
RNDr. Martin Gazak, MicroStep-MIS
Cavojskeho 1, 841 04 Bratislava, Slovakia
Tel: ++421 2 602 00 128, 111
Fax: ++421 2 602 00 180
e-mail: matog at microstep-mis.com
www: http://www.microstep-mis.com
More information about the Linux-HA
mailing list