[Linux-HA] Re: Putting code of a wrong takevover of services in emails

Fabrice Durand durand.fabrice at gmail.com
Fri Sep 9 08:18:50 MDT 2005


Hello, 
I just want to reformulate my message because all the begining was out
of subject. Now here is the problem :
Suppose I have an active/passive cluster with 3 resources and a MailTo.
**************************************haresources**************************************************
Node1 IpAddress1 Resource1 Resource2 Resource3
MailTo::fabrice at fabrice.com::Group
******************************************************************************************************
A node wants to take over the resources : 
- it fails in starting resource 1,
- it fails in starting resource 2,
- it succeeds in starting resource 3 (if heartbeat fails in taking
over a resource, it tries to take over the following one anyway)
- heartbeat signals an error (code 256) in the resources takover
- MailTo succeeds in sending an email saying the resource group has
been taken over successfully.

Here is an example of heartbeat log of this behaviour (without MailTo
but the sending email anyway has been verified in a similar
situation):

****************************************************************************************************
heartbeat: 2005/08/30_15:23:25 info: standby: acquire [all] resources
from eepclu1
heartbeat: 2005/08/30_15:23:25 info: acquire all HA resources (standby).
heartbeat: 2005/08/30_15:23:25 info: Acquiring resource group: eepclu1
135.9.216.51 drbddisk Filesystem::/dev/drbd0::/montagedrbd::ext3::
wu-ftpd
heartbeat: 2005/08/30_15:23:25 info: Running
/etc/ha.d/resource.d/IPaddr 135.9.216.51 start
heartbeat: 2005/08/30_15:23:26 info: /sbin/ifconfig eth0:0
135.9.216.51 netmask 255.255.248.0	broadcast 135.9.159.255
heartbeat: 2005/08/30_15:23:26 info: Sending Gratuitous Arp for
135.9.216.51 on eth0:0 [eth0]
heartbeat: 2005/08/30_15:23:26 /usr/lib/heartbeat/send_arp -i 1010 -r
5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-135.9.216.51 eth0
135.9.216.51 auto 135.9.216.51 ffffffffffff
heartbeat: 2005/08/30_15:23:26 info: Running
/etc/ha.d/resource.d/drbddisk  start
heartbeat: 2005/08/30_15:23:31 ERROR: Return code 20 from
/etc/ha.d/resource.d/drbddisk
heartbeat: 2005/08/30_15:23:31 info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /montagedrbd ext3  start
heartbeat: 2005/08/30_15:23:31 ERROR: Couldn't mount filesystem
/dev/drbd0 on /montagedrbd
heartbeat: 2005/08/30_15:23:31 ERROR: Return code 1 from
/etc/ha.d/resource.d/Filesystem
heartbeat: 2005/08/30_15:23:31 info: Running /etc/ha.d/resource.d/wu-ftpd  start
heartbeat: 2005/08/30_15:23:31 ERROR:
/usr/lib/heartbeat/ResourceManager takegroup 135.9.216.51 returned 256
heartbeat: 2005/08/30_15:23:31 info: all HA resource acquisition
completed (standby).
heartbeat: 2005/08/30_15:23:31 info: Standby resource acquisition done [all].
heartbeat: 2005/08/30_15:23:31 info: remote resource transition completed.
********************************************************************************************************

Now here is the question : do you know how to get the error code and
put it in an email to signal that there was a failure when starting
one of the resources (ie the group of resources was not successfully
taken over) ?

Thanks for your answers !
Fabrice



More information about the Linux-HA mailing list