[Linux-HA] Test of HA

Philip Juels pjuels at rics.bwh.harvard.edu
Tue Oct 18 13:43:03 MDT 2005


Yes, that would be my guess as well...it's trying to shutdown a resource 
that is already dead and just waits until it gets the appropriate 
shutdown response...which it never does and so it sits there twiddling 
its thumbs.  I should have expected this since I've essentially set up a 
version 1 heartbeat failover scheme using version 2 syntax.

Which brings me to the question of how to setup a version 2 heartbeat 
scheme where the primary node attempts to restart a crashed service 
first before failing over to the secondary node?  We run a jboss 
application service, and occasionally the Java VM will crash which 
doesn't mucking up the entire server.  To recover, we simply restart 
jboss.  It would be nice to be able to automatically restart jboss on 
the same node in the case of a VM failure.  Perhaps the answer is not to 
use heartbeat for a local recovery of a service (application) failure, 
but use heartbeat in the case of a more drastic server (hardware, OS, 
etc) crash?

--PJ

Alan Robertson wrote:

> Philip Juels wrote:
>
>> Hi all,
>>
>> I've successfully set up a simple two-node v2 active/passive Apache 
>> cluster.  In order to test failover of a crashed httpd service, I 
>> killed the httpd daemon and watched the ha.log to see if HA would 
>> recover the service.  Well, my "test" did not work...killing the 
>> daemon (or executing a httpd stop) only resulted in a series of 
>> errors in the ha.log:
>>
>> crmd[28210]: 2005/10/17_11:23:02 info: mask(lrm.c:do_lrm_rsc_op): 
>> Performing op start on group_1:httpd_2
>> crmd[28210]: 2005/10/17_11:23:03 info: mask(lrm.c:do_lrm_rsc_op): 
>> Performing op monitor on group_1:httpd_2
>> crmd[28210]: 2005/10/17_11:27:03 ERROR: mask(lrm.c:do_lrm_event): LRM 
>> operation (5) monitor on group_1:httpd_2 ERROR: invalid parameter
>> crmd[28210]: 2005/10/17_11:27:03 info: mask(lrm.c:do_lrm_rsc_op): 
>> Performing op stop on group_1:httpd_2
>> crmd[28210]: 2005/10/17_11:27:03 WARN: mask(lrm.c:do_lrm_event): LRM 
>> operation (5) monitor on group_1:httpd_2 cancelled
>> crmd[28210]: 2005/10/17_11:27:03 ERROR: mask(lrm.c:do_lrm_event): LRM 
>> operation (7) stop on group_1:httpd_2 ERROR: unknown error
>> crmd[28210]: 2005/10/17_11:27:05 info: mask(lrm.c:do_lrm_rsc_op): 
>> Performing op stop on group_1:httpd_2
>> crmd[28210]: 2005/10/17_11:27:05 ERROR: mask(lrm.c:do_lrm_event): LRM 
>> operation (8) stop on group_1:httpd_2 ERROR: unknown error
>> crmd[28210]: 2005/10/17_11:27:07 info: mask(lrm.c:do_lrm_rsc_op): 
>> Performing op stop on group_1:httpd_2
>>
>> These errrors continue until I restarted httpd on the primary node, 
>> after which heartbeat switched the httpd service over to the 
>> secondary node:
>
>
> My guess is that your httpd_2 resource failed when given a stop 
> operation when it was already stopped.
>
>




More information about the Linux-HA mailing list