[Linux-HA] How can you clean up a degraded node w/out killing it (and not manually)?

Andrew Beekhof beekhof at gmail.com
Tue Sep 25 08:22:10 MDT 2007


On 9/25/07, Peter Farrell <peter.d.farrell at gmail.com> wrote:
> I'm back again.
> *moderator held previous answer to your questions Andrew - I should
> have bzip'd those log files :-) It's no matter as it ended up working
> in any case.
>
> Versions:
> heartbeat-stonith-2.1.2-3.el4.centos
> heartbeat-pils-2.1.2-3.el4.centos
> heartbeat-ldirectord-2.1.2-3.el4.centos
> heartbeat-2.1.2-3.el4.centos
>
> Active / Passive set up.
> 2 nodes, one resource (ldirectord) balancing traffic for IP addresses
> on 2 web servers.
> 2 nics [eth0: dmz facing - eth1: crossover cable, on 10.0.0.1/2]
>
> abstracted CIB:
> group1:{dmz1-IPAddr_1, dmz2-IPAddr_2, ldirectord}
> clone:pingd (ping router)
> constraint:prefer one host (dmz1) over the other and when pingd fails,
> failover group_1
>
>
> 1. How can you clean up a degraded node w/out killing it (and not manually)?
> =============================================
>
> With above setup, it fails perfectly.
> I'm assuming that the first node that loses connectivity is in a
> 'degraded' state.
> That's why it doesn't fail back once connectivity is restored.
> (Either automatically or via 'crm_resource -M')
> I understand that I can 'clean it up' by running a number of
> 'crm_resource' commands against all my resources:
> ...
> crm_resource -r IPaddr_212_140_130_37 -C -H dmz1.example.com
> crm_resource -r IPaddr_212_140_130_38 -C -H dmz1.example.com
> crm_resource -r ldirectord_3 -C -H dmz1.example.com
> crm_resource -r pingd-child:0 -C -H dmz1.example.com
> crm_resource -r pingd-child:1 -C -H dmz1.example.com
> ...
> Once this has been done - if I 'ifup' the interface or plug it back in
> - the resource will migrate (as expected) back to it's preferred host.
>
> The point I'm missing conceptually or in the documentation is:
> Why can't I kill the node so it cleans itself up?
> *Actually - I don't really want to kill the node - but what about
> killing heartbeat? That seems to clean things up when I do it
> manually.

so there are a couple of points here...

there are three types of failures:
- monitors
- starts
- stops
and all are handled differently^.

monitors increment the failcount for that resource on that node.  if
you set one of the failcount-stickiness preferences, that may mean the
resource gets moved.

starts set up a block for that resource on that node which means the
resource is definitely moved but also can't run there again until you
run crm_resource -C as you noted.

stops are pretty fatal since we can't run it anywhere else until we're
sure that its not still running, which we cant do because stop keeps
failing.... so if stonith is enabled we'll shoot the node, otherwise
we'll wait forever for the admin to get it sorted out.


so.... based on what you described, i'm guessing that the monitor
failed, but the resource wasn't moved until the start failed too.
hence why you needed to clean it up manually.

its therefor possible that this problem is actually an artifact of how
you're testing (ie. pulling the interface down).  is that a realistic
thing to happen?  maybe.  either way, you _could_ set on_fail=fence
for the start action.  should you?  maybe not.

^ actually this is no longer true for newer code which handles start
by setting the failcount to infinity (so they're only terminal if you
set a value for default-resource-failcount-stickiness or whatever i
called that option).

in addition to using failcount, we will soon be able to "timeout"
failures.  ie. after a defined period of time passes, they are
ignored.  this should remove the need for manual intervention in most
cases.

> I've seen threads where Alan is saying "Why would you want to? It's
> only going to loop w/ reboots until connectivity is restored."
> Ok - I get that, as well as accepting that in certain situations /
> config you want a human to intervene - but not in this case.
> How can I get the 'cleanup' element to happen automatically?
>
> I tried to attach the 'failstop_type=stonith' to the pingd clone while
> adding an additional clone 'DoFence' with suicide.
> It loads with the suicide clone, but fails when I add the
> 'failstop_type' bit to pingd- says it's not in the DTD.
> I checked and didn't see it in there either.

use on_fail, and add it to the operation, not the resource

>
> *the documentation reads like a bloody 'choose your own adventure'
> story in some parts,

i'd laugh but alas its very true.
the novell docs have made a good start but there is still much to be written

> it's just so all over the place it feels really
> anarchic. I searched through the archive for suicide and
> failstop_type, there were only a few messages and nothing was
> resolved.
> Is "failstop_type" legit?

no (or at least not anymore... where did you see this?  i dont recall
such an option)

> Is "suicide" usage documented anywhere?
>
> Do I even _need_ a stonith device for this type of setup?

if you want to use on_fail=fence and/or recover from failed stops
automagically, yes.

> Why can't you use ssh as a stonith device when all the nodes are
> connected via 2 nics?

you can, as long as you understand/accept the risks
and it seems you do

> I don't have any fancy hardware rebooting capability and saw in
> 'stonith -h' that 'ssh isn't for production'.
> I accept the point that 'If you can ssh into it - you probably don't
> need to kill it'.

not _always_ true.  if you're trying to handle a comatose node then
ssh isn't going to help.  but if "all" you're trying to protect
against is resource level failures, ssh is probably "good enough"

> So again - I'm back to square one.
>
>
> Any advice?
>
> -Peter Farrell
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



More information about the Linux-HA mailing list