[Linux-HA] Node failure causes peer host to reboot?!?
Andrew Beekhof
beekhof at gmail.com
Thu Apr 17 05:03:09 MDT 2008
On Thu, Apr 17, 2008 at 12:58 PM, Andrew Beekhof <beekhof at gmail.com> wrote:
>
> On Thu, Apr 17, 2008 at 12:56 PM, Andrew Beekhof <beekhof at gmail.com> wrote:
> > On Thu, Apr 17, 2008 at 12:35 PM, Luis Motta Campos
> > <luismottacampos at yahoo.co.uk> wrote:
> > > Dejan Muhamedagic wrote:
> > > > Hi,
> > >
> > > >> respawn hacluster /usr/lib64/heartbeat/ipfail
> > > >
> > > > ipfail doesn't work with crm. You should use pingd instead.
> > >
> > > Well, I don't think this helps. :( I'm using the suggested (reasonable
> > > for me) defaults:
> > >
> > > respawn root /usr/lib64/heartbeat/pingd -m 100 -d 5s
> > >
> > > (yes, I'm running CentOS x86_64).
> > >
> > > I still have problems, but they seem to be worse, now. Before, if I
> > > restarted heartbeat (/etc/init.d/heartbeat restart), any service running
> > > on the machine jumped away before the restart, and heartbeat was able to
> > > restart ok.
> > >
> > > Using pingd instead of the ipfail, even this is crippled, and heartbeat
> > > reboots the peer host (the one supposed to keep services running) if I
> > > try to restart the heartbeat service on one of the machines.
> > >
> > > I presume I'm doing something really stupid, but I can't understand it.
> > > Please help me out. I used hb_report to fetch all I know about my
> > > system, please find the report attached.
> > >
> >
> > random question - did you install from source or packages? where did
> > you get them from?
> >
>
> and a followup... you cant just make up values for target_role:
>
> <nvpair name="target_role" value="Started:Master"
> id="d54bdbb8-5d79-4d12-a95f-9b9b015176e3"/>
>
> makes no sense. just "Master" would be correct
>
Then there is the failed start operation... that wont be helping at all.
pengine[13743]: 2008/04/17_12:23:22 WARN: unpack_rsc_op: Processing
failed op database-filesystem_start_0 on db-sql1.ripe.net: Error
And finally, it looks like there was a crash in the pengine process.
crmd[12352]: 2008/04/17_12:23:22 WARN: Managed pengine process 13743
killed by signal 11 [SIGSEGV - Segmentation violation].
crmd[12352]: 2008/04/17_12:23:22 ERROR: Managed pengine process 13743
dumped core
can you have a look for a core file in
/var/lib/heartbeat/cores/hacluster/ and post the backtrace?
More information about the Linux-HA
mailing list