[Linux-HA] strange monitor behaviour

Andrew Beekhof beekhof at gmail.com
Wed Jan 10 07:16:45 MST 2007


On 1/9/07, Pavol Gono <palo.gono at gmail.com> wrote:
> On 1/9/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> > > A)
> > > It would be nice to have some list of necessary software installed
> > > when one wants to run it. E.g. on SLES10 you need python-xml package.
> > > On debian (debo machine), installing python-dev or python-xml
> > > decreased number of 'BadNews' from 26 to 2. Maybe python version is
> > > also important...
> >
> > can you send me both outputs?  that shouldn't be the case.
>
> I looked at logs, it seems so.

i think you just got a little luckier the second time
i dont believe installing python-xml will have made any difference

>
> >
> > > B)
> > > On my notebook I use debian sarge, python version 2.4. When using HB
> > > sources directly (changeset 9918) and configure options equal to debo
> > > machine, BasicSanityCheck made a strange exception. Snippet from
> > > linux-ha.testlog:
> > > ... CTS: Warn: Startup pattern not found: crmd.*pgnotas: State
> > > transition.*-> S_IDLE
> > > ... CTS: Node pgnotas status:
> > > ... CTS: Node status for pgnotas is down but we think it should be up
> > > ... CTS: Warn: Start failed for node pgnotas
> > > ... CTS: Tearing down partial setup
> > > ... CTS: Stopping Cluster Manager on BSC node(s).
> > > ... CTS: Exception by exceptions.TypeError
> > > ... CTS: Traceback (most recent call last):
> > > ... CTS:   File "/usr/local/lib/heartbeat/cts/CTSlab.py", line 791, in ?
> > > ... CTS:     overall, detailed = tests.run(NumIter)
> > > ... CTS: TypeError: unpack non-sequence
> > > ... CTS: ****************
> > > ... CTS: Overall Results:{'failure': 0, 'success': 0, 'BadNews': 0}
> > > ... CTS: ****************
> > > ... CTS: Detailed Results
> > > ... CTS: Test AddResource:  {'auditfail': 0, 'failure': 0, 'skipped':
> > > 0, 'success': 0, 'calls': 0}
> > > ... CTS: <<<<<<<<<<<<<<<< TESTS COMPLETED
> > > ... CTS: No failure count but success != requested iterations
> > > CRM tests failed (rc=1).
> > > (end of linux-ha.testlog now)
> >
> > can you send me the whole file?
>
> I packed two such logs.

heartbeat[12265]: 2007/01/08_10:46:05 info: heartbeat: already running
[pid 11835].

that would explain it.  now why it was already running... thats
another question altogether

>
> > > D)
> > > On one SLES10 machine my colleague used HB sources of changeset 9909.
> > > Configure options were similar to debo & fico machines.
> > > There is one error reported at the end. It is triggered when the 'Does
> > > not look like we ARPed the address' messages is displayed. At the very
> > > beginning there is also message 'RTNETLINK answers: Network is
> > > unreachable' which I do not know where it comes from.
> > > Snippets from output of BasicSanityCheck:
> > > RTNETLINK answers: Network is unreachable
> > > Using interface: eth3
> > > Starting base64 and md5 algorithm tests
> > > base64 and md5 algorithm tests succeeded.
> > > Starting heartbeat
> > > Starting High-Availability services:
> > > 2007/01/08_14:56:02 INFO:  Resource is stopped
> > >    done
> > >
> > > Does not look like we ARPed the address
> > > Reloading heartbeat
> > > Reloading heartbeat
> > > Stopping heartbeat
> > > ...
> > > Starting CRM tests
> > > CRM tests passed.
> > > 1 errors. Log file is stored in /tmp/linux-ha.testlog
> >
> > i think the easiest solution is just change line 515 to:
> >        LookForString ARP >/dev/null
>
> yes, maybe commiting this change makes sense

upstream now

> > > E)
> > > On another SLES10 machine (HB sources of changeset 9918, conf options
> > > similar to debo & fico), the start of output of BasicSanityCheck
> > > looked:
> > > RTNETLINK answers: Network is unreachable
> > > Using interface: eth0
> > > Starting base64 and md5 algorithm tests
> > > base64 and md5 algorithm tests succeeded.
> > >
> > > The interesting thing is eth0 - no such interface was active on
> > > machine. The first one was eth2 with IP address 10.54.0.13,
> > > Mask:255.255.0.0, TESTIP in script was 10.54.0.2
> > > So maybe GuessIFname function has problems...
> >
> > any more information on why it cant guess correctly?
> > "set -x" etc
>
> The machine didn't have default route. From line 113:
> # /sbin/ip r g 123.0.0.1
> RTNETLINK answers: Network is unreachable
>
> IMHO the script should check retval of ip. If nonzero, it should write
> to output some thing like:
> We can't detect the correct NIC, modify DEFAULTINTERFACE variable.
> If the script finish now, no confusion is made.

upstream soon

> > > G)
> > > Many times I experienced these messages (output of BasicSanityCheck):
> > >
> > > ...
> > > Reloading heartbeat
> > > Reloading heartbeat
> > > Stopping heartbeat
> > > Stopping High-Availability services:
> > > Done.
> > >
> > > Looks like heartbeat did not really stop.
> > > You\'ll probably need to kill some processes yourself.
> > > Checking STONITH basic sanity.
> > > ...
> > >
> > > What does it mean - Can't heartbeat stop itself?
> >
> > possible - but without the logs its impossible to say why
> >
>
> All four attached log files have such message

strange i dont see anything like that in the attachment


More information about the Linux-HA mailing list