[Linux-HA] strange monitor behaviour
Pavol Gono
palo.gono at gmail.com
Tue Jan 9 08:56:18 MST 2007
On 1/9/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> > A)
> > It would be nice to have some list of necessary software installed
> > when one wants to run it. E.g. on SLES10 you need python-xml package.
> > On debian (debo machine), installing python-dev or python-xml
> > decreased number of 'BadNews' from 26 to 2. Maybe python version is
> > also important...
>
> can you send me both outputs? that shouldn't be the case.
I looked at logs, it seems so.
>
> > B)
> > On my notebook I use debian sarge, python version 2.4. When using HB
> > sources directly (changeset 9918) and configure options equal to debo
> > machine, BasicSanityCheck made a strange exception. Snippet from
> > linux-ha.testlog:
> > ... CTS: Warn: Startup pattern not found: crmd.*pgnotas: State
> > transition.*-> S_IDLE
> > ... CTS: Node pgnotas status:
> > ... CTS: Node status for pgnotas is down but we think it should be up
> > ... CTS: Warn: Start failed for node pgnotas
> > ... CTS: Tearing down partial setup
> > ... CTS: Stopping Cluster Manager on BSC node(s).
> > ... CTS: Exception by exceptions.TypeError
> > ... CTS: Traceback (most recent call last):
> > ... CTS: File "/usr/local/lib/heartbeat/cts/CTSlab.py", line 791, in ?
> > ... CTS: overall, detailed = tests.run(NumIter)
> > ... CTS: TypeError: unpack non-sequence
> > ... CTS: ****************
> > ... CTS: Overall Results:{'failure': 0, 'success': 0, 'BadNews': 0}
> > ... CTS: ****************
> > ... CTS: Detailed Results
> > ... CTS: Test AddResource: {'auditfail': 0, 'failure': 0, 'skipped':
> > 0, 'success': 0, 'calls': 0}
> > ... CTS: <<<<<<<<<<<<<<<< TESTS COMPLETED
> > ... CTS: No failure count but success != requested iterations
> > CRM tests failed (rc=1).
> > (end of linux-ha.testlog now)
>
> can you send me the whole file?
I packed two such logs.
> > D)
> > On one SLES10 machine my colleague used HB sources of changeset 9909.
> > Configure options were similar to debo & fico machines.
> > There is one error reported at the end. It is triggered when the 'Does
> > not look like we ARPed the address' messages is displayed. At the very
> > beginning there is also message 'RTNETLINK answers: Network is
> > unreachable' which I do not know where it comes from.
> > Snippets from output of BasicSanityCheck:
> > RTNETLINK answers: Network is unreachable
> > Using interface: eth3
> > Starting base64 and md5 algorithm tests
> > base64 and md5 algorithm tests succeeded.
> > Starting heartbeat
> > Starting High-Availability services:
> > 2007/01/08_14:56:02 INFO: Resource is stopped
> > done
> >
> > Does not look like we ARPed the address
> > Reloading heartbeat
> > Reloading heartbeat
> > Stopping heartbeat
> > ...
> > Starting CRM tests
> > CRM tests passed.
> > 1 errors. Log file is stored in /tmp/linux-ha.testlog
>
> i think the easiest solution is just change line 515 to:
> LookForString ARP >/dev/null
yes, maybe commiting this change makes sense
>
> > E)
> > On another SLES10 machine (HB sources of changeset 9918, conf options
> > similar to debo & fico), the start of output of BasicSanityCheck
> > looked:
> > RTNETLINK answers: Network is unreachable
> > Using interface: eth0
> > Starting base64 and md5 algorithm tests
> > base64 and md5 algorithm tests succeeded.
> >
> > The interesting thing is eth0 - no such interface was active on
> > machine. The first one was eth2 with IP address 10.54.0.13,
> > Mask:255.255.0.0, TESTIP in script was 10.54.0.2
> > So maybe GuessIFname function has problems...
>
> any more information on why it cant guess correctly?
> "set -x" etc
The machine didn't have default route. From line 113:
# /sbin/ip r g 123.0.0.1
RTNETLINK answers: Network is unreachable
IMHO the script should check retval of ip. If nonzero, it should write
to output some thing like:
We can't detect the correct NIC, modify DEFAULTINTERFACE variable.
If the script finish now, no confusion is made.
> > G)
> > Many times I experienced these messages (output of BasicSanityCheck):
> >
> > ...
> > Reloading heartbeat
> > Reloading heartbeat
> > Stopping heartbeat
> > Stopping High-Availability services:
> > Done.
> >
> > Looks like heartbeat did not really stop.
> > You\'ll probably need to kill some processes yourself.
> > Checking STONITH basic sanity.
> > ...
> >
> > What does it mean - Can't heartbeat stop itself?
>
> possible - but without the logs its impossible to say why
>
All four attached log files have such message
-------------- next part --------------
A non-text attachment was scrubbed...
Name: logs.tar.bz2
Type: application/x-bzip2
Size: 69136 bytes
Desc: not available
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20070109/d39f0c3f/logs.tar-0001.bin
More information about the Linux-HA
mailing list