[Linux-HA] strange monitor behaviour

Pavol Gono palo.gono at gmail.com
Mon Jan 8 09:32:39 MST 2007


On 1/8/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> On 1/5/07, Pavol Gono <palo.gono at gmail.com> wrote:
> > In attachment there is the log from fico.
> > The only difference in installation is beginning of configure options
> > (because debo is debian, fico is gentoo distro):
> > ./configure --with-group-name=cluster --with-ccmuser-name=cluster
> > --with-group-id=65 --with-ccmuser-id=65 "CFLAGS=-fno-unit-at-a-time -g
> > -O0" ...
>
> any reason you're not using the debian packages?
>
> you might also be better off with: ./ConfigureMe bootstrap
> which will call configure with the correct options for most distros

So my reasons for using latest sources:
- I am using different distros on machines and I want to have the same
code running (without special patches from distro maintainers) and the
similar configure options
- CIB configuration is still changing a lot, so I want to have the
latest XML config as possible (updates during runtime of servers won't
be so painful)
- Less software dependencies when using custom configure options
- I expect current code is much more closer to future 2.0.8 than to
latest 2.0.7 version :)


Concerning BasicSanityCheck script, I see many issues, and I don't
know how much should I trust it:

A)
It would be nice to have some list of necessary software installed
when one wants to run it. E.g. on SLES10 you need python-xml package.
On debian (debo machine), installing python-dev or python-xml
decreased number of 'BadNews' from 26 to 2. Maybe python version is
also important...


B)
On my notebook I use debian sarge, python version 2.4. When using HB
sources directly (changeset 9918) and configure options equal to debo
machine, BasicSanityCheck made a strange exception. Snippet from
linux-ha.testlog:
... CTS: Warn: Startup pattern not found: crmd.*pgnotas: State
transition.*-> S_IDLE
... CTS: Node pgnotas status:
... CTS: Node status for pgnotas is down but we think it should be up
... CTS: Warn: Start failed for node pgnotas
... CTS: Tearing down partial setup
... CTS: Stopping Cluster Manager on BSC node(s).
... CTS: Exception by exceptions.TypeError
... CTS: Traceback (most recent call last):
... CTS:   File "/usr/local/lib/heartbeat/cts/CTSlab.py", line 791, in ?
... CTS:     overall, detailed = tests.run(NumIter)
... CTS: TypeError: unpack non-sequence
... CTS: ****************
... CTS: Overall Results:{'failure': 0, 'success': 0, 'BadNews': 0}
... CTS: ****************
... CTS: Detailed Results
... CTS: Test AddResource:  {'auditfail': 0, 'failure': 0, 'skipped':
0, 'success': 0, 'calls': 0}
... CTS: <<<<<<<<<<<<<<<< TESTS COMPLETED
... CTS: No failure count but success != requested iterations
CRM tests failed (rc=1).
(end of linux-ha.testlog now)


C)
When using official debian package of heartbeat -
heartbeat-2_2.0.7-2_i386.deb, the similar error occured!
The only difference was the line number:
... CTS:   File "/usr/lib/heartbeat/cts/CTSlab.py", line 781, in ?

                             And some additional log messages after
CRM tests failed (rc=1).

The output of BasicSanityCheck contained:
cib[23598]: 2007/01/08_14:20:06 ERROR: cib_ccm_dispatch:callbacks.c
CCM connection appears to have failed: rc=-1.
crmd[23601]: 2007/01/08_14:20:06 ERROR: cib_native_signon:cib_native.c
No reply message - disconnected - 0
attrd[23599]: 2007/01/08_14:20:06 ERROR:
cib_native_signon:cib_native.c No reply message - disconnected - 0
crmd[23601]: 2007/01/08_14:20:06 ERROR: crm_timer_start:utils.c Tried
to start Shutdown Escalation (I_STOP:-1ms) with a -ve period
crmd[23601]: 2007/01/08_14:20:06 debug: actions:trace:  // A_ERROR
crmd[23601]: 2007/01/08_14:20:06 ERROR: do_log:misc.c [[FSA]] Input
I_SHUTDOWN from crm_shutdown() received in state (S_STARTING)
crmd[23601]: 2007/01/08_14:20:06 ERROR: send_ha_message:ipc.c No
heartbeat connection specified
cibmon[23595]: 2007/01/08_14:20:06 ERROR:
cib_native_signon:cib_native.c No reply message - disconnected - 0
mgmtd[23602]: 2007/01/08_14:20:06 ERROR:
cib_native_signon:cib_native.c No reply message - disconnected - 0
cibmon[23595]: 2007/01/08_14:20:06 ERROR: main:cibmon.c Signon to CIB
failed: not connected
cibmon[23595]: 2007/01/08_14:20:06 ERROR: main:cibmon.c Setup failed,
could not monitor CIB actions
stonithd[23597]: 2007/01/08_14:20:07 ERROR: Disconnected with heartbeat daemon
OOPS! Looks like we had some errors come up.
4 errors. Log file is stored in /tmp/linux-ha.testlog


D)
On one SLES10 machine my colleague used HB sources of changeset 9909.
Configure options were similar to debo & fico machines.
There is one error reported at the end. It is triggered when the 'Does
not look like we ARPed the address' messages is displayed. At the very
beginning there is also message 'RTNETLINK answers: Network is
unreachable' which I do not know where it comes from.
Snippets from output of BasicSanityCheck:
RTNETLINK answers: Network is unreachable
Using interface: eth3
Starting base64 and md5 algorithm tests
base64 and md5 algorithm tests succeeded.
Starting heartbeat
Starting High-Availability services:
2007/01/08_14:56:02 INFO:  Resource is stopped
   done

Does not look like we ARPed the address
Reloading heartbeat
Reloading heartbeat
Stopping heartbeat
...
Starting CRM tests
CRM tests passed.
1 errors. Log file is stored in /tmp/linux-ha.testlog


E)
On another SLES10 machine (HB sources of changeset 9918, conf options
similar to debo & fico), the start of output of BasicSanityCheck
looked:
RTNETLINK answers: Network is unreachable
Using interface: eth0
Starting base64 and md5 algorithm tests
base64 and md5 algorithm tests succeeded.

The interesting thing is eth0 - no such interface was active on
machine. The first one was eth2 with IP address 10.54.0.13,
Mask:255.255.0.0, TESTIP in script was 10.54.0.2
So maybe GuessIFname function has problems...


F)
A little curiosity:
Once after running BasicSanityCheck script, on debo machine I noticed
following virtual IP addresses (in addition to eth0 and lo):
lo:0      Link encap:Local Loopback
          inet addr:127.0.0.11  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1

lo:1      Link encap:Local Loopback
          inet addr:127.0.0.12  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1


G)
Many times I experienced these messages (output of BasicSanityCheck):

...
Reloading heartbeat
Reloading heartbeat
Stopping heartbeat
Stopping High-Availability services:
Done.

Looks like heartbeat did not really stop.
You\'ll probably need to kill some processes yourself.
Checking STONITH basic sanity.
...

What does it mean - Can't heartbeat stop itself?


H)
Another thing to improve:

When using default ./ConfigureMe bootstrap on debian sarge (gcc
version 4.1.2), the compiler gives warning:
cc1: warnings being treated as errors
cl_msg.c: In function 'ha_msg_new':
cl_msg.c:234: warning: type of 'nfields' defaults to 'int'
So at least configure option --disable-fatal-warnings must be used (HB
sources of changeset 9918).


I)
I tried this installation debo machine:
./ConfigureMe bootstrap --disable-fatal-warnings
make
make install

And again these types of errors when running BasicSanityCheck:
Looks like heartbeat did not really stop.
You\'ll probably need to kill some processes yourself.
...
2 errors. Log file is stored in /tmp/linux-ha.testlog
...
Overall Results:{'failure': 0, 'success': 2, 'BadNews': 22}



I have also one good news, I was able to run BasicSanityCheck
successfully on SLES10. I used the source package
http://linux-ha.org/download/heartbeat-2.0.7-1.src.rpm, I just needed
to hack heartbeat.spec a little. But still, with fico machine these
are only 2 successful installations :(

If you want any of linux-ha.testlog, BasicSanityCheck outputs,
configure / make / make install outputs, I can send them. It's just
too much data at once.

Could you give me some hints, how to install your latest heartbeat and
have successfull BasicSanityCheck? Because I am facing with many
problems.

Thanks

Palo


More information about the Linux-HA mailing list