[Linux-HA] strange monitor behaviour
Andrew Beekhof
beekhof at gmail.com
Tue Jan 9 07:10:55 MST 2007
dont like short emails do you ;-)
On 1/8/07, Pavol Gono <palo.gono at gmail.com> wrote:
> On 1/8/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> > On 1/5/07, Pavol Gono <palo.gono at gmail.com> wrote:
> > > In attachment there is the log from fico.
> > > The only difference in installation is beginning of configure options
> > > (because debo is debian, fico is gentoo distro):
> > > ./configure --with-group-name=cluster --with-ccmuser-name=cluster
> > > --with-group-id=65 --with-ccmuser-id=65 "CFLAGS=-fno-unit-at-a-time -g
> > > -O0" ...
> >
> > any reason you're not using the debian packages?
> >
> > you might also be better off with: ./ConfigureMe bootstrap
> > which will call configure with the correct options for most distros
>
> So my reasons for using latest sources:
> - I am using different distros on machines and I want to have the same
> code running (without special patches from distro maintainers) and the
> similar configure options
> - CIB configuration is still changing a lot, so I want to have the
> latest XML config as possible (updates during runtime of servers won't
> be so painful)
> - Less software dependencies when using custom configure options
> - I expect current code is much more closer to future 2.0.8 than to
> latest 2.0.7 version :)
makes sense :-)
>
>
> Concerning BasicSanityCheck script, I see many issues, and I don't
> know how much should I trust it:
>
> A)
> It would be nice to have some list of necessary software installed
> when one wants to run it. E.g. on SLES10 you need python-xml package.
> On debian (debo machine), installing python-dev or python-xml
> decreased number of 'BadNews' from 26 to 2. Maybe python version is
> also important...
can you send me both outputs? that shouldn't be the case.
> B)
> On my notebook I use debian sarge, python version 2.4. When using HB
> sources directly (changeset 9918) and configure options equal to debo
> machine, BasicSanityCheck made a strange exception. Snippet from
> linux-ha.testlog:
> ... CTS: Warn: Startup pattern not found: crmd.*pgnotas: State
> transition.*-> S_IDLE
> ... CTS: Node pgnotas status:
> ... CTS: Node status for pgnotas is down but we think it should be up
> ... CTS: Warn: Start failed for node pgnotas
> ... CTS: Tearing down partial setup
> ... CTS: Stopping Cluster Manager on BSC node(s).
> ... CTS: Exception by exceptions.TypeError
> ... CTS: Traceback (most recent call last):
> ... CTS: File "/usr/local/lib/heartbeat/cts/CTSlab.py", line 791, in ?
> ... CTS: overall, detailed = tests.run(NumIter)
> ... CTS: TypeError: unpack non-sequence
> ... CTS: ****************
> ... CTS: Overall Results:{'failure': 0, 'success': 0, 'BadNews': 0}
> ... CTS: ****************
> ... CTS: Detailed Results
> ... CTS: Test AddResource: {'auditfail': 0, 'failure': 0, 'skipped':
> 0, 'success': 0, 'calls': 0}
> ... CTS: <<<<<<<<<<<<<<<< TESTS COMPLETED
> ... CTS: No failure count but success != requested iterations
> CRM tests failed (rc=1).
> (end of linux-ha.testlog now)
can you send me the whole file?
> C)
> When using official debian package of heartbeat -
> heartbeat-2_2.0.7-2_i386.deb, the similar error occured!
> The only difference was the line number:
> ... CTS: File "/usr/lib/heartbeat/cts/CTSlab.py", line 781, in ?
>
> And some additional log messages after
> CRM tests failed (rc=1).
>
> The output of BasicSanityCheck contained:
> cib[23598]: 2007/01/08_14:20:06 ERROR: cib_ccm_dispatch:callbacks.c
> CCM connection appears to have failed: rc=-1.
> crmd[23601]: 2007/01/08_14:20:06 ERROR: cib_native_signon:cib_native.c
> No reply message - disconnected - 0
> attrd[23599]: 2007/01/08_14:20:06 ERROR:
> cib_native_signon:cib_native.c No reply message - disconnected - 0
> crmd[23601]: 2007/01/08_14:20:06 ERROR: crm_timer_start:utils.c Tried
> to start Shutdown Escalation (I_STOP:-1ms) with a -ve period
> crmd[23601]: 2007/01/08_14:20:06 debug: actions:trace: // A_ERROR
> crmd[23601]: 2007/01/08_14:20:06 ERROR: do_log:misc.c [[FSA]] Input
> I_SHUTDOWN from crm_shutdown() received in state (S_STARTING)
> crmd[23601]: 2007/01/08_14:20:06 ERROR: send_ha_message:ipc.c No
> heartbeat connection specified
> cibmon[23595]: 2007/01/08_14:20:06 ERROR:
> cib_native_signon:cib_native.c No reply message - disconnected - 0
> mgmtd[23602]: 2007/01/08_14:20:06 ERROR:
> cib_native_signon:cib_native.c No reply message - disconnected - 0
> cibmon[23595]: 2007/01/08_14:20:06 ERROR: main:cibmon.c Signon to CIB
> failed: not connected
> cibmon[23595]: 2007/01/08_14:20:06 ERROR: main:cibmon.c Setup failed,
> could not monitor CIB actions
> stonithd[23597]: 2007/01/08_14:20:07 ERROR: Disconnected with heartbeat daemon
> OOPS! Looks like we had some errors come up.
> 4 errors. Log file is stored in /tmp/linux-ha.testlog
this is caused by the whole PID mismatch thingy that made me ask you
to run BSC in the first place. so in this case its doing its job
correctly :-)
> D)
> On one SLES10 machine my colleague used HB sources of changeset 9909.
> Configure options were similar to debo & fico machines.
> There is one error reported at the end. It is triggered when the 'Does
> not look like we ARPed the address' messages is displayed. At the very
> beginning there is also message 'RTNETLINK answers: Network is
> unreachable' which I do not know where it comes from.
> Snippets from output of BasicSanityCheck:
> RTNETLINK answers: Network is unreachable
> Using interface: eth3
> Starting base64 and md5 algorithm tests
> base64 and md5 algorithm tests succeeded.
> Starting heartbeat
> Starting High-Availability services:
> 2007/01/08_14:56:02 INFO: Resource is stopped
> done
>
> Does not look like we ARPed the address
> Reloading heartbeat
> Reloading heartbeat
> Stopping heartbeat
> ...
> Starting CRM tests
> CRM tests passed.
> 1 errors. Log file is stored in /tmp/linux-ha.testlog
i think the easiest solution is just change line 515 to:
LookForString ARP >/dev/null
> E)
> On another SLES10 machine (HB sources of changeset 9918, conf options
> similar to debo & fico), the start of output of BasicSanityCheck
> looked:
> RTNETLINK answers: Network is unreachable
> Using interface: eth0
> Starting base64 and md5 algorithm tests
> base64 and md5 algorithm tests succeeded.
>
> The interesting thing is eth0 - no such interface was active on
> machine. The first one was eth2 with IP address 10.54.0.13,
> Mask:255.255.0.0, TESTIP in script was 10.54.0.2
> So maybe GuessIFname function has problems...
any more information on why it cant guess correctly?
"set -x" etc
> F)
> A little curiosity:
> Once after running BasicSanityCheck script, on debo machine I noticed
> following virtual IP addresses (in addition to eth0 and lo):
> lo:0 Link encap:Local Loopback
> inet addr:127.0.0.11 Mask:255.0.0.0
> UP LOOPBACK RUNNING MTU:16436 Metric:1
>
> lo:1 Link encap:Local Loopback
> inet addr:127.0.0.12 Mask:255.0.0.0
> UP LOOPBACK RUNNING MTU:16436 Metric:1
that would be due to various failures you've seen
> G)
> Many times I experienced these messages (output of BasicSanityCheck):
>
> ...
> Reloading heartbeat
> Reloading heartbeat
> Stopping heartbeat
> Stopping High-Availability services:
> Done.
>
> Looks like heartbeat did not really stop.
> You\'ll probably need to kill some processes yourself.
> Checking STONITH basic sanity.
> ...
>
> What does it mean - Can't heartbeat stop itself?
possible - but without the logs its impossible to say why
> H)
> Another thing to improve:
>
> When using default ./ConfigureMe bootstrap on debian sarge (gcc
> version 4.1.2), the compiler gives warning:
> cc1: warnings being treated as errors
> cl_msg.c: In function 'ha_msg_new':
> cl_msg.c:234: warning: type of 'nfields' defaults to 'int'
> So at least configure option --disable-fatal-warnings must be used (HB
> sources of changeset 9918).
fixed upstream
> I)
> I tried this installation debo machine:
> ./ConfigureMe bootstrap --disable-fatal-warnings
> make
> make install
>
> And again these types of errors when running BasicSanityCheck:
> Looks like heartbeat did not really stop.
> You\'ll probably need to kill some processes yourself.
> ...
> 2 errors. Log file is stored in /tmp/linux-ha.testlog
> ...
> Overall Results:{'failure': 0, 'success': 2, 'BadNews': 22}
i dont know what the problem is, but starting from a clean base (ie.
freshly rebooted machine) is always a good idea when testing one
thing/version vs. another.
> I have also one good news, I was able to run BasicSanityCheck
> successfully on SLES10. I used the source package
> http://linux-ha.org/download/heartbeat-2.0.7-1.src.rpm, I just needed
> to hack heartbeat.spec a little. But still, with fico machine these
> are only 2 successful installations :(
>
> If you want any of linux-ha.testlog, BasicSanityCheck outputs,
> configure / make / make install outputs, I can send them. It's just
> too much data at once.
bzipping it first should be enough
> Could you give me some hints, how to install your latest heartbeat and
> have successfull BasicSanityCheck? Because I am facing with many
> problems.
More information about the Linux-HA
mailing list