[Linux-HA] Re: pengine process killed by signal 11 (SIGEGV)
Daniel van Ham Colchete
daniel.colchete at gmail.com
Sat Dec 9 07:05:42 MST 2006
Hi again,
problem solved: chown cluster:cluster /var/lib/heartbeat
/var/run/heartbeat -R at the 'www0' node.
Sugestion: there is one bug within pengine, it's not checking if the
could create a file inside one of those directories and tries to write
it anyway, but in that case you get an SIGEGV. This condition could be
checked and the node marked not to enter the DC election.
How did I find out? I started the www0 node first and nothing worked.
The DC election algorithm always choose the www0 for DC because of
it's lower UUID.
Best regards,
Daniel Colchete
On 12/9/06, Daniel van Ham Colchete <daniel.colchete at gmail.com> wrote:
> Hi,
>
> I'm trying to setup an 2-node Heartbeat 2.0 system here. I'm using
> version 2.0.7 on a Gentoo system with kernel 2.6.18.
>
> When I start one of the nodes (mail0) first and them the second,
> everything works greatly. My problem is that when I start both at the
> same time, nothing works.
>
> Doing some digging, I found that pengine is having some sort of
> segmentation fault (signal 11).
>
> First, the important logs:
>
> Dec 9 11:43:05 www0 crmd: [24314]: info: crm_timer_popped:utils.c
> Election Trigger (I_DC_TIMEOUT) just popped!
> Dec 9 11:43:05 www0 crmd: [24314]: info: update_dc:utils.c Set DC to
> <null> (<null>)
> Dec 9 11:43:05 www0 crmd: [24314]: info: start_subsystem:subsystems.c
> Starting sub-system "pengine"
> Dec 9 11:43:05 www0 crmd: [24314]: info: do_dc_takeover:election.c
> Taking over DC status for this partition
> Dec 9 11:43:05 www0 cib: [13106]: info:
> cib_process_readwrite:messages.c We are now in R/W mode
> Dec 9 11:43:05 www0 pengine: [24321]: info: init_start:main.c Starting pengine
> Dec 9 11:43:05 www0 crmd: [24314]: info: update_dc:utils.c Set DC to
> www0 (1.0.6)
> Dec 9 11:43:06 www0 crmd: [24314]: info: do_state_transition:fsa.c
> All 2 cluster nodes responded to the join offer.
> Dec 9 11:43:06 www0 cib: [13106]: info: sync_our_cib:messages.c
> Syncing CIB to all peers
> Dec 9 11:43:06 www0 crmd: [24314]: info: update_dc:utils.c Set DC to
> www0 (1.0.6)
> Dec 9 11:43:07 www0 crmd: [24314]: info: do_state_transition:fsa.c
> www0: State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [
> input=I_FINALIZED cause=C_FSA
> _INTERNAL origin=check_join_state ]
> Dec 9 11:43:07 www0 crmd: [24314]: info: do_state_transition:fsa.c
> All 2 cluster nodes are eligable to run resources.
> Dec 9 11:43:07 www0 crmd: [24314]: info:
> crmd_ipc_msg_callback:callbacks.c pengine: no message this time
> Dec 9 11:43:07 www0 crmd: [24314]: info:
> process_client_disconnect:utils.c Received HUP from pengine:[24321]
> Dec 9 11:43:07 www0 crmd: [24314]: WARN: Exiting pengine process
> 24321 killed by signal 11.
> Dec 9 11:43:07 www0 crmd: [24314]: info:
> crmdManagedChildDied:subsystems.c Process pengine:[24321] exited
> (signal=11, exitcode=0)
> Dec 9 11:43:07 www0 crmd: [24314]: ERROR:
> crmdManagedChildDied:subsystems.c The pengine subsystem terminated
> unexpectedly
> Dec 9 11:43:07 www0 crmd: [24314]: ERROR: do_log:misc.c [[FSA]] Input
> I_ERROR from crmdManagedChildDied() received in state
> (S_TRANSITION_ENGINE)
> Dec 9 11:43:07 www0 crmd: [24314]: info: do_dc_release:election.c DC
> role released
>
> And it repeats indefinitely.
>
> You can acess my cib.xml at http://pastebin.ca/272979.
>
> Thanks for any help.
>
> Best regards,
> Daniel Colchete
>
More information about the Linux-HA
mailing list