[Linux-HA] Re: pengine process killed by signal 11 (SIGEGV)

Andrew Beekhof beekhof at gmail.com
Mon Dec 11 02:49:37 MST 2006


On 12/9/06, Daniel van Ham Colchete <daniel.colchete at gmail.com> wrote:
> Hi again,
>
> problem solved: chown cluster:cluster /var/lib/heartbeat
> /var/run/heartbeat -R at the 'www0' node.

can i ask what the original permissions were?
you might want to report this problem to the packager

>
> Sugestion: there is one bug within pengine,

actually its a bug in the packaging.  that directory is required to be
writable by the PE.

granted we could handle it more gracefully than we do currently.

> it's not checking if the
> could create a file inside one of those directories and tries to write
> it anyway, but in that case you get an SIGEGV. This condition could be
> checked and the node marked not to enter the DC election.
>
> How did I find out? I started the www0 node first and nothing worked.
> The DC election algorithm always choose the www0 for DC because of
> it's lower UUID.
>
> Best regards,
> Daniel Colchete
>
> On 12/9/06, Daniel van Ham Colchete <daniel.colchete at gmail.com> wrote:
> > Hi,
> >
> > I'm trying to setup an 2-node Heartbeat 2.0 system here. I'm using
> > version 2.0.7 on a Gentoo system with kernel 2.6.18.
> >
> > When I start one of the nodes (mail0) first and them the second,
> > everything works greatly. My problem is that when I start both at the
> > same time, nothing works.
> >
> > Doing some digging, I found that pengine is having some sort of
> > segmentation fault (signal 11).
> >
> > First, the important logs:
> >
> > Dec  9 11:43:05 www0 crmd: [24314]: info: crm_timer_popped:utils.c
> > Election Trigger (I_DC_TIMEOUT) just popped!
> > Dec  9 11:43:05 www0 crmd: [24314]: info: update_dc:utils.c Set DC to
> > <null> (<null>)
> > Dec  9 11:43:05 www0 crmd: [24314]: info: start_subsystem:subsystems.c
> > Starting sub-system "pengine"
> > Dec  9 11:43:05 www0 crmd: [24314]: info: do_dc_takeover:election.c
> > Taking over DC status for this partition
> > Dec  9 11:43:05 www0 cib: [13106]: info:
> > cib_process_readwrite:messages.c We are now in R/W mode
> > Dec  9 11:43:05 www0 pengine: [24321]: info: init_start:main.c Starting pengine
> > Dec  9 11:43:05 www0 crmd: [24314]: info: update_dc:utils.c Set DC to
> > www0 (1.0.6)
> > Dec  9 11:43:06 www0 crmd: [24314]: info: do_state_transition:fsa.c
> > All 2 cluster nodes responded to the join offer.
> > Dec  9 11:43:06 www0 cib: [13106]: info: sync_our_cib:messages.c
> > Syncing CIB to all peers
> > Dec  9 11:43:06 www0 crmd: [24314]: info: update_dc:utils.c Set DC to
> > www0 (1.0.6)
> > Dec  9 11:43:07 www0 crmd: [24314]: info: do_state_transition:fsa.c
> > www0: State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [
> > input=I_FINALIZED cause=C_FSA
> > _INTERNAL origin=check_join_state ]
> > Dec  9 11:43:07 www0 crmd: [24314]: info: do_state_transition:fsa.c
> > All 2 cluster nodes are eligable to run resources.
> > Dec  9 11:43:07 www0 crmd: [24314]: info:
> > crmd_ipc_msg_callback:callbacks.c pengine: no message this time
> > Dec  9 11:43:07 www0 crmd: [24314]: info:
> > process_client_disconnect:utils.c Received HUP from pengine:[24321]
> > Dec  9 11:43:07 www0 crmd: [24314]: WARN: Exiting pengine process
> > 24321 killed by signal 11.
> > Dec  9 11:43:07 www0 crmd: [24314]: info:
> > crmdManagedChildDied:subsystems.c Process pengine:[24321] exited
> > (signal=11, exitcode=0)
> > Dec  9 11:43:07 www0 crmd: [24314]: ERROR:
> > crmdManagedChildDied:subsystems.c The pengine subsystem terminated
> > unexpectedly
> > Dec  9 11:43:07 www0 crmd: [24314]: ERROR: do_log:misc.c [[FSA]] Input
> > I_ERROR from crmdManagedChildDied() received in state
> > (S_TRANSITION_ENGINE)
> > Dec  9 11:43:07 www0 crmd: [24314]: info: do_dc_release:election.c DC
> > role released
> >
> > And it repeats indefinitely.
> >
> > You can acess my cib.xml at http://pastebin.ca/272979.
> >
> > Thanks for any help.
> >
> > Best regards,
> > Daniel Colchete
> >
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>


More information about the Linux-HA mailing list