[Linux-HA] Corosync startup at boot and stonith device failed start
Dejan Muhamedagic
dejanmm at fastmail.fm
Tue Oct 12 02:02:13 MDT 2010
Hi,
On Mon, Oct 11, 2010 at 10:17:08PM -0600, Eric Schoeller wrote:
> Good evening,
>
> I noticed that when corosync is set to start at boot my stonith devices
> don't start up correctly.
>
> Here is some version info:
>
> cluster-glue: 1.0.6
> Corosync Cluster Engine, version '1.2.7' SVN revision '3008'
> Name : pacemaker
> Version : 1.0.9.1
> Release : 1.15.el5
>
> I've read in many places that stonith devices may rely upon atd. I
> haven't looked around enough to fully understand the necessity of this
> dependency, but I believe it's the cause of the problem I'm
> experiencing.
Wrong. atd is needed only for external/ssh and then only for the
fencing operations. You're running into a different problem.
> The corosync init script is configured to start and stop
> at 20, and atd is configured to start and stop at 95 and 5 on my RHEL5.5
> system. If I move corosync up to 98 (after atd) my stonith devices start
> just fine. If I add a start-delay to the stonith device that delays it
> past the startup of atd, the stonith device also starts just fine. Using
> the default init script and no start-delay ends with a Failed Action for
> the stonith device, and it never recovers without manual intervention.
>
> My questions are: Why is the default init script shipped with the RPM
> from the clusterlabs repo configured to start before atd if atd is a
> dependency of certain parts of the pacemaker framework (if this indeed
> the case)? Is it safe/recommended to add a start-delay of several
> minutes to a stonith device to work around this problem?
Well, if possible much better to fix the problem. Otherwise,
start-delay on the start action may at times slow the fencing
action. For instance, for startup fencing.
> Thanks!!
>
> Eric Schoeller
>
>
> Here are some logs:
>
> Oct 11 20:33:14 nodea crmd: [3156]: info: do_lrm_rsc_op: Performing
> key=52:56:0:be604143-3a5a-4086-8e5c-d3d052804091 op=st-nodeb-ipmi_start_0 )
> Oct 11 20:33:14 nodea lrmd: [3153]: info: rsc:st-nodeb-ipmi:8: start
> Oct 11 20:33:14 nodea lrmd: [3397]: info: Try to start STONITH resource
> <rsc_id=st-nodeb-ipmi> : Device=external/ipmi
> Oct 11 20:33:14 nodea crmd: [3156]: info: do_lrm_rsc_op: Performing
> key=12:56:0:be604143-3a5a-4086-8e5c-d3d052804091 op=drbd_nfs:0_start_0 )
> Oct 11 20:33:14 nodea lrmd: [3153]: info: rsc:drbd_nfs:0:9: start
>
> Oct 11 20:33:37 nodea external/ipmi[3433]: ERROR: error executing
> ipmitool: Error: Unable to establish IPMI v2 / RMCP+ session^M Unable to
> get Chassis Power Status
ipmitool fails here. Perhaps the network is not fully
operational.
Thanks,
Dejan
> Oct 11 20:33:38 nodea stonithd: [3432]: info: external_run_cmd: Calling
> '/usr/lib64/stonith/plugins/external/ipmi status' returned 256
> Oct 11 20:33:38 nodea stonithd: [3432]: CRIT: external_status: 'ipmi
> status' failed with rc 256
> Oct 11 20:33:38 nodea stonithd: [3151]: WARN: start st-nodeb-ipmi
> failed, because its hostlist is empty
> Oct 11 20:33:38 nodea lrmd: [3153]: WARN: Managed st-nodeb-ipmi:start
> process 3397 exited with return code 1.
> Oct 11 20:33:38 nodea crmd: [3156]: info: process_lrm_event: LRM
> operation st-nodeb-ipmi_start_0 (call=8, rc=1, cib-update=16,
> confirmed=true) unknown error
> Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_ais_dispatch: Update
> relayed from nodeb.domain.com
> Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_trigger_update: Sending
> flush op to all hosts for: fail-count-st-nodeb-ipmi (INFINITY)
> Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_perform_update: Sent
> update 26: fail-count-st-nodeb-ipmi=INFINITY
> Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_ais_dispatch: Update
> relayed from nodeb.domain.com
> Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_trigger_update: Sending
> flush op to all hosts for: last-failure-st-nodeb-ipmi (1286850818)
> Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_perform_update: Sent
> update 29: last-failure-st-nodeb-ipmi=1286850818
> Oct 11 20:33:38 nodea crmd: [3156]: info: do_lrm_rsc_op: Performing
> key=1:58:0:be604143-3a5a-4086-8e5c-d3d052804091 op=st-nodeb-ipmi_stop_0 )
> Oct 11 20:33:38 nodea lrmd: [3153]: info: rsc:st-nodeb-ipmi:12: stop
> Oct 11 20:33:38 nodea lrmd: [5063]: info: Try to stop STONITH resource
> <rsc_id=st-nodeb-ipmi> : Device=external/ipmi
> Oct 11 20:33:38 nodea stonithd: [3151]: notice: try to stop a resource
> st-nodeb-ipmi who is not in started resource queue.
> Oct 11 20:33:38 nodea lrmd: [3153]: info: Managed st-nodeb-ipmi:stop
> process 5063 exited with return code 0.
> Oct 11 20:33:38 nodea crmd: [3156]: info: process_lrm_event: LRM
> operation st-nodeb-ipmi_stop_0 (call=12, rc=0, cib-update=17,
> confirmed=true) ok
>
> Here is my cluster configuration:
>
> node nodea.domain.com \
> attributes standby="off"
> node nodeb.domain.com \
> attributes standby="off"
> primitive drbd_nfs ocf:linbit:drbd \
> params drbd_resource="r0" \
> op monitor interval="15s"
> primitive fs_nfs ocf:heartbeat:Filesystem \
> params device="/dev/drbd0" directory="/mnt/drbd0" fstype="ext3" \
> meta is-managed="true"
> primitive ip_nfs ocf:heartbeat:IPaddr2 \
> params ip="1.2.3.20" cidr_netmask="32" nic="bond0"
> primitive nfsserver ocf:heartbeat:nfsserver \
> params nfs_shared_infodir="/mnt/drbd0/nfs" nfs_ip="1.2.3.20"
> nfs_init_script="/etc/init.d/nfs"
> primitive st-nodea-ipmi stonith:external/ipmi \
> params hostname="nodea.domain.com" ipaddr="1.2.3.23"
> userid="coolguy" passwd="changeme" interface="lanplus" \
> op monitor interval="20m" timeout="1m" \
> op start interval="0" timeout="1m" start-delay="360s" \
> meta target-role="Started"
> primitive st-nodeb-ipmi stonith:external/ipmi \
> params hostname="nodeb.domain.com" ipaddr="1.2.3.25"
> userid="coolguy" passwd="changeme" interface="lanplus" \
> op monitor interval="20m" timeout="1m" \
> op start interval="0" timeout="1m" start-delay="360s" \
> meta target-role="Started"
> group nfs fs_nfs ip_nfs nfsserver \
> meta target-role="Started"
> ms ms_drbd_nfs drbd_nfs \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true" target-role="Started" is-managed="true"
> location l-st-nodea st-nodea-ipmi -inf: nodea.domain.com
> location l-st-nodeb st-nodeb-ipmi -inf: nodeb.domain.com
> colocation nfs_on_drbd inf: nfs ms_drbd_nfs:Master
> order nfs_after_drbd inf: ms_drbd_nfs:promote nfs:start
> property $id="cib-bootstrap-options" \
> dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> stonith-enabled="true" \
> no-quorum-policy="ignore" \
> last-lrm-refresh="1286851694"
> rsc_defaults $id="rsc-options" \
> resource-stickiness="100"
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
More information about the Linux-HA
mailing list