[Linux-HA] Corosync startup at boot and stonith device failed start

Dejan Muhamedagic dejanmm at fastmail.fm
Tue Oct 12 02:02:13 MDT 2010


Hi,

On Mon, Oct 11, 2010 at 10:17:08PM -0600, Eric Schoeller wrote:
> Good evening,
> 
> I noticed that when corosync is set to start at boot my stonith devices 
> don't start up correctly.
> 
> Here is some version info:
> 
> cluster-glue: 1.0.6
> Corosync Cluster Engine, version '1.2.7' SVN revision '3008'
> Name        : pacemaker
> Version     : 1.0.9.1
> Release     : 1.15.el5
> 
> I've read in many places that stonith devices may rely upon atd. I 
> haven't looked around enough to fully understand the necessity of this 
> dependency, but I believe it's the cause of the problem I'm 
> experiencing.

Wrong. atd is needed only for external/ssh and then only for the
fencing operations. You're running into a different problem.

> The corosync init script is configured to start and stop 
> at 20, and atd is configured to start and stop at 95 and 5 on my RHEL5.5 
> system. If I move corosync up to 98 (after atd) my stonith devices start 
> just fine. If I add a start-delay to the stonith device that delays it 
> past the startup of atd, the stonith device also starts just fine. Using 
> the default init script and no start-delay ends with a Failed Action for 
> the stonith device, and it never recovers without manual intervention.
> 
> My questions are: Why is the default init script shipped with the RPM 
> from the clusterlabs repo configured to start before atd if atd is a 
> dependency of certain parts of the pacemaker framework (if this indeed 
> the case)? Is it safe/recommended to add a start-delay of several 
> minutes to a stonith device to work around this problem?

Well, if possible much better to fix the problem. Otherwise,
start-delay on the start action may at times slow the fencing
action. For instance, for startup fencing.

> Thanks!!
> 
> Eric Schoeller
> 
> 
> Here are some logs:
> 
> Oct 11 20:33:14 nodea crmd: [3156]: info: do_lrm_rsc_op: Performing 
> key=52:56:0:be604143-3a5a-4086-8e5c-d3d052804091 op=st-nodeb-ipmi_start_0 )
> Oct 11 20:33:14 nodea lrmd: [3153]: info: rsc:st-nodeb-ipmi:8: start
> Oct 11 20:33:14 nodea lrmd: [3397]: info: Try to start STONITH resource 
> <rsc_id=st-nodeb-ipmi> : Device=external/ipmi
> Oct 11 20:33:14 nodea crmd: [3156]: info: do_lrm_rsc_op: Performing 
> key=12:56:0:be604143-3a5a-4086-8e5c-d3d052804091 op=drbd_nfs:0_start_0 )
> Oct 11 20:33:14 nodea lrmd: [3153]: info: rsc:drbd_nfs:0:9: start
> 
> Oct 11 20:33:37 nodea external/ipmi[3433]: ERROR: error executing 
> ipmitool: Error: Unable to establish IPMI v2 / RMCP+ session^M Unable to 
> get Chassis Power Status

ipmitool fails here. Perhaps the network is not fully
operational.

Thanks,

Dejan

> Oct 11 20:33:38 nodea stonithd: [3432]: info: external_run_cmd: Calling 
> '/usr/lib64/stonith/plugins/external/ipmi status' returned 256
> Oct 11 20:33:38 nodea stonithd: [3432]: CRIT: external_status: 'ipmi 
> status' failed with rc 256
> Oct 11 20:33:38 nodea stonithd: [3151]: WARN: start st-nodeb-ipmi 
> failed, because its hostlist is empty
> Oct 11 20:33:38 nodea lrmd: [3153]: WARN: Managed st-nodeb-ipmi:start 
> process 3397 exited with return code 1.
> Oct 11 20:33:38 nodea crmd: [3156]: info: process_lrm_event: LRM 
> operation st-nodeb-ipmi_start_0 (call=8, rc=1, cib-update=16, 
> confirmed=true) unknown error
> Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_ais_dispatch: Update 
> relayed from nodeb.domain.com
> Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_trigger_update: Sending 
> flush op to all hosts for: fail-count-st-nodeb-ipmi (INFINITY)
> Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_perform_update: Sent 
> update 26: fail-count-st-nodeb-ipmi=INFINITY
> Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_ais_dispatch: Update 
> relayed from nodeb.domain.com
> Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_trigger_update: Sending 
> flush op to all hosts for: last-failure-st-nodeb-ipmi (1286850818)
> Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_perform_update: Sent 
> update 29: last-failure-st-nodeb-ipmi=1286850818
> Oct 11 20:33:38 nodea crmd: [3156]: info: do_lrm_rsc_op: Performing 
> key=1:58:0:be604143-3a5a-4086-8e5c-d3d052804091 op=st-nodeb-ipmi_stop_0 )
> Oct 11 20:33:38 nodea lrmd: [3153]: info: rsc:st-nodeb-ipmi:12: stop
> Oct 11 20:33:38 nodea lrmd: [5063]: info: Try to stop STONITH resource 
> <rsc_id=st-nodeb-ipmi> : Device=external/ipmi
> Oct 11 20:33:38 nodea stonithd: [3151]: notice: try to stop a resource 
> st-nodeb-ipmi who is not in started resource queue.
> Oct 11 20:33:38 nodea lrmd: [3153]: info: Managed st-nodeb-ipmi:stop 
> process 5063 exited with return code 0.
> Oct 11 20:33:38 nodea crmd: [3156]: info: process_lrm_event: LRM 
> operation st-nodeb-ipmi_stop_0 (call=12, rc=0, cib-update=17, 
> confirmed=true) ok
> 
> Here is my cluster configuration:
> 
> node nodea.domain.com \                                               
>         attributes standby="off"                                         
> node nodeb.domain.com \                                                  
>         attributes standby="off"                                         
> primitive drbd_nfs ocf:linbit:drbd \                                     
>         params drbd_resource="r0" \                                      
>         op monitor interval="15s"                                        
> primitive fs_nfs ocf:heartbeat:Filesystem \                              
>         params device="/dev/drbd0" directory="/mnt/drbd0" fstype="ext3" \
>         meta is-managed="true"                                           
> primitive ip_nfs ocf:heartbeat:IPaddr2 \                                 
>         params ip="1.2.3.20" cidr_netmask="32" nic="bond0"               
> primitive nfsserver ocf:heartbeat:nfsserver \                            
>         params nfs_shared_infodir="/mnt/drbd0/nfs" nfs_ip="1.2.3.20" 
> nfs_init_script="/etc/init.d/nfs"
> primitive st-nodea-ipmi stonith:external/ipmi \
>         params hostname="nodea.domain.com" ipaddr="1.2.3.23" 
> userid="coolguy" passwd="changeme" interface="lanplus" \
>         op monitor interval="20m" timeout="1m" \
>         op start interval="0" timeout="1m" start-delay="360s" \
>         meta target-role="Started"
> primitive st-nodeb-ipmi stonith:external/ipmi \
>         params hostname="nodeb.domain.com" ipaddr="1.2.3.25" 
> userid="coolguy" passwd="changeme" interface="lanplus" \
>         op monitor interval="20m" timeout="1m" \
>         op start interval="0" timeout="1m" start-delay="360s" \
>         meta target-role="Started"
> group nfs fs_nfs ip_nfs nfsserver \
>         meta target-role="Started"
> ms ms_drbd_nfs drbd_nfs \
>         meta master-max="1" master-node-max="1" clone-max="2" 
> clone-node-max="1" notify="true" target-role="Started" is-managed="true"
> location l-st-nodea st-nodea-ipmi -inf: nodea.domain.com
> location l-st-nodeb st-nodeb-ipmi -inf: nodeb.domain.com
> colocation nfs_on_drbd inf: nfs ms_drbd_nfs:Master
> order nfs_after_drbd inf: ms_drbd_nfs:promote nfs:start
> property $id="cib-bootstrap-options" \
>         dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
>         cluster-infrastructure="openais" \
>         expected-quorum-votes="2" \
>         stonith-enabled="true" \
>         no-quorum-policy="ignore" \
>         last-lrm-refresh="1286851694"
> rsc_defaults $id="rsc-options" \
>         resource-stickiness="100"
> 
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems



More information about the Linux-HA mailing list