[Linux-HA] Corosync startup at boot and stonith device failed start

Eric Schoeller eric.schoeller at colorado.edu
Tue Oct 12 11:07:39 MDT 2010


Dejan,

Thanks for your comments. It's taking 37 seconds for my LACP bond to 
come online, and this is the interface which provides the route to the 
ipmi device. This is most certainly the problem, so it really has 
nothing to do with pacemaker. Sorry for the post. On a side note, if 
anyone has any hints about slow initialization with LACP bonds between 
Cisco gear and Linux boxes I'd appreciate it! Our on-site network 
engineer can't explain it. My only guess was the lacp_rate, but even at 
'fast' it's taking a long time.

Eric


Dejan Muhamedagic wrote:
> Hi,
>
> On Mon, Oct 11, 2010 at 10:17:08PM -0600, Eric Schoeller wrote:
>   
>> Good evening,
>>
>> I noticed that when corosync is set to start at boot my stonith devices 
>> don't start up correctly.
>>
>> Here is some version info:
>>
>> cluster-glue: 1.0.6
>> Corosync Cluster Engine, version '1.2.7' SVN revision '3008'
>> Name        : pacemaker
>> Version     : 1.0.9.1
>> Release     : 1.15.el5
>>
>> I've read in many places that stonith devices may rely upon atd. I 
>> haven't looked around enough to fully understand the necessity of this 
>> dependency, but I believe it's the cause of the problem I'm 
>> experiencing.
>>     
>
> Wrong. atd is needed only for external/ssh and then only for the
> fencing operations. You're running into a different problem.
>
>   
>> The corosync init script is configured to start and stop 
>> at 20, and atd is configured to start and stop at 95 and 5 on my RHEL5.5 
>> system. If I move corosync up to 98 (after atd) my stonith devices start 
>> just fine. If I add a start-delay to the stonith device that delays it 
>> past the startup of atd, the stonith device also starts just fine. Using 
>> the default init script and no start-delay ends with a Failed Action for 
>> the stonith device, and it never recovers without manual intervention.
>>
>> My questions are: Why is the default init script shipped with the RPM 
>> from the clusterlabs repo configured to start before atd if atd is a 
>> dependency of certain parts of the pacemaker framework (if this indeed 
>> the case)? Is it safe/recommended to add a start-delay of several 
>> minutes to a stonith device to work around this problem?
>>     
>
> Well, if possible much better to fix the problem. Otherwise,
> start-delay on the start action may at times slow the fencing
> action. For instance, for startup fencing.
>
>   
>> Thanks!!
>>
>> Eric Schoeller
>>
>>
>> Here are some logs:
>>
>> Oct 11 20:33:14 nodea crmd: [3156]: info: do_lrm_rsc_op: Performing 
>> key=52:56:0:be604143-3a5a-4086-8e5c-d3d052804091 op=st-nodeb-ipmi_start_0 )
>> Oct 11 20:33:14 nodea lrmd: [3153]: info: rsc:st-nodeb-ipmi:8: start
>> Oct 11 20:33:14 nodea lrmd: [3397]: info: Try to start STONITH resource 
>> <rsc_id=st-nodeb-ipmi> : Device=external/ipmi
>> Oct 11 20:33:14 nodea crmd: [3156]: info: do_lrm_rsc_op: Performing 
>> key=12:56:0:be604143-3a5a-4086-8e5c-d3d052804091 op=drbd_nfs:0_start_0 )
>> Oct 11 20:33:14 nodea lrmd: [3153]: info: rsc:drbd_nfs:0:9: start
>>
>> Oct 11 20:33:37 nodea external/ipmi[3433]: ERROR: error executing 
>> ipmitool: Error: Unable to establish IPMI v2 / RMCP+ session^M Unable to 
>> get Chassis Power Status
>>     
>
> ipmitool fails here. Perhaps the network is not fully
> operational.
>
> Thanks,
>
> Dejan
>
>   
>> Oct 11 20:33:38 nodea stonithd: [3432]: info: external_run_cmd: Calling 
>> '/usr/lib64/stonith/plugins/external/ipmi status' returned 256
>> Oct 11 20:33:38 nodea stonithd: [3432]: CRIT: external_status: 'ipmi 
>> status' failed with rc 256
>> Oct 11 20:33:38 nodea stonithd: [3151]: WARN: start st-nodeb-ipmi 
>> failed, because its hostlist is empty
>> Oct 11 20:33:38 nodea lrmd: [3153]: WARN: Managed st-nodeb-ipmi:start 
>> process 3397 exited with return code 1.
>> Oct 11 20:33:38 nodea crmd: [3156]: info: process_lrm_event: LRM 
>> operation st-nodeb-ipmi_start_0 (call=8, rc=1, cib-update=16, 
>> confirmed=true) unknown error
>> Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_ais_dispatch: Update 
>> relayed from nodeb.domain.com
>> Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_trigger_update: Sending 
>> flush op to all hosts for: fail-count-st-nodeb-ipmi (INFINITY)
>> Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_perform_update: Sent 
>> update 26: fail-count-st-nodeb-ipmi=INFINITY
>> Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_ais_dispatch: Update 
>> relayed from nodeb.domain.com
>> Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_trigger_update: Sending 
>> flush op to all hosts for: last-failure-st-nodeb-ipmi (1286850818)
>> Oct 11 20:33:38 nodea attrd: [3154]: info: attrd_perform_update: Sent 
>> update 29: last-failure-st-nodeb-ipmi=1286850818
>> Oct 11 20:33:38 nodea crmd: [3156]: info: do_lrm_rsc_op: Performing 
>> key=1:58:0:be604143-3a5a-4086-8e5c-d3d052804091 op=st-nodeb-ipmi_stop_0 )
>> Oct 11 20:33:38 nodea lrmd: [3153]: info: rsc:st-nodeb-ipmi:12: stop
>> Oct 11 20:33:38 nodea lrmd: [5063]: info: Try to stop STONITH resource 
>> <rsc_id=st-nodeb-ipmi> : Device=external/ipmi
>> Oct 11 20:33:38 nodea stonithd: [3151]: notice: try to stop a resource 
>> st-nodeb-ipmi who is not in started resource queue.
>> Oct 11 20:33:38 nodea lrmd: [3153]: info: Managed st-nodeb-ipmi:stop 
>> process 5063 exited with return code 0.
>> Oct 11 20:33:38 nodea crmd: [3156]: info: process_lrm_event: LRM 
>> operation st-nodeb-ipmi_stop_0 (call=12, rc=0, cib-update=17, 
>> confirmed=true) ok
>>
>> Here is my cluster configuration:
>>
>> node nodea.domain.com \                                               
>>         attributes standby="off"                                         
>> node nodeb.domain.com \                                                  
>>         attributes standby="off"                                         
>> primitive drbd_nfs ocf:linbit:drbd \                                     
>>         params drbd_resource="r0" \                                      
>>         op monitor interval="15s"                                        
>> primitive fs_nfs ocf:heartbeat:Filesystem \                              
>>         params device="/dev/drbd0" directory="/mnt/drbd0" fstype="ext3" \
>>         meta is-managed="true"                                           
>> primitive ip_nfs ocf:heartbeat:IPaddr2 \                                 
>>         params ip="1.2.3.20" cidr_netmask="32" nic="bond0"               
>> primitive nfsserver ocf:heartbeat:nfsserver \                            
>>         params nfs_shared_infodir="/mnt/drbd0/nfs" nfs_ip="1.2.3.20" 
>> nfs_init_script="/etc/init.d/nfs"
>> primitive st-nodea-ipmi stonith:external/ipmi \
>>         params hostname="nodea.domain.com" ipaddr="1.2.3.23" 
>> userid="coolguy" passwd="changeme" interface="lanplus" \
>>         op monitor interval="20m" timeout="1m" \
>>         op start interval="0" timeout="1m" start-delay="360s" \
>>         meta target-role="Started"
>> primitive st-nodeb-ipmi stonith:external/ipmi \
>>         params hostname="nodeb.domain.com" ipaddr="1.2.3.25" 
>> userid="coolguy" passwd="changeme" interface="lanplus" \
>>         op monitor interval="20m" timeout="1m" \
>>         op start interval="0" timeout="1m" start-delay="360s" \
>>         meta target-role="Started"
>> group nfs fs_nfs ip_nfs nfsserver \
>>         meta target-role="Started"
>> ms ms_drbd_nfs drbd_nfs \
>>         meta master-max="1" master-node-max="1" clone-max="2" 
>> clone-node-max="1" notify="true" target-role="Started" is-managed="true"
>> location l-st-nodea st-nodea-ipmi -inf: nodea.domain.com
>> location l-st-nodeb st-nodeb-ipmi -inf: nodeb.domain.com
>> colocation nfs_on_drbd inf: nfs ms_drbd_nfs:Master
>> order nfs_after_drbd inf: ms_drbd_nfs:promote nfs:start
>> property $id="cib-bootstrap-options" \
>>         dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
>>         cluster-infrastructure="openais" \
>>         expected-quorum-votes="2" \
>>         stonith-enabled="true" \
>>         no-quorum-policy="ignore" \
>>         last-lrm-refresh="1286851694"
>> rsc_defaults $id="rsc-options" \
>>         resource-stickiness="100"
>>
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA at lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>     
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>   



More information about the Linux-HA mailing list