[Linux-HA] lrmd CPU intensive?
Brian O'Neill
oneill at oinc.net
Fri Aug 24 15:12:42 MDT 2007
Dejan Muhamedagic wrote:
> On Thu, Aug 23, 2007 at 01:29:38PM -0400, Brian O'Neill wrote:
>> I can confirm that we are seeing this on a 2.0.7 installation currently
>> (upgrading not a good option - production cluster).
>
> I'd still suggest upgrading. Many things have been fixed since
> 2.0.7. lrmd was also thoroughly examined. If you have a spare
> cluster to test the new release, upgrading is really a
> breeze---just put the cluster in an unmanaged mode, upgrade, and
> then put it back to the managed mode.
I'll bring it up with the customer who's services are on it. We are
actually hoping that they may eliminate the need for floating an IP
around and go fully loadbalanced soon anyway, but that may be a ways off.
>
>> lrmd has produced
>> more than 273k of these messages in 7 hours. The cluster itself has been
>> up since January 31st.
>>
>> It has also produced over 100k of the following:
>>
>> Aug 23 06:27:37 node0 lrmd: [4583]: WARN: on_repeat_op_readytorun:
>> Operations list for admin0_ip is suspicously long [69]
>
> This is not an error. What it says is that for that resource
> there are too many operations queued. Where too many is four.
> That could be a sign that some operations are taking too long to
> execute, more than the interval defined for the monitor (I guess
> that it is a monitor operation). But, is that really a group?
> Because no operations should be defined for a group.
You are correct, it is one resource in the group...it is just an IPaddr
resource.
> Yes, this should happen very very seldom. Are there any other
> warnings? Such as, for example, about operations delayed or
> max_child_count reached. At any rate, you should investigate this
> thoroughly. You can also post the configuration and logs.
>
Yes, I found these as well:
Aug 19 06:31:10 node0 lrmd: [4583]: WARN: perform_ra_op: the operation
operation monitor[353] on ocf::IPaddr::admin0_ip for client 4586, its
parameters: CRM_meta_interval=[15000] ip=[10.16.8.30]
CRM_meta_op_target_rc=[7] netmask=[27] CRM_meta_id=[admin0_ip_mon_id]
CRM_meta_timeout=[3000] crm_feature_set=[1.0.6] CRM_meta_nam stayed in
operation list for 6420 ms (longer than 5000 ms)
It appears to be logging this extremely frequently as well - not sure
why I didn't notice it. This is a simple IPaddr resource - nothing fancy
about it.
More information about the Linux-HA
mailing list