[Linux-ha-dev] Add real monitoring capabilities to IPaddr2 resource agent

Robert Euhus euhus-liste1 at rrzn.uni-hannover.de
Fri Feb 11 10:48:48 MST 2011


Thanks for your comments. I have attached an updated version, although 
not everything has been worked in yet (see below).

Lars Ellenberg schrieb:
>> [..]
>> +	RX_PACKETS_NEW="`get_rx_packets`"
>> +	RX_PACKETS_OLD="$RX_PACKETS_NEW"
> 
> ^^^ assign old here, only
ok, done.

>> +	for n in `seq $OCF_RESKEY_pktcnt_timeout`; do
>> +		sleep 1
> 
> maybe even sub second sleep?
> likely we are busy anyways,
> so we should see some packets very quickly
> 
Sounds reasonable to me. Would 0.1s steps be okay? Although I would 
think it doesn't really matter that much.

>> +		RX_PACKETS_OLD="$RX_PACKETS_NEW"
> 
> not here. because, if it changed, you return anyways.
> if not, well, it has not changed right?
I like this :-) done.

>> +		RX_PACKETS_NEW="`get_rx_packets`"
>> +		ocf_log debug "RX_PACKETS_OLD: $RX_PACKETS_OLD    RX_PACKETS_NEW: $RX_PACKETS_NEW"
>> +		if [ "$RX_PACKETS_OLD" -ne "$RX_PACKETS_NEW" ]; then
>> +			ocf_log debug "we received some packages."
> 
> s/packages/packets/
ok.

>> +			return
> 
> explicitly return 0 here, or you will be returning the result of
> ocf_log, and you don't want to cause recovery action because the logging
> failed somewhere.
> 
ok.
(Funny I that have never stumbled about "return" returning the result of 
the last used function...)

>> +		fi
>> +	done
>> +	return 1
>> +}
> 
> I do like this packet counter monitoring.
Thanks! :)

>> +# returns list of chached ARP entries for $NIC
> 
> s/chached/cached
ok.

>>[..]
>> +	$IP2UTIL -s neighbour show dev $NIC \
>> +		| grep -v "^$BASEIP " \
> 
> my own ip is not included anyways,
> at least on my box?
I just wanted to make sure and didn't check. But now I think you are 
right - my own IP should never come up in the arp cache.

> do you want to grep REACHABLE here, too?
> not sure.
I don't see a problem with also trying stale, delay and probe entries, 
but maybe we should filter out at least the "incomplete" and "failed" 
state? Or others to?

Maybe we should somehow make sure that we are only trying to arping IPs 
which are on the subnet for the IP we manage. There could be several IPs 
with different subnets assigned to one device. What do you think?

> 
>> +		| tr '/' ' ' | sort -n -k8 | cut -d' ' -f1 \
> 
>  | sort -t/ -k2,2n | cut -d' ' -f1 \
thanks.

> 
>> +		| head -n $OCF_RESKEY_arping_cache_entries | tr '\n' ' '
> 
> what for do you tr \n ' ' ?
> no need.
right. I thought that it was needed, because piping the command into 
less always gave me newlines.

>>[..]
>> +do_arping () {
>> +	if $DEBUG ; then arping -c $OCF_RESKEY_arping_count -I $NIC $1; return $?; fi
>> +	arping -q -c $OCF_RESKEY_arping_count -I $NIC $1
> 
> I think you should add the source IP.
I will see into this.

> Is this the only arping out there?  I think there are at least two
> variants, with slightly different options.
Ooops, I did not know that.

> If your invokation is compatible, add a comment that it is.
> If not, find a way to deal with both.
Will go and investigate this...

>> +	return $?
> 
> return $? can go, functions return with the exit code of the last command anyways.
Glad I learned this now. :)

>> [..]
>> +ip_monitor_if_check () {
>> +	local CURRENT_CHECK_LEVEL
>> +	CURRENT_CHECK_LEVEL=$1
>> +	# always check link status first
>> +	link_status="`get_link_status`"
>> +	ocf_log debug "link_status: $link_status"
>> +	case $link_status in
>> +		UP)
>> +			if [ "$CURRENT_CHECK_LEVEL" -eq 0 ]; then
>> +				return $OCF_SUCCESS
>> +			fi
>> +			;;
>> +		DOWN)
>> +			# remove address from NIC
>> +			return $OCF_NOT_RUNNING
>> +			;;
>> +		*) # this should not happen.
>> +			return $OCF_ERR_GENERIC
>> +			;;
>> +	esac
> 
> # $CURRENT_CHECK_LEVEL > 0 and link_status UP is implicit from now on
added.

>>[..]
>> +	# watch for packet counter changes
>> +	if [ "$CURRENT_CHECK_LEVEL" -eq 10 ]; then
> 
> Do you want -le here?
> if not, add a comment why you want -eq.
doesn't make any difference, but -le definitely looks more consistent.

>>[..]
>> +	# check arping_ip_list
>> +	if [ $CURRENT_CHECK_LEVEL -le 19 ]; then
>> +		ocf_log debug "check arping_ip_list" 
>> +		if [ "$OCF_RESKEY_arping_ip_list" != "" ]; then
> 
> No need to check for != "",
> the for loop will just not execute if it is.
ok.

>> +			for ip in $OCF_RESKEY_arping_ip_list; do
>> +				do_arping $ip && return $OCF_SUCCESS
>> +			done
>> +		fi
>> +	fi
>> +
>> +	# check arping ARP cache entries
>> +	if [ $CURRENT_CHECK_LEVEL -le 20 ]; then
>> +		ocf_log debug "check arping ARP cache entries" 
>> +		if [ "$OCF_RESKEY_arping_ip_list" != "" ]; then
> 
> no need here either, probably even wrong.
> You want this even if there is no arping_ip_list explicitly set, right?
Right. Ooops, c&p error.

>> +			for ip in `get_arp_list`; do
>> +				do_arping $ip && return $OCF_SUCCESS
>> +			done
>> +		fi
>> +	fi
>> +
>> +	# watch for packet counter changes in promiscios mode
>> +	if [ $CURRENT_CHECK_LEVEL -le 30 ]; then
>> +		ocf_log debug "watch for packet counter changes in promiscios mode" 
>> +		# be sure switch off promiscios mode in any case
>> +		trap "$IP2UTIL link set dev $NIC promisc off; exit" INT TERM EXIT
> 
> I don't like this.
> You assume that the dev does not operate in promisc mode.
> This will break the (presumably minority of) cases that actually
> run promisc intentionally during "normal operation".
> 
Oh yes, you are right. I honestly just didn't think about this.

>> +			$IP2UTIL link set dev $NIC promisc on
>> +			watch_pkt_counter && return $OCF_SUCCESS
>> +			$IP2UTIL link set dev $NIC promisc off
>> +		trap - INT TERM EXIT
>> +	fi
> 
> How about you remember the packet stats first thing,
> then do all the other checks,
> finally you do a broadcast ping or something,
> then go again waiting for packets,
> comparing with the first remembered count?

This sounds like a good idea to me, although I'm not sure about that 
"broadcast ping or something". Maybe as an option, but I wouldn't want 
to do any broadcasts by default. (and I have no idea, what that other 
something could be)

> If you really want the promisc check,
> maybe first check if it was on,
> and if so, leave it on.
> Or make that an advanced option: keep_promisc_on...
> 
> Maybe I'm just wrong here, and it's ok as you have it.
> 
No, you are right. I think will add a check and only switch the mode if 
necessary. But right now I do not yet see a way to omit a possible race 
condition where someone switches to promisc mode just right after the 
check, so that we switch it off afterwards. I will have to think a bit 
about this.
About the advanced option: I would rather not alter the mode by default, 
but maybe offer an option always_switch_promisc_off. Although I don't 
know how useful this would be.

>> +	# looks like it's not working (for whatever reason)
>> +	# remove address from NIC
>> +	return $OCF_NOT_RUNNING
>> +}
>> +
> 
> Enough for today...
More than enough for me to do for now!

> Thanks again for this valuable contribution.

Thank you very much for your comments.

Sorry for the slow replies -- I'm on part time right now and quite busy.
But you can be assured that I will keep working on this.

Robert.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: IPaddr2-monitor
Url: http://lists.linux-ha.org/pipermail/linux-ha-dev/attachments/20110211/7b1bb3ef/attachment.txt 


More information about the Linux-HA-Dev mailing list