[Linux-HA] ERROR: NV failure (msgfromsteam)

Ken P Maier kpm at pt.com
Thu Jul 9 13:55:09 MDT 2009


Hi,  Sorry if this is a repeat.  I have checked the faq.

I get this error 1 out of 20 times when starting heartbeat for the first 
time.

I'm running version 2.99.3.

Here's my log.

Jul  9 14:28:45 (none) heartbeat: [760]: WARN: Core dumps could be lost if 
multiple dumps occur.
Jul  9 14:28:45 (none) heartbeat: [760]: WARN: Consider setting 
non-default value in /proc/sys/kernel/core_pattern (or equivalent) f
or maximum supportability
Jul  9 14:28:45 (none) heartbeat: [760]: WARN: Consider setting 
/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supp
ortability
Jul  9 14:28:45 (none) heartbeat: [760]: info: Version 2 support: false
Jul  9 14:28:45 (none) heartbeat: [760]: WARN: Logging daemon is disabled 
--enabling logging daemon is recommended
Jul  9 14:28:46 (none) heartbeat: [760]: info: **************************
Jul  9 14:28:46 (none) heartbeat: [760]: info: Configuration validated. 
Starting heartbeat 2.99.3
Jul  9 14:28:46 (none) heartbeat: [761]: info: heartbeat: version 2.99.3
Jul  9 14:28:46 (none) heartbeat: [761]: info: Heartbeat generation: 
1247147626
Jul  9 14:50:08 (none) last message repeated 61 timesUDP Broadcast 
heartbeat started on port 694 (694) interface eth1
Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: UDP Broadcast 
heartbeat closed on port 694 interface eth1 - Status: 1
Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: write socket 
priority set to IPTOS_LOWDELAY on eth0
Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: bound send 
socket to device: eth0
Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: bound receive 
socket to device: eth0
Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: started on 
port 694 interface eth0 to 147.139.22.45
Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: write socket 
priority set to IPTOS_LOWDELAY on eth0
Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: bound send 
socket to device: eth0
Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: bound receive 
socket to device: eth0
Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: started on 
port 694 interface eth0 to 147.139.22.44
Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: ping heartbeat 
started.
Jul  9 14:28:46 (none) heartbeat: [761]: info: G_main_add_TriggerHandler: 
Added signal manual handler
Jul  9 14:28:46 (none) heartbeat: [761]: info: G_main_add_TriggerHandler: 
Added signal manual handler
Jul  9 14:28:46 (none) heartbeat: [761]: info: G_main_add_SignalHandler: 
Added signal handler for signal 17
Jul  9 14:28:46 (none) heartbeat: [761]: info: Local status now set to: 
'up'
Jul  9 14:28:47 (none) heartbeat: [761]: info: Link 
147.139.16.250:147.139.16.250 up.
Jul  9 14:28:47 (none) heartbeat: [761]: info: Status update for node 
147.139.16.250: status ping
Jul  9 14:28:47 (none) heartbeat: [761]: info: Link ems2:eth1 up.
Jul  9 14:28:48 (none) heartbeat: [764]: ERROR: fifo_child: here???
Jul  9 14:29:06 (none) last message repeated 18 times
Jul  9 14:29:06 (none) heartbeat: [761]: WARN: node ems1: is dead
Jul  9 14:29:06 (none) heartbeat: [761]: info: Comm_now_up(): updating 
status to active
Jul  9 14:29:06 (none) heartbeat: [761]: info: Local status now set to: 
'active'
Jul  9 14:29:06 (none) heartbeat: [761]: info: Starting child client 
"/usr/lib/heartbeat/ipfail" (6343,65)
Jul  9 14:29:06 (none) heartbeat: [761]: WARN: No STONITH device 
configured.
Jul  9 14:29:06 (none) heartbeat: [761]: WARN: Shared disks are not 
protected.
Jul  9 14:29:06 (none) heartbeat: [761]: info: Resources being acquired 
from ems1.
Jul  9 14:29:06 (none) heartbeat: [775]: info: Starting 
"/usr/lib/heartbeat/ipfail" as uid 6343  gid 65 (pid 775)
Jul  9 14:29:07 (none) heartbeat: [764]: WARN: ha_msg_add_nv_depth: line 
doesn't contain '='
Jul  9 14:29:07 (none) heartbeat: [764]: info: >>>
Jul  9 14:29:07 (none) heartbeat: [764]: ERROR: NV failure (msgfromsteam): 
[>>> ]
Jul  9 14:29:07 (none) heartbeat: [764]: ERROR: fifo_child: here???
Jul  9 14:29:07 (none) heartbeat: [777]: info: Local Resource acquisition 
completed.
Jul  9 14:29:07 (none) heartbeat: [761]: info: Initial resource 
acquisition complete (T_RESOURCES(us))
Jul  9 14:29:08 (none) heartbeat: [764]: ERROR: fifo_child: here???
Jul  9 14:29:16 (none) last message repeated 8 times
Jul  9 14:29:17 (none) heartbeat: [761]: info: Local Resource acquisition 
completed. (none)
Jul  9 14:29:17 (none) heartbeat: [761]: info: local resource transition 
completed.
Jul  9 14:29:17 (none) heartbeat: [764]: ERROR: fifo_child: here???


Is this a know problem or is it me?

I did add some debug in function fifo_child(), as shown below.

    /* Make sure we check for death of parent every so often... */
    for (;;) {

        setmsalarm(1000L);
        msg = msgfromstream(fifo);
        setmsalarm(0L);
        hb_check_mcp_alive();
        hb_signal_process_pending();

        if (msg) {
            IPC_Message*    m;
            if (DEBUGDETAILS) {
                cl_log(LOG_DEBUG, "fifo_child message:");
                cl_log_message(LOG_DEBUG, msg);
            }
            m = hamsg2ipcmsg(msg, chan);
            if (m) {
                /* Send frees "m" "at the right time" */
                chan->ops->send(chan, m);
                hb_check_mcp_alive();
                hb_signal_process_pending();
                chan->ops->waitout(chan);
                hb_check_mcp_alive();
                hb_signal_process_pending();
            }
            ha_msg_del(msg);
            msg = NULL;

        }else if (feof(fifo)) {
            cl_log(LOG_ERR, "fifo_child: EOF on FIFO");
            if (ANYDEBUG) {
                return_to_orig_privs();
                cl_log(LOG_DEBUG
                ,   "fifo_child: EOF on FIFO");
            }
            hb_check_mcp_alive();
            exit(2);
        } else {
            cl_log(LOG_ERR, "fifo_child: here???");
        }
        cl_cpu_limit_update();
        cl_realtime_malloc_check();
        hb_check_mcp_alive();
        hb_signal_process_pending();
    }
    /*notreached*/


Many Thanks,
Ken


More information about the Linux-HA mailing list