[Linux-HA] ERROR: NV failure (msgfromsteam)

Dejan Muhamedagic dejanmm at fastmail.fm
Fri Jul 10 05:48:30 MDT 2009


Hi,

On Thu, Jul 09, 2009 at 03:55:09PM -0400, Ken P Maier wrote:
> Hi,  Sorry if this is a repeat.  I have checked the faq.
> 
> I get this error 1 out of 20 times when starting heartbeat for the first 
> time.

It always happens only on startup?

> I'm running version 2.99.3.
> 
> Here's my log.
> 
> Jul  9 14:28:45 (none) heartbeat: [760]: WARN: Core dumps could be lost if 
> multiple dumps occur.
> Jul  9 14:28:45 (none) heartbeat: [760]: WARN: Consider setting 
> non-default value in /proc/sys/kernel/core_pattern (or equivalent) f
> or maximum supportability
> Jul  9 14:28:45 (none) heartbeat: [760]: WARN: Consider setting 
> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supp
> ortability
> Jul  9 14:28:45 (none) heartbeat: [760]: info: Version 2 support: false
> Jul  9 14:28:45 (none) heartbeat: [760]: WARN: Logging daemon is disabled 
> --enabling logging daemon is recommended
> Jul  9 14:28:46 (none) heartbeat: [760]: info: **************************
> Jul  9 14:28:46 (none) heartbeat: [760]: info: Configuration validated. 
> Starting heartbeat 2.99.3
> Jul  9 14:28:46 (none) heartbeat: [761]: info: heartbeat: version 2.99.3
> Jul  9 14:28:46 (none) heartbeat: [761]: info: Heartbeat generation: 
> 1247147626
> Jul  9 14:50:08 (none) last message repeated 61 timesUDP Broadcast 
> heartbeat started on port 694 (694) interface eth1
> Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: UDP Broadcast 
> heartbeat closed on port 694 interface eth1 - Status: 1
> Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: write socket 
> priority set to IPTOS_LOWDELAY on eth0
> Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: bound send 
> socket to device: eth0
> Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: bound receive 
> socket to device: eth0
> Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: started on 
> port 694 interface eth0 to 147.139.22.45
> Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: write socket 
> priority set to IPTOS_LOWDELAY on eth0
> Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: bound send 
> socket to device: eth0
> Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: bound receive 
> socket to device: eth0
> Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: started on 
> port 694 interface eth0 to 147.139.22.44
> Jul  9 14:28:46 (none) heartbeat: [761]: info: glib: ping heartbeat 
> started.
> Jul  9 14:28:46 (none) heartbeat: [761]: info: G_main_add_TriggerHandler: 
> Added signal manual handler
> Jul  9 14:28:46 (none) heartbeat: [761]: info: G_main_add_TriggerHandler: 
> Added signal manual handler
> Jul  9 14:28:46 (none) heartbeat: [761]: info: G_main_add_SignalHandler: 
> Added signal handler for signal 17
> Jul  9 14:28:46 (none) heartbeat: [761]: info: Local status now set to: 
> 'up'
> Jul  9 14:28:47 (none) heartbeat: [761]: info: Link 
> 147.139.16.250:147.139.16.250 up.
> Jul  9 14:28:47 (none) heartbeat: [761]: info: Status update for node 
> 147.139.16.250: status ping
> Jul  9 14:28:47 (none) heartbeat: [761]: info: Link ems2:eth1 up.
> Jul  9 14:28:48 (none) heartbeat: [764]: ERROR: fifo_child: here???
> Jul  9 14:29:06 (none) last message repeated 18 times
> Jul  9 14:29:06 (none) heartbeat: [761]: WARN: node ems1: is dead
> Jul  9 14:29:06 (none) heartbeat: [761]: info: Comm_now_up(): updating 
> status to active
> Jul  9 14:29:06 (none) heartbeat: [761]: info: Local status now set to: 
> 'active'
> Jul  9 14:29:06 (none) heartbeat: [761]: info: Starting child client 
> "/usr/lib/heartbeat/ipfail" (6343,65)
> Jul  9 14:29:06 (none) heartbeat: [761]: WARN: No STONITH device 
> configured.
> Jul  9 14:29:06 (none) heartbeat: [761]: WARN: Shared disks are not 
> protected.
> Jul  9 14:29:06 (none) heartbeat: [761]: info: Resources being acquired 
> from ems1.
> Jul  9 14:29:06 (none) heartbeat: [775]: info: Starting 
> "/usr/lib/heartbeat/ipfail" as uid 6343  gid 65 (pid 775)
> Jul  9 14:29:07 (none) heartbeat: [764]: WARN: ha_msg_add_nv_depth: line 
> doesn't contain '='
> Jul  9 14:29:07 (none) heartbeat: [764]: info: >>>
> Jul  9 14:29:07 (none) heartbeat: [764]: ERROR: NV failure (msgfromsteam): 
> [>>> ]
> Jul  9 14:29:07 (none) heartbeat: [764]: ERROR: fifo_child: here???
> Jul  9 14:29:07 (none) heartbeat: [777]: info: Local Resource acquisition 
> completed.
> Jul  9 14:29:07 (none) heartbeat: [761]: info: Initial resource 
> acquisition complete (T_RESOURCES(us))
> Jul  9 14:29:08 (none) heartbeat: [764]: ERROR: fifo_child: here???
> Jul  9 14:29:16 (none) last message repeated 8 times
> Jul  9 14:29:17 (none) heartbeat: [761]: info: Local Resource acquisition 
> completed. (none)
> Jul  9 14:29:17 (none) heartbeat: [761]: info: local resource transition 
> completed.
> Jul  9 14:29:17 (none) heartbeat: [764]: ERROR: fifo_child: here???
> 
> 
> Is this a know problem or is it me?

Don't know :) The message heartbeat gets is not formatted
correctly. Can't recall anybody reporting that myself, though I
can see one message with the same footprint in the archives. The
question is who's sending this message. Perhaps you can set debug
to 2 and see if there are any clues in the logs.

Thanks,

Dejan

> I did add some debug in function fifo_child(), as shown below.
> 
>     /* Make sure we check for death of parent every so often... */
>     for (;;) {
> 
>         setmsalarm(1000L);
>         msg = msgfromstream(fifo);
>         setmsalarm(0L);
>         hb_check_mcp_alive();
>         hb_signal_process_pending();
> 
>         if (msg) {
>             IPC_Message*    m;
>             if (DEBUGDETAILS) {
>                 cl_log(LOG_DEBUG, "fifo_child message:");
>                 cl_log_message(LOG_DEBUG, msg);
>             }
>             m = hamsg2ipcmsg(msg, chan);
>             if (m) {
>                 /* Send frees "m" "at the right time" */
>                 chan->ops->send(chan, m);
>                 hb_check_mcp_alive();
>                 hb_signal_process_pending();
>                 chan->ops->waitout(chan);
>                 hb_check_mcp_alive();
>                 hb_signal_process_pending();
>             }
>             ha_msg_del(msg);
>             msg = NULL;
> 
>         }else if (feof(fifo)) {
>             cl_log(LOG_ERR, "fifo_child: EOF on FIFO");
>             if (ANYDEBUG) {
>                 return_to_orig_privs();
>                 cl_log(LOG_DEBUG
>                 ,   "fifo_child: EOF on FIFO");
>             }
>             hb_check_mcp_alive();
>             exit(2);
>         } else {
>             cl_log(LOG_ERR, "fifo_child: here???");
>         }
>         cl_cpu_limit_update();
>         cl_realtime_malloc_check();
>         hb_check_mcp_alive();
>         hb_signal_process_pending();
>     }
>     /*notreached*/
> 
> 
> Many Thanks,
> Ken
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems



More information about the Linux-HA mailing list