#479 Xen7 crash
Closed: Fixed None Opened 16 years ago by toshio.

== Summary ==
* Xen4 crashed at 2008-03-29 14:05 UTC
* This took down app1 which has the writable moin instance
* Other services were load balanced so end users were not affected
* Xen guests brought back up ~ 2008-03-29 18:30 UTC

== Log ==
* Noticed problem through report on #fedora-admin at 11:00 PDT
* Saw nagios alerts for app1, app3, and proxy2
* Looked for a xen server down
* First run of scanXen.sh reported xen5 and xen7 down.
* Able to ping xen5 but not xen7
* A new run of scanXen.sh run found both hosts -- xen7 had no guests running.
* Started app1, app3, fas1, and proxy2
* Outage over

== Thoughts ==

Since xen5 and xen7 weren't found by scanXen at first, I wonder if we're having a network issue that causes the disconnects in the log.

Log is similar to ticket:477 but the last entry here is an iscsi recovery.
{{{
Mar 29 13:58:36 xen7 puppetd[3754]: Finished catalog run in 12.47 seconds
Mar 29 14:06:14 xen7 iscsid: Nop-out timedout after 15 seconds on connection 2:0 state (3). Dropping session.
Mar 29 14:06:20 xen7 iscsid: connection2:0 is operational after recovery (2 attempts)
Mar 29 14:11:35 xen7 syslogd 1.4.1: restart.
}}}

Like ticket:477 there are multiple iscsi connect failures during the boot process before we reconnect to the server.


Happened again this morning. ausil restarted the guests. Log messages from around that time:
{{{
Mar 30 14:03:21 xen7 kernel: xen_net: Memory squeeze in netback driver.
Mar 30 14:05:29 xen7 kernel: xen_net: Memory squeeze in netback driver.
Mar 30 14:06:37 xen7 last message repeated 11 times
Mar 30 14:09:06 xen7 kernel: xen_net: Memory squeeze in netback driver.
Mar 30 14:09:28 xen7 last message repeated 5 times
Mar 30 14:13:58 xen7 iscsid: Nop-out timedout after 15 seconds on connection 2:0
state (3). Dropping session.
Mar 30 14:14:04 xen7 iscsid: connection2:0 is operational after recovery (2 atte
mpts)
Mar 30 14:19:18 xen7 syslogd 1.4.1: restart.
Mar 30 14:19:18 xen7 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Mar 30 14:19:18 xen7 kernel: Bootdata ok (command line is ro root=LABEL=/ consol
e=ttyS0,9600n8)
}}}

And again this morning. spot reported and I brought it back up: Mar 31 15:40 UTC 2008. I was unable to ssh to the box at 15:33 but was able to ping. Logged in on the serial console, verified that sshd was running, tried to ssh again, it worked.

/var/log/secure has this to say from the time of the reboot:
{{{
Mar 31 14:25:49 xen7 sshd[3480]: Server listening on :: port 22.
Mar 31 14:25:49 xen7 sshd[3480]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
}}}

/var/log/messages from the time of the crash

{{{
Mar 31 14:21:48 xen7 iscsid: Nop-out timedout after 15 seconds on connection 2:0 state (3). Dropping session.
Mar 31 14:21:53 xen7 iscsid: connection2:0 is operational after recovery (2 attempts)
Mar 31 14:25:13 xen7 syslogd 1.4.1: restart.
Mar 31 14:25:13 xen7 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Mar 31 14:25:13 xen7 kernel: Bootdata ok (command line is ro root=LABEL=/ console=ttyS0,9600n8)
Mar 31 14:25:13 xen7 kernel: Linux version 2.6.18-53.1.14.el5xen (brewbuilder@hs20-bc2-3.build.redhat.com)
(gcc version 4.1.2 20070626 (Red Hat 4.1.2-14)) #1 SMP Tue Feb 19 07:33:17 EST 2008
}}}

This was related to a poorly configured network interface (an ip conflict) Between the management port and the primary nic.

Login to comment on this ticket.

Metadata