== Summary == * xen1 crashed at 2008-03-29 8:49UTC. * buildsystem was down until ~ 2008-03-29 11:50 UTC. * Other services were load balanced or not serving end users so no outage was visible
== Log of Actions == * Found out koji was down at 9:50UTC * Found that koji1/publictest8 wasn't running anywhere and neither was puppet1 * Started koji1 and puppet1 on xen2 * puppet1 came up * koji1's root filesystem mounted itself read-only * Rebooted koji1 * Manual fsck was needed and required a root password * Found that xen1 didn't have any guests running and had a configuration for both koji1 and puppet1 * Logs showed koji1, puppet1, and others had been running on xen1 and xen1 had crashed at 8:49UTC * Shutdown koji1 and puppet1 on xen2 * Brought up koji1, puppet1, app4, proxy4, fas2, smtp, publictest3, and noc1 on xen1 * koji1 still required root for fsck * publictest3 did not restart due to xen saying not enough memory * Called Dennis who had root password and was able to fsck koji1 * Dennis ran xm memset so that publictest3 could be brought back up as well * Outage over
== Possible Causes == /var/log/messages and /var/log/dmesg on xen1 have been saved to ~root/outage-2008-03-29.messages ~root/outage-2008-03-29.dmesg in case someone can pull some useful information from them later.
/var/log/messages around the crash::
{{{ Mar 29 08:45:35 xen1 puppetd[19997]: Finished catalog run in 21.25 seconds Mar 29 08:46:57 xen1 iscsid: Nop-out timedout after 15 seconds on connection 2:0 state (3). Dropping session. Mar 29 08:49:38 xen1 syslogd 1.4.1: restart. Mar 29 08:49:38 xen1 kernel: klogd 1.4.1, log source = /proc/kmsg started. }}}
The iscsi timeout could be a red-herring since we see them throughout the logs but on reboot we do have multiple failures to connect to iscsi before it finlly succeeds:: {{{ [Others like this] Mar 29 08:50:47 xen1 iscsid: Nop-out timedout after 15 seconds on connection 1:0 state (3). Dropping session. Mar 29 08:50:51 xen1 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 8 Mar 29 08:50:53 xen1 iscsid: connect failed (113) Mar 29 08:50:53 xen1 iscsid: Nop-out timedout after 15 seconds on connection 2:0 state (3). Dropping session. Mar 29 08:50:56 xen1 iscsid: connect failed (113) Mar 29 08:50:58 xen1 iscsid: connect failed (113) [...] Mar 29 08:51:02 xen1 iscsid: connection2:0 is operational after recovery (3 attempts) Mar 29 08:51:02 xen1 iscsid: connection1:0 is operational after recovery (4 attempts) }}}
This is likely related to:
https://bugzilla.redhat.com/show_bug.cgi?id=429469
Side note, We've got two nic's listening as the iscsi target. We can start looking into adjusting the timeout settings in /etc/iscsid as well as getting multipath properly setup.
Closing this, xen1 seems to have calmed down quite a bit with the newer 5.2 kernel
Login to comment on this ticket.