== Summary ==
* xen1 crashed at 2008-03-29 8:49UTC.
* buildsystem was down until ~ 2008-03-29 11:50 UTC.
* Other services were load balanced or not serving end users so no outage was visible
== Log of Actions ==
* Found out koji was down at 9:50UTC
* Found that koji1/publictest8 wasn't running anywhere and neither was puppet1
* Started koji1 and puppet1 on xen2
* puppet1 came up
* koji1's root filesystem mounted itself read-only
* Rebooted koji1
* Manual fsck was needed and required a root password
* Found that xen1 didn't have any guests running and had a configuration for both koji1 and puppet1
* Logs showed koji1, puppet1, and others had been running on xen1 and xen1 had crashed at 8:49UTC
* Shutdown koji1 and puppet1 on xen2
* Brought up koji1, puppet1, app4, proxy4, fas2, smtp, publictest3, and noc1 on xen1
* koji1 still required root for fsck
* publictest3 did not restart due to xen saying not enough memory
* Called Dennis who had root password and was able to fsck koji1
* Dennis ran xm memset so that publictest3 could be brought back up as well
* Outage over
== Possible Causes ==
/var/log/messages and /var/log/dmesg on xen1 have been saved to ~root/outage-2008-03-29.messages ~root/outage-2008-03-29.dmesg in case someone can pull some useful information from them later.
/var/log/messages around the crash::
Mar 29 08:45:35 xen1 puppetd: Finished catalog run in 21.25 seconds
Mar 29 08:46:57 xen1 iscsid: Nop-out timedout after 15 seconds on connection 2:0
state (3). Dropping session.
Mar 29 08:49:38 xen1 syslogd 1.4.1: restart.
Mar 29 08:49:38 xen1 kernel: klogd 1.4.1, log source = /proc/kmsg started.
The iscsi timeout could be a red-herring since we see them throughout the logs but on reboot we do have multiple failures to connect to iscsi before it finlly succeeds::
[Others like this]
Mar 29 08:50:47 xen1 iscsid: Nop-out timedout after 15 seconds on connection 1:0 state (3). Dropping session.
Mar 29 08:50:51 xen1 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 8
Mar 29 08:50:53 xen1 iscsid: connect failed (113)
Mar 29 08:50:53 xen1 iscsid: Nop-out timedout after 15 seconds on connection 2:0 state (3). Dropping session.
Mar 29 08:50:56 xen1 iscsid: connect failed (113)
Mar 29 08:50:58 xen1 iscsid: connect failed (113)
Mar 29 08:51:02 xen1 iscsid: connection2:0 is operational after recovery (3 attempts)
Mar 29 08:51:02 xen1 iscsid: connection1:0 is operational after recovery (4 attempts)
This is likely related to:
Side note, We've got two nic's listening as the iscsi target. We can start looking into adjusting the timeout settings in /etc/iscsid as well as getting multipath properly setup.
Closing this, xen1 seems to have calmed down quite a bit with the newer 5.2 kernel
to comment on this ticket.