At roughly 19:00 UTC, xen1 crashed. This brought down the koji hub but no other external services. puppet1, noc1, and smtp affected internal services. fas2 and app4 did not have an immediate impact as they are load balanced.
At some point between 19:02 and 19:25 UTC app5.vpn crashed and sat at the console (console messages attached). The good news is that this does not seem to have brought about a loss of web services as we still had app2 serving TurboGears content. However, all requests were going to app2 at that point.
By 19:25 UTC we had brought app5 back up. skvidal had all the guests back up by 12:40. mbonnet restarted kojid on the builders that needed to know that the koji hub was back. Outage officially over by 19:51 UTC.
console output of app5 crash
skvidal thinks we might be able to script starting of xen guests. This would make time to recover from a xen host reboot much quicker:
We didn't experience an outage from the combined app4, app5 downtime as app2 was still running but it is a bit worrisome that we only had one app server serving requests at that point. app5 will be rebuilt as x86_64 soon to see if 64bit guest on 64bit host resolves its crashes.
I've been working on a script that auto-restarts xen guests via a simple cli and snmp. I also haven't seen any more issues with app5 or xen1. This seems to have been a one-off. App5 is i686 now.
Closing ticket, its overdue.
to comment on this ticket.