Outage: App appXX servers - 2012-04-03 11:20 UTC
There was an outage starting at 2012-04-03 11:20 UTC, which lasted approximately 2.5 hours.
To convert UTC to your local time, take a look at http://fedoraproject.org/wiki/Infrastructure/UTCHowto or run:
date -d '2012-04-03 11:20 UTC'
Reason for outage: All appXX servers had a very high load.
The problem started at 11:20 UTC.
What we know so far: rebooting appXX did not resolve the problem
Stopping mirrorlist_server and restarting it manually seemed to resolve the problem
Root cause is still under investigation. There is a memory image and LVM snapshot image of app05 during this outage, on ibiblio02, at: /root/app05-2012-04-03-break.mem.img and /root/app05_broken_snap.img.gz respectively.
Affected Services:
Bodhi - https://admin.fedoraproject.org/updates/
Fedora Community - https://admin.fedoraproject.org/community/
Mirror List - https://mirrors.fedoraproject.org/
Mirror Manager - https://admin.fedoraproject.org/mirrormanager/
Package Database - https://admin.fedoraproject.org/pkgdb/
Wiki - http://fedoraproject.org/wiki/
Unaffected Services:
Ask Fedora - http://ask.fedoraproject.org/
BFO - http://boot.fedoraproject.org/
Buildsystem - http://koji.fedoraproject.org/
GIT / Source Control
DNS - ns1.fedoraproject.org, ns2.fedoraproject.org
Docs - http://docs.fedoraproject.org/
Email system
Fedora Account System - https://admin.fedoraproject.org/accounts/
Fedora Hosted - https://fedorahosted.org/
Fedora Insight - https://insight.fedoraproject.org/
Fedora People - http://fedorapeople.org/
Main Website - http://fedoraproject.org/
QA Services
Secondary Architectures
Smolt - http://smolts.org/
Spins - http://spins.fedoraproject.org/
Start - http://start.fedoraproject.org/
Torrent - http://torrent.fedoraproject.org/
Ticket Link:
Contact Information:
Please join #fedora-admin or #fedora-noc on irc.freenode.net or add comments to the ticket for this outage above.
I've dug around and can't really find too much related to this outage. ;(
Apr 3 11:22:01 app04.phx2.fedoraproject.org kernel: Killed process 13039, UID 441, (mirrorlist_serv) total-vm:322028kB, anon-rss:310424kB, file-rss:216kB
mirrorlist_serv was def getting killed by OMMkiller.
It's unclear to me if there's anything at all different between bring up on a new boot and restart... clearly there was somehow, but not sure what it could be. Adding Matt here for any ideas.
I'll note that bapp01 didn't die or have problems... could it have been bad mm data and what really cured it was not restarting instead of rebooting, but the bad mm data expiring out of bapp01 and getting synced to the apps?
Additional info: Another mirrormanager instance had issues at exactly the same time.
We are now thinking it was bad asn data that was messing things up, but don't have any easy way to confirm that. ;(
I'm going to close this now, feel free to reopen if there's any actions we can take.
Login to comment on this ticket.