Issue #3226: Outage Report: 2012-04-03, 11:20 UTC, all app servers - fedora-infrastructure

fedora-infrastructure

#3226 Outage Report: 2012-04-03, 11:20 UTC, all app servers

Closed: Fixed None Opened 12 years ago by codeblock.

Outage: App appXX servers - 2012-04-03 11:20 UTC

There was an outage starting at 2012-04-03 11:20 UTC, which lasted approximately 2.5 hours.

To convert UTC to your local time, take a look at
http://fedoraproject.org/wiki/Infrastructure/UTCHowto
or run:

date -d '2012-04-03 11:20 UTC'

Reason for outage: All appXX servers had a very high load.

The problem started at 11:20 UTC.
What we know so far: rebooting appXX did not resolve the problem
Stopping mirrorlist_server and restarting it manually seemed to resolve the problem
Root cause is still under investigation. There is a memory image and LVM snapshot image of app05 during this outage, on ibiblio02, at: /root/app05-2012-04-03-break.mem.img and /root/app05_broken_snap.img.gz respectively.

Affected Services:

Bodhi - https://admin.fedoraproject.org/updates/

Fedora Community - https://admin.fedoraproject.org/community/

Mirror List - https://mirrors.fedoraproject.org/

Mirror Manager - https://admin.fedoraproject.org/mirrormanager/

Package Database - https://admin.fedoraproject.org/pkgdb/

Wiki - http://fedoraproject.org/wiki/

Unaffected Services:

Ask Fedora - http://ask.fedoraproject.org/

BFO - http://boot.fedoraproject.org/

Buildsystem - http://koji.fedoraproject.org/

GIT / Source Control

DNS - ns1.fedoraproject.org, ns2.fedoraproject.org

Docs - http://docs.fedoraproject.org/

Email system

Fedora Account System - https://admin.fedoraproject.org/accounts/

Fedora Hosted - https://fedorahosted.org/

Fedora Insight - https://insight.fedoraproject.org/

Fedora People - http://fedorapeople.org/

Main Website - http://fedoraproject.org/

QA Services

Secondary Architectures

Smolt - http://smolts.org/

Spins - http://spins.fedoraproject.org/

Start - http://start.fedoraproject.org/

Torrent - http://torrent.fedoraproject.org/

Ticket Link:

Contact Information:

Please join #fedora-admin or #fedora-noc on irc.freenode.net or add comments to the ticket for this outage above.

kevin commented 12 years ago

I've dug around and can't really find too much related to this outage. ;(

Apr 3 11:22:01 app04.phx2.fedoraproject.org kernel: Killed process 13039, UID 441, (mirrorlist_serv) total-vm:322028kB, anon-rss:310424kB, file-rss:216kB

mirrorlist_serv was def getting killed by OMMkiller.

It's unclear to me if there's anything at all different between bring up on a new boot and restart... clearly there was somehow, but not sure what it could be. Adding Matt here for any ideas.

I'll note that bapp01 didn't die or have problems... could it have been bad mm data and what really cured it was not restarting instead of rebooting, but the bad mm data expiring out of bapp01 and getting synced to the apps?

kevin commented 12 years ago

Additional info: Another mirrormanager instance had issues at exactly the same time.

We are now thinking it was bad asn data that was messing things up, but don't have any easy way to confirm that. ;(

I'm going to close this now, feel free to reopen if there's any actions we can take.

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Priority

None

fedora-infrastructure

Source Code

#3226 Outage Report: 2012-04-03, 11:20 UTC, all app servers Closed: Fixed None Opened 12 years ago by codeblock.

Metadata

#3226 Outage Report: 2012-04-03, 11:20 UTC, all app servers

Closed: Fixed None Opened 12 years ago by codeblock.