#3058 Outage: unplanned nfs server outage - 2011-12-08
Closed: Fixed None Opened 8 years ago by kevin.

Outage: unplanned nfs server outage - 2011-12-08

Our nfs01 server which holds packages and repos for release engineering had a failure at approximately 08:05 UTC on 2011-12-08. The server has been restarted and a filesystem check has been started. Service will be restored as soon as that check is complete. Build jobs submitted during this time will queue and be built as soon as the nfs server returns to service.

To convert UTC to your local time, take a look at
http://fedoraproject.org/wiki/Infrastructure/UTCHowto
or run:

date -d '2011-12-08 08:05'

Reason for outage:

Unplanned outage with nfs services. Root cause still being investigated.

Affected Services:

Buildsystem - http://koji.fedoraproject.org/

Unaffected services:

BFO - http://boot.fedoraproject.org/
Bodhi - https://admin.fedoraproject.org/updates/
GIT / Source Control
DNS - ns1.fedoraproject.org, ns2.fedoraproject.org
Docs - http://docs.fedoraproject.org/
Email system
Fedora Account System - https://admin.fedoraproject.org/accounts/
Fedora Community - https://admin.fedoraproject.org/community/
Fedora Hosted - https://fedorahosted.org/
Fedora Insight - https://insight.fedoraproject.org/
Fedora People - http://fedorapeople.org/
Fedora Talk - http://talk.fedoraproject.org/
Main Website - http://fedoraproject.org/
Mirror List - https://mirrors.fedoraproject.org/
Mirror Manager - https://admin.fedoraproject.org/mirrormanager/
Package Database - https://admin.fedoraproject.org/pkgdb/
Smolt - http://smolts.org/
Spins - http://spins.fedoraproject.org/
Start - http://start.fedoraproject.org/
Torrent - http://torrent.fedoraproject.org/
Translation Services - http://translate.fedoraproject.org/
Wiki - http://fedoraproject.org/wiki/

Ticket Link: https://fedorahosted.org/fedora-infrastructure/ticket/3058

Contact Information:

Please join #fedora-admin in irc.freenode.net or add comments to the ticket for this outage above.


The filesystem check has completed, however, it's seen quite a lot of issues.

We have a snapshot from earlier this morning which is in much better shape.

We are going to:

  • backup the snapshot to our backup store.
  • merge the snapshot back in as the main volume.
  • drop any builds in the koji db that were completed today.
  • reboot servers and bring everything back up.

Backup to backup store is done.

Merge of snapshot is in progress now: ETA: 30min or so.

After thats done, I will be doing reboots.

Finally we need to prune some builds before we can bring the builders online.

merge of snapshot is done.

reboots are done.

Everything is back up, but the koji builders are not yet re-enabled.
We need to prune any builds that happened after our snapshot.

All should be back up and running as of a 3-4 hours ago.

Please report any issues you find with builds in a new ticket.

Login to comment on this ticket.

Metadata