#2516 PHX2 Netapp problems

Created 6 years ago by smooge
Modified 6 years ago

= phenomenon =

Due to problems with some NFS in our PHX2 facility we are experiencing diminished capacity to several of our services. We are working with our provider and engineers on how to deal with this issue soon.

To convert UTC to your local time, take a look at
or run:

date -d '2010-12-15 22:00 UTC'

Reason for outage:
NFS operations with filer are peaking below expected rates causing hangs on NFS clients.

Affected Services:

BFO - http://boot.fedoraproject.org/
Buildsystem - http://koji.fedoraproject.org/
CVS / Source Control
Main Website - http://fedoraproject.org/
Mirror List - https://mirrors.fedoraproject.org/
Mirror Manager - https://admin.fedoraproject.org/mirrormanager/
Package Database - https://admin.fedoraproject.org/pkgdb/

Unaffected Services:

Ticket Link:

Contact Information:

Please join #fedora-admin in irc.freenode.net or respond to this email to
track the status of this outage.

= reason =

= recommendation =

mitigations being worked on

  1. move fi-repo from NFS to disks on puppet.
  2. move lookaside cache to disks on equalogix

trying to figure out next steps.


<skvidal> okay
<skvidal> 1. performance problems - those are likely to continue since we/ve not removed any load
<skvidal> 2. nothing-works-not-even-a-mount problem appears to have been some dns issues which we are expecting an explanation on "soon" - but the changes, thus far, do appear to be solving them
<skvidal> the next steps are:
<skvidal> a. see if the performance issues gets better w/o the svn repos adding load
<skvidal> b. if the answer to a is yes - see if we can limp along through to the new year so we don't have to play silly buggers over the holiday
<skvidal> c. if the answer to a is no then come up with a new plan
<skvidal> long-ish term (mid feb) is to transition to a new netapp and magically solve all our problems (and find some new ones)
<skvidal> wow, echoing silence as a reply
<skvidal> fantastic


Outage seems to have been solved by cleaning up bad DNS connection. It looks like at sometime in November some change caused the RHIT servers no longer to get DNS from Fedora. When the phx2.fedoraproject.org tables timed out systems trying to get new mounts failed and other issues stopped.

DNS problems were corrected and other jobs that were causing high CPU usage on the server were removed. Traffic seems to be moving back to normal.

