#1059 db3 outage
Closed: Fixed None Opened 15 years ago by ricky.

db3 went down around 2008-12-16 08:10 UTC, I had to kick it on the PDU, and now I have confirmed file corruption/loss on the / filesystem, at the very least. I have paged Mike to take a look at this issue.


K, I'm on it now. We've heard there's issues on / (not a big deal we store no data there) though we've actually found issues on /backup. Normally not a problem but db1 is running from there right now. I'm in the process of bringing db1 back up.

So initial thoughts almost have to be "db3 can't handle the load of running two databases". The problem is db1, under normal load (which its been getting) isn't terribly busy. DB3 is a pretty beefy box. Also load was always fairly low, and it never swapped. Though, aside from some disk issues, there's nothing indicating what the crash was. The RSAII management card is not reporting any faults.

RSA card finally throwing some errors including:

aacraid: Host adapter abort request (0,0,5,0)
aacraid: Host adapter abort request (0,0,3,0)
aacraid: Host adapter abort request (0,0,3,0)
aacraid: Host adapter abort request (0,0,3,0)
aacraid: Host adapter abort request (0,0,3,0)
aacraid: Host adapter abort request (0,0,3,0)
aacraid: Host adapter abort request (0,0,3,0)
aacraid: Host adapter abort request (0,0,3,0)
aacraid: Host adapter abort request (0,0,3,0)
aacraid: Host adapter abort request (0,0,3,0)
aacraid: Host adapter abort request (0,0,3,0)
aacraid: Host adapter abort request (0,0,3,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,4,0)
aacraid: Host adapter abort request (0,0,4,0)
aacraid: Host adapter abort request (0,0,4,0)
aacraid: Host adapter abort request (0,0,4,0)
aacraid: Host adapter abort request (0,0,4,0)
aacraid: Host adapter abort request (0,0,4,0)
aacraid: Host adapter abort request (0,0,4,0)
aacraid: Host adapter abort request (0,0,4,0)
aacraid: Host adapter abort request (0,0,4,0)
aacraid: Host adapter abort request (0,0,4,0)
aacraid: Host adapter abort request (0,0,1,0)
aacraid: Host adapter reset request. SCSI hang ?
AAC: Host adapter BLINK LED 0xa
AAC0: adapter kernel panic'd a.
AAC0: adapter kernel panic'd a.

and:
SERVPROC 12/16/08, 11:00:08 Hard Drive 3 Fault
SERVPROC 12/16/08, 10:59:59 Hard Drive 5 Fault
SERVPROC 12/16/08, 10:59:59 Hard Drive 2 Fault
SERVPROC 12/16/08, 10:59:59 Hard Drive 1 Fault
SERVPROC 12/16/08, 10:59:58 Hard Drive 0 Fault

IBM is going to replace the raid controller and mobo. the mobo is going to take a while to get in. I've tried repeatedly to rebuild the array and bring postgres back online and its failed every time. however in a degraded mode, I seem to be able to read from the arrays. The last backup did work as required, at worst we're looking at 9 hours of data loss.

I've decided it best (it is the middle of the night right now) to just take the downtime and go for 0 hours of downtime. I'm syncing the raw data files off now. At 83G it will take a while (the drives are borked). It's about 5% done. I'll have a better estimate on how long it will take soon.

Replying to [comment:4 mmcgrath]:

RSA card finally throwing some errors including:

aacraid: Host adapter reset request. SCSI hang ?
AAC: Host adapter BLINK LED 0xa
AAC0: adapter kernel panic'd a.
AAC0: adapter kernel panic'd a.

Actually the above errors were from the kernel, the others were from the RSA card (just to avoid confusion)

Ok, db3's files are copied off. There was corruption of:

/var/lib/pgsql/data/base/19461/pg_internal.init

But that seems to be the only file.

Ok.

  1. postgres files moved from db3 to xen3 in guest 'db3tmp'
  2. postgres turned back on
  3. koji turned back on
  4. a build is processing now.
  5. db3 is currently at IP 10.8.34.188
  6. db3tmp is at 10.8.34.213
  7. db3 has had its network interface disabled, it won't come back up on reboot
  8. puppet has been disabled on both

The current plan is to wait until the IBM tech gets on site. He'll replace the motherboard and backplane. Once we're confident its working properly we'll schedule another outage, probably asap, to copy those files back.

Differences between xen3 and db3. xen3 only has one other app on it, app4. we can safely disable it if we need to. Both db3 has 18G ram, db3tmp (on xen3) has 20G ram. The biggest difference is that xen3 just has a single RAID1 mirror for its data. db3 had a RAID1 array for its logs and a raid10 array for the data. IO is likely to be the biggest cause of issues until we move back.

IBM guy left, replaced the back plane but not motherboard. I'm going to run some fsck's, let the arrays get back in shape and put some general load on it for the next 24 hours (provided koji continues to run ok where it is)

K, db3's arrays have synced. Everything looks good. I'm going to setup bonnie to run to stress it over the next 12 hours or so. if its still working I'll schedule downtime to move back to it.

Log in to comment on this ticket.

Metadata