#6384 update postgres version for datagrepper/datanommer
Closed: Fixed 6 years ago Opened 6 years ago by kellin.

@ralph and @kevin asked me to look into a datagrepper query speed issue.

It seems that the queries in question use an EXISTS combined with ORDER BY that make incorrect estimates thus the query planner is using the wrong optimizations.

This is fixed in Postgres 9.3 according to the patch notes; tested it locally on Postgres 9.6 and the query that times out comes back immediately.

I have already spoke to @kevin about it - we should test in stage as best we can even though that's not full load to ensure this fix is all we need to resolve the timeout issues in datagrepper.


ok, we are on postgresql-server-9.2.21-1.el7.x86_64 on db-datanommer02.

I assume we need to do a dump/restore to move to 9.6? And once we upgrade, we would need to do a dump/restore to downgrade again?

Currently a dump (and xz compress) takes about 16 hours. So, I think we are looking at 2 days downtime here. ;( Unless there's some more clever way to do things?

I wasn't able to log into the machine, but here's what I would think should work:

Provided the filesystem is on LVM:

  • stop the database and service
  • take an LVM snapshot
  • upgrade to latest postgres 9.6
  • start database, run pg_upgrade
  • start services see if everything is OK
  • if not OK, then stop service/database and revert to LVM snapshot

I'm not sure if there is space for the snapshot, but that would be the 'quick' way to go about it and I would definitely test that in stage first to be sure there are no other complications from our environment or setup.

Yeah, we could look at doing that.

We may have to add space to allow for the snapshot, but I think that should not be a problem.

testing in stage is largely going to be useless here. staging datagrepper is on a bdr cluster thats using 9.4.12, which it shares with a number of other databases, so it wouldn't be easy to upgrade. We could look to see if the issue is not there on the queries that are slow in prod, but not sure that data is even there.

I'll ponder on if this is worth a freeze break...

Since this is pretty intrusive and a pretty critical service we are going to wait until after freeze to do this.

Note: I was able to adjust the wsgi datagrepper web app to handle things a lot better... so queries are no longer timing out/erroring. We should probibly still do this, but it's not as critical.

So, we did some more digging and found a way to make this better:

  • We added some more indexes. Should help some queries.

  • We found that the websites (the big thing we wanted to fix here) was doing a query that was basically asking the db to get all fedimg messages since the beginning of time and send back a subset that had a specific content. We changed the websites queries to have a definite time bound and to just get all the query data back and look at it itself for the content). That seems to have made the db much happier.

So, I think we will just leave it on el7 for now and try and see if we can keep it working.

Thanks for all the assistance here.

:police_car:

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

6 years ago

Login to comment on this ticket.

Metadata