#8292 pg_dump causes high load and poor performance in koji
Closed: Fixed 13 days ago by kevin. Opened a month ago by kevin.

In order to make the koji db more performant, I moved the vm from 32 to 120gb memory last weekend.

Then, I tried to adjust postgresql options several times.

Currently the normal performance is just fine, but when a pg_dump runs the load on the db server goes from 15-20 to 150 and performance of the web app goes way down.

We did discover today an odd breakage in multipath (it's multipathing to a single disk in the array). This may be related to the issues.

At 21UTC today we are going to try and fix the multipath issue and tune things again to get dumps working moving forward. We hope to do this without causing any outages.


Don't know if related, today, starting around 14:15 UTC the page https://koji.fedoraproject.org/koji/taskinfo?taskID=38235370 took in average 45 seconds to load for me. This behavior is still present. I consider that time much slower than usual.

@codeblock looked into this last night and found several of the pages have poorly performing (aka wat???) queries with multiple joins which seem to be taking a very long time to process now. There was a slight schema change between 1.17 and 1.18 which may be the cause of this.. or we have been very very lucky for a while and now it is not.

Currently we are going to need a koji developer and someone with dba chops to figure out why this is happening.. as the dump should not be doing anything which would cause this.

Are the problematic queries and plans available? If so, @tkopecek may be able to provide guidance

https://infrastructure.fedoraproject.org/infra/db-dumps/koji-2019-10-09.dump.xz is the last db backup we have that completed (excluding the sessions table).

The dumps get to the buildroot_listing table when the load goes up. That table is 203 GB in size. ;(

https://infrastructure.fedoraproject.org/cgit/ansible.git/tree/roles/postgresql_server/templates/postgresql.conf is our config
https://infrastructure.fedoraproject.org/cgit/ansible.git/tree/inventory/host_vars/db-koji01.phx2.fedoraproject.org is the host config.

Happy to provide any further info or try things...

I do see some old locks while running a dump. All accesssharelocks tho.

datname | relation | transactionid | mode |
granted | usename |

  query

                                                                                            |  
    query_start          |       age        |  pid

---------+---------------------------------------------+---------------+--------------------------+
---------+----------+------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------+--
-----------------------------+------------------+-------
koji | pg_ts_config_oid_index | | AccessShareLock |
t | postgres | COPY public.buildroot_listing (buildroot_id, rpm_id, is_update) TO stdout;

                                                                                            | 2

019-10-12 17:11:41.319834+00 | 02:42:32.160655 | 18185
koji | build_notifications_block | | AccessShareLock |
t | postgres | COPY public.buildroot_listing (buildroot_id, rpm_id, is_update) TO stdout;

                                                                                            | 2

019-10-12 17:11:41.319834+00 | 02:42:32.160655 | 18185
koji | lock_monitor.transaction_locks | | AccessShareLock |
t | postgres | COPY public.buildroot_listing (buildroot_id, rpm_id, is_update) TO stdout;

                                                                                            | 2

019-10-12 17:11:41.319834+00 | 02:42:32.160655 | 18185
koji | pg_attrdef_adrelid_adnum_index | | AccessShareLock |
t | postgres | COPY public.buildroot_listing (buildroot_id, rpm_id, is_update) TO stdout;

                                                                                            | 2

019-10-12 17:11:41.319834+00 | 02:42:32.160655 | 18185

select count(*) from pg_locks;

count

3549

So although #8293 was closed as a duplicate of this bug, now f31/f30 newrepo does not appear for more than 210 min (as of 2019-10-14 10:10 UTC = 19:10 JST) after I submitted override request, and I cannot proceed further builds. And f32 newrepo does not appeare for more than 150 min.

Can't there be any immediate workaround for newrepo issue regardless of the status of this bug?

Anyone with 'admin' or 'repo' privs can trigger newRepo task manually. Nevertheless, it seems, that kojira is maybe not running at all? But yes, it could be also those ultra-slow queries.

@kevin Do you've some listing of long-running queries? Or pg plans for them?

Also some queries in koji require exclusive locks, which can interfere with pg_dump, theoretically causing deadlocks. It should be visible in long-running queries log.

I've stopped the current dump, so things should go back to more normal. kojira is definitely running, it's just not firing off newrepo tasks because of the database dump.

We are currently not logging any queries, can put that in place later today.

I'm going to try:

log_lock_waits = on
log_statement = 'all'
log_rotation_size = 10000

but that might swamp us in logs... Any suggestions on how we should best set logging here?

Oh, one other curious thing: pg_top notes:

DB activity: 74 tps, 190 rollbs/s, 2381 buffer r/s, 99 hit%, 313591 row r/s, 160 row w/s

should we really be seeing that many rollbacks per second? is there some common path thing that does rollbacks in normal operation?

ok, we have put in some long query logging and I have started another dump to show the issue.

so far we are seeing a lot of:

LOG:  process 21131 acquired ShareLock on transaction 3337332504 after 2769.780 ms
CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."sessions" x WHERE "id" OPERATOR(pg_catalog.=)
 $1 FOR SHARE OF x"
STATEMENT:  UPDATE sessions
                SET expired=TRUE,exclusive=NULL
                WHERE id = 75626281 OR master = 75626281

(of course where the id and master and transaction are different, but the statement is the same form.

We do have close to 100k sessions:

select count(*) from sessions;

count

93419
(1 row)

We have also seen a number of these:

LOG:  duration: 11500.139 ms  statement: 
        SELECT buildroot.container_arch, buildroot.br_type, buildroot.cg_id, content_generator.name
, buildroot.cg_version, buildroot.container_arch, buildroot.container_type, create_events.id, creat
e_events.time, EXTRACT(EPOCH FROM create_events.time), buildroot.extra, buildroot.host_arch, host.i
d, host.name, buildroot.host_os, buildroot.id, buildroot_listing.is_update, repo_create.id, repo_cr
eate.time, repo.id, repo.state, retire_events.id, retire_events.time, EXTRACT(EPOCH FROM retire_eve
nts.time), standard_buildroot.state, tag.id, tag.name, standard_buildroot.task_id
          FROM buildroot
          JOIN buildroot_listing ON buildroot.id = buildroot_listing.buildroot_id
          LEFT OUTER JOIN standard_buildroot ON standard_buildroot.buildroot_id = buildroot.id
          LEFT OUTER JOIN content_generator ON buildroot.cg_id = content_generator.id
          LEFT OUTER JOIN host ON host.id = standard_buildroot.host_id
          LEFT OUTER JOIN repo ON repo.id = standard_buildroot.repo_id
          LEFT OUTER JOIN tag ON tag.id = repo.tag_id
          LEFT OUTER JOIN events AS create_events ON create_events.id = standard_buildroot.create_e
vent
          LEFT OUTER JOIN events AS retire_events ON standard_buildroot.retire_event = retire_event
s.id
          LEFT OUTER JOIN events AS repo_create ON repo_create.id = repo.create_event
         WHERE (buildroot_listing.rpm_id = 13571464)
         ORDER BY buildroot.id DESC

         LIMIT 50

I expect when we hit the slowness we should see a lot of stuff logging waiting for locks...

Another really slow one:

LOG:  duration: 509550.809 ms  statement: 
        SELECT task.arch, task.awaited, task.channel_id, task.completion_time, EXTRACT(EPOCH FROM completion_time), task.create_time, EXTRACT(EPOCH FROM create_time), task.host_id, task.id, task.label, task.method, task.owner, users.name, users.usertype, task.parent, task.priority, task.request, task.result, task.start_time, EXTRACT(EPOCH FROM task.start_time), task.state, task.waiting, task.weight
          FROM task
          LEFT JOIN users ON task.owner = users.id
         WHERE (method = 'createLiveCD')
           AND (state IN (2))
         ORDER BY task.id DESC

         LIMIT 50

Now tons of these:

LOG:  duration: 11852.985 ms  statement: 
        SELECT buildroot.container_arch, buildroot.br_type, buildroot.cg_id, content_generator.name
, buildroot.cg_version, buildroot.container_arch, buildroot.container_type, create_events.id, creat
e_events.time, EXTRACT(EPOCH FROM create_events.time), buildroot.extra, buildroot.host_arch, host.i
d, host.name, buildroot.host_os, buildroot.id, repo_create.id, repo_create.time, repo.id, repo.stat
e, retire_events.id, retire_events.time, EXTRACT(EPOCH FROM retire_events.time), standard_buildroot
.state, tag.id, tag.name, standard_buildroot.task_id
          FROM buildroot
          LEFT OUTER JOIN standard_buildroot ON standard_buildroot.buildroot_id = buildroot.id
          LEFT OUTER JOIN content_generator ON buildroot.cg_id = content_generator.id
          LEFT OUTER JOIN host ON host.id = standard_buildroot.host_id
          LEFT OUTER JOIN repo ON repo.id = standard_buildroot.repo_id
          LEFT OUTER JOIN tag ON tag.id = repo.tag_id
          LEFT OUTER JOIN events AS create_events ON create_events.id = standard_buildroot.create_e
vent
          LEFT OUTER JOIN events AS retire_events ON standard_buildroot.retire_event = retire_event
s.id
          LEFT OUTER JOIN events AS repo_create ON repo_create.id = repo.create_event
         WHERE (host.id = 287)
           AND (standard_buildroot.state IN (0, 1, 2))

(about 2100 of them so far...)

And these:

LOG:  duration: 28410.968 ms  statement: 
            SELECT host.id,name,arches,task_load,capacity FROM host
                JOIN sessions USING (user_id)
                JOIN host_config ON host.id = host_config.host_id
            WHERE enabled = TRUE AND ready = TRUE
                AND expired = FALSE
                AND master IS NULL
                AND update_time > NOW() - '5 minutes'::interval
                AND active IS TRUE

How many records are in sessions table? Not sure, if you've cleaning of the table set up. If there is more than few thousands, you should run:

DELETE FROM sessions WHERE update_time < NOW() - '1 day'::interval;
VACUUM sessions;

koji PR: #1492

Yes, we have a daily cron that does the delete at least. It doesn't currently do a vacuum.

Should we try running this more often? Or do you thing the vacuum matters that much?

It needn't to be done often, but I've noticed, as it is a table which gets updated often, that it can easily grow to many pages. Than even index becomes cluttered and joins (which are run on any call from builder to hub) are pain, taking tens of seconds. Running vacuum/analyze cca weekly helped me a lot to improve overall system performance. Not sure, if that is your case, but it looks from that host select, that it could be.

Well, I tried running a pg_dump and then periodically vacuuming the sessions table... didn't seem to help. ;( Afer a while things got slow again... it perhaps did a bit better, but it still wasn't really acceptable.

Also, our dump excludes the sessions table entirely.

Any other ideas on what could be happening here? If we can't come up with a db fix, I may need to look at making a hot standby and try doing the backups on that...

ok, so I setup a hot-standby and ran the dump on there. Sadly, that caused the master to do the same thing.

I am pretty sure whats happening here is:

  • The buildroot_listing table is 203GB.
  • Lots of things insert new data into this table (anytime a new buildroot exists or a build finishes right?) And worse, they insert one row per rpm that is in the buildroot. Thats tons of inserts * tons of processes.
  • when doing a pg_dump, postgres at least partially locks that table so it can get a consistent dump of it.
  • This drives the load way up and all the things that insert into that table are delayed.

Perhaps this table could be partitioned? say into 'recent' and 'archive' ? Then all the inserts would be done on the small 'recent' partition and things could slowly be moved to the large (but mostly not used) archive over time?
Or even a new partition per month or something? ie, always insert into the current months partition, if there isn't one, create one and insert to it. With constraint exclusion, this could even make queries a lot faster (it would avoid partitions that couldn't possibly satisfy the query).

Or if thats too complex, perhaps the buildroot_listing could be changed to be a buildroot id and then an array of the rpm_ids instead of pairs of values (which results in tons more inserts). That would collapse a bunch of inserts into one.

The only other really large table is 'task' (for us it's 47 GB). It has a number of the same constraints... it's inserted into a lot, but often all queries are for recent tasks. So, perhaps it could be also partitioned?

Thoughts?

We did manage to get a goo backup from the hot standby yesterday. Not sure what changed...

ok, I managed to I think tweak things so we can get a good, consistent backup from the hot standby.

I'd really like to ask for improvement upstream on this however. I am sure it will hit more folks, and koji could definitely be more efficent here.

Theres several upstream tickets already related to this:

https://pagure.io/koji/issue/1707 (dealing with old, rarely accessed data)
https://pagure.io/koji/issue/1712 (use bulk inserts instead of inserting 100's to 1000's of times per buildroot)

Do you want me to file a more general database performance ticket? Or just leave it to those two?

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

13 days ago

I've added also https://pagure.io/koji/issue/1746 for partitioning the tables. I don't think, that general db performance is useful.

Login to comment on this ticket.

Metadata