#8183 Koji response is slow for user queries with tens of thousands of tasks
Closed: Fixed 4 years ago by kevin. Opened 4 years ago by churchyard.

See https://pagure.io/koji/issue/129 for the original upstream report.

The following pages load for ages, to a point where there are not usable:

Note that I use my tasks page quite frequently and it sometimes loads fast (within seconds), but sometimes it takes ~5 minutes, possibly depending on the amount of Koji load.


Were they fast(er) in the past? Any time you can pinpoint when they started to be slow?

Metadata Update from @kevin:
- Issue priority set to: Waiting on Assignee (was: Needs Review)

4 years ago

This was always very slow and flaky, but now I feel it with my user more often. I can say that couple years ago, when https://pagure.io/koji/issue/129 was open, I have only experienced this with all task or kojira user, but not mine. However I assume that is problem with the number of builds only (I've made a couple of builds since then).

Well, we could give the db host more memory, might help things out.

That will require a (short) outage tho. I could try and do that this weekend...

I really don't think we want to start doing non-upstreamed indexes and such. If there's someone that knows postgres and wants to look and what could be improved upstream we have a dump of our db publicly available, so someone could look and propose changes.

Metadata Update from @cverna:
- Issue tagged with: backlog

4 years ago

I increased the memory in the database server a lot (from 32GB to 120GB).

The first link still takes 1:40 for me. The second one is about 5seconds. The third about 10s.

I stupidly didn't measure them before the memory increase. Is that better? It is good enough?

I'm not sure there's much more I can do, if upstream can add an index here it might help a lot, but I dont want to diverge from upstream.

The response times were not predictable before, sometimes rather slow, sometimes faster. So not sure if this is better now or not. Will try to report back after a while.

if upstream can add an index here

I think they would, but what index?

Metadata Update from @mizdebsk:
- Issue assigned to mizdebsk
- Issue tagged with: koji

4 years ago

We already have deployment-specific index task_owner_idx1 (which is not in upstream Koji) that is used in the queries. On production db server queries are hundreds times slower than on staging server, possibly because the server is overloaded, or due to different PostgreSQL version, eg:

Production (PostgreSQL 9.2.24, 120 GB RAM):

koji=# explain analyze select * from task where owner=2274 order by id;
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=434367.52..434703.27 rows=134303 width=1181) (actual time=48640.823..48830.073 rows=141983 loops=1)
   Sort Key: id
   Sort Method: external merge  Disk: 151072kB
   ->  Index Scan using task_owner_idx1 on task  (cost=0.00..304098.17 rows=134303 width=1181) (actual time=1.428..47084.944 rows=141983 loops=1)
         Index Cond: (owner = 2274)
 Total runtime: 48885.472 ms

Staging (PostgreSQL 10.10, 16 GB RAM):

koji=# explain analyze select * from task where owner=2274 order by id;
                                                                        QUERY PLAN                                                                        
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Merge  (cost=235645.91..245090.04 rows=80944 width=1166) (actual time=216.425..387.668 rows=101025 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Sort  (cost=234645.89..234747.07 rows=40472 width=1166) (actual time=209.557..239.009 rows=33675 loops=3)
         Sort Key: id
         Sort Method: external merge  Disk: 37280kB
         ->  Parallel Index Scan using task_owner_idx1 on task  (cost=0.56..213881.84 rows=40472 width=1166) (actual time=0.046..40.651 rows=33675 loops=3)
               Index Cond: (owner = 2274)
 Planning time: 0.235 ms
 Execution time: 398.720 ms

@mizdebsk

Interesting - the staging one notes that a parallel index scan was used, where production just notes an index scan. A lot of the time on production is spent here. In staging, little of the time is spent on the index scan and most of it is spent on the sort.

Staging also spends about half the time on sorting as production, though I'd call it the same order of magnitude.

I also note that the temporary table built on disk for production was an order of magnitude larger. Do we use slow storage for the path that postgres uses for temporary tables? If so, that could be an area to investigate. This query wrote a 100 MB+ file to disk. If there were several queries all doing this at once, that would certainly be consistent with the suggestion that the load of production is the difference. I don't know why staging would write a smaller file, unless that is also explained by either the newer PG version or by the lower load (perhaps it only does this if it is doesn't have much RAM free).

Just more data: prod db-koji01 is on ssd storage. stg is on spinning disk.

ok. I did a bit more tuning on prod. Turns out we were not using memory very efficiently there at all. Can you retry your queries and see if they are good enough now?

They seem to be. Thanks. Should we keep this open to gather more long term data points, or close and only reopen if needed?

I'd say close and re-open if needed. :)

Thanks!

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 years ago

Login to comment on this ticket.

Metadata