#2095 Dispatcher is slow for queue with 70k+ jobs
Opened 5 months ago by praiskup. Modified 4 months ago

I measured today, and:

  • got a timeout on the /pending-jobs/ route
  • 07:11:33 requesting jobs...
  • 07:12:02 tasks requested, and backend started filling priority queue
  • 07:12:33 priority queue filled (31s!), and manager.run() started
  • several builds (~20) have started, but we did not process the whole queue
  • 07:12:49 last build started
  • 07:12:53 run() stopped (only 20s in the run() method, because sleeptime=20)

As an immediate hotfix, I increased sleep time to 40s (to start more jobs). But I'm sure we should be able to gather the low hanging fruit:

  • to fix the slow priority queue filling
  • make the run() timeout dynamic (depending on how many tasks are in the queue)
  • speedup

And slightly more work:
- the /pending-jobs/ route should be optimized
- we should speedup the worker starting mechanism (seems like the logic which
starts the backend worker takes from 0.5s to 1s, and it is not done in parallel)


Btw., the turn-around (if we have luck and there are no timeouts on the /pending-jobs/ route) is about 1.5 minutes. So
- user submits the build
- the task is up to 1.5 minutes in pending
- then the source build is relatively quickly processed and imported
- and then the binary RPM builds are waiting, again 1.5 minutes

That's weird, I imported the 70k pending builds database dump locally and

$ time curl http://127.0.0.1:5000/backend/pending-jobs/
...
real    0m16.874s
user    0m0.011s
sys     0m0.079s

Curl is I/O bound in this case. Or can you elaborate?

Sorry, my mistake. I should have explained myself better. My point is, that in my container, loading http://127.0.0.1:5000/backend/pending-jobs/ with the gigantic 70k builds queue, takes only ~16s (while in production it is a minute and a half). I measured it and 99% of the time is spent by querying things from the database. Converting it into JSON is almost instant.

Maybe it is possible that our postgres is somehow messed up (remember the slow monitor issue on copr-fe-dev)? Or am I drawing wrong conclusions?

(while in production it is a minute and a half)

It was about ~30 seconds, 07:11:33 to 07:12:02. The other code on the backend side made the additional 60s penalty.

Maybe it is possible that our postgres is somehow messed up (remember the slow monitor issue on copr-fe-dev)? Or am I drawing wrong conclusions?

This doesn't seem to be the case in my opinion, I suspect that even the copr-fe-dev
the situation is about paralyzing PostgreSQL (on mem, swap, i/o, cpu..., dunno)
rather than we have a broken DB files.

Commit 5a55d53 relates to this ticket

Login to comment on this ticket.

Metadata
Related Pull Requests
  • #2139 Merged 4 months ago