#9528 koji db issues with koji-gc jobs
Closed: Fixed 3 years ago by kevin. Opened 3 years ago by kevin.

Since upgrading the koji hubs (koji01/koji02) to fedora 33, we have been seeing a database issue happen every night around when the koji-gc jobs run. ;(

The koji db is normally very busy all the time, but during these times the load goes down to almost nothing and koji processes things very very slowly.

Killing the koji-gc jobs brings everything back to normal after a minute or two.

There are some wsgi/python crashes in httpd on koji02, but I am not sure if they are related.

I'm attempting to get more debugging from the koji-gc jobs now.


I've seem some issues with mod_wsgi update in f33 leading to DoS (crashing frequently). I've not had much time to look into that yet, but I can reproduce it if needed.

So, I disabled the koji-gc prune job yesterday because I thought that was the one causing issues, but then we got issues again this morning. :(

I've seem some issues with mod_wsgi update in f33 leading to DoS (crashing frequently). I've not had much time to look into that yet, but I can reproduce it if needed.

We are seeing a lot of httpd crashes, but I am not sure at all that they are related to the koji-gc issue. Here's the last 2 hours:

Mon 2020-12-14 18:01:38 UTC  1460176    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 18:01:38 UTC  1460098    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 18:01:38 UTC  1460023    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 18:13:46 UTC  1460384    48    48   6 present   /usr/sbin/httpd
Mon 2020-12-14 18:15:07 UTC  1460480    48    48   6 present   /usr/sbin/httpd
Mon 2020-12-14 18:15:15 UTC  1460546    48    48   6 present   /usr/sbin/httpd
Mon 2020-12-14 18:15:52 UTC  1460629    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 18:21:48 UTC  1460743    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 18:22:01 UTC  1460820    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 18:22:11 UTC  1460888    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 18:22:11 UTC  1461044    48    48   6 present   /usr/sbin/httpd
Mon 2020-12-14 18:22:12 UTC  1460886    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 18:22:13 UTC  1461068    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 18:22:14 UTC  1461021    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 18:39:57 UTC  1461493    48    48   6 present   /usr/sbin/httpd
Mon 2020-12-14 18:41:09 UTC  1461654    48    48   6 present   /usr/sbin/httpd
Mon 2020-12-14 18:41:12 UTC  1461579    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 18:41:22 UTC  1461749    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 18:41:23 UTC  1461720    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 18:41:26 UTC  1461922    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 18:41:27 UTC  1461856    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 18:41:29 UTC  1461930    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 18:44:19 UTC  1462140    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 19:01:00 UTC  1462324    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 19:01:22 UTC  1462413    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 19:01:29 UTC  1462545    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 19:01:31 UTC  1462479    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 19:01:35 UTC  1462547    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 19:01:44 UTC  1462776    48    48  11 present   /usr/sbin/httpd
Mon 2020-12-14 19:02:58 UTC  1462686    48    48  11 present   /usr/sbin/httpd

the last trace:

         PID: 1462686 (httpd)
           UID: 48 (apache)
           GID: 48 (apache)
        Signal: 11 (SEGV)
     Timestamp: Mon 2020-12-14 19:02:55 UTC (3min 44s ago)
  Command Line: /usr/sbin/httpd -DFOREGROUND
    Executable: /usr/sbin/httpd
 Control Group: /system.slice/httpd.service
          Unit: httpd.service
         Slice: system.slice
       Boot ID: 652df1e82f7e406787f45c4a995bbc3d
    Machine ID: ea5b65e667ab426fb7ea71abaeed265b
      Hostname: koji02.iad2.fedoraproject.org
       Storage: /var/lib/systemd/coredump/core.httpd.48.652df1e82f7e406787f45c4a995bbc3d.1462686.1607972575000000.z>
       Message: Process 1462686 (httpd) of user 48 dumped core.

                Stack trace of thread 1462880:
                #0  0x00007fe02135d001 tstate_delete_common (libpython3.9.so.1.0 + 0x1b6001)
                #1  0x00007fe021374d83 _PyThreadState_DeleteCurrent (libpython3.9.so.1.0 + 0x1cdd83)
                #2  0x00007fe021373b0b t_bootstrap (libpython3.9.so.1.0 + 0x1ccb0b)
                #3  0x00007fe021373984 pythread_wrapper (libpython3.9.so.1.0 + 0x1cc984)
                #4  0x00007fe0225713f9 start_thread (libpthread.so.0 + 0x93f9)
                #5  0x00007fe022497903 __clone (libc.so.6 + 0x101903)

                Stack trace of thread 1462686:
                #0  0x00007fe0225779e8 pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0 + 0xf9e8)
                #1  0x00007fe02129d5ad take_gil (libpython3.9.so.1.0 + 0xf65ad)
                #2  0x00007fe0212cf3c2 PyEval_RestoreThread (libpython3.9.so.1.0 + 0x1283c2)
                #3  0x00007fdfd812cfc3 connection_dealloc (_psycopg.cpython-39-x86_64-linux-gnu.so + 0x18fc3)
                #4  0x00007fe0212ab718 free_keys_object (libpython3.9.so.1.0 + 0x104718)
                #5  0x00007fe0212ab5d8 dict_dealloc (libpython3.9.so.1.0 + 0x1045d8)
                #6  0x00007fe0212c7e21 subtype_dealloc (libpython3.9.so.1.0 + 0x120e21)
                #7  0x00007fe0212ab718 free_keys_object (libpython3.9.so.1.0 + 0x104718)
                #8  0x00007fe0212ab5d8 dict_dealloc (libpython3.9.so.1.0 + 0x1045d8)
                #9  0x00007fe0212ab62f dict_dealloc (libpython3.9.so.1.0 + 0x10462f)
                #10 0x00007fe0212c7e21 subtype_dealloc (libpython3.9.so.1.0 + 0x120e21)
                #11 0x00007fe0212a360b insertdict (libpython3.9.so.1.0 + 0xfc60b)
                #12 0x00007fe021302fc8 _PyModule_ClearDict (libpython3.9.so.1.0 + 0x15bfc8)
                #13 0x00007fe021358dc0 _PyImport_Cleanup (libpython3.9.so.1.0 + 0x1b1dc0)
                #14 0x00007fe0212930b3 Py_EndInterpreter (libpython3.9.so.1.0 + 0xec0b3)
                #15 0x00007fe0215097cd Interpreter_dealloc (mod_wsgi_python3.so + 0xf7cd)
                #16 0x00007fe0212ab718 free_keys_object (libpython3.9.so.1.0 + 0x104718)
                #17 0x00007fe021511944 wsgi_python_child_cleanup (mod_wsgi_python3.so + 0x17944)
                #18 0x00007fe0225b1fbe apr_pool_destroy (libapr-1.so.0 + 0x27fbe)
                #19 0x00007fe021c0faf5 clean_child_exit (mod_mpm_worker.so + 0x3af5)
                #20 0x00007fe021c114e0 child_main (mod_mpm_worker.so + 0x54e0)
                #21 0x00007fe021c11ba5 make_child (mod_mpm_worker.so + 0x5ba5)
                #22 0x00007fe021c126ca worker_run (mod_mpm_worker.so + 0x66ca)
                #23 0x000055a8f09528e0 ap_run_mpm (httpd + 0x288e0)
                #24 0x000055a8f0940eec main (httpd + 0x16eec)
                #25 0x00007fe0223be1e2 __libc_start_main (libc.so.6 + 0x281e2)
                #26 0x000055a8f094103e _start (httpd + 0x1703e)

                Stack trace of thread 1462879:
                #0  0x00007fe022497c4e epoll_wait (libc.so.6 + 0x101c4e)
                #1  0x00007fe00c23bd81 select_epoll_poll (select.cpython-39-x86_64-linux-gnu.so + 0x2d81)
                #2  0x00007fe0212c8c78 method_vectorcall_FASTCALL_KEYWORDS (libpython3.9.so.1.0 + 0x121c78)
                #3  0x00007fe0212af456 _PyEval_EvalFrameDefault (libpython3.9.so.1.0 + 0x108456)
                #4  0x00007fe0212bc4bb function_code_fastcall (libpython3.9.so.1.0 + 0x1154bb)
                #5  0x00007fe0212af456 _PyEval_EvalFrameDefault (libpython3.9.so.1.0 + 0x108456)
                #6  0x00007fe0212bc4bb function_code_fastcall (libpython3.9.so.1.0 + 0x1154bb)
                #7  0x00007fe0212af456 _PyEval_EvalFrameDefault (libpython3.9.so.1.0 + 0x108456)
                #8  0x00007fe0212ade81 _PyEval_EvalCode (libpython3.9.so.1.0 + 0x106e81)
                #9  0x00007fe0212bc206 _PyFunction_Vectorcall (libpython3.9.so.1.0 + 0x115206)
                #10 0x00007fe0212c4f60 method_vectorcall (libpython3.9.so.1.0 + 0x11df60)
                #11 0x00007fe0212b011d _PyEval_EvalFrameDefault (libpython3.9.so.1.0 + 0x10911d)
                #12 0x00007fe0212bc4bb function_code_fastcall (libpython3.9.so.1.0 + 0x1154bb)
                #13 0x00007fe0212c5150 method_vectorcall (libpython3.9.so.1.0 + 0x11e150)
                #14 0x00007fe0212b207c _PyEval_EvalFrameDefault (libpython3.9.so.1.0 + 0x10b07c)
                #15 0x00007fe0212bc4bb function_code_fastcall (libpython3.9.so.1.0 + 0x1154bb)
                #16 0x00007fe0212af456 _PyEval_EvalFrameDefault (libpython3.9.so.1.0 + 0x108456)
                #17 0x00007fe0212bc4bb function_code_fastcall (libpython3.9.so.1.0 + 0x1154bb)
                #18 0x00007fe0212af456 _PyEval_EvalFrameDefault (libpython3.9.so.1.0 + 0x108456)
                #19 0x00007fe0212bc4bb function_code_fastcall (libpython3.9.so.1.0 + 0x1154bb)
                #20 0x00007fe0212c5150 method_vectorcall (libpython3.9.so.1.0 + 0x11e150)
                #2

Seems like it's in cleanup?

One thought I had the other day was to set a tag_filter for koji-gc so it only really had to look at the active tags. There shouldn't be anything more to GC in old tags... but it doesn't seem to work.
Anything I add to the config file as 'tag_filter' seems ignored. ;( Adding to the command line does seem to work. ie,

--tag-filter 'f34' --tag-filter 'f33' --tag-filter 'f32' --tag-filter 'eln' --tag-filter 'coreos' --tag-filter 'epel7' --tag-filter 'epel8'

works. I am not sure why the config file doesn't. Could it be some option handling isn't in place for tag_filter?

Anyhow, when the issue happens the db is very unloaded... making me think it's a lock somehow? I guess I will try and duplicate it again and get pg_locks output...

So, I ran koji-gc again manually with debug (doing all the actions).

It actually seems it's delete... the logs/output has:

...
Deleting build: xstream-1.4.12-3.eln101
Deleting build: xstream-1.4.12-6.eln103
Deleting build: xterm-330-3.module_470ee1ff
Deleting build: xterm-354-1.fc33
Deleting build: xterm-358-2.eln103
Deleting build: xtl-0.6.12-1.fc33
Deleting build: xtl-0.6.16-1.fc34
Deleting build: xxhash-0.7.1-1.el6
Deleting build: xxhash-0.7.2-1.el6

In the code I see:

        print("Deleting build: %s" % binfo['nvr'])
        mcall.untagBuildBypass(trashcan_tag, binfo['id'])
        mcall.deleteBuild(binfo['id'])

so, it's in these 2 multicalls somehow? We do have a gigantic pile of things to delete, since gc wasn't working to move things to the trash for a while.

grep Deleting /tmp/koji-gc-20201214.1 | wc -l
22598

So, can we break down/debug those multicalls? or split them up so it does them in batches?

That multicall is internally broken to batches per 1000 calls (see mcall instance). Easiest way to look if something changes is to lower this number (e.g. to 100).
Could it be that some other part is slow? E.g. in delete call there is a disk access and deleting builds like texlive could take a long time without any db access.

I can also see 'psycopg2.ProgrammingError: can't adapt type 'tuple'' with f33 in the moments when mod_wsgi is also failing. I've no clue why it starts to manifest as it is independent on type of call.

So I tried changing it to 10, but I guess I didn't do so correctly. ;(

            print("Deleting build: %s" % binfo['nvr'])
            mcall = koji.MultiCallSession(session, batch=10)
            mcall.untagBuildBypass(trashcan_tag, binfo['id'])
            mcall.deleteBuild(binfo['id'])

But the db still got very low load and very slow.

I saw:

 105774 | koji    | 3643659 |   105769 | koji     |                  | 10.3.169.104 |                 |       57130 
| 2020-12-16 23:37:59.167789+00 | 2020-12-16 23:53:00.401419+00 | 2020-12-16 23:53:01.395239+00 | 2020-12-16 23:53:0
1.395239+00 | LWLock          | SubtransControlLock | active              |             |   2220028013 | FETCH 1000 
FROM qp_cursor_140285404039344_1794024_0

I guess those are the multicall? and still 1000?

Any idea how to change it? and is there any way to do some kind of query log here so we can see what it's sending and getting hung up on? Perhaps drop multicall entirely and make calls one at a time ?

Just put it before the loop

     print("5th pass: deletion")
+    mcall = koji.MultiCallSession(session, batch=10)
     for binfo in continuing:

So, I did that, but it still seemed to be doing 1000 batches?

    print("5th pass: deletion")
    mcall = koji.MultiCallSession(session, batch=10)
    for binfo in continuing:
        # go ahead and delete
        if options.test:
            print("Would have deleted build from trashcan: %s" % binfo['nvr'])
        else:
            print("Deleting build: %s" % binfo['nvr'])
            mcall.untagBuildBypass(trashcan_tag, binfo['id'])
            mcall.deleteBuild(binfo['id'])

    for binfo, result in zip(continuing, mcall.call_all()):
        if isinstance(result, dict):
            print("Warning: deletion failed: %s" % result['faultString'])
            # TODO - log details for delete failures
 105774 | koji    |  370720 |   105769 | koji     |                  | 10.3.169.105 |                 |       38488
| 2020-12-17 20:24:54.225684+00 | 2020-12-17 20:42:07.36945+00  | 2020-12-17 20:42:07.836651+00 | 2020-12-17 20:42:0
7.836652+00 | LWLock          | SubtransControlLock | active |             |   2316777238 | FETCH 1000 FROM qp_curso
r_140598145378384_1988591_0                                                                                        



                                                    | client backend

Any further ideas to track this down?

I don't think it is directly related - there should be such big query in any of there calls.

If you're sure that it is coming from this part, I would disable multicall completely and measure times of each call and let's see which of them is the problem. (I suspect untagBuildBypass - it seems to be quite slow these days, not sure what is the change affecting it).

import time

    print("5th pass: deletion")
    for binfo in continuing:
        # go ahead and delete
        if options.test:
            print("Would have deleted build from trashcan: %s" % binfo['nvr'])
        else:
            print("Deleting build: %s" % binfo['nvr'])
            try:
                t1 = time.time()
                session.untagBuildBypass(trashcan_tag, binfo['id'])
                t2 = time.time()
                print("untagBuildBypass: %.2f" % t2 - t1)
                session.deleteBuild(binfo['id'])
                print("deleteBuild: %.2f" % time.time() - t2)
            except Exception:
                print("Warning: deletion failed: %s" % result['faultString'])

btw, I'm not sure how are you running koji-gc, but it is better for us to run each action separately - it is shorter run if there is a problem and can be debugger more easily.

Yeah, we are currently running all the passes seperately.

I'll try the above and see what that gets us.
It might be I wasn't running the correct modified koji-gc when I changed multicall... which would explain why it was still 1000. ;(

So, I had to modify that code some... the prints didn't like time.time() (mix of float and str?) and there's no place defining result, so that failed. After tweaking that tho, it ran and finished. :)

I'm going to try the multicall again with 10 or 100... I think I was just calling the non modified file before.

So it really looks like it's the 1000 batches causing issues...

So, it ran fine with batch=10... I am trying now back to the 1000 default. My theory is that it will work now, but it didn't before because it had too many deletes it was trying to do all at once (deletion hasn't been happening for a number of months).

I guess it might be good for upstream to lower the batch here in case others hit it?

will report results of this next run.

ok, it fniished just fine. I think we are back on track... I'll leave it up to you @tkopecek if you want to adjust upstream any in this area. Thanks for the help!

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Yes, we definitely want to lower the default. We've hit a similar problem with clone-tag command. There is same batch size and it causes many sessions locks waiting for long time. Long-term solution is rewriting the session handling to not hold the lock for every operation.
Created https://pagure.io/koji/issue/2616

Issue status updated to: Open (was: Closed)

3 years ago

Issue status updated to: Closed (was: Open)
Issue close_status updated to: Fixed

3 years ago

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Done