#2535 koji-gc trash traceback
Closed: Fixed 2 years ago by tkopecek. Opened 2 years ago by kevin.

koji-1.22.1 trying to run koji-gc --action=trash

Using option ('main', 'keytab') from config file
Using option ('main', 'principal') from config file
Using option ('main', 'serverca') from config file
Using option ('main', 'server') from config file
Using option ('main', 'weburl') from config file
Using option ('main', 'smtp_host') from config file
Using option ('main', 'from_addr') from config file
Using option ('main', 'unprotected_keys') from config file
delay: 432000 seconds
grace_period: 2419200 seconds
successfully connected to hub
Getting untagged builds...
...got 145234 builds
1st pass: blacklist
2nd pass: references
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2735, in _callMethod
    return self._sendCall(handler, headers, request)
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2653, in _sendCall
    return self._sendOneCall(handler, headers, request)
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2699, in _sendOneCall
    ret = self._read_xmlrpc_response(r)
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2711, in _read_xmlrpc_response
    result = u.close()
  File "/usr/lib64/python3.8/xmlrpc/client.py", line 655, in close
    raise Fault(**self._stack[0])
xmlrpc.client.Fault: <Fault 1009: 'unable to retry call 80 (method multiCall) for session 114124448'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sbin/koji-gc", line 1012, in <module>
    rv = main(args)
  File "/sbin/koji-gc", line 443, in main
    globals()['handle_' + x]()
  File "/sbin/koji-gc", line 474, in handle_trash
    for binfo, [refs] in six.moves.zip(continuing, mcall.call_all()):
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 3168, in call_all
    _results = self._session._callMethod('multiCall', args, {})
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2754, in _callMethod
    raise err
koji.RetryError: unable to retry call 80 (method multiCall) for session 114124448

Happy to gather more info...


Looks like temporary refusal from hub. Can you see something in hub log related to this?

Metadata Update from @tkopecek:
- Custom field Size adjusted to None

2 years ago
pid 2522737:tid 2522786] [client 10.3.163.76:34088] 2020-10-12 12:46:07,087 [WARNING] m=multiCall u=None p=2522737 r=10.3.163.76:34088 koji.xmlrpc: Traceback (most recent call last): 
pid 2522737:tid 2522786] [client 10.3.163.76:34088]   File "/usr/share/koji-hub/kojixmlrpc.py", line 253, in _wrap_handler
pid 2522737:tid 2522786] [client 10.3.163.76:34088]     response = handler(environ)
pid 2522737:tid 2522786] [client 10.3.163.76:34088]   File "/usr/share/koji-hub/kojixmlrpc.py", line 277, in handle_rpc
pid 2522737:tid 2522786] [client 10.3.163.76:34088]     return self._dispatch(method, params)
pid 2522737:tid 2522786] [client 10.3.163.76:34088]   File "/usr/share/koji-hub/kojixmlrpc.py", line 301, in _dispatch
pid 2522737:tid 2522786] [client 10.3.163.76:34088]     self.check_session()
pid 2522737:tid 2522786] [client 10.3.163.76:34088]   File "/usr/share/koji-hub/kojixmlrpc.py", line 283, in check_session
pid 2522737:tid 2522786] [client 10.3.163.76:34088]     context.session = koji.auth.Session()
pid 2522737:tid 2522786] [client 10.3.163.76:34088]   File "/usr/lib/python3.8/site-packages/koji/auth.py", line 147, in __init__
pid 2522737:tid 2522786] [client 10.3.163.76:34088]     raise koji.RetryError(
pid 2522737:tid 2522786] [client 10.3.163.76:34088] koji.RetryError: unable to retry call 80 (method multiCall) for session 114456532
pid 2522737:tid 2522786] [client 10.3.163.76:34088] 
pid 2522737:tid 2522786] [client 10.3.163.76:34088] 87 [WARNING] m=None u=None p=2522737 r=?:? koji.xmlrpc: Traceback (most recent call last): 
pid 2522737:tid 2522786] [client 10.3.163.76:34088]   File "/usr/share/koji-hub/kojixmlrpc.py", line 253, in _wrap_handler
pid 2522737:tid 2522786] [client 10.3.163.76:34088]     response = handler(environ)
pid 2522737:tid 2522786] [client 10.3.163.76:34088]   File "/usr/share/koji-hub/kojixmlrpc.py", line 277, in handle_rpc
pid 2522737:tid 2522786] [client 10.3.163.76:34088]     return self._dispatch(method, params)
pid 2522737:tid 2522786] [client 10.3.163.76:34088]   File "/usr/share/koji-hub/kojixmlrpc.py", line 301, in _dispatch
pid 2522737:tid 2522786] [client 10.3.163.76:34088]     self.check_session()
pid 2522737:tid 2522786] [client 10.3.163.76:34088]   File "/usr/share/koji-hub/kojixmlrpc.py", line 283, in check_session
pid 2522737:tid 2522786] [client 10.3.163.76:34088]     context.session = koji.auth.Session()
pid 2522737:tid 2522786] [client 10.3.163.76:34088]   File "/usr/lib/python3.8/site-packages/koji/auth.py", line 147, in __init__
pid 2522737:tid 2522786] [client 10.3.163.76:34088]     raise koji.RetryError(
pid 2522737:tid 2522786] [client 10.3.163.76:34088] koji.RetryError: unable to retry call 80 (method multiCall) for session 114456532

Hmm, interesting that it got to the logs. Is it happening the other day or is it now working ok?

It has been happening every day since 2020-09-11, which is the day we upgraded to 1.22.1. ;)

I see it was actually failing on 1.22.0 also, but a different traceback:

Getting untagged builds...
...got 130833 builds
1st pass: blacklist
2nd pass: references
Traceback (most recent call last):
  File "/usr/sbin/koji-gc", line 1012, in <module>
    rv = main(args)
  File "/usr/sbin/koji-gc", line 443, in main
    globals()['handle_' + x]()
  File "/usr/sbin/koji-gc", line 474, in handle_trash
    for binfo, [refs] in six.moves.zip(continuing, mcall.call_all()):
ValueError: too many values to unpack (expected 1)

and... actually, on 2020-08-07 we upgraded to 1.22.0 (from 1.21.1) and it was giving the same traceback as it is now:

Getting untagged builds...
...got 120740 builds
1st pass: blacklist
2nd pass: references
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2791, in _callMethod
    return self._sendCall(handler, headers, request)
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2709, in _sendCall
    return self._sendOneCall(handler, headers, request)
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2755, in _sendOneCall
    ret = self._read_xmlrpc_response(r)
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2767, in _read_xmlrpc_response
    result = u.close()
  File "/usr/lib64/python3.8/xmlrpc/client.py", line 655, in close
    raise Fault(**self._stack[0])
xmlrpc.client.Fault: <Fault 1009: 'unable to retry call 89 (method multiCall) for session 107743863'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/sbin/koji-gc", line 1026, in <module>
    rv = main(args)
  File "/usr/sbin/koji-gc", line 463, in main
    globals()['handle_' + x]()
  File "/usr/sbin/koji-gc", line 494, in handle_trash
    for binfo, [refs] in six.moves.zip(continuing, mcall.call_all()):
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 3223, in call_all
    _results = self._session._callMethod('multiCall', args, {})
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2810, in _callMethod
    raise err
koji.RetryError: unable to retry call 89 (method multiCall) for session 107743863

The last time where it got past this was 2020-06-26... there:

Getting untagged builds...
...got 109805 builds
1st pass: blacklist
2nd pass: references
Skipping build trac-0.11.4-1.fc12. Keys: ['d2933660', 'f13bd4d5']
Skipping build trac-git-plugin-0.11.0.2-4.20090511svn5396.fc12. Keys: ['d2933660']
Skipping build rpld-1.8-0.10.beta1.fc14. Keys: ['97a1071f']
Skipping build quagga-0.99.20-1.fc14. Keys: ['97a1071f']
Skipping build system-config-date-1.9.66-1.fc14. Keys: ['97a1071f']
Skipping build tito-0.3.3-1.fc14. Keys: ['97a1071f']
Skipping build amide-0.9.2-4.fc14. Keys: ['97a1071f']
Skipping build radvd-1.8.2-1.fc14. Keys: ['97a1071f']
Skipping build hplip-3.11.10-2.fc14. Keys: ['97a1071f']
...snip...
[109359/109805] Adding build to trash list: Fedora-Security-Live-Rawhide-20200525.n.0
[109364/109805] Adding build to trash list: OpenImageIO-2.1.15.0-2.el8
Traceback (most recent call last):
  File "/usr/sbin/koji-gc", line 1026, in <module>
    rv = main(args)
  File "/usr/sbin/koji-gc", line 463, in main
    globals()['handle_' + x]()
  File "/usr/sbin/koji-gc", line 550, in handle_trash
    last = max(history)[1]
TypeError: '>' not supported between instances of 'dict' and 'dict'

So, I guess it used to work in 1.21.1 (but had the > not supported error toward the end) and then 1.22.0 and 1.22.1 broke in different ways?

Anyhow, happy to manually run stuff or debugged version to try and track this down. It's somewhat important as it means we are not freeing up any storage and thats not good. ;(

@kevin Try current master. dict issue was solved in https://pagure.io/koji/pull-request/2327 (current version nevertheless run only on py3 - we've dropped py2 support meanwhile). There is a few more changes which could help and we're going to release 1.23 in few days.

Well, the dict thing was a while back when we were running 1.21.0 I think... all the recent tracebacks were the multicall retry.

I'll try the one from current master....

The one from master crashes the same way:

successfully connected to hub
Getting untagged builds...
...got 146427 builds
1st pass: blacklist
2nd pass: references
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2735, in _callMethod
    return self._sendCall(handler, headers, request)
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2653, in _sendCall
    return self._sendOneCall(handler, headers, request)
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2699, in _sendOneCall
    ret = self._read_xmlrpc_response(r)
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2711, in _read_xmlrpc_response
    result = u.close()
  File "/usr/lib64/python3.8/xmlrpc/client.py", line 655, in close
    raise Fault(**self._stack[0])
xmlrpc.client.Fault: <Fault 1009: 'unable to retry call 80 (method multiCall) for session 114750472'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/fedora/kevin/koji-gc", line 1008, in <module>
    rv = main(args)
  File "/home/fedora/kevin/koji-gc", line 439, in main
    globals()['handle_' + x]()
  File "/home/fedora/kevin/koji-gc", line 470, in handle_trash
    for binfo, [refs] in zip(continuing, mcall.call_all()):
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 3168, in call_all
    _results = self._session._callMethod('multiCall', args, {})
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2754, in _callMethod
    raise err
koji.RetryError: unable to retry call 80 (method multiCall) for session 114750472

Can you try this? It just doesn't use multicall. So, if it fail we would see exact build, etc.

https://pagure.io/fork/tkopecek/koji/c/8078dbcefb99eaa68e98a5f3737179c48c062b71?branch=issue2535

@kevin If it would work without mutlicall, I've remembered this one: #2382 - it should help in other cases (if exception is thrown in the multicall, call can fail on server-side).

Indeed, it seems to be getting much further now at least... it's past where it tracebacked before. Will update when/if it finally finishes.

I can't tell if it's stuck or just slow, but it's still running:

Skipping build gdouros-asea-fonts-6.31-1.fc26. Keys: ['bfb2d6e3']
Skipping build gdouros-avdira-fonts-6.31-1.fc26. Keys: ['bfb2d6e3']
Skipping build gdouros-musica-fonts-3.12-0.5.20150430.fc26. Keys: ['64dab85d', 'bfb2d6e3']
Skipping build gdouros-aroania-fonts-6.31-1.fc26. Keys: ['bfb2d6e3']
Skipping build scudcloud-1.31-1.fc26. Keys: ['64dab85d', 'bfb2d6e3']
Skipping build python-heatclient-1.1.0-2.fc26. Keys: ['bfb2d6e3']
Skipping build cinnamon-session-3.0.1-3.fc26. Keys: ['64dab85d', 'bfb2d6e3']
Skipping build cabal-rpm-0.10.0-1.fc26. Keys: ['64dab85d', 'bfb2d6e3']
Skipping build glpi-0.90.5-1.fc26. Keys: ['64dab85d', 'bfb2d6e3']
Skipping build openms-2.0.0-35.20160121git6f51b3.fc26. Keys: ['64dab85d', 'bfb2d6e3']

It's been sitting there since yesterday at 19:03UTC.

Still sitting in the same place now, and if I strace it I get:

strace: Process 3736937 attached
restart_syscall(<... resuming interrupted read ...>

So, I think it's pretty much stuck. ;( Any more ideas? No GC since aug is starting to take up our disk space. ;(

Have you link to you GC conf? I can try to run/debug it in test mode - it could be run without privileges.
Isn't there a memory problem? GC can be a bit memory hungry. But 150k builds shouldn't be that big problem.
You can run it with --debug which could show us more exact place where it is hanging. I don't see a reason why it should die just there (after Skipping build and before "Moving to trashcan". Maybe you can add some debug prints there. It is about line 550

Have you link to you GC conf? I can try to run/debug it in test mode - it could be run without privileges.

https://pagure.io/fedora-infra/ansible/blob/master/f/roles/koji_hub/templates/koji-gc.conf.j2
(it's a jinja template, so you will need to replace any {{ variables }} )

Isn't there a memory problem? GC can be a bit memory hungry. But 150k builds shouldn't be that big problem.

I don't think so. nothing looked like it was taking up much...

You can run it with --debug which could show us more exact place where it is hanging. I don't see a reason why it should die just there (after Skipping build and before "Moving to trashcan". Maybe you can add some debug prints there. It is about line 550

yeah, it's odd. I restarted it with --debug... will let you know when it gets stuck...

It seems to have stopped this time at:

[77396/147532] Build has 10 rpm references: zanata-parent-28-3.fc26                                                
[77397/147532] Build has 19 rpm references: libsodium-1.0.11-1.fc26                                                
[77398/147532] Build has been used in a buildroot: python-ofxparse-0.15-2.fc26                                     
Last_used: 2016-09-29T11:43:18.871453
Build: python-ofxparse-0.15-2.fc26, Keys: ['64dab85d', 'bfb2d6e3']                                                 
Skipping build python-ofxparse-0.15-2.fc26. Keys: ['64dab85d', 'bfb2d6e3']                                         
[77399/147532] Build has 2 rpm references: php-nette-di-2.3.12-1.fc26   

I'm still not able to find a reason. It look that it behaves better without multicalls. Try to edit koji-gc and replace batch argument in all places with 1 (now there is 1000). It will be really slow, but it could pass through those retries and other potential problems. I'll look at it more meanwhile.

One of the trouble-makers is https://koji.fedoraproject.org/koji/buildinfo?buildID=947008 Simple buildReferences call takes hours (not finished yet). You can try to delete it manually if it is eligible. Btw, can you see some problems in db performance log for this query?

I am trying... but it's been an hour at:

...
5th pass: deletion
Deleting build: libreoffice-5.4.0.3-4.fc27

Yes, the db host has been having problems since I started trying to investigate this. High loads, etc. ;(

The query seems to be:

 105774 | koji    | 1245993 |   105769 | koji    |                  | 10.3.169.104 |                 |       52758 |
 2020-10-26 16:34:00.500584+00 | 2020-10-26 16:38:36.954658+00 | 2020-10-26 16:38:39.179577+00 | 2020-10-26 16:38:39
.179603+00 |                 |            | active |  2304167077 |   2303925437 | SELECT MAX(create_event)          






                                                                                                                 +| 
client backend
        |         |         |          |         |                  |              |                 |             |
                               |                               |                               |                    
           |                 |            |        |             |              |                FROM standard_build
root                                                                                                                





                                                                                                                 +| 
        |         |         |          |         |                  |              |                 |             |
                               |                               |                               |                    
           |                 |            |        |             |              |                WHERE buildroot_id 
IN (                                                                                                                





                                                                                                                 +| 
        |         |         |          |         |                  |              |                 |             |
                               |                               |                               |                    
           |                 |            |        |             |              |                  SELECT buildroot_
id                                                                                                                  





                                                                                                                 +| 
        |         |         |          |         |                  |              |                 |             |
                               |                               |                               |                    
           |                 |            |        |             |              |                  FROM buildroot_li
sting                                                                                                               





                                                                                                                 +| 
        |         |         |          |         |                  |              |                 |             |
                               |                               |                               |                    
           |                 |            |        |             |              |                  WHERE rpm_id IN (
(10859025), (10859026), (10859027), (10859028), (10859029), (10859030), (10859031), (10859032), (10859033), (1085903
4), (10859035), (10859036), (10859037), (10859038), (10859039), (10859040), (10859041), (10859042), (10859043), (108
59044), (10859045), (10859046), (10859047), (10859048), (10859049), (10859050), (10859051), (10859052), (10859053), 
(10859054), (10859055), (10859056), (10859057), (10859058), (10859059), (10859060), (10859061), (10859062), (1085906
3), (10859064), (10859065), (10859066), (10859067), (10859068), (10859069), (10859070), (10859071), (10859072), (108
59073), (10859074), (10859075), (10859076), (10859077), (10859078), (10859079), (10859080), (10859081), (10859082), 
(10859083), (10859084), (10859085), (10859086), (10859087), (10859088), (10859089), (10859090), (10859091), (1085 | 
(1 row)
postgres 1263360 91.0  8.8 34670152 11394624 ?   Rs   16:38  61:08 postgres: parallel worker for PID 1245993   
postgres 1263361 90.9  8.7 34670152 11269944 ?   Rs   16:38  61:01 postgres: parallel worker for PID 1245993   
postgres 1263362 91.2  8.8 34670152 11353920 ?   Rs   16:38  61:13 postgres: parallel worker for PID 1245993   
postgres 1263363 90.9  8.7 34670152 11296744 ?   Rs   16:38  61:04 postgres: parallel worker for PID 1245993   
postgres 1263364 91.2  8.8 34670152 11390764 ?   Rs   16:38  61:14 postgres: parallel worker for PID 1245993   
postgres 1263365 91.0  8.7 34670152 11202612 ?   Rs   16:38  61:08 postgres: parallel worker for PID 1245993   
postgres 1263366 91.0  8.7 34670152 11259484 ?   Rs   16:38  61:05 postgres: parallel worker for PID 1245993  

It looks like +-same issue as #2383 - I expect that you've that patch running. Can you check what
EXPLAIN says for that query?

So yeah, after a long time:

5th pass: deletion
Deleting build: libreoffice-5.4.0.3-4.fc27
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2735, in _callMethod                              
    return self._sendCall(handler, headers, request)
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2653, in _sendCall                                
    return self._sendOneCall(handler, headers, request)
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2699, in _sendOneCall                             
    ret = self._read_xmlrpc_response(r)
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2711, in _read_xmlrpc_response                    
    result = u.close()
  File "/usr/lib64/python3.8/xmlrpc/client.py", line 655, in close                                                 
    raise Fault(**self._stack[0])
xmlrpc.client.Fault: <Fault 1009: 'unable to retry call 7 (method multiCall) for session 115885228'>               

During handling of the above exception, another exception occurred:                                                

Traceback (most recent call last):
  File "/home/fedora/kevin/koji-gc", line 1006, in <module>                                                        
    rv = main(args)
  File "/home/fedora/kevin/koji-gc", line 439, in main
    globals()['handle_' + x]()
  File "/home/fedora/kevin/koji-gc", line 710, in handle_delete                                                    
    for binfo, result in zip(continuing, mcall.call_all()):                                                        
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 3168, in call_all                                 
    _results = self._session._callMethod('multiCall', args, {})                                                    
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 2754, in _callMethod                              
    raise err
koji.RetryError: unable to retry call 7 (method multiCall) for session 115885228

It did finally mark it deleted.

Yeah, I am using koji-gc from git HEAD.

How can I get the entire query? It seems pg_stat_activity cuts it off?

I've reconstructed query here: https://paste.centos.org/view/164cd263

bad-query.out

...
 Finalize Aggregate  (cost=8855923.53..8855923.54 rows=1 width=4)
   ->  Gather  (cost=8855922.80..8855923.51 rows=7 width=4)
         Workers Planned: 7
         ->  Partial Aggregate  (cost=8854922.80..8854922.81 rows=1 width=4)
               ->  Nested Loop Semi Join  (cost=1.02..8854905.17 rows=7052 width=4)
                     ->  Parallel Index Scan using standard_buildroot_pkey on standard_buildroot  (cost=0.44..606990.47 rows=3282972 width=8)
                     ->  Index Only Scan using buildroot_listing_buildroot_id_key on buildroot_listing  (cost=0.58..4394.93 rows=2282 width=4)
                           Index Cond: 
...

cte.sql

Can you test speed of this query? If it would improve it (it scales better in brew), it would be PR #2567 if it improves your situation.

Metadata Update from @tkopecek:
- Issue set to the milestone: 1.23.1
- Issue tagged with: usability

2 years ago

Well, been running 3 hours and not finished yet, so if it's faster it's not too much faster. ;(

I suppose we could look again at partitioning now that we have a modern postgresql version...

5 hours and 36min... and it finished! so, I guess thats a plus, since the other one didn't ever finish.

It is ugly, it is a pretty simple query :-(

@mikem any better idea than that CTE?

Metadata Update from @tkopecek:
- Issue tagged with: testing-ready

2 years ago

I can confirm with PR 2567 we are processing again... so it's definitely an improvement. :)

Sure, CTEs seem like a reasonable trade-off here.

:thumbsup:

Metadata Update from @mfilip:
- Issue tagged with: testing-done

2 years ago

Login to comment on this ticket.

Metadata
Attachments 2
Attached 2 years ago View Comment
Attached 2 years ago View Comment
Related Pull Requests
  • #2567 Merged 2 years ago
  • #2564 Merged 2 years ago