#2851 kojira 1.24.x and rmtree/repo deletes
Closed: Fixed 2 years ago by tkopecek. Opened 2 years ago by kevin.

We seem to have hit another repo deletes problem with kojira in 1.24 and 1.24.1. Moving back to the version in 1.23.1 seems to avoid it.

We just noticed the problem today when it was noted that we were again almost out of space on our koji volume. ;(
On looking, there was 47TB or so of repos still around in repos/

Logs seemed to show some tracebacks:

May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]: Exception in thread rmtreeLoop:
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]: Traceback (most recent call last):
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:   File "/usr/sbin/kojira", line 600, in rmt
reeLoop
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:     self.checkQueue()
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:   File "/usr/sbin/kojira", line 352, in che
ckQueue
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:     pid = self._rmtree(path)
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:   File "/usr/sbin/kojira", line 375, in _rm
tree
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:     pid = os.fork()
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:   File "/usr/sbin/kojira", line 278, in __e
xit__
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:     self.release_locks()
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:   File "/usr/sbin/kojira", line 300, in rel
ease_locks
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:     logging._releaseLock()
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:   File "/usr/lib64/python3.9/logging/__init
__.py", line 232, in _releaseLock
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:     _lock.release()
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]: RuntimeError: cannot release un-acquired lo
ck
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]: During handling of the above exception, ano
ther exception occurred:
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]: Traceback (most recent call last):
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:   File "/usr/lib64/python3.9/threading.py",
 line 954, in _bootstrap_inner
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:     self.run()
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:   File "/usr/lib64/python3.9/threading.py",
 line 892, in run
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:     self._target(*self._args, **self._kwarg
s)
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:   File "/usr/sbin/kojira", line 606, in rmt
reeLoop
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:     session.logout()
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:   File "/usr/lib/python3.9/site-packages/ko
ji/__init__.py", line 2560, in logout
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:     self._sendCall(handler, headers, reques
t)
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:   File "/usr/lib/python3.9/site-packages/koji/__init__.py", line 2636, in _sendCall
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:     return self._sendOneCall(handler, headers, request)
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:   File "/usr/lib/python3.9/site-packages/koji/__init__.py", line 2685, in _sendOneCall
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:     ret = self._read_xmlrpc_response(r)
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:   File "/usr/lib/python3.9/site-packages/koji/__init__.py", line 2697, in _read_xmlrpc_response
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:     result = u.close()
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:   File "/usr/lib64/python3.9/xmlrpc/client.py", line 662, in close
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]:     raise Fault(**self._stack[0])
May 10 21:56:34 koji02.iad2.fedoraproject.org kojira[1314442]: xmlrpc.client.Fault: <Fault 1007: 'session "135545223" has expired'>

and

May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]: 2021-05-08 23:59:20,265 [INFO] {805} koji.repo.
regen:340 Queuing rmtree job for /mnt/koji/repos/f35-build-side-40484/3472506
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]: 2021-05-08 23:59:20,269 [ERROR] {805} koji:1091
 Traceback (most recent call last):
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:   File "/usr/lib/python3.9/site-packages/koji/_
_init__.py", line 2721, in _callMethod
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:     return self._sendCall(handler, headers, req
uest)
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:   File "/usr/lib/python3.9/site-packages/koji/_
_init__.py", line 2636, in _sendCall
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:     return self._sendOneCall(handler, headers, 
request)
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:   File "/usr/lib/python3.9/site-packages/koji/_
_init__.py", line 2685, in _sendOneCall
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:     ret = self._read_xmlrpc_response(r)
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:   File "/usr/lib/python3.9/site-packages/koji/_
_init__.py", line 2697, in _read_xmlrpc_response
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:     result = u.close()
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:   File "/usr/lib64/python3.9/xmlrpc/client.py",
 line 662, in close
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:     raise Fault(**self._stack[0])
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]: xmlrpc.client.Fault: <Fault 1008: '17693841 > 1
7693840 (session 134841685)'>
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]: During handling of the above exception, another
 exception occurred:
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]: Traceback (most recent call last):
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:   File "/usr/sbin/kojira", line 1062, in main
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:     repomgr.updateRepos()
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:   File "/usr/sbin/kojira", line 752, in updateR
epos
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:     self.readCurrentRepos()
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:   File "/usr/sbin/kojira", line 421, in readCur
rentRepos
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:     if not getTag(self.session, repo.tag_id) an
d not repo.expired():
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:   File "/usr/sbin/kojira", line 54, in getTag
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:     info = session.getTag(tag, event=event)
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:   File "/usr/lib/python3.9/site-packages/koji/_
_init__.py", line 2310, in __call__
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:     return self.__func(self.__name, args, opts)
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:   File "/usr/lib/python3.9/site-packages/koji/_
_init__.py", line 2740, in _callMethod
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]:     raise err
May 08 23:59:20 koji02.iad2.fedoraproject.org kojira[805]: koji.SequenceError: 17693841 > 17693840 (sessio
n 134841685)

It's unclear if these have anything to do with things though.

Another one seems to clearly show:

May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293341]: 2021-05-10 17:55:53,732 [INFO] {1293341} ko
ji.repo.regen:353 Started rmtree (pid 1293352) for /mnt/koji/repos/epel7-infra-mailman/3449468
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293352]: 2021-05-10 17:55:53,733 [ERROR] {1293352} k
oji.repo.rmtree:603 Error in delete thread
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293352]: Traceback (most recent call last):
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293352]:   File "/usr/sbin/kojira", line 600, in rmt
reeLoop
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293352]:     self.checkQueue()
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293352]:   File "/usr/sbin/kojira", line 352, in che
ckQueue
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293352]:     pid = self._rmtree(path)
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293352]:   File "/usr/sbin/kojira", line 375, in _rm
tree
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293352]:     pid = os.fork()
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293352]:   File "/usr/sbin/kojira", line 278, in __e
xit__
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293352]:     self.release_locks()
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293352]:   File "/usr/sbin/kojira", line 300, in rel
ease_locks
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293352]:     logging._releaseLock()
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293352]:   File "/usr/lib64/python3.9/logging/__init
__.py", line 232, in _releaseLock
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293352]:     _lock.release()
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293352]: RuntimeError: cannot release un-acquired lo
ck
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293353]: 2021-05-10 17:55:53,743 [ERROR] {1293353} k
oji.repo.rmtree:603 Error in delete thread
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293353]: Traceback (most recent call last):
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293353]:   File "/usr/sbin/kojira", line 600, in rmt
reeLoop
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293353]:     self.checkQueue()
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293353]:   File "/usr/sbin/kojira", line 352, in che
ckQueue
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293353]:     pid = self._rmtree(path)
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293353]:   File "/usr/sbin/kojira", line 375, in _rm
tree
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293353]:     pid = os.fork()
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293353]:   File "/usr/sbin/kojira", line 278, in __e
xit__
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293353]:     self.release_locks()
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293353]:   File "/usr/sbin/kojira", line 300, in rel
ease_locks
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293353]:     logging._releaseLock()
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293353]:   File "/usr/lib64/python3.9/logging/__init
__.py", line 232, in _releaseLock
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293353]:     _lock.release()
May 10 17:55:53 koji02.iad2.fedoraproject.org kojira[1293353]: RuntimeError: cannot release un-acquired lo
ck

The 1.23.1 kojira sees tons of:

May 10 22:42:18 koji02.iad2.fedoraproject.org kojira[1316276]: 2021-05-10 22:42:18,181 [INFO] {1316276} koji:650 Removing stray repo (state=DELETED): /mnt/koji/repos/module-nginx-mainline-3220210331001209-43bbeeef-build/3452529

So, it was marked deleted, added to rmqueue, but never actually deleted?

Happy to provide more info / kojira --debug, but we need to get the space usage down before we run out, so I figured I'd file this sooner rather than later.


Another one with 1.23.1:

May 10 23:58:10 koji02.iad2.fedoraproject.org kojira[1316276]: 2021-05-10 23:58:10,263 [INFO] {1316276} ko
ji.repo.regen:307 Queuing rmtree job for /mnt/koji/repos/eln-build/3485072                                
May 10 23:58:10 koji02.iad2.fedoraproject.org kojira[1316276]: 2021-05-10 23:58:10,263 [ERROR] {1316276} k
oji.repo.delete:554 Error in delete thread                                                                
May 10 23:58:10 koji02.iad2.fedoraproject.org kojira[1316276]: Traceback (most recent call last):         
May 10 23:58:10 koji02.iad2.fedoraproject.org kojira[1316276]:   File "/usr/sbin/kojira", line 551, in del
eteLoop                                                                                                   
May 10 23:58:10 koji02.iad2.fedoraproject.org kojira[1316276]:     self.deleteRepos()                     
May 10 23:58:10 koji02.iad2.fedoraproject.org kojira[1316276]:   File "/usr/sbin/kojira", line 961, in del
eteRepos                                                                                                  
May 10 23:58:10 koji02.iad2.fedoraproject.org kojira[1316276]:     del self.repos[repo.repo_id]           
May 10 23:58:10 koji02.iad2.fedoraproject.org kojira[1316276]: KeyError: 3485072                          
May 10 23:58:10 koji02.iad2.fedoraproject.org kojira[1316276]: Exception in thread deleteLoop:            
May 10 23:58:10 koji02.iad2.fedoraproject.org kojira[1316276]: Traceback (most recent call last):         
May 10 23:58:10 koji02.iad2.fedoraproject.org kojira[1316276]:   File "/usr/lib64/python3.9/threading.py",
 line 954, in _bootstrap_inner                                                                            
May 10 23:58:10 koji02.iad2.fedoraproject.org kojira[1316276]:     self.run()                             
May 10 23:58:10 koji02.iad2.fedoraproject.org kojira[1316276]:   File "/usr/lib64/python3.9/threading.py",
 line 892, in run                                                                                         
May 10 23:58:10 koji02.iad2.fedoraproject.org kojira[1316276]:     self._target(*self._args, **self._kwarg
s)                                                                                                        
May 10 23:58:10 koji02.iad2.fedoraproject.org kojira[1316276]:   File "/usr/sbin/kojira", line 551, in del
eteLoop                                                                                                   
May 10 23:58:10 koji02.iad2.fedoraproject.org kojira[1316276]:     self.deleteRepos()                     
May 10 23:58:10 koji02.iad2.fedoraproject.org kojira[1316276]:   File "/usr/sbin/kojira", line 961, in del
eteRepos                                                                                                  
May 10 23:58:10 koji02.iad2.fedoraproject.org kojira[1316276]:     del self.repos[repo.repo_id]           
May 10 23:58:10 koji02.iad2.fedoraproject.org kojira[1316276]: KeyError: 3485072     

Metadata Update from @kevin:
- Custom field Size adjusted to None

2 years ago

It seems to be two issues:
1) logging locking (side-effect of #2714) - what is your python version? + @mikem https://pagure.io/koji/pull-request/2715#_1__51 maybe this should encompass whole forked code until the end of _rmtree?

2) race with del self.repos - #2853 should fix this

Initial traceback, cleaned up

Traceback (most recent call last):
  File "/usr/sbin/kojira", line 600, in rmtreeLoop
    self.checkQueue()
  File "/usr/sbin/kojira", line 352, in checkQueue
    pid = self._rmtree(path)
  File "/usr/sbin/kojira", line 375, in _rmtree
    pid = os.fork()
  File "/usr/sbin/kojira", line 278, in __exit__
    self.release_locks()
  File "/usr/sbin/kojira", line 300, in release_locks
    logging._releaseLock()
  File "/usr/lib64/python3.9/logging/__init__.py", line 232, in _releaseLock
    _lock.release()
RuntimeError: cannot release un-acquired lock

Hmm, we only try to do this is the earlier logging._acquireLock() succeeds. That's odd

The SequenceError is its own issue. I noticed it myself the other week and I've just filed #2854 for track it.

https://pagure.io/koji/issue/2854

maybe this should encompass whole forked code until the end of _rmtree?

I don't see how this will help. Once we've forked, we're the only thread.

@kevin If you can give me details about the system where kojira is running, I can try to replicate. OS and python version.

Also, were you running stock koji 1.24?

It's Fedora 34, so python-3.9.4.

We were running koji 1.24.1... but I think the issue started with 1.24, we just didn't notice.

The 1.23.1 kojira version has now cleaned things all out, so I can re-upgrade and/or run with debug if you like.

I've a simple reproducer. https://paste.centos.org/view/c5719b07
Child process will always have this problem. (F33/python-3.9.4)

After the fork, child's lock is not acquired. It is just reinitialized. So, all locks in the child process are free. So, release_locks would always work only in parent. We could just wrap all releases with try/except. Anything better (detecting that we're child somehow?)?

https://github.com/python/cpython/pull/19195/files#diff-23071700ee893f365a4c117210b9129d4ac6afe5cc6e52bb5cd4cabbc6816e07R498

Metadata Update from @tkopecek:
- Issue set to the milestone: 1.25

2 years ago

Ah, the reproducer does not reproduce on f32, but errors on f33 and 34. This may be a fix the underlying problem in newer python that is breaking the workaround.

f32 is the last one with py 3.8 as default. so, _acquireLock will not be called there.

Metadata Update from @jcupova:
- Issue tagged with: testing-ready

2 years ago

Metadata Update from @mfilip:
- Issue tagged with: testing-done

2 years ago

Commit 3849e07 relates to this ticket

Commit 11e71e8 relates to this ticket

Login to comment on this ticket.

Metadata
Related Pull Requests
  • #2856 Merged 2 years ago
  • #2853 Merged 2 years ago