mdapi seems to crashloop under load. A easy reproducer is to run the bugzilla-sync script on pkgs02. Due to the poor way that works it does a mdapi query for every single package that exists, this causes mdapi to stop processing, then it fails liveness probe and openshift starts returning a 503 until it's restarted.
I can't see any particular traceback or anything, it just stops processing.
This means we are not syncing bugzilla components, which in turn broke new hotness because of missing components. :(
Perhaps this makes the 'fix bugzilla sync' team a higher priority?
Found one traceback:
2019-10-01 17:42:41,833 [ERROR] aiohttp.server: Unhandled exception Traceback (most recent call last): File "/usr/lib64/python3.7/site-packages/aiohttp/web_protocol.py", line 411, in start await resp.write_eof() File "/usr/lib64/python3.7/site-packages/aiohttp/web_response.py", line 596, in write_eof await super().write_eof(body) File "/usr/lib64/python3.7/site-packages/aiohttp/web_response.py", line 401, in write_eof await self._payload_writer.write_eof(data) File "/usr/lib64/python3.7/site-packages/aiohttp/http_writer.py", line 136, in write_eof self._write(chunk) File "/usr/lib64/python3.7/site-packages/aiohttp/http_writer.py", line 67, in _write raise ConnectionResetError('Cannot write to closing transport') ConnectionResetError: Cannot write to closing transport
So I have increased the frequency of the liveness and readiness probe to check the pod every minute instead of every 10s. That should make the pod less likely to restart.
https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=707a7ab
I have also deployed https://pagure.io/mdapi/pull-request/90 which should increase mdapi performance.
Before closing this ticket I would like to add nagios monitoring to mdapi so that next time this happen we have some alerts :e-mail:
Metadata Update from @cverna: - Issue assigned to cverna
We could have errors from nagios when pod in OpenShift crashes?
I don't know if we can but it would be nice :-)
@cverna Absolutely. I would use it for both Anitya and the-new-hotness.
Monitoring of pods is currently investigated by @asaleh and possible use of prometheus.
Closing this ticket.
Metadata Update from @cverna: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.