#8264 mdapi crashing under load
Closed: Fixed a month ago by cverna. Opened 3 months ago by kevin.

mdapi seems to crashloop under load. A easy reproducer is to run the bugzilla-sync script on pkgs02. Due to the poor way that works it does a mdapi query for every single package that exists, this causes mdapi to stop processing, then it fails liveness probe and openshift starts returning a 503 until it's restarted.

I can't see any particular traceback or anything, it just stops processing.

This means we are not syncing bugzilla components, which in turn broke new hotness because of missing components. :(

Perhaps this makes the 'fix bugzilla sync' team a higher priority?


Found one traceback:

2019-10-01 17:42:41,833 [ERROR] aiohttp.server: Unhandled exception
Traceback (most recent call last):
  File "/usr/lib64/python3.7/site-packages/aiohttp/web_protocol.py", line 411, in start
    await resp.write_eof()
  File "/usr/lib64/python3.7/site-packages/aiohttp/web_response.py", line 596, in write_eof
    await super().write_eof(body)
  File "/usr/lib64/python3.7/site-packages/aiohttp/web_response.py", line 401, in write_eof
    await self._payload_writer.write_eof(data)
  File "/usr/lib64/python3.7/site-packages/aiohttp/http_writer.py", line 136, in write_eof
    self._write(chunk)
  File "/usr/lib64/python3.7/site-packages/aiohttp/http_writer.py", line 67, in _write
    raise ConnectionResetError('Cannot write to closing transport')
ConnectionResetError: Cannot write to closing transport

So I have increased the frequency of the liveness and readiness probe to check the pod every minute instead of every 10s. That should make the pod less likely to restart.

https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=707a7ab

I have also deployed https://pagure.io/mdapi/pull-request/90 which should increase mdapi performance.

Before closing this ticket I would like to add nagios monitoring to mdapi so that next time this happen we have some alerts :e-mail:

Metadata Update from @cverna:
- Issue assigned to cverna

3 months ago

We could have errors from nagios when pod in OpenShift crashes?

We could have errors from nagios when pod in OpenShift crashes?

I don't know if we can but it would be nice :-)

@cverna
Absolutely.
I would use it for both Anitya and the-new-hotness.

Monitoring of pods is currently investigated by @asaleh and possible use of prometheus.

Closing this ticket.

Metadata Update from @cverna:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a month ago

Login to comment on this ticket.

Metadata