This has happened for the second time (that I know of), it's no coincidence. On all dev, stg and prod at roughly the same time fedmsg-hub stopped triggering new jobs. When you expect the available memory, the process consumes all available memory (~3GB) and all swap space (~2GB). Restarting the process "fixes" the problem.
There's clearly some memory leak. But it's not trivial to find out where it is, whether in fedmsg-hub itself or in our adjustments of it (taskotron-trigger). It can be happening for each message and growing over time, or it can happen just for some specific messages, we don't know yet.
On taskotron-stg01, I saved a core file using gdb into /root/fedmsg-hub-leak.core. Since the issue doesn't seem to be going away, we'll need to figure out what leaks in there.
/root/fedmsg-hub-leak.core
Metadata Update from @kparal: - Issue priority set to: High - Issue tagged with: infra
A few pointers for start: https://stackoverflow.com/questions/1435415/python-memory-leaks http://blog.korbakov.com/2013/09/22/hunting-memory-leaks-in-running-python-process.html
It seems pyopenssl is to blame here:
<bowlofeggs> jcline recently fixed some kind of memory leak in pyopenssl that had some relationship to fedmsg - i wonder if that was the cause here? <jcline> kparal, bowlofeggs, that is indeed the pyopenssl fixes. Both leaks should be fixed in pyOpenSSL 17.3.0, and I backported one of them to older releases. I can get the other one in now it's merged upstream. <jcline> Although it looks like the maintainer actually just closed all my PRs without accepting them so... <jbowen> Good <time of day> <jbowen> Is there any chance I can get membership approval for the QA group (FAS username is the same as my IRC username) <jcline> kparal, is this consumer on f25? <tflink> jcline: yeah, it is <jcline> tflink, ah okay, thanks * jcline backports patches further :(
Meeting outcome:
* to fix the fedmsg memory issue, the plan is to use side-builds in the short term if needed, upgrade the master to f26 in the medium term and upgrade everything to f26 in the longer term (tflink, 14:36:52)
https://meetbot.fedoraproject.org/fedora-meeting-1/2017-09-18/fedora-qadevel.2017-09-18-14.01.log.html
These updates fixed the problem on dev: https://bodhi.fedoraproject.org/updates/FEDORA-2017-235298fa58 https://bodhi.fedoraproject.org/updates/pyOpenSSL-16.2.0-2.fc25
We need to deploy the same to stg and prod.
The fix was deployed to prod, stg is not updated yet (due to login issues).
Deployed even to stg, closing.
Metadata Update from @kparal: - Issue assigned to kparal - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.