Since the new deployment of mailman was done there are common memory spikes happening in gunicorn. Usually the gunicorn worker for mailman is consuming around 400 MB, but during the spike the amount of memory can be as high as 30 GB. This sometimes causes the services being killed by OOM killer.
We should try to find what is causing this and fix it.
Not urgent, but it will help make mailman more stable.
I already tried to search for the problem, but usually the issue people have is the high memory consumption and not really spikes in memory consumption.
I'm trying to look at strace to see what the process is doing during the spike.
strace
This looks similar to what I've observed in Bodhi for the past year or so. Never found what the cause is, though. Does the spikes in mailman happens always at the same daily time (if I remember correctly, in Bodhi I see this happening every 3 hours)?
Not sure if that is related, but it seems MM is completely down ATM, returning 503
@smooge created a PR, that should hopefully fix the situation. I will merge it today, deploy and monitor if that helped.
I was trying to figure out a way to make this happen to all systems behind the proxies but I realized I don't know enough about the current infrastructure to do that correctly. I do not know if it needs to go into each group or if putting it in roles/httpd/domainrewrite/templates/domainrewrite.conf and/or roles/haproxy/rewrite/templates/rewrite.conf is what is needed.
roles/httpd/domainrewrite/templates/domainrewrite.conf
roles/haproxy/rewrite/templates/rewrite.conf
I'm not sure about that either, @kevin will probably know more.
The memory spikes are still happening even after merging the PR. I can see this error that corresponds to the spikes:
ERROR 2024-07-09 06:59:49,566 1219375 django.request Internal Server Error: /archives/list/scm-commits@lists.fedoraproject.org/2021/12/ Traceback (most recent call last): File "/usr/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner response = get_response(request) File "/usr/lib/python3.9/site-packages/django/core/handlers/base.py", line 197, in _get_response response = wrapped_callback(request, *callback_args, **callback_kwargs) File "/usr/lib/python3.9/site-packages/hyperkitty/lib/view_helpers.py", line 137, in inner return func(request, *args, **kwargs) File "/usr/lib/python3.9/site-packages/hyperkitty/views/mlist.py", line 127, in archives return _thread_list(request, mlist, threads, extra_context=extra_context) File "/usr/lib/python3.9/site-packages/hyperkitty/views/mlist.py", line 194, in _thread_list return render(request, template_name, context) File "/usr/lib/python3.9/site-packages/django/shortcuts.py", line 24, in render content = loader.render_to_string(template_name, context, request, using=using) File "/usr/lib/python3.9/site-packages/django/template/loader.py", line 62, in render_to_string return template.render(context, request) File "/usr/lib/python3.9/site-packages/django/template/backends/django.py", line 61, in render return self.template.render(context) File "/usr/lib/python3.9/site-packages/django/template/base.py", line 175, in render return self._render(context) File "/usr/lib/python3.9/site-packages/django/template/base.py", line 167, in _render return self.nodelist.render(context) File "/usr/lib/python3.9/site-packages/django/template/base.py", line 1005, in render return SafeString("".join([node.render_annotated(context) for node in self])) File "/usr/lib/python3.9/site-packages/django/template/base.py", line 1005, in <listcomp> return SafeString("".join([node.render_annotated(context) for node in self])) File "/usr/lib/python3.9/site-packages/django/template/base.py", line 966, in render_annotated return self.render(context) File "/usr/lib/python3.9/site-packages/django/template/loader_tags.py", line 157, in render return compiled_parent._render(context) File "/usr/lib/python3.9/site-packages/django/template/base.py", line 167, in _render return self.nodelist.render(context) File "/usr/lib/python3.9/site-packages/django/template/base.py", line 1005, in render return SafeString("".join([node.render_annotated(context) for node in self])) File "/usr/lib/python3.9/site-packages/django/template/base.py", line 1005, in <listcomp> return SafeString("".join([node.render_annotated(context) for node in self])) File "/usr/lib/python3.9/site-packages/django/template/base.py", line 966, in render_annotated return self.render(context) File "/usr/lib/python3.9/site-packages/django/template/loader_tags.py", line 63, in render result = block.nodelist.render(context) File "/usr/lib/python3.9/site-packages/django/template/base.py", line 1005, in render return SafeString("".join([node.render_annotated(context) for node in self])) File "/usr/lib/python3.9/site-packages/django/template/base.py", line 1005, in <listcomp> return SafeString("".join([node.render_annotated(context) for node in self])) File "/usr/lib/python3.9/site-packages/django/template/base.py", line 966, in render_annotated return self.render(context) File "/usr/lib/python3.9/site-packages/django/template/loader_tags.py", line 63, in render result = block.nodelist.render(context) File "/usr/lib/python3.9/site-packages/django/template/base.py", line 1005, in render return SafeString("".join([node.render_annotated(context) for node in self])) File "/usr/lib/python3.9/site-packages/django/template/base.py", line 1005, in <listcomp> return SafeString("".join([node.render_annotated(context) for node in self])) File "/usr/lib/python3.9/site-packages/django/template/base.py", line 966, in render_annotated return self.render(context) File "/usr/lib/python3.9/site-packages/django/templatetags/cache.py", line 54, in render fragment_cache.set(cache_key, value, expire_time) File "/usr/lib/python3.9/site-packages/django/core/cache/backends/memcached.py", line 79, in set if not self._cache.set(key, value, self.get_backend_timeout(timeout)): pylibmc.TooBig
Looking at it more closely, I'm not sure if the error is even related. As the spikes happen even when the error is not happening. :-/
Is memcached reporting any problems on the system? [I am WILDLY guessing here on the crash.] Or is it 'embedded' in the application. Also are all the crashes on similar directories: /archives/list/scm-commits@lists.fedoraproject.org or similar. That is a list which should have a LOT of messages in it so if the system is doing some sort of 'I got this query of an old archive.. I need to search, index, and pull a string or two to show on screen as live data'. That is going to blow up memory.
/archives/list/scm-commits@lists.fedoraproject.org
I don't see any error reported by memcached at all.
The errors are happening on similar archives that have LOT of messages, but they don't happen that often.
I made a few attempts at getting it working better:
The spikes are usually happening in gunicorn workers. I'm trying to find anything in the logs that is corresponding to the spikes, but I didn't noticed anything so far.
After @kevin enabled debug logs I seem to find the culprit, but I'm not sure what to do with it. Each time I checked the worker with memory spike it was always stuck on /archives/search and then it was cut off by EPIPE.
/archives/search
EPIPE
Here is the entry from logs:
Jul 10 07:41:21 mailman01.iad2.fedoraproject.org gunicorn[1469853]: [2024-07-10 02:41:21 -0500] [1469853] [DEBUG] GET /archives/search Jul 10 07:42:27 mailman01.iad2.fedoraproject.org gunicorn[1469853]: [2024-07-10 02:42:27 -0500] [1469853] [DEBUG] Ignoring EPIPE
It was stuck on that with increasing memory consumption (around 20GB before EPIPE) for a minute.
I'm wondering if that could be related to search index still being built.
I created a PR to disable /archives/search endpoint. Hopefully this will prevent the memory spikes till the search index is rebuilt.
I'm not sure if this is the best solution, but it should work for now. I will wait for reviews till the PR will be merged, but I tested the solution on https://lists.stg.fedoraproject.org/ and I can confirm that it works. You can try it there.
Commit cdd78d14 fixes this issue
The PR is now merged and deployed, I will monitor the mailman for any spikes. Right now it looks like the memory consumption is now stable and around 3-5 GB. Most of the memory is consumed by the job that is regenerating the search index.
We should remove the changes once the search index is regenerated.
Metadata Update from @zlopez: - Issue status updated to: Open (was: Closed)
The situation seems stable now, the hourly job is running fine and all the gunicorn workers are stable and consuming few hundreds MB.
Shall we close this now?
I would like to keep this open till we enable the search again.
Bump, is it OK to turn search back on yet? Last comment on here was over 20 days ago.
@bderry71 Not yet, I'm regenerating the search cache right now. Unfortunately the automatic mailman job didn't do the expected job, so I wrote a script and regenerating it from scratch. You can see the progress in https://pagure.io/fedora-infrastructure/issue/12027
Hey folks, is this sorted? Can we have it in the prod?
There were some issues with generating the fulltext index on staging. I ended up changing the backend from whoosh to xapian. Now it's almost finished on staging and I will enable the search there to see if everything is OK. After that I will continue with production.
whoosh
xapian
Metadata Update from @zlopez: - Issue assigned to zlopez
Metadata Update from @zlopez: - Issue untagged with: Needs investigation - Issue tagged with: high-trouble
The search is enabled again and there are no spikes when trying to search for something. Closing this as fixed.
Metadata Update from @zlopez: - Issue close_status updated to: Fixed with Explanation - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.