#12043 Mailman deployment memory spikes
Closed: Fixed with Explanation 2 months ago by zlopez. Opened 6 months ago by zlopez.

Describe what you would like us to do:


Since the new deployment of mailman was done there are common memory spikes happening in gunicorn. Usually the gunicorn worker for mailman is consuming around 400 MB, but during the spike the amount of memory can be as high as 30 GB. This sometimes causes the services being killed by OOM killer.

We should try to find what is causing this and fix it.

When do you need this to be done by? (YYYY/MM/DD)


Not urgent, but it will help make mailman more stable.


I already tried to search for the problem, but usually the issue people have is the high memory consumption and not really spikes in memory consumption.

I'm trying to look at strace to see what the process is doing during the spike.

This looks similar to what I've observed in Bodhi for the past year or so. Never found what the cause is, though.
Does the spikes in mailman happens always at the same daily time (if I remember correctly, in Bodhi I see this happening every 3 hours)?

Not sure if that is related, but it seems MM is completely down ATM, returning 503

@smooge created a PR, that should hopefully fix the situation. I will merge it today, deploy and monitor if that helped.

I was trying to figure out a way to make this happen to all systems behind the proxies but I realized I don't know enough about the current infrastructure to do that correctly. I do not know if it needs to go into each group or if putting it in roles/httpd/domainrewrite/templates/domainrewrite.conf and/or roles/haproxy/rewrite/templates/rewrite.conf is what is needed.

I'm not sure about that either, @kevin will probably know more.

The memory spikes are still happening even after merging the PR. I can see this error that corresponds to the spikes:

ERROR 2024-07-09 06:59:49,566 1219375 django.request Internal Server Error: /archives/list/scm-commits@lists.fedoraproject.org/2021/12/
Traceback (most recent call last):                                                                                                                                           
  File "/usr/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)                                                                                                                                         
  File "/usr/lib/python3.9/site-packages/django/core/handlers/base.py", line 197, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)                                                                                                  
  File "/usr/lib/python3.9/site-packages/hyperkitty/lib/view_helpers.py", line 137, in inner
    return func(request, *args, **kwargs)                                                                                                                                    
  File "/usr/lib/python3.9/site-packages/hyperkitty/views/mlist.py", line 127, in archives
    return _thread_list(request, mlist, threads, extra_context=extra_context)                                                                                                
  File "/usr/lib/python3.9/site-packages/hyperkitty/views/mlist.py", line 194, in _thread_list
    return render(request, template_name, context)
  File "/usr/lib/python3.9/site-packages/django/shortcuts.py", line 24, in render
    content = loader.render_to_string(template_name, context, request, using=using)
  File "/usr/lib/python3.9/site-packages/django/template/loader.py", line 62, in render_to_string
    return template.render(context, request)
  File "/usr/lib/python3.9/site-packages/django/template/backends/django.py", line 61, in render
    return self.template.render(context)
  File "/usr/lib/python3.9/site-packages/django/template/base.py", line 175, in render 
    return self._render(context)
  File "/usr/lib/python3.9/site-packages/django/template/base.py", line 167, in _render
    return self.nodelist.render(context)
  File "/usr/lib/python3.9/site-packages/django/template/base.py", line 1005, in render
    return SafeString("".join([node.render_annotated(context) for node in self]))
  File "/usr/lib/python3.9/site-packages/django/template/base.py", line 1005, in <listcomp>
    return SafeString("".join([node.render_annotated(context) for node in self]))
  File "/usr/lib/python3.9/site-packages/django/template/base.py", line 966, in render_annotated
    return self.render(context)
  File "/usr/lib/python3.9/site-packages/django/template/loader_tags.py", line 157, in render
    return compiled_parent._render(context) 
  File "/usr/lib/python3.9/site-packages/django/template/base.py", line 167, in _render
    return self.nodelist.render(context)
  File "/usr/lib/python3.9/site-packages/django/template/base.py", line 1005, in render
    return SafeString("".join([node.render_annotated(context) for node in self]))
  File "/usr/lib/python3.9/site-packages/django/template/base.py", line 1005, in <listcomp>
    return SafeString("".join([node.render_annotated(context) for node in self]))
  File "/usr/lib/python3.9/site-packages/django/template/base.py", line 966, in render_annotated
    return self.render(context)
  File "/usr/lib/python3.9/site-packages/django/template/loader_tags.py", line 63, in render
    result = block.nodelist.render(context) 
  File "/usr/lib/python3.9/site-packages/django/template/base.py", line 1005, in render
    return SafeString("".join([node.render_annotated(context) for node in self]))
  File "/usr/lib/python3.9/site-packages/django/template/base.py", line 1005, in <listcomp>
    return SafeString("".join([node.render_annotated(context) for node in self]))
  File "/usr/lib/python3.9/site-packages/django/template/base.py", line 966, in render_annotated
    return self.render(context)
  File "/usr/lib/python3.9/site-packages/django/template/loader_tags.py", line 63, in render
    result = block.nodelist.render(context) 
  File "/usr/lib/python3.9/site-packages/django/template/base.py", line 1005, in render
    return SafeString("".join([node.render_annotated(context) for node in self]))
  File "/usr/lib/python3.9/site-packages/django/template/base.py", line 1005, in <listcomp>
    return SafeString("".join([node.render_annotated(context) for node in self]))
  File "/usr/lib/python3.9/site-packages/django/template/base.py", line 966, in render_annotated
    return self.render(context)
  File "/usr/lib/python3.9/site-packages/django/templatetags/cache.py", line 54, in render
    fragment_cache.set(cache_key, value, expire_time)
  File "/usr/lib/python3.9/site-packages/django/core/cache/backends/memcached.py", line 79, in set
    if not self._cache.set(key, value, self.get_backend_timeout(timeout)):
pylibmc.TooBig

Looking at it more closely, I'm not sure if the error is even related. As the spikes happen even when the error is not happening. :-/

Is memcached reporting any problems on the system? [I am WILDLY guessing here on the crash.] Or is it 'embedded' in the application. Also are all the crashes on similar directories:
/archives/list/scm-commits@lists.fedoraproject.org
or similar. That is a list which should have a LOT of messages in it so if the system is doing some sort of 'I got this query of an old archive.. I need to search, index, and pull a string or two to show on screen as live data'. That is going to blow up memory.

I don't see any error reported by memcached at all.

The errors are happening on similar archives that have LOT of messages, but they don't happen that often.

I made a few attempts at getting it working better:

  • Moved to using memcached01 instead of a local one. That should free up some memory and avoid some thrashing
  • changed number of django workers to 4. I think it's normally default to 1. Perhaps we want to try more here.

The spikes are usually happening in gunicorn workers. I'm trying to find anything in the logs that is corresponding to the spikes, but I didn't noticed anything so far.

After @kevin enabled debug logs I seem to find the culprit, but I'm not sure what to do with it.
Each time I checked the worker with memory spike it was always stuck on /archives/search and then it was cut off by EPIPE.

Here is the entry from logs:

Jul 10 07:41:21 mailman01.iad2.fedoraproject.org gunicorn[1469853]: [2024-07-10 02:41:21 -0500] [1469853] [DEBUG] GET /archives/search
Jul 10 07:42:27 mailman01.iad2.fedoraproject.org gunicorn[1469853]: [2024-07-10 02:42:27 -0500] [1469853] [DEBUG] Ignoring EPIPE

It was stuck on that with increasing memory consumption (around 20GB before EPIPE) for a minute.

I'm wondering if that could be related to search index still being built.

I created a PR to disable /archives/search endpoint. Hopefully this will prevent the memory spikes till the search index is rebuilt.

I'm not sure if this is the best solution, but it should work for now. I will wait for reviews till the PR will be merged, but I tested the solution on https://lists.stg.fedoraproject.org/ and I can confirm that it works. You can try it there.

The PR is now merged and deployed, I will monitor the mailman for any spikes. Right now it looks like the memory consumption is now stable and around 3-5 GB. Most of the memory is consumed by the job that is regenerating the search index.

We should remove the changes once the search index is regenerated.

Metadata Update from @zlopez:
- Issue status updated to: Open (was: Closed)

6 months ago

The situation seems stable now, the hourly job is running fine and all the gunicorn workers are stable and consuming few hundreds MB.

Shall we close this now?

I would like to keep this open till we enable the search again.

Bump, is it OK to turn search back on yet? Last comment on here was over 20 days ago.

@bderry71 Not yet, I'm regenerating the search cache right now. Unfortunately the automatic mailman job didn't do the expected job, so I wrote a script and regenerating it from scratch. You can see the progress in https://pagure.io/fedora-infrastructure/issue/12027

Hey folks, is this sorted? Can we have it in the prod?

There were some issues with generating the fulltext index on staging. I ended up changing the backend from whoosh to xapian. Now it's almost finished on staging and I will enable the search there to see if everything is OK. After that I will continue with production.

Metadata Update from @zlopez:
- Issue assigned to zlopez

3 months ago

Metadata Update from @zlopez:
- Issue untagged with: Needs investigation
- Issue tagged with: high-trouble

3 months ago

The search is enabled again and there are no spikes when trying to search for something. Closing this as fixed.

Metadata Update from @zlopez:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

2 months ago

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog
Related Pull Requests
  • #2147 Merged 6 months ago