Issue #8233: [Release-monitoring] Pod is being restarted with exit code 137 on staging - fedora-infrastructure

fedora-infrastructure

#8233 [Release-monitoring] Pod is being restarted with exit code 137 on staging

Closed: Fixed 4 years ago by kevin. Opened 4 years ago by zlopez.

I have issue with one pod release-monitoring-check-service-21-l9tmq in staging OpenShift.

It keeps getting restarted with exit code 137 and according to docker support page this is caused by OOM as I don't see any error in pod's log that should cause the restart.

I can't validate this because I can't see resources available or used by the pod.

Could somebody look at this? It's possible that pod only has small amount of memory available. I want to eliminate possible issue with Anitya before I will deploy it on production.

Metadata Update from @bowlofeggs:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: OpenShift, release-monitoring

4 years ago

kevin commented 4 years ago

Yeah, it is hitting OOM killer.

[Mon Sep 23 12:10:38 2019] Killed process 36989 (python3), UID 1000200000, total-vm:4575592kB, anon-rss:3008776kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 12:22:20 2019] Killed process 48214 (python3), UID 1000200000, total-vm:4567692kB, anon-rss:2976480kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 12:36:13 2019] Killed process 58875 (python3), UID 1000200000, total-vm:4571784kB, anon-rss:2965092kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 12:49:18 2019] Killed process 72555 (python3), UID 1000200000, total-vm:4592188kB, anon-rss:3040100kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 13:04:22 2019] Killed process 84628 (python3), UID 1000200000, total-vm:4579560kB, anon-rss:2949364kB, file-rss:4kB, shmem-rss:0kB
[Mon Sep 23 13:18:27 2019] Killed process 98227 (python3), UID 1000200000, total-vm:4578248kB, anon-rss:3012592kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 13:30:48 2019] Killed process 110965 (python3), UID 1000200000, total-vm:4567960kB, anon-rss:2961144kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 13:45:53 2019] Killed process 122177 (python3), UID 1000200000, total-vm:4571788kB, anon-rss:2968424kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 13:58:58 2019] Killed process 5121 (python3), UID 1000200000, total-vm:4592188kB, anon-rss:2915000kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 14:14:06 2019] Killed process 16864 (python3), UID 1000200000, total-vm:4579300kB, anon-rss:2858760kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 14:28:07 2019] Killed process 30982 (python3), UID 1000200000, total-vm:4578784kB, anon-rss:2814680kB, file-rss:4kB, shmem-rss:0kB
[Mon Sep 23 14:40:18 2019] Killed process 44074 (python3), UID 1000200000, total-vm:4567972kB, anon-rss:2832304kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 14:53:46 2019] Killed process 55175 (python3), UID 1000200000, total-vm:4637320kB, anon-rss:2768488kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 15:07:55 2019] Killed process 67176 (python3), UID 1000200000, total-vm:4657724kB, anon-rss:3525556kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 15:23:03 2019] Killed process 79967 (python3), UID 1000200000, total-vm:4579556kB, anon-rss:3290708kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 15:37:39 2019] Killed process 93649 (python3), UID 1000200000, total-vm:4578768kB, anon-rss:3286512kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 15:49:45 2019] Killed process 106881 (python3), UID 1000200000, total-vm:4567468kB, anon-rss:3270800kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 16:03:32 2019] Killed process 117748 (python3), UID 1000200000, total-vm:4572056kB, anon-rss:3686756kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 16:03:33 2019] Killed process 117919 (python3), UID 1000200000, total-vm:4572056kB, anon-rss:3687668kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 16:16:46 2019] Killed process 130422 (python3), UID 1000200000, total-vm:4658256kB, anon-rss:3641860kB, file-rss:4kB, shmem-rss:0kB
[Mon Sep 23 16:31:41 2019] Killed process 11690 (python3), UID 1000200000, total-vm:4580084kB, anon-rss:3594100kB, file-rss:4kB, shmem-rss:0kB
[Mon Sep 23 16:45:14 2019] Killed process 25353 (python3), UID 1000200000, total-vm:4578772kB, anon-rss:3555952kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 16:57:06 2019] Killed process 37933 (python3), UID 1000200000, total-vm:4707508kB, anon-rss:3585092kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 17:10:54 2019] Killed process 48754 (python3), UID 1000200000, total-vm:4572984kB, anon
-rss:3526220kB, file-rss:0kB, shmem-rss:0kB
[Mon Sep 23 17:10:56 2019] Killed process 48840 (python3), UID 1000200000, total-vm:4572984kB, anon
-rss:3527416kB, file-rss:0kB, shmem-rss:0kB

That script is taking up rather a lot of memory... around 8-9GB or so. Our openshift nodes only have 16GB.

1000200+ 61266 54.4 1.2 873844 199204 ? Ss 17:11 18:46 python3 /usr/local/bin/check_service.py

So, possible options:

The current pod has been running since the 12th, perhaps we should restart it in case there's a leak going on?
Figure out why script is taking so much memory and reduce it via code
Increase memory of nodes
Add another node or two.

Thoughts?

zlopez commented 4 years ago

The current pod has been running since the 12th, perhaps we should restart it in case there's a leak going on?

The restart count is currently 316. But I'm not sure if this means restart of the script or restart of whole pod. We could try to restart pod itself and see if memory consumption will be lower.

Figure out why script is taking so much memory and reduce it via code

This will be preferred option, but I'm not sure what is causing so much memory consumption. I'm using 10 threads that are running in parallel and each of them goes through projects one by one.

Increase memory of nodes

If the memory is rising over time it will not help.

Add another node or two.

I currently don't have any mechanism that could share the load between more nodes. I could try to add celery and use rabbitmq queue to trigger jobs, but I will still need to have some service that will periodically goes through the projects and send message for every project ready to be checked.

Thoughts?

I will try to analyze the memory issue on my local instance of Anitya.

I see that on production the check-service pod is taking around 310 MB of RAM running version 0.16.1 of Anitya. So the excessive memory consumption must be issue with last version of Anitya.

zlopez commented 4 years ago

Not sure why but on production both Anitya pods are only 4 days old. Is this related to new certificate?

zlopez commented 4 years ago

I did some analysis with memory-profiler and did a few changes. I will let it run for a few hours on my local machine and if the memory will be fine, I will deploy the change on staging.

Hopefully this will prevent OOM in the future.

Edited 4 years ago by zlopez

zlopez commented 4 years ago

I deployed the fix and the initial consumption of memory is still too high, so I need to try another optimalization.

zlopez commented 4 years ago

After some playing with it locally, I found out that this issue is caused by one project on staging. This project points to site, which causes MemoryError or OOM after some time trying to parse it using our default regex.

We should probably introduce some timeout, when parsing the page. So we didn't end up with memory issue.

However, when I tried to change the project on staging I got 504 GATEWAY TIMEOUT. I looked in the log and didn't even saw any POST request. :-(

zlopez commented 4 years ago

Ok, I was able to change the project, hope this helps with the OOM error.

I also created issue for the OOM error https://github.com/release-monitoring/anitya/issues/843

kevin commented 4 years ago

Cool. Feel free to re-open or file new if there's anything further to do on our end.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 years ago

zlopez commented 4 years ago

@kevin Thanks for your help, it is now running without restart with average memory usage around 350 MB

kevin commented 4 years ago

Awesome! 🕺🕺🕺

Metadata

Assignee

None

Tags

Blocking

None

Depending on

None

Priority

Waiting on Assignee

fedora-infrastructure

Source Code

#8233 [Release-monitoring] Pod is being restarted with exit code 137 on staging Closed: Fixed 4 years ago by kevin. Opened 4 years ago by zlopez.

Metadata

release-monitoring OpenShift

#8233 [Release-monitoring] Pod is being restarted with exit code 137 on staging

Closed: Fixed 4 years ago by kevin. Opened 4 years ago by zlopez.