#9278 koschei frontend pods crashing from time to time
Closed: Fixed 3 years ago by mizdebsk. Opened 3 years ago by kevin.

Now about 4 times koschei frontend pods just start crashing (resulting in one alert for every proxy and one for each openshift node, so about ~30 pages).

logs from the pods:

[Sun Aug 30 00:22:10.758996 2020] [lbmethod_heartbeat:notice] [pid 1:tid 140182298075456] AH02282: No slotmem from mod_heartmonitor
[Sun Aug 30 00:22:10.759698 2020] [http2:warn] [pid 1:tid 140182298075456] AH02951: mod_ssl does not seem to be enabled
[Sun Aug 30 00:22:10.759997 2020] [cgid:error] [pid 8:tid 140182298075456] (13)Permission denied: AH01243: Couldn't bind unix domain socket /etc/httpd/run/cgisock.1
[Sun Aug 30 00:22:10.767070 2020] [mpm_event:notice] [pid 1:tid 140182298075456] AH00489: Apache/2.4.41 (Fedora) mod_wsgi/4.6.6 Python/3.7 configured -- resuming normal operations
[Sun Aug 30 00:22:10.767115 2020] [core:notice] [pid 1:tid 140182298075456] AH00094: Command line: 'httpd -D NO_DETACH -D FOREGROUND -f /usr/share/koschei/httpd.conf'
[Sun Aug 30 00:22:10.767242 2020] [cgid:error] [pid 1:tid 140182298075456] (13)Permission denied: AH01240: Couldn't unlink unix domain socket /etc/httpd/run/cgisock.1
[Sun Aug 30 00:22:10.767282 2020] [cgid:crit] [pid 1:tid 140182298075456] AH01238: cgid daemon failed to initialize
[Sun Aug 30 00:23:08.236387 2020] [core:warn] [pid 1:tid 140182298075456] AH00045: child process 13 still did not exit, sending a SIGTERM
[Sun Aug 30 00:23:08.236427 2020] [core:warn] [pid 1:tid 140182298075456] AH00045: child process 231 still did not exit, sending a SIGTERM
[Sun Aug 30 00:23:10.238493 2020] [core:warn] [pid 1:tid 140182298075456] AH00045: child process 13 still did not exit, sending a SIGTERM
[Sun Aug 30 00:23:10.238577 2020] [core:warn] [pid 1:tid 140182298075456] AH00045: child process 231 still did not exit, sending a SIGTERM
[Sun Aug 30 00:23:12.240639 2020] [core:warn] [pid 1:tid 140182298075456] AH00045: child process 231 still did not exit, sending a SIGTERM
[Sun Aug 30 00:23:13.241857 2020] [mpm_event:notice] [pid 1:tid 140182298075456] AH00491: caught SIGTERM, shutting dow

The perms on /run/httpd are weird, but nothing has changed that I can see in months.

We don't build the images for koschei, it pulls them from elsewhere, but our openshift says the image hasn't changed.

I have no idea whats going on here.

@mizdebsk any thoughts?


Metadata Update from @mobrien:
- Issue priority set to: Waiting on External (was: Needs Review)

3 years ago

Is this the issue that's causing Koschei to show:

Application is not available

The application is currently not serving requests at this endpoint. It may not have been started or is still starting.

?

Is this the issue that's causing Koschei to show:

Application is not available

The application is currently not serving requests at this endpoint. It may not have been started or is still starting.

?

Yes :(

I don't know what may be causing this, but I will look into the issue tomorrow.

Metadata Update from @mizdebsk:
- Issue assigned to mizdebsk
- Issue priority set to: Waiting on Assignee (was: Waiting on External)

3 years ago

Metadata Update from @mizdebsk:
- Issue tagged with: ops

3 years ago

I am working on the issue.

It looks like this may be related to poor DB performance.
OpenShift liveness and readiness probes have timeouts of 1 second -- they try to load /stats page, which results in the following slow query, that results in probe failures.

koschei=# explain analyze SELECT resource_consumption_stats.name AS resource_consumption_stats_name, resource_consumption_stats.arch AS resource_consumption_stats_arch, resource_consumption_stats.time AS resource_consumption_stats_time, resource_consumption_stats.time_percentage AS resource_consumption_stats_time_percentage FROM resource_consumption_stats ORDER BY resource_consumption_stats.time DESC NULLS LAST LIMIT 20 OFFSET 0;
                                                                              QUERY PLAN                                                                       

---------------------------------------------------------------------------------------------------------------------------------------------------------------
-------
 Limit  (cost=936146.77..936149.11 rows=20 width=44) (actual time=3550.219..3552.325 rows=20 loops=1)
   ->  Gather Merge  (cost=936146.77..3592774.85 rows=22769508 width=44) (actual time=3550.218..3552.320 rows=20 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=935146.75..963608.63 rows=11384754 width=44) (actual time=3545.225..3545.227 rows=16 loops=3)
               Sort Key: "time" DESC NULLS LAST
               Sort Method: top-N heapsort  Memory: 27kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 27kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 27kB
               ->  Parallel Seq Scan on resource_consumption_stats  (cost=0.00..632202.54 rows=11384754 width=44) (actual time=1077.342..3541.631 rows=31055 lo
ops=3)
 Planning Time: 10.913 ms
 Execution Time: 3552.396 ms
(12 rows)

I've ran vacuum full resource_consumption_stats;.

Before vacuum:

koschei=# select pg_table_size('resource_consumption_stats');
 pg_table_size 
---------------
    4247568384
(1 row)

After vacuum:

koschei=# select pg_table_size('resource_consumption_stats');
 pg_table_size 
---------------
       7192576
(1 row)

Query performance after vacuum:

koschei=# explain analyze SELECT resource_consumption_stats.name AS resource_consumption_stats_name, resource_consumption_stats.arch AS resource_consumption_stats_arch, resource_consumption_stats.time AS resource_consumption_stats_time, resource_consumption_stats.time_percentage AS resource_consumption_stats_time_percentage FROM resource_consumption_stats ORDER BY resource_consumption_stats.time DESC NULLS LAST LIMIT 20 OFFSET 0;
                                                                 QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=4287.74..4287.79 rows=20 width=44) (actual time=18.061..18.066 rows=20 loops=1)
   ->  Sort  (cost=4287.74..4520.65 rows=93165 width=44) (actual time=18.060..18.061 rows=20 loops=1)
         Sort Key: "time" DESC NULLS LAST
         Sort Method: top-N heapsort  Memory: 27kB
         ->  Seq Scan on resource_consumption_stats  (cost=0.00..1808.65 rows=93165 width=44) (actual time=0.011..9.240 rows=93165 loops=1)
 Planning Time: 0.721 ms
 Execution Time: 18.085 ms
(7 rows)

This seems to have fixed the issue.
I've scaled add deployment configs to desired replica counts.

oc -n koschei scale --replicas 2 dc/frontend
oc -n koschei scale --replicas 1 dc/polling
oc -n koschei scale --replicas 1 dc/repo-resolver
oc -n koschei scale --replicas 1 dc/build-resolver
oc -n koschei scale --replicas 1 dc/scheduler
oc -n koschei scale --replicas 1 dc/watcher

The issue is resolved for now, but it may reappear in the future unless autovacuum is fixed.

Metadata Update from @mizdebsk:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

In commit 150b9c3 I changed OpenShift probes so that they don't rely on the slow query any longer.

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Done