#9591 python builds cause kojid OOM kill on 32 bit arm builders
Closed: Upstream 3 years ago by kevin. Opened 3 years ago by kevin.

It seems like it's a test somehow, they always get to tests.

So it gets to the tests, kojid is OOM killed, and the build restarts. ;(

last parts of log:

0:04:04 load avg: 9.09 [286/422] test_multiprocessing_fork passed (1 min 31 sec) -- running: test_concurrent_futures (3 min 15 sec), test_multiprocessing_forkserver (1 min 28 sec), test_pickle (1 min 1 sec), test_multiprocessing_spawn (1 min 24 sec)
/builddir/build/BUILD/Python-3.8.7/Lib/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/builddir/build/BUILD/Python-3.8.7/Lib/multiprocessing/resource_tracker.py:229: UserWarning: resource_tracker: '/psm_b44f88d8': [Errno 2] No such file or directory: '/psm_b44f88d8'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
0:04:06 load avg: 9.09 [287/422] test_sched passed -- running: test_concurrent_futures (3 min 16 sec), test_multiprocessing_forkserver (1 min 29 sec), test_pickle (1 min 3 sec), test_multiprocessing_spawn (1 min 25 sec)
0:04:06 load avg: 9.09 [288/422] test_sax passed -- running: test_concurrent_futures (3 min 17 sec), test_multiprocessing_forkserver (1 min 30 sec), test_pickle (1 min 3 sec), test_multiprocessing_spawn (1 min 26 sec)
0:04:07 load avg: 9.09 [289/422] test_scope passed -- running: test_concurrent_futures (3 min 17 sec), test_multiprocessing_forkserver (1 min 30 sec), test_pickle (1 min 4 sec), test_multiprocessing_spawn (1 min 26 sec)
0:04:07 load avg: 9.09 [290/422] test_secrets passed -- running: test_concurrent_futures (3 min 18 sec), test_multiprocessing_forkserver (1 min 31 sec), test_pickle (1 min 4 sec), test_multiprocessing_spawn (1 min 27 sec)
0:04:08 load avg: 9.09 [291/422] test_script_helper passed -- running: test_concurrent_futures (3 min 18 sec), test_multiprocessing_forkserver (1 min 31 sec), test_pickle (1 min 5 sec), test_multiprocessing_spawn (1 min 27 sec)
0:04:08 load avg: 9.09 [292/422] test_pickle passed (1 min 4 sec) -- running: test_concurrent_futures (3 min 18 sec), test_multiprocessing_forkserver (1 min 31 sec), test_multiprocessing_spawn (1 min 27 sec)
0:04:13 load avg: 10.37 [293/422] test_set passed -- running: test_concurrent_futures (3 min 24 sec), test_multiprocessing_forkserver (1 min 37 sec), test_multiprocessing_spawn (1 min 33 sec)
0:04:14 load avg: 9.94 [294/422] test_setcomps passed -- running: test_concurrent_futures (3 min 24 sec), test_multiprocessing_forkserver (1 min 38 sec), test_multiprocessing_spawn (1 min 34 sec)
0:04:17 load avg: 9.94 [295/422] test_shelve passed -- running: test_concurrent_futures (3 min 27 sec), test_multiprocessing_forkserver (1 min 41 sec), test_multiprocessing_spawn (1 min 37 sec)
0:04:18 load avg: 9.94 [296/422] test_shlex passed -- running: test_concurrent_futures (3 min 28 sec), test_multiprocessing_forkserver (1 min 42 sec), test_multiprocessing_spawn (1 min 38 sec)
0:04:19 load avg: 9.62 [297/422] test_select passed -- running: test_concurrent_futures (3 min 29 sec), test_multiprocessing_forkserver (1 min 43 sec), test_multiprocessing_spawn (1 min 39 sec)

https://koji.fedoraproject.org/koji/taskinfo?taskID=60105956

The builders are now f33 with 40GB mem. Before they were f32, an older kernel and 24GB memory.

Not sure how to debug this... but it's been a very long day.

I'd suggest perhaps to disable tests on 32bit arm for now to get a build out?

cc @churchyard


I'll see if running the tests in -j1 works.

Note that this is happening with Python 3.8. I'll check the other versions as well. The Python 3.9 build passed (submitted more or less together).

Metadata Update from @humaton:
- Issue priority set to: Waiting on External (was: Needs Review)

3 years ago
  1. this also affects other Python versions (at least 3.7, 3.8, 3.9)
  2. -j1 does not help :(

Even with tests disabled, the build now takes longer than it used to. Not sure if relevant or a random difference due to a different builder etc.

  • Before: python3-3.8.7-1.fc32, armv7hl with tests, task time 0:54:15
  • Now: python3-3.8.7-2.fc32, armv7hl without tests, task time 1:54:47

Thanks for the info. I will try some things to adjust them...

Metadata Update from @kevin:
- Issue assigned to kevin
- Issue priority set to: Waiting on Assignee (was: Waiting on External)
- Issue tagged with: high-gain, high-trouble, ops

3 years ago

So, I tried a bunch of things today without much luck.

I reduced the memory from 40GB to 32GB, then to 24GB (which is what it was before).

I am noticing a oops now:

[ 2352.075906] cat: page allocation failure: order:6, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,
mems_allowed=0
[ 2352.077887] CPU: 4 PID: 1782 Comm: cat Not tainted 5.10.9-201.fc33.armv7hl+lpae #1
[ 2352.079267] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 2352.080264] [<c040f7a0>] (unwind_backtrace) from [<c040a18c>] (show_stack+0x18/0x1c)
[ 2352.081563] [<c040a18c>] (show_stack) from [<c0db5c30>] (dump_stack+0xb4/0xd0)
[ 2352.082557] [<c0db5c30>] (dump_stack) from [<c05d25c4>] (warn_alloc+0xa0/0x138)
[ 2352.083720] [<c05d25c4>] (warn_alloc) from [<c05d2c88>] (__alloc_pages_nodemask+0x610/0xbe8)
[ 2352.085044] [<c05d2c88>] (__alloc_pages_nodemask) from [<c05af120>] (kmalloc_order+0x3c/0x70)
[ 2352.086538] [<c05af120>] (kmalloc_order) from [<c05af174>] (kmalloc_order_trace+0x20/0xb4)
[ 2352.087629] [<c05af174>] (kmalloc_order_trace) from [<c0695178>] (proc_sys_call_handler+0xa4/0x21c)
[ 2352.089116] [<c0695178>] (proc_sys_call_handler) from [<c0608180>] (vfs_read+0x174/0x1bc)
[ 2352.090291] [<c0608180>] (vfs_read) from [<c0608404>] (ksys_read+0x68/0x9c)
[ 2352.091477] [<c0608404>] (ksys_read) from [<c04001a0>] (ret_fast_syscall+0x0/0x28)
[ 2352.092627] Exception stack(0xe35d1fa8 to 0xe35d1ff0)
[ 2352.093464] 1fa0:                   b6a5b000 00020000 00000003 b6a5b000 00020000 00000000
[ 2352.094841] 1fc0: b6a5b000 00020000 b6f01160 00000003 00455d40 00000000 00020000 00000002
[ 2352.096260] 1fe0: 00455f08 befa3978 004426b0 b6e1a06c
[ 2352.097032] Mem-Info:
[ 2352.097568] active_anon:163 inactive_anon:47071 isolated_anon:0
                active_file:46458 inactive_file:414537 isolated_file:0
                unevictable:0 dirty:54 writeback:0
                slab_reclaimable:25432 slab_unreclaimable:33312
                mapped:29508 shmem:168 pagetables:860 bounce:0
                free:5582712 free_pcp:130 free_cma:16320
[ 2352.103320] Node 0 active_anon:652kB inactive_anon:188724kB active_file:185832kB inactive_file:1658808kB unevictable:0kB is$
lated(anon):0kB isolated(file):0kB mapped:118032kB dirty:216kB writeback:0kB shmem:672kB writeback_tmp:0kB kernel_stack:1368kB 
all_unreclaimable? no
[ 2352.107134] DMA free:11216kB min:2148kB low:2684kB high:3220kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB ac$
ive_file:1040kB inactive_file:964kB unevictable:0kB writepending:0kB present:786432kB managed:307300kB mlocked:0kB pagetables:$
440kB bounce:0kB free_pcp:156kB local_pcp:80kB free_cma:0kB
[ 2352.111390] lowmem_reserve[]: 0 0 31996 0
[ 2352.111906] HighMem free:22317100kB min:512kB low:61512kB high:122512kB reserved_highatomic:0KB active_anon:656kB inactive_$
non:189632kB active_file:184896kB inactive_file:1658656kB unevictable:0kB writepending:208kB present:32766976kB managed:243768$
8kB mlocked:0kB pagetables:0kB bounce:0kB free_pcp:356kB local_pcp:0kB free_cma:65280kB
[ 2352.116616] lowmem_reserve[]: 0 0 0 0
[ 2352.117527] DMA: 716*4kB (UE) 961*8kB (UE) 36*16kB (UE) 7*32kB (UE) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096$
B = 11352kB
[ 2352.119383] HighMem: 16*4kB (UM) 27*8kB (UM) 2801*16kB (UM) 3077*32kB (UM) 1648*64kB (UM) 726*128kB (UM) 215*256kB (MC) 91*$
12kB (MC) 50*1024kB (UMC) 31*2048kB (UMC) 5312*4096kB (UMC) = 22316232kB
[ 2352.121930] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 2352.123443] 461846 total pagecache pages
[ 2352.123936] 0 pages in swap cache
[ 2352.124453] Swap cache stats: add 0, delete 0, find 0/0
[ 2352.125488] Free swap  = 37748728kB
[ 2352.126068] Total swap = 37748728kB
[ 2352.126543] 8388352 pages RAM
[ 2352.126946] 8191744 pages HighMem/MovableOnly
[ 2352.127769] 2217325 pages reserved
[ 2352.128407] 16384 pages cma reserved

But that might be related to the ansible 'reboot' module poking:

# cat /proc/sys/kernel/random/boot_id
cat: /proc/sys/kernel/random/boot_id: Cannot allocate memory

I guess tomorrow if we can't think of anything else I could move them back to f32 for the mass rebuild and see if that works?

Any further ideas?

cc @pwhalen @pbrobinson @jforbes

Additionally... my f33 python3.8 builds are failing oddly: https://koji.fedoraproject.org/koji/taskinfo?taskID=60180821

But I just had a python3.9 f33 build finish in a little over an hour. So, perhaps it's somewhat back to normal?

xargs: invalid number "%{_smp_build_ncpus}" for -P option

This is caused by bulding on armhfp. That architecture does not have %{_smp_build_ncpus} defined and the /usr/lib/rpm/redhat/brp-strip-lto script fails.

Ah, oops.

So, do you still see the slowness now?

Not such significant slowness, no.

koji build --scratch f33 python3.8-3.8.7-2.fc33.src.rpm --arch-override 9.40s user 0.48s system 0% cpu 45:15.14 total

thats with the tests back on...

3.8 still gets OOM killed, but not the others ¯_(ツ)_/¯

So, I moved all builders to 16GB memory and removed the highmem=off qemu command line and they seem more stable hopefully.

I think what we might be hitting is lowmem getting exhausted so kojid gets OOM killed.

highmem_is_dirtyable doesn't seem to help, so perhaps we need to change lowmem_reserve_ratio, but the docs are... confusing to me.

Looking on a 40GB instance in /proc/meminfo:

LowTotal: 208608 kB
LowFree: 62688 kB

On a 16GB instance:

LowTotal: 503524 kB
LowFree: 368516 kB

I tried taking some of them down to 8GB, and I think they are a little more stable, but not sure.

I tried highmem_is_dirtyable => 1 and I tried 'lowmem_reserve_ratio' => 1

I also tried installing them with direct kernel boot rather than using edk2 / uefi, but that doesnt seem to matter.

https://paste.centos.org/view/b4ff9beb is a OOm kill dmesg.

I guess I can try putting them back to f32 later today. ;(

The current bulder building gcc right now is a 8gb one. Will see if it restarts again. So far it's gotten much further...

I can't use any kernels between 5.7.0 and 5.10.9 since they all have the option that will cause the pauses. ;(

I'll look tomorrow and see if gcc finished, if so we can decide if we want to just lower them all to 8gb for now and keep on f33/latest kernel, or just redo them back as f32 with 5.6.6 and try and track down the issue more while the mass rebuild goes ahead.

gcc did finish, but only after 20 hours...

Filed kernel bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1920183

I'm testing now back to as close as I can get to the builders before: Fedora 32 install with 5.6.x kernels. Hopefully that will be ok, if so, I will reinstall all the builders with it and we can fire the mass rebuild and revisit when we have a proposed fix.

I'm going to close this in favor of the upstream kernel bug above... hopefully that will get sorted and when it does we will move things back. :)

Metadata Update from @kevin:
- Issue close_status updated to: Upstream
- Issue status updated to: Closed (was: Open)

3 years ago

Thanks for all the investigation. I'll get the skip-test-on-arm commit reverted.

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Done