#155 aarch64 builders run out of memory
Closed: Fixed 3 years ago by arrfab. Opened 3 years ago by kkeithle.

https://cbs.centos.org/koji/buildinfo?buildID=30094

https://cbs.centos.org/koji/taskinfo?taskID=1643138

https://cbs.centos.org/kojifiles/work/tasks/3138/1643138/build.log

...
[ 47%] Building CXX object src/mds/CMakeFiles/mds.dir/CDir.cc.o
cd /builddir/build/BUILD/ceph-14.2.13/build/src/mds && /usr/bin/c++ -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D__linux__ -isystem /builddir/build/BUILD/ceph-14.2.13/build/boost/include -I/builddir/build/BUILD/ceph-14.2.13/build/src/include -I/builddir/build/BUILD/ceph-14.2.13/src -isystem /builddir/build/BUILD/ceph-14.2.13/build/include -I/usr/include/nss3 -I/usr/include/nspr4 -isystem /builddir/build/BUILD/ceph-14.2.13/src/xxHash -isystem /builddir/build/BUILD/ceph-14.2.13/src/rapidjson/include -I/builddir/build/BUILD/ceph-14.2.13/src/lua/src -I/builddir/build/BUILD/ceph-14.2.13/build/src/lua -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -fasynchronous-unwind-tables -fstack-clash-protection -Wall -Wtype-limits -Wignored-qualifiers -Winit-self -Wpointer-arith -Werror=format-security -fno-strict-aliasing -fsigned-char -Wno-unknown-pragmas -rdynamic -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -fasynchronous-unwind-tables -fstack-clash-protection -ftemplate-depth-1024 -Wnon-virtual-dtor -Wno-unknown-pragmas -Wno-ignored-qualifiers -Wstrict-null-sentinel -Woverloaded-virtual -fno-new-ttp-matching -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -fstack-protector-strong -fdiagnostics-color=auto -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -O2 -g -DNDEBUG -fPIC -DHAVE_CONFIG_H -D__CEPH__ -D_REENTRANT -D_THREAD_SAFE -D__STDC_FORMAT_MACROS -std=c++1z -o CMakeFiles/mds.dir/CDir.cc.o -c /builddir/build/BUILD/ceph-14.2.13/src/mds/CDir.cc
virtual memory exhausted: Cannot allocate memory
as: out of memory allocating 4064 bytes after a total of 82509824 bytes
{standard input}: Assembler messages:
{standard input}:1594579: Fatal error: can't close CMakeFiles/mds.dir/MDBalancer.cc.o: Memory exhausted


seems to fail consistently

https://cbs.centos.org/koji/buildinfo?buildID=31427

https://cbs.centos.org/koji/taskinfo?taskID=1644013

...
as: out of memory allocating 4064 bytes after a total of 465960960 bytes
{standard input}: Assembler messages:
{standard input}:5626257: Fatal error: can't close CMakeFiles/osd.dir/PG.cc.o: Memory exhausted
{standard input}: Assembler messages:
{standard input}:1239469: Fatal error: bfd_make_empty_symbol: Memory exhausted
{standard input}:1239469: Fatal error: can't close CMakeFiles/osd.dir/OSD.cc.o: Memory exhausted
...

Metadata Update from @arrfab:
- Issue assigned to arrfab

3 years ago

Metadata Update from @arrfab:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: cbs, centos-common-infra, low-gain, low-trouble

3 years ago

Having a look right now, but it seems you launched in serial then some tasks (instead of parallel) and they both finished ?

Worth noting that both aarch64 builders are just VMs on the same hypervisor so wondering if we weren't pushed to the limit with such IO/cpu/memory intensive builds.
Can though have a look at some memory bump but my theory here (as they built ok in serial) is that we reached some limit at the single hypervisor level

Both aarch64 koji builders were bumped with new specs :

  • 20Gb of ram (instead of 12)
  • 12 vcpus (instead of 8)

Closing now as that should solve the issue, but advice is still to "chain" the builds like ceph instead of parallelizing this, due to the limit of current infra for aarch64 (same hypervisor). We can revisit later if we move some infra and rebalance aarch64 nodes workload

Metadata Update from @arrfab:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Metadata Update from @kkeithle:
- Issue status updated to: Open (was: Closed)

3 years ago

are you sure that it's the same issue ? From build.log :

make[2]: Leaving directory '/builddir/build/BUILD/ceph-15.2.6/build'
[ 96%] Built target radosgw
make[1]: Leaving directory '/builddir/build/BUILD/ceph-15.2.6/build'
make: *** [Makefile:144: all] Error 2

doesn't mention OOM error here. Can you reproduce it and eventually (as infra and releng team has focus on other things) we can try to see how to balance aarch64 workload differently

Yes, I'm sure it's a repeat of out of memory and I've repeated it multiple times before the holiday.

You can't just look at the tail of the build.log.

virtual memory exhausted: Cannot allocate memory
virtual memory exhausted: Cannot allocate memory
as: out of memory allocating 11 bytes after a total of 143327232 bytes
{standard input}: Assembler messages:
{standard input}:646172: Fatal error: can't close CMakeFiles/unittest_librbd.dir/exclusive_lock/test_mock_PreAcquireRequest.cc.o: Memory exhausted
make[2]: [src/test/librbd/CMakeFiles/unittest_librbd.dir/build.make:300: src/test/librbd/CMakeFiles/unittest_librbd.dir/exclusive_lock/test_mock_PreAcquireRequest.cc.o] Error 2
make[2]:
Waiting for unfinished jobs....
make -f src/rgw/CMakeFiles/ceph_rgw_multiparser.dir/build.make src/rgw/CMakeFiles/ceph_rgw_multiparser.dir/depend
make[2]: Entering directory '/builddir/build/BUILD/ceph-15.2.6/build'
cd /builddir/build/BUILD/ceph-15.2.6/build && /usr/bin/cmake -E cmake_depends "Unix Makefiles" /builddir/build/BUILD/ceph-15.2.6 /builddir/build/BUILD/ceph-15.2.6/src/rgw /builddir/build/BUILD/ceph-15.2.6/build /builddir/build/BUILD/ceph-15.2.6/build/src/rgw /builddir/build/BUILD/ceph-15.2.6/build/src/rgw/CMakeFiles/ceph_rgw_multiparser.dir/DependInfo.cmake --color=
Dependee "/builddir/build/BUILD/ceph-15.2.6/build/src/rgw/CMakeFiles/ceph_rgw_multiparser.dir/DependInfo.cmake" is newer than depender "/builddir/build/BUILD/ceph-15.2.6/build/src/rgw/CMakeFiles/ceph_rgw_multiparser.dir/depend.internal".
Dependee "/builddir/build/BUILD/ceph-15.2.6/build/src/rgw/CMakeFiles/CMakeDirectoryInformation.cmake" is newer than depender "/builddir/build/BUILD/ceph-15.2.6/build/src/rgw/CMakeFiles/ceph_rgw_multiparser.dir/depend.internal".
make[2]: *** [src/test/librbd/CMakeFiles/unittest_librbd.dir/build.make:326: src/test/librbd/CMakeFiles/unittest_librbd.dir/exclusive_lock/test_mock_PreReleaseRequest.cc.o] Error 1

Issue status updated to: Closed (was: Open)
Issue close_status updated to: Fixed

3 years ago

Hi,
I'm trying to help with Ceph builds here and from what I can see the described issue is still happening (Memory exhausted at some point) [1].
Not sure how I can "chain" some operations and I still need to investigate, but why the status is fixed?
I run the same build several time but it's always failing on aarch64 w/ the same issue.

[1] https://cbs.centos.org/kojifiles/work/tasks/2893/1712893/build.log

Metadata Update from @arrfab:
- Issue status updated to: Open (was: Closed)

3 years ago

that means that Ceph builds are really memory hungry :(
Have reopened ticket and I'll try to see how to balance some aarch64 VMs here and there to increase available memory in the aarch64 cbs builders. Subscribe to this thread to get info (but other tasks with higher priority on my plate right now)

Yeah thanks @arrfab for your help.
I've opened #164 to keep track of this issue but you should probably want to close it and keep this one.

As a work-around I've changed the spec file to reduce the number of -j jobs on aarch64. E.g.

@@ -1177,6 +1177,9 @@ echo "Available memory:"
free -h
echo "System limits:"
ulimit -a
+%ifarch aarch64
+CEPH_SMP_NCPUS="4"
+%else
if test -n "$CEPH_SMP_NCPUS" -a "$CEPH_SMP_NCPUS" -gt 1 ; then
mem_per_process=2700
max_mem=$(LANG=C free -m | sed -n "s|^Mem: ([0-9]).*$|\1|p")
@@ -1184,6 +1187,7 @@ if test -n "$CEPH_SMP_NCPUS" -a "$CEPH_SMP_NCPUS" -gt 1 ;
test "$CEPH_SMP_NCPUS" -gt "$max_jobs" && CEPH_SMP_NCPUS="$max_jobs" && ech
test "$CEPH_SMP_NCPUS" -le 0 && CEPH_SMP_NCPUS="1" && echo "Warning: Not us
fi
+%endif

I can't submit builds to cbs.centos.org atm due to network issues in Westford offices now.

Thank you Kaleb, I've built the package reducing the number of jobs as you suggested for now.

Just to let you know that it's still on my plan to chase after some aarch64 machines and then be able to bump specs for the cbs aarch64 builders, but if you have a local workaround already, that's also good. Let's keep this one open to track status on both sides

Issue status updated to: Closed (was: Open)
Issue close_status updated to: Fixed

3 years ago

Login to comment on this ticket.

Metadata
Boards 2
CBS Status: Done