#12299 builds failing due to full disks
Closed: Fixed 16 days ago by zlopez. Opened 3 months ago by kkeithle.

Describe what you would like us to do:


Back on or around 14 Nov (2024) I had several builds fail on ppc64le and x86_64 because the disk was filling up.

Then the problem seemed to go away.

Now it's back again. (20 Nov 2024)

E.g. https://koji.fedoraproject.org/koji/taskinfo?taskID=126060500, both the ppc64le and x86_64 builds failed when disk space was exhausted.

Several builds, both scratch and regular failed for me — today — prior to that one

When do you need this to be done by? (YYYY/MM/DD)


Well, today would be nice. 2024/11/20 ;-)


This does not look like disk space.

The x86 build has:

/usr/bin/g++ -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 -DARROW_HAVE_RUNTIME_BMI2 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_WITH_TIMING_TESTS -DGTEST_LINKED_AS_SHARED_LIBRARY=1 -I/builddir/build/BUILD/libarrow-16.1.0-build/apache-arrow-16.1.0/cpp/redhat-linux-build/src -I/builddir/build/BUILD/libarrow-16.1.0-build/apache-arrow-16.1.0/cpp/src -I/builddir/build/BUILD/libarrow-16.1.0-build/apache-arrow-16.1.0/cpp/src/generated -isystem /builddir/build/BUILD/libarrow-16.1.0-build/apache-arrow-16.1.0/cpp/thirdparty/flatbuffers/include -Wno-noexcept-type -Wno-self-move -Wno-subobject-linkage -O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -m64 -march=x86-64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -mtls-dialect=gnu2 -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -fdiagnostics-color=always  -Wall -Wno-conversion -Wno-sign-conversion -Wunused-result -Wdate-time -fno-semantic-interposition  -O2 -g -DNDEBUG -ftree-vectorize -ggdb  -std=c++17 -fPIE -MD -MT src/arrow/CMakeFiles/arrow-array-test.dir/Unity/unity_0_cxx.cxx.o -MF src/arrow/CMakeFiles/arrow-array-test.dir/Unity/unity_0_cxx.cxx.o.d -o src/arrow/CMakeFiles/arrow-array-test.dir/Unity/unity_0_cxx.cxx.o -c /builddir/build/BUILD/libarrow-16.1.0-build/apache-arrow-16.1.0/cpp/redhat-linux-build/src/arrow/CMakeFiles/arrow-array-test.dir/Unity/unity_0_cxx.cxx
{standard input}: Assembler messages:
{standard input}:34772514: Warning: end of file not at end of a line; newline inserted
{standard input}:34773878: Error: unknown pseudo-op: `.'
g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.

The ppc64le one hit OOM:

[Wed Nov 20 17:04:03 2024] Out of memory: Killed process 1380291 (cc1plus) total-vm:3929088kB, anon-rss:3346304kB, file-rss:26624kB, shmem-rss:0kB, UID:1000 pgtables:497kB oom_score_adj:0

This can be helped by less compile threads or limiting the build in various ways.

Metadata Update from @kevin:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: low-gain, low-trouble, ops

3 months ago

So, I didn't mean to dismiss this... I just don't think it's disk related, and I am not sure how to further debug it, but perhaps more information will help?

  • Do they always fail the same way? or it varies?

  • Is there any pattern to what versions or arches fail? Or is it everything pretty much?

  • Can you see the problem in a local mockbuild? or that always works?

  • Have builds been stable until the last few days? Or was there issues before say wed?
    (We upgraded builders to f41 on wed, so that might be related if it started happening right after that).

So, I didn't mean to dismiss this... I just don't think it's disk related, and I am not sure how to further debug it, but perhaps more information will help?

I was pretty sure we had established that it was OOMkill, not disk full. The symptoms are similar and I don't have access (AFAIK) to the builder syslog to see what really happened.

  • Do they always fail the same way? or it varies?

Ceph, and now libarrow generally don't fail. So when they start failing with OOMkills (and disk full) then I start asking questions. On at least one occasion that I recall, someone had changed builder configs.

  • Is there any pattern to what versions or arches fail? Or is it everything pretty much?

I don't really recall. Historically it's been mainly x86_64 I think.

  • Can you see the problem in a local mockbuild? or that always works?

I generally have enough disk space and memory so no, it never fails in a mock build.

  • Have builds been stable until the last few days? Or was there issues before say wed?
    (We upgraded builders to f41 on wed, so that might be related if it started happening right after that).

Yes, build have been reliable. Until they weren't. ;-)

I see the mass rebuild libarrow was fine. Have you seen this issue recently?

I see the mass rebuild libarrow was fine. Have you seen this issue recently?

nope

Closing as fixed. Feel free to reopen, if the issue happens again.

Metadata Update from @zlopez:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

16 days ago

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog