#9624 s390x builders run out of memory or disk while building ceph
Closed: Fixed 2 years ago by kevin. Opened 3 years ago by kkeithle.

Describe what you would like us to do:


Last week during the mass rebuild for ELN the s390x buildsfailed when the s390x builder ran out of memory and OOM killed the compiler or ran out of disk.

I cut the number of make -j jobs in half and was able to get one successful build (task 61243057) but since then the s390x builds are failing again.

To make matters worse, somehow the failing builds are now being restarted and the task runs forever until I cancel it.

E.g. tasks 61310643, 61329315, and 61349680 have all needed to be cancelled by me when the s390x build aborted.E

When do you need this to be done by? (YYYY/MM/DD) 20210205?



Thank you for the report. I am not sure there is anything I can do at this time to make this better, but we can use this as a 'report of problems causing issues for engineers'.

The VM's each have ~8 to 9 GB of ram and there is no extra RAM which can be added to them. They also have ~96 GB of disk space with no extra which can be given to them. There are some builders with only 25 GB free space and others with 60 GB free space.

The servers are also overloaded currently due to maintenance at Westford which will go on until March.

Metadata Update from @smooge:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: high-trouble, medium-gain, ops, s390x

3 years ago

Is there some way you could let us know what the largest disk footprint of a ceph build is?

Unfortunately we don't have more disk to throw at things, all we can play with is the caching repos/failed builds, and your builds were right after mass rebuild and mass eln rebuild, so all the failed builds that happened on that builder were likely cached. ;(

A lot depends on how much bundled software, e.g. boost, is built as part of the build, rather than using the system's version.

And I am trying to use as many system dependencies, including boost, as possible.

One VM I have where I occasionally build I had to bump up to 85G to avoid running out of disk.

FWIW, I've learned that over on the other side of the house, they have -beefy build machines and they have rules in their build system for building big packages like ceph on the -beefy machines. Maybe Fedora could implement something like that?

So, our s390x builders have either 96GB or 103GB.

The problem with another channel/builders here is:
we would have to destroy one of our existing builders to have disk to make a 'biger' one.
less builders would slow down everyone else
If you or CI or eln or whatever does a ceph build and the 'heavy' builder is busy, your build will wait until those are free.
If the heavy builders are still in the normal mix, you will have to wait until they are free for your build to even start.

I suppose waiting might be better for you than failing?

There's talk of some new hardware appearing sometime... at that point we could look at trying to get more disk per builder.

add cross-gcc to the list, it needs >90GB of free disk ...

Sure, but that makes it worse... if we have 1 builder with ~200gb of disk, and tell koji to build ceph and cross-gcc and... they will then each have to wait for the builder if any of those other things are building. :(

I don't do ceph builds very often, so I really don't see that as an issue.

And if I know about it I can check and see if cross-gcc is building and wait to start the ceph build. IMO that'd be better than having builds fail randomly and having to retry the build multiple times until one magically works.

I agree that having to wait isn't a problem, cross-gcc builds are also only occasional.

I had an additional thought. We have one instance that is a varnish cache... it could have a smaller disk and I could move that space over to another one. Will try and see if I can do that without too much outage.

yup, it could work too, cross-gcc was tight in F-33 rebuild, so I guess it's just few GB we missed now (on the z/VM builder)

Something.... Anything...

I spent all weekend canceling failing builds and restarting without ever once getting lucky.

I'm now down to make -j1... on s390x to see if maybe that will work. It's very slow. :-(

Next will be excluding s390x I guess. I don't want to, but if I have to, as a short term work-around, I guess I will.

I think manually assigning the s390x build task to the z/VM builders with the slightly bigger disk could help, I will look into it in the morning.

I mean as a temporary workaround ...

FYI, we have more resources added to our kvm lpar now.

I just need to find time to rebalance the builders so they all have more space /etc

As soon as we get the s390x builders moved over to the new z15 mainframe, I can reblance them (at least the kvm ones).

So, the rebalance is done and I don't think this is happening anymore.

However, to doublecheck: @kkeithle can you let us know if you are seeing out of disk on s390x builds anymore?

If you see this happening again, please file a new issue or reopen this one. Thanks!

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog