#8407 tracking issue: ppc64le network failures during f29AH composes
Closed: Fixed 12 months ago by kevin. Opened a year ago by dustymabe.

We are seeing network issues when pulling the ostree content during the image builds for f29AH (run during bodhi updates composes). This seems to be getting more consistent lately. I'm opening this bug to track the issue.

Here are current cases where it has happened:


cc @sinnykumari - do you think you can run a test image build on a ppc64le machine to see if you can reproduce the issue?

Did AH image build on ppc64le from Fedora-29-updates-testing-20190602.0 and Fedora-29-updates-20190601.0 configs. Didn't observe any issue, build succeeded fine.

Something is wrong on ppc64le builders , it took ~10 hours for both Fedora-29-updates-20190624.0 and Fedora-29-updates-testing-20190624.0 compose. Among that ppc64le took around 7 hours to finish ppc64le ISO creation https://koji.fedoraproject.org/koji/taskinfo?taskID=35780975

@kevin @sharkcz thoughts on why we are having these ppc64le related failures and delay?

uff, I wonder what made it so slow ... Looks like the IO to the nested VM's disk was super slow.

Looks like the IO to the nested VM's disk was super slow.

ouch :(

uff, I wonder what made it so slow ...

Do you think you could investigate? We'd like to do a Fedora Atomic Host release this week and we'd like to ship the ppc64le artifacts as part of it.

I think a first thing to do would be to update/reboot the power9 hosts. They are Fedora 30 and now on a older kernel. I'll try and do that today.

I think a first thing to do would be to update/reboot the power9 hosts. They are Fedora 30 and now on a older kernel. I'll try and do that today.

Thanks.. after you do that can you re-run a koji task (for example 35785502 from Fedora-29-updates-20190624.0) and see if it completes successfully?

resubmitting failed image compose tasks isn't a good idea... there's already a record of it failing, if it worked it could cause confusion, also inputs it uses could change.

Can I just fire off another f29-updates-testing compose? or f29-updates?

resubmitting failed image compose tasks isn't a good idea... there's already a record of it failing, if it worked it could cause confusion, also inputs it uses could change.
Can I just fire off another f29-updates-testing compose? or f29-updates?

yeah - that works

I suspect it's not only the composes, the last kernel builds are also taking way too many hours on ppc64le.

Compose Fedora-29-updates-20190625.0 finished within expected time, looks like rebooting builders helped. Let's observe for few more days and see if problem comes back

So, the power8 boxes have 10,000rpm sas drives.
The power9 hosts have 7200 rpm sata drives.

I think seek times may be causing problems when there's a lot of builds going on. I can try reducing density of builders perhaps, or perhaps we can replace storage with ssds or the like.

Observed AH composes from last two days from both F29 updates and updatest-testing, all of them succeeded within expected time

Yeah, seems like ppc64le issues are showing up again. cloud images are failing and also ISO creation is taking longer time, it took ~5 hrs for ISO from Fedora-29-{updates,updates-testing}-20190701.0

I can try reducing density of builders perhaps, or perhaps we can replace storage with ssds or the like.

Did either of those two things happen? Or are we still experiencing slow IO on ppc64le builders across various tasks? One task Sinny brought up seems to suggest it's still happening (https://koji.fedoraproject.org/koji/taskinfo?taskID=36279258) but it's hard to gain visibility into what's really going on.

Or are we still experiencing slow IO on ppc64le builders across various tasks?

It is still an issue afaik.

@nirik any plan to move ppc64le image builder soon to F30? imagebuild is failing frequently on ppc64le and I am not able to reproduce them locally (running F30 though) with several attempts. My local ppc64le vm is also runs very slow and takes lot of time but it doesn't fail with timeout error during ostree repo pull.

F29 AH imagebuild on ppc64le composes are running successfully since 20190724 for both updates and updates-testing.

From IRC conversation with Kevin, we have cache mode unsafe enabled on ppc64le builder which made things about 10x faster.

Thanks Kevin and everyone for the helping in resolving this issue.

We can close this ticket for now!

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

12 months ago

Login to comment on this ticket.

Metadata