#1592 Redefinition of what constitutes a secondary/alternate architecture in Fedora
Closed None Opened 3 years ago by pbrobinson.

With Fedora 20 we promoted our first new architecture ever, in Fedora 21 we redefined entirely what was delivered with the introduction Workstation/Server/Cloud Editions. With FESCo's decision to make all of 32 bit x86 non release blocking it has effectively become a secondary architecture without any real defined process for how to do so, yet there's still a lot of reasons to keep i686 around as an almost first class citizen (it is currently at least 20% of the installed userbase) and for the last few releases there's been a lot of queries and requests about promoting components of AArch64 like Server and Docker. For the last two there's currently no cut and dry means of doing this as, unlike i686, the builds aren't in the main koji instance to easily consume for the Server edition or a Cloud component such as Docker. As the Fedora ARM lead there's no way I'd currently recommend a AArch64 Workstation edition, but in the context of server/docker an AArch64 promotion makes a lot of sense. Similarly it's made a lot of sense for a couple of cycles to be able to demote some parts of i686 while still keeping it around and consuming other parts "multilib" to support some applications in the x86_64 world.

There's no current easy means of promoting/demoting a "Fedora artifact deliverable" between primary/secondary within an architecture at the moment with our current definition of what constitutes a primary/secondary architecture because that definition revolves around the koji build system. While that might have been suitable pre modern Fedora Editions "Everything as one" world it now means we need to redefine "primary" and "secondary" architectures as artefact deliverables not koji instances, how that promotion/demotion works etc.

It's clear we need to redefine what constitutes a secondary/alternate architecture and how we deal with architectures and artifacts of them as a project. So how best do we redefine alternate architectures and make it easier to promote artefacts?

The proposal here is to remove koji as the definition of what constitutes a primary and secondary release.

With the changes to i686 for all intents and purposes we basically have already done this. It has already been the case for some time that any major toolchain issue with a non x86 "secondary" release has impacted primary [1] so "primary" is not isolated from issues of other alternate architectures like some believe. This would eventually result in all architectures running in the same instance of koji (like the Red Hat internal koji instance "brew") and the distinction of what makes a "primary" or "secondary" / "alternate" release of Fedora decided at compose time and the release artefacts being delivered depending on their status.

The proposal would be to initially import the AArch64 builds in the Fedora 26 cycle and complete the transition with Power64 and s390x in the Fedora 27 cycle. In both this would require a mass rebuild of all packages to be guaranteed for the cycle. This is already been requested by the toolchain team for the Fedora 26 cycle.

'''Questions and Answers:'''

'''Q: Will my builds be slower?'''

A: No. PowerPC which was previously the slowest builders in the primary koji instance (used for EPEL arch builds and any noarch) have been replaced with POWER8 hardware resolving the PPC/noarch issues and we have new 64 bit ARM hardware soon to go into production to provide new ARMv7 virtual builders of much higher spec that is currently in production. It's likely that over all in the near future your builds will actually be faster.

'''Q: When will the new ARMv7 builders be in place?'''

A: Soon! The current plan is mid to late July. This proposal isn't impacted by this as ARMv7 is already a primary architecture.

'''Q: Will a single arch failure affect the overall build failure?'''

A: Yes. An architecture failure will always affect a build failure and be dealt with, as it is now, in the case of x86_64/i686/ARMv7 it's instant, in the case of AArch64/Power64/s390x it's currently slightly delayed. Any toolchain or other issues in the current secondary architecture set already affects the primary builds as was seen with a toolchain issue in the F-21/22 cycle [1]. Also most of the issues discovered during gcc6 mass rebuild on s390(x) were general compiler issues possibly affecting all arches.

The fact is though that the actual packages that are ever affected by arch specific build issues (and I'm taking this from all non x86 arches) are a small subset of the 18K packages we ship.

Of those that are ever affect the vast majority are maintained by RH people that are paid to care about it across all arches and they deal with the issues already (gcc toolchain stack, glibc, python, golang,etc) and the vast majority of those teams want them to all happen as a single build they can deal with.

We have an enhancement for koji planned (should be relatively minor) where all arch builds will run to completion (whether pass or fail) rather than cancelling all the rest when one fails to enable quicker debug as to which arches have issues (one, all, etc) and comparison. The reason we can not fail a single arch yet still tag the build when a single arch fails is how we deal with any builds that are dependent on that package? Koji doesn’t have the means to deal with that. EG: new major version of library X with a soname bump, an arch fails. How do we deal with the soname bump across all the arches? If a dependent package then tries to build they'll either get disparate sonames depending on the arch, or a missing library on the arch when that fails. Basically either way it currently ends in a big mess. How do you suggest we deal with that? It's actually less problematic to fail them all and ensure there's arch people there to help out. It doesn't currently cause us a big issue with the 3 arches already in place (or even taking into account secondary arches), and for 95%+ of the packages I doubt it'll ever cause any issue ever.

Basically from experience with all the architectures, all packages fall into one of 5 basic categories:

1) no issues ever - probably 97+% of the 18K source packages

2) toolchains and core bits (gcc/glibc/python/golang etc) - Red Hat already pays people to care about all these core toolchains across all architectures no matter where the package resides. Currently they have to build,test, etc in 4 different koji instances. This change will be a reduction in work for these people and enable the same process internally/externally.

3) kernel/grub2/shim - we already deal with this in a similar manner now as we would with it merged. The Fedora kernel maintainers only actively care about x86_64, i686 is a token effort with the maintainers actively seeking others to assume maintenance, ARMv7/AArch64/Power64/s390x are actively maintained by others.

4) arch specific packages for HW enablement, our toolchain only supports some arches - already Exclude/Exclusive arch rpm options are configured in the vast majority of cases. There might be a few that need minor adjustments, but this is already in process as part of the secondary architectures teams.

5) a handful of packages that occasionally cause issues (only ones actually come to mind ATM are firefire/xulrunner/thunderbird and friends, possibly libreoffice). This is where we need to put process in place and we'll have more resources to assist with these issues as they do this that aren't doing mindless "shadow" work.

'''Q: How does this affect the kernel?'''

A: It doesn't. All architectures are already supported in the current kernel src.rpm instance. Even on primary architectures the kernel for i686 and ARMv7 aren't actively maintained by the kernel leads. The AArch64/Power64/s390x kernel issues are already dealt with by other teams as well.

'''Q: Koji build failure debugs?'''

A: We're working on an enhancement to koji [2] to enable all builds to run to completion if a single architecture fails. The plan is for all sub tasks to run to completion no matter if one sub task fails. This will enable a maintainer to see if a build failure is due to the architecture or is a general failure across all architectures to aid quicker debugging and turn around. The primary task will still fail on any sub task failures.

'''Q: What is the benefit of this?'''

A: There’s consistency of a single koji instance for all architectures, less confusion on how to deal with architectures for package maintainers and how to deal with failures, escalations and architecture issues. From core teams such as toolchain, release-engineering, and alternate architecture support there’s numerous time savings with a massive reduction of effort where they don’t need to build across multiple koji instances, multiple composes for release, koji-shadow, pushing multiple sets of updates, maintenance of four separate sets of infrastructure allowing people to focus more on core Fedora improvements rather than mirroring existing efforts.

'''Q: What is the cost/budget required for this?'''

A: There's not expected to be any material change in cost. Initially there may be a slight increase in storage usage on the primary koji instance due to the new architectures but this will be offset against the decommissioning of the arm/ppc/s390 koji instances where a lot of files are already duplicated (src/noarch rpms). There will be no need for new builders as the existing infrastructure will be reused and there's already enough capacity for the transition.

[1] https://fedoraproject.org/wiki/Fedora_21_22_Mass_Rebuild

[2] https://pagure.io/koji/issue/48


With my infrastructure hat on, I'm happy to assist/work on making this happen, it will be nice for infrastructure also to have everything in one koji, etc.

I think this should be drafted up as a f26 change so it has a standard place to point people with faq and tracker bugs and all that. I can assist with that if you like too.

The aarch64 mass rebuild would be for f26, but what will be the scheduling there? If we wait for all changes for f26 to be in so we can determine if a mass rebuild is needed it would make it be a few months into next year. So, perhaps we should do one shortly after f25 branches off in order to land aarch64 and then if we need another one in a few months, do that.

Anyhow, I am for this proposal. :)

It will have a lot of nice benefits, from using less disk on the mirrors and for the koji, to freeing up resources to allow us to work on new deliverables and tooling improvements and being able to look at doing something like promoting aarch64 server to primary. I agree we will need to have a f26 change for it, Ideally we will start implementing as soon as we branch f25. I think when the mass rebuild happens is a little less important as the shadow process makes things identical, but it is a big plus to have one to ensure that everything is built in primary infra.

I am very willing to assist with the work to make this happen. and am a big +1 for it

I'm fine with the planned koji changes. They make sense. A few comments otherwise though.

1) The proposal starts out by saying we're changing the definition of what primary/secondary means, but then only focuses on the technicality of koji. That still leaves open the entire mechanism and process that need to be determined for primary/secondary. I'd actually like to handle that in an entirely separate ticket and focus on the koji changes with this ticket if that is acceptable to everyone. The koji changes are worth it and I don't want to tangle them up in a bigger issue.

2) The FAQ section is a good start, but it misses a few things. While implied in the longer text, the first and foremost question package maintainers are going to have is: "Why do I have to worry about s390x/powerpc/aarch64 when I didn't before?" That will be followed shortly by "I don't have access to hardware to debug issues on those architectures", "I can just use ExcludeArch if things fail, right?", and "This is forcing more work on people that don't care and won't get any benefit."

If we're going to make a change in koji that impacts everyone, we need to consider everyone's point of view. The koji changes have a clear and direct benefit to infrastructure, rel-eng, and the current secondary arch maintainers. The benefits to everyone else are less clear. Can we work to highlight those, and perhaps proactively answer some of the above questions?

3) To help with clarity on the new stance, if we move forward with the koji changes I would very much like to stop calling the other architectures "secondary". As Peter mentioned somewhere in the initial writeup, "alternative" architectures would be a better fit. It also has the benefit of completely removing architecture from the nomenclature of what we decide to ship/block on. Builds are builds, images/artifacts are what we haggle about otherwise.

Replying to [comment:2 kevin]:

With my infrastructure hat on, I'm happy to assist/work on making this happen, it will be nice for infrastructure also to have everything in one koji, etc.

Now I have everything for non x86 in ansible I expect this to be relatively straight forward. I have a few ideas regarding this and looking forward to working with the team on it.

I think this should be drafted up as a f26 change so it has a standard place to point people with faq and tracker bugs and all that. I can assist with that if you like too.

Yes, that was the intention for each of the koji instance merges. This is for the overall discussion of the redefinition, I didn't see much point in doing the change before this.

The aarch64 mass rebuild would be for f26, but what will be the scheduling there? If we wait for all changes for f26 to be in so we can determine if a mass rebuild is needed it would make it be a few months into next year. So, perhaps we should do one shortly after f25 branches off in order to land aarch64 and then if we need another one in a few months, do that.

So the aarch64 packages are very close to mainline ATM. With no mass rebuild scheduled for F-25 this shouldn't change. The plan would be to import them in soon after branch (with Flock right away I suspect 2nd week of August) and the toolchain team already have indicated they need a mass rebuild in F-26 so a mass rebuild would happen once as part of that.

Replying to [comment:5 pbrobinson]:

So the aarch64 packages are very close to mainline ATM. With no mass rebuild scheduled for F-25 this shouldn't change. The plan would be to import them in soon after branch (with Flock right away I suspect 2nd week of August) and the toolchain team already have indicated they need a mass rebuild in F-26 so a mass rebuild would happen once as part of that.

I've explained myself poorly I fear, let me try again with more detail...

currently schedule has f25 branching from rawhide on 2016-07-26. Typically we wait until the change submission deadline or close to it before we schedule the mass rebuild for a cycle. This is so we can try and get everything in place so we can just do one mass rebuild and catch all the changes that need one. If the f26 cycle is like the f24 cycle, the mass rebuild would be sometime around 2017-02. If we need a mass rebuild in order to import aarch64, this is a long time to wait, so I am just asking if we should do two for the f26 cycle. One after f25 branching whenever the aarch64 bits are setup, and another one at the normal time for gcc and other changes that require it.

currently schedule has f25 branching from rawhide on 2016-07-26. Typically we wait until the change submission deadline or close to it before we schedule the mass rebuild for a cycle. This is so we can try and get everything in place so we can just do one mass rebuild and catch all the changes that need one. If the f26 cycle is like the f24 cycle, the mass rebuild would be sometime around 2017-02. If we need a mass rebuild in order to import aarch64, this is a long time to wait, so I am just asking if we should do two for the f26 cycle. One after f25 branching whenever the aarch64 bits are setup, and another one at the normal time for gcc and other changes that require it.

So the builds on arm.koji are very very close to primary. As of this morning we're 56 behind (FTBFS or similar but prev built) and 295 "missing" ie not built at all (some are arch specific x86, some like mono not supported/bootstrapped yet of which I hope to clean up some more before this), and 17803 the "same".

We would import these builds and use them. Then do a single mass rebuild at the usual time at the end of the cycle before branching. This is what happened for ARMv7 [1] so I don't see a need for a second mass rebuild.

https://fedoraproject.org/wiki/Fedora_20_Mass_Rebuild

1) The proposal starts out by saying we're changing the definition of what primary/secondary means, but then only focuses on the technicality of koji. That still leaves open the entire mechanism and process that need to be determined for primary/secondary. I'd actually like to handle that in an entirely separate ticket and focus on the koji changes with this ticket if that is acceptable to everyone. The koji changes are worth it and I don't want to tangle them up in a bigger issue.

Yes, the intention was to keep the two issues separate. I think also there will be different criteria for Server/Workstation/Cloud/Spins/Atomic promotion and I'm not sure how FESCo wishes to handle those. Maybe a general overall "promotion/demotion" process with specifics for each.

2) The FAQ section is a good start, but it misses a few things. While implied in the longer text, the first and foremost question package maintainers are going to have is: "Why do I have to worry about s390x/powerpc/aarch64 when I didn't before?" That will be followed shortly by "I don't have access to hardware to debug issues on those architectures", "I can just use ExcludeArch if things fail, right?", and "This is forcing more work on people that don't care and won't get any benefit."

I agree with your points and the longer text in that section was intended to try and address these issues. I'm very aware of them and would love assistance in clarification of FAQ. Maybe we need to move the core text to a wiki doc for easy editing/review single place for reference.

To answer the explicit questions above:

Packagers already have to deal with aarch64/Power64/s390. They get bug reports when they're FTBFS or there's issues. The only change here is that rather than having to deal with it post initial primary build they'll need to deal with it then and there. There will be the secondary arch teams available to assist as before, in fact they'll have more time to assist as they're not dealing with "tail chasing" that is koji-shadow and associated processes.

In the vast majority of cases this will be no more work and they'll never have an issue with these other architectures. I would say around 98% of packagers will barely notice. For a noarch package (insert %) they already deal with ppc64/ppc64le builders due to EPEL.

In the case of no access to hardware. In all of the aarch64/Power64/s390x there is means to get access to this HW to fix issues, or there's the secondary teams that can assist in the fixing process, or in the case where the package just doesn't work, no major need for it on that arch there is already the Exclude/Exclusive option, which in the vast majority of the packages that fit this description has already been actively put in place by the secondary arch teams.

If we're going to make a change in koji that impacts everyone, we need to consider everyone's point of view. The koji changes have a clear and direct benefit to infrastructure, rel-eng, and the current secondary arch maintainers. The benefits to everyone else are less clear. Can we work to highlight those, and perhaps proactively answer some of the above questions?

Yes, I thought I had covered the majority of them with the initial FAQ and the intention is to clarify any that further come up.

3) To help with clarity on the new stance, if we move forward with the koji changes I would very much like to stop calling the other architectures "secondary". As Peter mentioned somewhere in the initial writeup, "alternative" architectures would be a better fit. It also has the benefit of completely removing architecture from the nomenclature of what we decide to ship/block on. Builds are builds, images/artifacts are what we haggle about otherwise.

YES! You might have noticed in the Fedora 24 release announcement that they were referred to as "Alternate architectures" and while there was reference to "secondary" it was primarily there as a linking reference to the past.

In the vast majority of cases this will be no more work and they'll never have an issue with these other architectures. I would say around 98% of packagers will barely notice. For a noarch package (insert %) they already deal with ppc64/ppc64le builders due to EPEL.

Oops, forgot to update the noarch percentage!

There are 9733 noarch packages out of 18154, so over 50% of the source packages in the distro are pure noarch.

Replying to [comment:8 pbrobinson]:

1) The proposal starts out by saying we're changing the definition of what primary/secondary means, but then only focuses on the technicality of koji. That still leaves open the entire mechanism and process that need to be determined for primary/secondary. I'd actually like to handle that in an entirely separate ticket and focus on the koji changes with this ticket if that is acceptable to everyone. The koji changes are worth it and I don't want to tangle them up in a bigger issue.

Yes, the intention was to keep the two issues separate. I think also there will be different criteria for Server/Workstation/Cloud/Spins/Atomic promotion and I'm not sure how FESCo wishes to handle those. Maybe a general overall "promotion/demotion" process with specifics for each.

OK, good. Then I think the title of this ticket needs to be more narrowly scoped, because as it is right now it suggest we're tackling the broader issue and not the koji changes.

2) The FAQ section is a good start, but it misses a few things. While implied in the longer text, the first and foremost question package maintainers are going to have is: "Why do I have to worry about s390x/powerpc/aarch64 when I didn't before?" That will be followed shortly by "I don't have access to hardware to debug issues on those architectures", "I can just use ExcludeArch if things fail, right?", and "This is forcing more work on people that don't care and won't get any benefit."

I agree with your points and the longer text in that section was intended to try and address these issues. I'm very aware of them and would love assistance in clarification of FAQ. Maybe we need to move the core text to a wiki doc for easy editing/review single place for reference.

To answer the explicit questions above:

Packagers already have to deal with aarch64/Power64/s390. They get bug reports when they're FTBFS or there's issues. The only change here is that rather than having to deal with it post initial primary build they'll need to deal with it then and there. There will be the secondary arch teams available to assist as before, in fact they'll have more time to assist as they're not dealing with "tail chasing" that is koji-shadow and associated processes.

From their point of view, they don't care and they don't have to deal with anything. Bug reports are easily ignored, the things they care about march on and they don't have to do anything. Worst case, they add ExcludeArch and go back to the x86 bubble. This fundamentally changes that, and it will be a big deal to them. I do not think trying to hide this behind "you already have to.." will work at winning anyone over.

(I realize not everyone lives in the x86 bubble today, but the people that do will be the vocal ones.)

In the vast majority of cases this will be no more work and they'll never have an issue with these other architectures. I would say around 98% of packagers will barely notice. For a noarch package (insert %) they already deal with ppc64/ppc64le builders due to EPEL.

That's probably a better lead.

In the case of no access to hardware. In all of the aarch64/Power64/s390x there is means to get access to this HW to fix issues, or there's the secondary teams that can assist in the fixing process, or in the case where the package just doesn't work, no major need for it on that arch there is already the Exclude/Exclusive option, which in the vast majority of the packages that fit this description has already been actively put in place by the secondary arch teams.

Upfront information on machine access would be helpful.

If we're going to make a change in koji that impacts everyone, we need to consider everyone's point of view. The koji changes have a clear and direct benefit to infrastructure, rel-eng, and the current secondary arch maintainers. The benefits to everyone else are less clear. Can we work to highlight those, and perhaps proactively answer some of the above questions?

Yes, I thought I had covered the majority of them with the initial FAQ and the intention is to clarify any that further come up.

Speaking of coming up... this proposal needs to go to devel list. I am very much in support of it, but it still needs to be sent there and discussed before I would be willing to vote on it.

I'm +1 to this, thanks for all the work pbrobinson as well as any others involved.

I will be away for today's meeting.
I am +1 to this proposal.

I would ask FESCo to express their position for or against in the ticket, but I am explicitly against officially voting for this until the proposal has been sent to devel list. Not doing so for a fundamental change like this that impacts everyone would be a drastic overstepping of our bounds.

IMHO, it is entirely unacceptable to let toolchain bugs on obscure architectures (bugs that, in my experience, are much more frequent than the OP is claiming) hold our builds hostage (through the proposed "fail on one = fail on all" principle). It is already painful enough with ARM (e.g., this showstopper: https://bugzilla.redhat.com/show_bug.cgi?id=1342095 has been breaking builds of several Qt/KDE packages for months and is still not fixed – the only workaround that makes the affected packages build on ARM makes the output not Fedora-complaint (it is not allowed to require NEON)). I have seen even worse architecture-specific bugs and limitations (e.g. on the number of relocations) from targets such as ppc64 (the obscure "number of relocations" thing is a real ppc64 example) that this proposal would also make blocking for builds.

IMHO, only '''one''' architecture (probably x86_64) should block builds. A failure on any other architecture (including ARM) should affect only the failing architecture.

Removing the meeting keyword until this is posted to devel.

I think as part of this we should step back and demote 32 bit x86 to alternate status. Server asked that we drop all 32 bit x86 images. I suspect that it is more to do with not wanthing to promote it than not wanting it period. I personally think there is benefit to demoting to alternate status at least for a period of time before removing entirely. but given that nothing is release blocking on 32 bit x86 I think that makes it clearly alternate and not primary.

APPROVED

  • AGREED: FESCo approves the new alternative architectures plan (+7, ~ 0, -0) (sgallagh, 16:11:29)

Meeting Minutes from 2016-08-12 https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/EIIOK5OB5G637AIQ3T6IOJWB5WLSUQEN/

Anything remained here to discuss?

Replying to [comment:21 pnemade]:

Anything remained here to discuss?

Not from my PoV

Login to comment on this ticket.

Metadata