#8037 robosignatory broken in prod?
Closed: Fixed 4 years ago by kevin. Opened 4 years ago by pingou.

I have a few builds which seems to be stuck where robosignatory should pick them up:

fedora-gather-easyfix-0.1.1-29.fc31
Finished: Fri, 26 Jul 2019 08:16:06 CEST
Tags: f31-updates-candidate
fedora-gather-easyfix-0.1.1-28.fc31
Finished: Fri, 26 Jul 2019 07:45:19 CEST
Tags: f31-updates-candidate
fedora-gather-easyfix-0.1.1-30.fc31
Finished: Fri, 26 Jul 2019 08:57:24 CEST
Tags: f31-updates-candidate

Could someone see what's going on?


Metadata Update from @pingou:
- Issue priority set to: None (was: Needs Review)

4 years ago

Mass rebuild causes all packages to go through robosignatory. Things are just way queued up to talk to the sigul server

I just checked, and it is now working on nodejs-npm-license-0.3.3-7.fc31, with build date/time of: Fri, 26 Jul 2019 06:44:36 UTC.
So it's about two hours of builds out from your builds.

So we may need to figure out how to change this as this is effectively stopping rawhide entirely, packages are not signed or pushed to the buildroot.

Do we have an estimate (for previous mass-rebuild) of how long it'll take to recover?

Can we increase the speed of robosignatory? Increase the number of workers?

In the future, would is be doable to have a dedicated robosignatory instance for the mass-rebuild?

So we may need to figure out how to change this as this is effectively stopping rawhide entirely, packages are not signed or pushed to the buildroot.
Do we have an estimate (for previous mass-rebuild) of how long it'll take to recover?

There are multiple things which make estimating hard.
1. There is the mass rebuild which is a one and done attempt to rebuild all the packages in the tree.
2. There are developers seeing their packages were FTBFS and adding in builds
3. There are developers who send large builds of modules through daily
4. There are other groups doing builds for EPEL, F29/F30.

Mass rebuilds depend on the slowest architecture, times the number of packages, times the amount of other work going on at the same time. At this moment the slowest architecture seems to be PPC9.. The number of packages is about the same. I believe a mass rebuild attempt takes 3-4 days in the past. However the work load from other parts have changed and we have stuck in other parts in the build system since the last rebuild which will affect timing.

Mass rebuilds also in the past did not have large amounts of module rebuilds which this one will have in a couple of days.

Can we increase the speed of robosignatory? Increase the number of workers?

Robosignatory has to wait for other parts of the system to say they are ready for things to be done. The autosigner looks to be signing one set of packages every 3-4 seconds. The sigul backend has to work its parts. In front of the robosignatory, koji has to digest what it is getting and talk to other things to confirm it can move to the next thing

Adding more robots without actually clearing up the workflow is just going to lead to more delays or worse collapse of parts of the system.

In the future, would is be doable to have a dedicated robosignatory instance for the mass-rebuild?

Again this needs to actually have a mapped understanding of what all the parts are doing before and after the robosignatory does its thing and actually knowing where things are really waiting. [Of course it could just be that someone who has all this already in their brain goes.. nah.. you need to just pull this plug out of the wall and the lake will drain much faster... but I do not know where that plug is or if the plug might also flood an orphanage beneath it.]

Mass rebuilds also in the past did not have large amounts of module rebuilds which this one will have in a couple of days.

Just to be fair in case you are pointing to Rust modules. Those modules contain large number of packages (from 20 to 200), but they are built within 5 seconds after buildroot population and that is done only on one architecture (noarch). So if we have problems with this, we really need to look at why build takes so much time / resources.

Note: robosignatory did indeed get backed up with the mass rebuild, but it worked fine and caught up last night.

All the buildsys.rpm.sign messages stopped when we switched over to fedora-messaging in koji. So there's some bug there.

I'd like to get that fixed, but sign work on a new signing system might be better done out of ticket?

FYI, the mass rebuild tag (f31-rebuild) was tagged into f31-pending, so robosign is checking every single build to make sure it's signed and written out... and this is taking a very long time. ;(

I did what I could to speed it up, but it will be a while longer. ;(

Metadata Update from @kevin:
- Issue priority set to: Waiting on External

4 years ago

Everything should be fully caught up now.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 years ago

Login to comment on this ticket.

Metadata