#10899 Enhancement: priority signing in Robosignatory
Closed: Fixed 10 months ago by abompard. Opened 2 years ago by abompard.

The FCOS project would like to avoid their builds being signed after other packages in the queue when a mass rebuild is going on.

After some discussion with Kevin and Dusty, we could leverage RabbitMQ's support for priority.
It will need some configuration and migration of the queue, and this ticket also aims at discussing what we want to do with this new feature in the infra. Should we attach a lower priority to mass rebuilds, a normal priority to regular builds and a higher priority to security updates?

Message priority is set by the message sender. Not setting the priority means that it will be lowest (zero).
Kevin suggests using Koji's priority value to set the priority of the message.
The number of possible priorities has an impact on used resources in RabbitMQ, they recommend against having more than 10 possible priorities. I'm not sure how many priorities are possible in Koji but we may need to have some sort of mapping.

Queues can only be set to support priority at creation time, so we'll need to destroy and recreate the robosig queue. To avoid loosing messages, we can follow this process:
- create robosignatory-tmp queue
- bind robosignatory-tmp queue to the same topics as original
- unbind the original
- wait for the queue to empty
- connect the process to the robosignatory-tmp queue
- destroy the original queue and re-create it with x-max-priority = 5
- bind it as it used to be bound
- unbind the robosignatory-tmp queue
- wait for it to empty
- connect the process to the robosignatory queue
- destroy the robosignatory-tmp queue

Kevin thinks we may also be able to set an outage when we would turn off koji. Currently Robosig is subscribed to the following topics:
- pungi.compose.ostree
- coreos.build.request.artifacts-sign
- coreos.build.request.ostree-sign
- buildsys.tag
We'll need to turn off all of those emitters.

Ideas?


Metadata Update from @phsmoura:
- Issue priority set to: Waiting on Assignee (was: Needs Review)

2 years ago

So, the koji priority is an integer, lower is more important. Users can only call koji build with --background (which sets priority to 5).
Scratch builds are pri 20, mass rebuilds run with --background (ie, 5).
However, I don't think this is going to work for us here, because all that info is in the task, not in the buildys.tag message.
So, I dont think we can actually use it.

So, I guess we should perhaps just stick to the 4 types you listed?
make the coreos ones highest, ostree next and then buildsys.tag.

Related could we implement this delay at the same time: https://github.com/fedora-infra/fedora-messaging/issues/254
(ie, a delay for tag build messages. I don't think it needs to be much... like 2seconds?)

So, the koji priority is an integer, lower is more important. Users can only call koji build with --background (which sets priority to 5).
Scratch builds are pri 20, mass rebuilds run with --background (ie, 5).
However, I don't think this is going to work for us here, because all that info is in the task, not in the buildys.tag message.

Could we add that information to the buildsys.tag message?

For clarity.. This proposal is just changing it so that higher priority messages get acted on first, correct? It does not actually change robosignatory operating a single lane (i.e. not signing more than one thing at a time)?

I can turn off the coreos* emitters when we're ready to try to implement this.

Yes, Robosignatory would still be a single lane, but RabbitMQ would reorder the queue server-side so that higher priority messages arrive first in robosig's consumer. There would be no change to robosig's code at all.
We will however need to change the code of the coreos, ostree and koji emitters so that they set the priority.

Mass rebuild happening now.. which is my periodic cue to check back in on this. Thoughts on next steps here? Anything I can do to help?

I think we should look at trying to do this after f38 release, but long before any mass rebuilding.

Metadata Update from @abompard:
- Issue assigned to abompard

a year ago

I'm going to start testing the queue switchover in the next days. I've written a script that should help me with the queue create & rebind process.

@dustymabe , you can start sending messages with a priority (you don't need to wait for me, it'll be ignored if the destination queue does not support it yet).
To do that, just set the priority attribute of the Message object to an integer before sending it (it's not possible to do that in the constructor directly).

You'll need at least version 3.2.0 of fedora-messaging to have that attribute available.

This is the priorities we'll be using:
- coreos.build.request.artifacts-sign: 3
- coreos.build.request.ostree-sign: 3
- pungi.compose.ostree: 2
- buildsys.tag on normal package builds: 1
- buildsys.tag on mass rebuilds: 0
(bigger number means higher priority)

Did I get that right?
I can help with setting the priority in your emitter, if needed.

Thanks.

I'd actually put normal builds higher... 2?

and... can we seperate normal builds out some? For example, I know eln folks often rebuild a ton of packages and they land on the signing queue at the same time... so perhaps eln could be 0 and normal builds could be 2? Or can we seperate out there any?

Sure, what do you want at priority 1 then? Do you want pungi at the same level as normal builds, or should we shift all above normal builds as well?

BTW the distinction between mass rebuilds, normal builds, and eln builds (and anything else that use the buildsys.tag topic) will have to be done in the fedora-messaging plugin running in Koji.

Sure, what do you want at priority 1 then? Do you want pungi at the same level as normal builds, or should we shift all above normal builds as well?

Good question. I guess I'd put it above. I am not sure how easy it will be to tell tho...

BTW the distinction between mass rebuilds, normal builds, and eln builds (and anything else that use the buildsys.tag topic) will have to be done in the fedora-messaging plugin running in Koji.

yeah. mass-rebuilds we can tell from the tag (FN-rebuild) as well as eln. But pungi/composes are just mixed in with normal stuff... although I guess they are all owned by a 'releng' or 'bodhi' user?

Anyhow, we could start this simple and improve it over time. The mass signing/rebuild is the big case in my mind, followed by eln... ie, we need coreos over those for sure.

OK, here's the revised plan then:
- coreos.build.request.artifacts-sign: 4
- coreos.build.request.ostree-sign: 4
- pungi.compose.ostree: 3
- buildsys.tag on normal package builds: 2
- buildsys.tag on eln package builds: 1
- buildsys.tag on mass rebuilds: 0

Does this seem correct?

I have done testing on staging and it seems fine, I plan on making the switch to the priority queue in rabbitmq/robosig early next week. Then the emitters will need to be priority-enabled, I'll have a look at the koji one first.

I think so? It would be good if we could easily adjust or extend... :)

We can easily adjust, as the message producers set the priority. We can't however easily extend, as it requires the queue to be recreated. However I've written a script that should make it easier this and next time(s) we want to do it, so it's not entirely out of the question :-)

I'll set the max priority to 5 because I have a not-so-secret plan to set security builds to the highest priority so that they get processed as fast as possible.

ok. I'm not sure how you might identify security related builds, but ok. ;)

Happy to see movement on this! I threw up https://github.com/coreos/coreos-assembler/pull/3444 which should set the priority level on our signing messages to 4.

This PR on the Koji message schemas should have it set priorities for mass rebuild, ELN, and normal builds. When it's merged we'll still have to deploy it to the Koji hub and restart it.

And this is the PR on pungi for the pungi.compose.ostree messages.

Alright, the robosignatory queue has been priority-enabled in prod! For the record, the script I use is this one: https://gist.github.com/abompard/4675431a7d2cf236f39f95f0369c7273

Nice! thanks for working on this

Awesome! Should we keep this open until there's a mass rebuild going on and we can verify that it works as intended?

Lets keep it until we finish deploying the koji/pungi parts... then close. :)

If it doesn't work right at mass rebuild time, we should open a new ticket and fix it there.

all IMHO

That's fine with me. :) Was trying to figure out if there was an easy way to verify the actual prioritization bit worked, but I didn't come up with anything. (Relatedly, datagrepper doesn't currently show the priority of messages. That could be a nice addition if it's easy to do.)

The Koji schemas update is now deployed to prod, Koji should start sending messages with a priority set according to the list above.

Should this be working as intended now? Today, our pipeline hit a timeout on an ostree-sign request. This is the request:

https://apps.fedoraproject.org/datagrepper/v2/id?id=f06e68c3-0494-4084-88c6-a85de7041246&is_raw=true&size=extra-large

And this is the RobosIgnatory response:

https://apps.fedoraproject.org/datagrepper/v2/id?id=4c0ed23f-1f09-4115-b0e5-1f3587758d41&is_raw=true&size=extra-large

As you can see from the timestamps, it came more than one hour after the request. This is usual from when the queue was backed up before any priority signing. Should we have expected a faster turnaround here with the new priority stuff or is something still wrong? (Or was there something else up with the infra where e.g. Robosignatory was down for some time?)

Hmm, I guess we could test whether the priority stuff is working as intended ourselves by e.g. temporarily lowering the priority level for artifact-sign to 3, and then seeing if sending multiple artifact-sign requests followed by one ostree-sign request (still level 4) will result in the ostree-sign request being handled before at least some of the artifact-sign requests. Would be happy to get together too with someone to test this in a more systematic way.

I can look into that with you @jlebon , but I'm only going to be available next week. Would that be OK?

I'm offline Monday, but otherwise next week should be fine. Feel free to throw something on my calendar!

Met with @abompard today about this. In testing, we couldn't get the queue to exhibit priority ordering, but we think it's because of the consumer prefetching 25 messages at a time. We also dug a bit into the logs of the event mentioned inhttps://pagure.io/fedora-infrastructure/issue/10899#comment-856602, but the actual priority of messages isn't captured. @abompard will work on surfacing the priority into the message header so that it e.g. ends up in logs. That way the next time this happens, we'll have more data to work with. We'll also consider lowering the prefetch amount to decrease signing latency of higher priority items.

Awesome work. Thanks to both of you. :)

Hey CoreOS folks! There's been a mass rebuild yesterday, did you notice an improvement in your package throughput?

hey @abompard - I notice the queue has been high for many hours now but our builds and artifacts are getting signed without much delay. Looks like we're happy with the work that you put in place.

Metadata Update from @abompard:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

10 months ago

Log in to comment on this ticket.

Metadata
Boards 1
dev Status: Backlog