#3 Expose information about current jobs being handled.
Opened 7 years ago by ralph. Modified 3 years ago

For Fedora, we typically respond to a build.tag event in which case there is only one rpm to be signed.

For RH, we're going to responded to an errata tool state change event. Errata advisories are like bodhi updates, and they can contain multiple RPMs. These might take a long time.

The fedmsg-hub process exposes some metrics about what it is doing to /var/run/fedmsg/monitoring-fedmsg-hub.socket. This is nice for tools like nagios and collectd to consume.

We should look at seeing if we can extend the information exposed by that socket to include more information about what the robosignatory consumers are doing.

In particular:

  • The number of RPMs being signed at the moment. For Fedora, this will almost always be just 1. For other consumers, it may be more.
  • For Fedora, it might be interesting to expose the size of the rpms (in bytes). This way we can detect when texlive is being signed and is backing up the whole system.

Hmm, this is related to what we're hitting currently I think.

The Fedora CoreOS pipeline currently sends signing requests to robosignatory and just dumbly waits 1h for it to reply (see https://github.com/coreos/coreos-assembler/pull/995). Today, we hit an even worse case due to the mass rebuild side tag getting merged.

Now, we could just make the timeout something like 24h, but what I'd really like is a way to know if robosignatory is running but just busy, or if it's dead entirely. In the latter case, I'd rather we error out faster so we can notify Fedora admins of the situation. Whereas in the former case, it's no big deal to just keep waiting.

Is this exposed anywhere? E.g. some /ping API we can use to tell if robosignatory is running? Or maybe at the AMQP level, e.g. some way to tell that messages are being processed and we're just in the queue?

@abompard Any thoughts about this. This would be really useful to us! And reduce the load of the infra team since we don't have to keep asking whether RoboSignatory is down or it's just working through thousands of requests. :)

Hey! Robosignatory itself does not know the size of its work queue, that's the job of the AMQP server. Fortunately we now have graphs to monitor it!

Will this give you the information you need? Those graphs have just been recreated yesterday (the monitoring system was down due to the colo move) so there's not much history yet.

Oh that's neat!

Will this give you the information you need?

I think that helps humans trying to diagnose the issue, but what I'd really like is some programmatic way to get this information.

Hmm, I think what we really want here is the ability to query RabbitMQ itself to know what the state of the robosignatory queue is, e.g. how many messages are queued, when the last message was consumed, and where we are in the queue. See https://www.rabbitmq.com/management.html -- AFAIK, the RabbitMQ instance doesn't have this turned on, right?

I guess another or supplementary approach is for a new robosignatory-status queue which is handled separately from the main thread which pops off messages from the robosignatory queue. The idea is we'd basically send a ping message, and RoboSignatory would reply back pong along with information about what payload/message it was handling right now.

That'd work too, because then we could just safely wait for our messages to be signed as long as we know that RoboSignatory is up and signing things. Though ideally we'd have both this and access to the RabbitMQ API so that we have better visibility of what's going on.

Login to comment on this ticket.

Metadata