#2 Investigate possible dropped build messages on fedbus
Closed: Fixed None Opened 10 years ago by tflink.

While zmq has no guarantees that messages will be delivered, that doesn't mean that it isn't reliable enough to use for our purposes.

Monitor build messages and compare them with actual builds from koji. Note any discrepancies


Ralph Bean wrote a script to do this monitoring, dumping results to a file. It needs to be deployed somewhere where it can run for a while.

https://gist.github.com/ralphbean/7204706

started ralphs script on a machine in the fedora infra, will post results after a while

Here's a transcript from IRC with some information on false positives.

 tflink │ threebean: just saw a missing message from koji: * For 2013-11-01 16:17:22.601115 koji did not have https://koji.fedoraproject.org/koji/buildinfo?buildID=475205

threebean │ tflink: cool. did that message show up only once?
tflink │ yeah
threebean │ ok
threebean │ i'm 99% sure that's just a race condition in the script.
tflink │ does that mean that it was eventually sent?
threebean │ yeah, if it really was missing.. that message would show up over and over again for an hour.
tflink │ ah, OK. just figured I would mention it
threebean │ no, that's super helpful. thank you
threebean │ the logic in the script goes, "wake up every minute or five minutes or something... when I wake up, ask koji for the builds from the last hour. ask fedmsg for the
│ builds from the last hour. compare."
threebean │ the thing is "the last hour" means different things to the two systems (by a matter of seconds)

I've been running ralph's script for about a week now and am seeing the following:
- 61 instances of delayed messages
- 0 instances of dropped messages

In each case of a "delayed" messages only showed up for one interval of the script (15 seconds), so I'm not very worried at this point. I'm still going to keep running the script, though.

So, I'm completely ignorant in this area. But I remember some Ralph's blogposts about querying fedmsg history:
http://threebean.org/blog/querying-datagrepper-example/
https://apps.fedoraproject.org/datagrepper/

Can't we just listen on the bus, and query the history once an hour (or so) to find potentially dropped messages?

The worry is that a message never makes it from koji to data{nommer,grepper} in the first place.

The script here is a more robust check that checks koji's own logs against datagrepper. If those two are out of sync, then we have a problem.

The good news is, it looks to be running just fine.

I'm not so worried about this anymore, closing the issue. I'd like to see us make better use of datagrepper to backfill any holes on startup but that's a whole new ticket :)

Login to comment on this ticket.

Metadata