#92 `taskotron.result.new` fedmsg does not contain important information expected by Greenwave. Change it? Have a new one? Do something else?
Closed: Fixed 6 years ago Opened 6 years ago by adamwill.

While poking into greenwave decision fedmsgs, we - @dcallagh , @ralph , @tflink and I - have found enough problems with use of the taskotron.result.new fedmsgs for Greenwave that we think it probably makes sense to create a new message, something like resultsdb.result.new, with a different approach to when messages should be sent and what their content should be.

For the other end of this problem, see https://pagure.io/greenwave/issue/122 . The way greenwave decision fedmsgs work is, greenwave looks for a task dict in the fedmsg, and has a list of iterables of key names that may appear in that task dict, called ANNOUNCEMENT_SUBJECT_KEYS. For each iterable of key names, if all those keys appear in the task dict, greenwave will use those keys as the subject for a decision query, and emit a fedmsg with the result of the decision (if the decision changed from the previous time this whole thing happened).

I've spent the last couple of days looking into how this works in practice in a couple of cases, and the answer is 'not well', and there are some interesting associated issues.

The way this task dict is actually constructed during resultsdb fedmsg emission is fairly simple: resultsdb takes the testcase name and uses it as the name key in the dict, then it adds the item and type from the result to that dict if they exist. And that's it: that's all you ever get.

Greenwave, however, is kinda expecting to get other things, as we can see from its ANNOUNCEMENT_SUBJECT_KEYS dict:

ANNOUNCEMENT_SUBJECT_KEYS = [
    ('item', 'type',),
    ('original_spec_nvr',),
    ('productmd.compose.id',),
]

the 'intent' behind those original_spec_nvr and productmd.compose.id entries is clear if you know what kinds of results include them. Results from the CI pipleline include the original_spec_nvr key; it is a consistent identifier for each of the results from a single execution of the pipeline, for each of the various tests that run in it. So someone is clearly expecting greenwave to emit decision fedmsgs for each 'thing' (the original_spec_nvr is basically the 'thing', AIUI) that goes through the pipeline.

Compose test results from openQA and Autocloud include the productmd.compose.id; it identifies the compose the test relates to. (Notably the item for openQA tests is not always the compose ID; sometimes it is the filename of the image tested. So productmd.compose.id is important because it's the key that should always be present in any result for a given compose). So clearly the intent here is for greenwave to emit a decision fedmsg for each compose that is tested.

As we can hopefully see quite easily, though, neither of these paths actually works, because the task dict in the fedmsg will never contain an original_spec_nvr or productmd.compose.id key. The result extradata does, but the fedmsg task dict does not. In practice, the only one of greenwave's ANNOUNCEMENT_SUBJECT_KEYS that is ever used is the first one, ('item', 'type'); this causes it to emit fedmsgs with correct decisions for updates (since using item and type as the decision subject does in fact find the correct results for update decisions), and incorrect decisions for composes (since compose result fedmsgs do have item and type in their task dicts, but the item is not the same for all compose tests, so whatever the item is in the message that triggers the decision, some of the expected results won't be found as they'd have a different item).

It will never emit any decision fedmsgs for builds tested in the CI pipeline, as the fedmsgs for compose CI results will never have item and type in their task dicts, because the ResultsDB results do not have those keys in their data - see a sample ResultsDB result from the pipeline.

An interesting related issue here is that ResultsDB almost never emits fedmsgs for CI pipeline results. @dcallagh figured out why. It's because of the previous message detection. ResultsDB tries to figure out if it's already emitted a message for 'the same' result, and not emit a message again if so; @ralph says this was basically because of the old rpmdepcheck test which would run over and over and over and over on the same packages. We didn't want to emit a new fedmsg every time it ran again on the same package and got the same result. But because that detection relies on the item, type and arch keys, none of which are present in CI pipeline results, the effect is that we will only ever emit a fedmsg for the very first CI pipeline result for a given testcase name and outcome; all CI pipeline results with the same testcase name and outcome are considered to be duplicates by this function.

All of us caused Dan, Ralph, Tim and I to have a bit of a chat about this whole system. Considering the 'expected data not in the message' problem, we thought that just stuffing more keys into the task dict of these messages is probably not the way to go. Other options include just putting more information outside the task dict but still in the message (maybe the whole extradata dict, whatever). Both @ralph and I kinda do a mental substitution from taskotron.result.new to resultsdb.result.new when thinking about these messages, and we're fine with that, but the conceptual problems bug @dcallagh more, and he has a point. This really is a fairly 'Taskotron-y' message, in the old-school sense: the task concept itself seems kinda specific to Taskotron concepts. And the 'previous result detection' can also be seen as a kinda Taskotron-y idea, if not in concept then at least in execution.

So we came up with the tentative idea that we could keep the existing taskotron.result.new messages unchanged for compatibility with anything currently using them, but also emit a new message, say call it resultsdb.result.new, which would be a 'dumber' message; perhaps one that just contains the entire result (I guess an alternative would be to include nothing but the result ID, and expect consumers to query resultsdb for the rest of the data?) In any case, it wouldn't include this task concept.


tl;dr; just do whatever is necessary/right/useful for you, guys. The Fedmsg emitter is something @ralph wrote in the first place, so I'm not the one to hold you back.

AFAIK (but ask @tflink ), we are not actually reacting to the taskotron.result.new messages anywhere in the Taskotron stack (yet).
The only comment I can offer is this - I'm not sure that stuffing the whole result into the fedmessage is the "right" solution - seems like duplicating all the data from resultsdb in datagrepper/message-hub is kind of odd.
The current messages absolutely are taskotron-y (hence the taskotron prefix in the message topic) by design, in what they do, and I believe that adding another "less taskotrony" message is a good idea.
Maybe more 'consumer-targeted' messages (aka resultsdb.greenwave..., resultsdb.taskotron...) are a better solution than just pure "hey, we have news, check it out here!" or "hey, this is all the data we have, right here!". I kind of think that since we have control over the messages, they should be useful for the consumer, and since resultsdb is generalist by design I don't see why emitting multiple "targeted" messages would be a bad thing. And we can always have something like resultsdb.announce with just "hey, look at this url, to get the data" semantics in there too (not that I think it is necessary, at least now).

But once again, I'm not the messaging/greenwave expert here, just sharing my $0.02

@dcallagh says he's working on a PR for this. AIUI it'll basically just emit a new message with all the extradata included, so at least consumers have all the data to work on, even if they still have to have special knowledge of what format it's in for now.

@adamwill - I merged a PR that does precisely that couple of days ago (on Saturday, I think). IIRC @ralph submitted it.

Metadata Update from @kparal:
- Issue assigned to dcallagh

6 years ago

Zoiks, too many moving things here :)

So, that commit basically just stuffed all of extradata in the task dict. It didn't invent a new message topic, though. @dcallagh , WDYT of that?

Yeah, my suggestion would be to revert that, and to refactor the messaging stuff so that you have the taskotron topic with the same behaviour as it is today (with the questionable get_prev_result stuff). This would be optional and we would not turn it on on the Red Hat internal deployment. And then a separate message producer which produces a generic resultsdb message on a generic resultsdb topic, with all the extra data and no special message suppression logic.

I am still trying to finish a patch for that, but I keep falling down rabbit holes :-) so hopefully tomorrow...

@dcallagh hehe, I feel your pain :) But thanks for keeping an eye on this, and working on the solution!

As I said earlier, I'm firm believer in having multiple messages/configurable messaging, based on the way the messages are to be consumed, and the current state really was pretty taskotron-y (since it was meant for that stuff from the beginning).

Looking forward to seeing the PR, and thanks again!

So far as messages go, there's some work going on within RH on a standardized spec for CI system messages. I'm collaborating on that, and already have openQA staging instance emitting messages in that format. So, we should definitely consider that...

@adamwill those messages will be produced by the various testing tools though, right? Indicating the state of their test jobs/runs and so forth. So they serve a slightly different purpose than the ones produced by Resultsdb to indicate a new result has appeared.

So I think it makes sense to do this improvement in Resultsdb, in addition to the messages from the various testing tools.

Yes...but the wrinkle is that resultsdb is sort of acting as the thing emitting the 'test is complete now!' message for the testing tool called Taskotron in this case, right?

If some other bit of taskotron actually emits 'test complete!' messages, then yeah, this would be something different.

Oh I see. You are suggesting, instead of continuing to have Resultsdb emit these messages on Taskotron's behalf (which I was proposing), we have Taskotron adopt the new message format for announcing the status of its tasks, and emit those itself. And have Resultsdb just emit the generic "a new result has appeared" message. Is that right?

I was more sort of pointing out the issue, I guess: that we need to be clear about when/whether ResultsDB is acting in its capacity as "a generic results store", and when it's acting in its capacity as "a part of the Taskotron test system". If it's emitting a message in the latter capacity, arguably, we should adopt the common format for those messages. If it's emitting a message in the former capacity - i.e. it's basically saying "Hi, I'm a result storage system, and I just got a new result!" - then that probably doesn't fall under the common format thing.

And of course, if resultsdb isn't going to do the job of emitting messages on behalf of "the test system called Taskotron", something should...

Yeah it's a fair point. I don't particularly want to tackle changing the Taskotron message format though, since that would be a backwards incompatible change... I really just want some Resultsdb messages with all extra data in them :-)

Login to comment on this ticket.

Metadata