Issue #126: results/latest endpoint doesn't consider scenario - resultsdb

taskotron / resultsdb

#126 results/latest endpoint doesn't consider scenario

Closed: Fixed 5 years ago by jskladan. Opened 5 years ago by gnaponie.

We would like to change results/latest API endpoint to consider 2 tests with different scenarios as 2 different tests...
For example, if you pass in an item and a type, this API already gives you several results with different testcases... it should return also the results with the same testcase, but different scenario.

Example:
curl -sL 'https://taskotron.fedoraproject.org/resultsdb_api/api/v2.0/results/latest?item=Fedora-Server-dvd-x86_64-Rawhide-20190213.n.0.iso&limit=1000' | jq '.data[] | select(.testcase.name == "compose.install_blivet_no_swap") | [.id, .data.scenario[]]'
[
27344845,
"fedora.universal.x86_64.uefi"
]

curl -sL 'https://taskotron.fedoraproject.org/resultsdb_api/api/v2.0/results?item=Fedora-Server-dvd-x86_64-Rawhide-20190213.n.0.iso&limit=1000' | jq '.data[] | select(.testcase.name == "compose.install_blivet_no_swap") | [.id, .data.scenario[]]'
[
27344845,
"fedora.universal.x86_64.uefi"
]
[
27344838,
"fedora.universal.x86_64.64bit"
]

jskladan commented 5 years ago

While I don't have anything against it, you should know that scenario is (in the context of resultsdb) one of the non-required fields, stored in the ResultData "key-value store", so any solution you propose must be:
1) functional on any "key" (like arch or type found in most of the dist.* testcases, branch, repo or topic in https://taskotron.fedoraproject.org/resultsdb/results/27387207 and stuff like firmware, productmd.compose.short, and all the rest of key-values that are stored for https://taskotron.fedoraproject.org/resultsdb/results/27385791) - e.g. it must be generic, not tailored to the one specific usecase of scenario
2) fast. The current implementation is pretty fast, partly because of the fact that we can group-and-join by the testcase_name. The /latest endpoint is the most frequently used one, and must be able to return data within miliseconds on the current dataset.

The basic, and most important idea behind resultsdb is, that it does not really care for the "details", and leaves the "logic" on the consumers. The design principle is that resultsdb should not care/know about the specifics more than necessary, and should not be designed around any specific testcases, but rather provide reasonable functionality in genereal.
Thus something like hard-coding "just one other key to take into consideration for the /latest endpoint" would not be acceptable, if that key is not shared/required by all the stored results.

lholecek commented 5 years ago

@gnaponie I don't know the details but I think this is required because Greenwave has to handle scenario field. Honestly, it seems like unnecessary complication.

I couldn't find the original issue but commit from @ralph [1][2] says:

One thing we ran into was that the requirements for a rawhide compose
can't be expressed only in terms of testcase names. They must be
specified in terms of both the testcase name and the scenario.

Why not use more specific test case name instead? E.g. replace rule:

!PassingTestCaseRule {test_case_name: compose.install_no_user, scenario: scenario1}

with:

!PassingTestCaseRule {test_case_name: compose.install_no_user.scenario1}

Though this would mean that the test result producers need to be changed.

CC: @adamwill

[1] https://pagure.io/greenwave/c/0bdaae679bb994ff15c6978e1ef8004e4895c5fe?branch=0bdaae679bb994ff15c6978e1ef8004e4895c5fe
[2] https://pagure.io/greenwave/pull-request/108

jskladan commented 5 years ago

I honestly lack the proper understanding of what the scenario field really represents, but looking at the values of the scenario field, the testcase-name route seems... not ideal.

Also, the scenario field has a smell of implementation details bleed-out. From the limited amount of scenarios I saw, it feels like it not only encompasses the "vital" differences - like the uefi vs bios firmware field, in case of the compose.install_blivet_no_swap testcase which is weirdly encoded as uefi and 64bit respectively in the scenario field. But also all kinds of other information, some of which is encoded in the other fields, and some is not.

I think that understanding the scenario field (in the scope of openqa generated results, note that rpmdeplint also has a 'scenario' field that is always identical to its 'arch' value) composition and meaning is the first necessary step. Second step then might be making sure the scenario can not be produced in a different way, which would make the values less "spread around", so greenwave could be aware of the possible (meaningful) combinations, just like it knows what arch is for koji builds for example.

I'm also not really sure how the gating rules are set, but I suppose that it checks against known set of testcases (in order to decide whether the compose 'is ok' or not), rather than grabbing all the results, and just checking for presence of random FAILED results. While not ideal, the knowledge of the testcase names could be extended with the 'differentiating fields' like the firmware in case of the compose.install_blivet_no_swap.

None of these sound ideal, but as I said, knowing what the scenario really means, how it's constructed, and whether it can't be done in a different way is IMO vital base knowledge, on top of which we can then build the solution.

adamwill commented 5 years ago

We had a long discussion about what 'scenario' means on resultsdb-users, remember...

Broadly speaking it's the entire context in which the test was run. The term is borrowed from openQA and openQA has an internal definition of it as encompassing the DISTRI (which is a very high-level concept that sort of maps to distribution, in Fedora we have only one, called 'fedora', right now), VERSION (which is the release number or 'Rawhide', for Fedora), FLAVOR (which is approximately the type of image under test, or the special 'universal' flavor), ARCH (...it's the arch), TEST (which is the test name), and optionally MACHINE (which is something like the configuration of the system under test; it's very significant for Fedora as this is where the distinction between an x86_64 BIOS and x86_64 UEFI run of the same test is made). This definition basically is openQA's canonical definition of what an entire unique 'test scenario' looks like: if two tests have the same 'scenario' they are considered to be multiple runs of "the same test".

For resultsdb reporting of openQA we construct a scenario value from that set of keys, including MACHINE but omitting TEST (as it would be effectively duplicated with the test name) and VERSION. Omitting version makes sense to me for various reasons; gating is a good example - we want to be able to specify "we gate on this test in this context" without having to do it differently for every release because the "context" definition changes per release. It also allows us to compare "the same test" across releases quite easily when we want to do that. Again the VERSION value is always available in the result anyway, so when you want to query for results with the same "complete" scenario you just find ones that match on test + version + scenario.

Practically speaking this means the 'important' parts of the scenario at present for Fedora openQA tests are the ARCH, FLAVOR and MACHINE; to give concrete examples, "the same" test run on two different images shouldn't be considered to be the same, and "the same" test run on UEFI or BIOS shouldn't be considered to be the same.

To take a step back, though, really the idea of the scenario concept was that things that consume it only need to know the basic idea; it's not really important to know what its contents are, just that "if you want to find results that are for 'the same test', find ones where test + scenario match". The idea was that different test systems can construct the scenario value in any way that makes sense to them. Nothing should be caring too much about what the string is, nothing is supposed to parse it - only match values of it from different results for the purpose of deciding if they're results for "the same test" or not.

Given this, I'm not sure it's really important to debate whether this or that value ought to be included in the 'scenario' string for a given test system or not, so long as it actually achieves the purpose?

I guess this does break down a bit in the gating case, because to do gating we have to know the possible scenarios and write them down in the policy. I'll have to think if it makes any sense to tweak the concept at all there...

There are other ways to achieve the basic goal here, of course. We thought about several of them. For instance, we could have each result define a set of 'scenario keys', so if you want to decide if two results are for "the same test" you have to ask each result what its 'scenario keys' are and compare the values of those keys in each test, or something like that. But IIRC, the 'it's just a magic string that the test system decides the details of' approach seemed like the sanest one, at the time.

If this was a wonderfully well-organized universe we could of course envision all possible 'scenario' concepts in all test systems and come up with standard names for all of them and ensure that all results in resultsdb used those standard names. But given that we can't even get two test systems (the pipeline and taskotron) to agree on how to represent package NVRs in results, I don't think we live in that universe...:)

Does this help at all? Does anyone still have questions?

Edited 5 years ago by adamwill

adamwill commented 5 years ago

It may help in thinking about the 'scenario' idea to just look at one of openQA's result summary pages:

https://openqa.fedoraproject.org/tests/overview?distri=fedora&version=Rawhide&build=Fedora-Rawhide-20190215.n.0&groupid=1

the install_default test is a good one to consider. Note how many times it appears: under different flavors, on different arches, and in UEFI runs and BIOS runs. Each of those is considered a 'different' test, but they're all using the same test case ('test suite' in openQA terminology). It's the context that differs. This is the concept we're trying to cover with the 'scenario'.

jskladan commented 5 years ago

@adamwill I'll try to reply in detail, once not on the phone, but while what you say certainly makes sense from your point of view, in practice, the "gating system" kind of really needs to know what to look for. Say that you have a gating for koji builds, you probably need to know what the arch is, in order to decide whether to let it go, or block it (e.g. you could only care for some arches, but not for others).

The thing is, the "gating system" needs to be, by definition, aware of what kinds of results it is looking for, and while what you say makes sense in one particular usecase - when the system is consuming all the results, knowing that for some of them there is not only testcase, but also a scenario, and that they make up a 'compound index', and then keeps an internal 'cache' of the 'compound indexed testcase results' - in other usecases, it might make more sense to know "what to look for" a bit more precisely than "nobody should really care about what the value is" like with the arch.

I don't remember the discussion, but I'd guess I was not really fond of it even then :D

adamwill commented 5 years ago

I think we were generally agreed that it was a difficult problem with no nice answers :D

It is a squishy concept and difficult to think about but I can't really think of a use case where the current implementation doesn't actually work. And I do think there's a big problem with expecting systems to know "what to look for" when "what to look for" could differ so much between test systems and we're even very unlikely to get them all to agree on one way to describe similar things (the BIOS vs. UEFI distinction is a good example; this is one that may well come up in multiple test systems, but are we going to be able to get them all to use the same name for it? Take up of resultsdb_conventions so far suggests "no" :D)

adamwill commented 5 years ago

BTW, to answer a question I forgot to answer: "why don't we just stuff all the scenario details in the test name?"

Well, we did actually consider that, IIRC. I don't recall who proposed it and who opposed it - I think @jskladan didn't like it, but IMBW. There is one obvious problem with it: sometimes you do just want to look up results by test name. It's perfectly reasonable that sometimes you might want to find, for instance, all the install_default results for a given release or compose, regardless of the 'scenario'. If the test name and scenario details are smushed up together in one value, this is very icky and difficult; I think we all agreed strongly we didn't want to get into teaching consumers to parse out the individual elements of the scenario value.

There's also kind of a historical precedent here. When I showed up and started thinking about this problem, the way I saw it, the 'scenario' concept kinda already existed, but no-one had thought about it properly. There were various bits of code which basically implemented the logic: "if the test name and arch are the same, then do X". That's basically the scenario concept! It's just a very very incomplete and not-thought-through implementation of it, because at the time only Taskotron was filing results in resultsdb, and for the tests that existed in Taskotron at the time, the only 'scenario' concept that actually mattered was the arch. So at the time for the results we had, 'scenario' basically equalled 'arch'. But note that Taskotron didn't just stuff the arch into the test case name, it stored it separately, and used this matching logic. So to me, the 'scenario' concept is really just extending what we were already doing with Taskotron and ResultsDB with a somewhat more robust definition of what a 'scenario' or 'context' actually is (i.e. more than just "well isn't it always just the arch?" No, no it isn't.)

I was actually hoping that we'd update Taskotron to include a 'scenario' value in all results it filed - its value could simply be equal to the arch, that's perfectly fine. The way it works in my head, if I were actually writing something that had to deal with this, I'd write it like this:

scenario = result.get('scenario', result.get('arch', ''))

i.e. we basically expect all results to have scenarios; for results that don't (which in my ideal world would be just old, pre-'scenario' concept results) we have a working assumption that the arch is the scenario, and if the result doesn't even have an arch, we just define the scenario as an empty string. That's how the concepts sort of look in my head.

jskladan commented 5 years ago

@adamwill And I absolutely understand how you got to the whole 'scenario' thing, and what benefits it has in the way of thinking you base the ideas on, but I still believe that the core idea of "gating/validation system knowing what kinds of results it deals with" and "tests producing sensible results" supersedes it in my way of thinking :)

Anyway, this is rather a philosophical debate, and us disagreeing on "what would have been a better design principle", and "what level of knowledge should/could be required for each and every bit in the pipeline" won't have much impact on the matter at hand.

Getting back to the issue - I just want to repeat, that I have, in principle, nothing against "compound unique keys" for the resultsdb queries based off of any combination of testcase_name and result_data/extra_data, as long as the functionality is not hardcoded for only specifically scenario, and it performs well.

Non-naive implementation is beyond my SQL skill, though, so I won't be able to provide the patch here. And naive implementation is a no-go, as this really needs to be lightning fast, in order to work properly.

gnaponie commented 5 years ago

Sorry that I've created all this big discussion :D
I've started that because I think 2 tests with same testcase, type etc, but 2 different scenarios should be treated as different. So it shouldn't be any hard-coded thing in the code, and it shouldn't impact much the performances.
If the scenario is "null", it won't be taken into consideration.

Does that make sense?

kparal commented 5 years ago

We had a long discussion about what 'scenario' means on resultsdb-users, remember...

For reference (and long winter nights):
https://lists.fedoraproject.org/archives/list/resultsdb-users@lists.fedoraproject.org/thread/DJSIOLA3YEH6HNSJEXVPLIZT5TW4CDED/

Non-naive implementation is beyond my SQL skill, though, so I won't be able to provide the patch here. And naive implementation is a no-go, as this really needs to be lightning fast, in order to work properly.

Another option is to make scenario a top level field in the database on par with item, testcase, etc. That would make the queries fast, right? But for that we should be convinced that scenario concept is really the right way to go. And that probably means sitting at the design board and talking to all stakeholders and their use cases/requirements (because nowadays there are many more stakeholders in this than before).

jskladan commented 5 years ago

Another option is to make scenario a top level field in the database on par with item, testcase, etc.

Not really
1) item is not a "top level field in database"
2) the whole latest concept is based around testcase being the one unique thing, it does not matter whether scenario is in Testcase table or in the key-value ResultData, the SQL query would have to be changed from group by over testcase.name to something that takes result.testcase_name + result.data.scenario as a compound unique value. And it does not really matter whether it is result.data.scenario or result.scenario. You need to do group by over those two values as a compound key.

As I said before, I could most probably make a naive solution, but a proper one really escapes me, especially since I really do not wish/want/like to just make scenario a prime citizen in the code, because who's to say somebody else won't come and say "what about this field foobar that I invented for myself". The solution really needs to work for any value/combination of values, must be able to handle that "some" results don't have the (say) scenario value and still work properly... Once you really start thinking about it, it becomes rather complicated.

lucarval commented 5 years ago

Please correct me if I'm wrong, but it seems that we all agree and understand that scenario, when used, should be taken into account grouping unique test results. However, while this is true for this particular use case, there are also other fields (potentially) that also fall in this category. Thus, hard coding scenario in /results/latest API endpoint is not ideal.

To get around this, we could enhance the /results/latest API endpoint to take an optional argument with additional keys to be used during aggregation. The downside is that this may have a performance impact because of the underlying SQL schema.

If that's the case, I'd like to proceed with trying out this change to see what kind of hit this has on performance, and the possible options to solve that.

How does that sound? We'd be happy to make the changes and submit the PR for review.

A bit more background: We've recently changed Greenwave to use the /results/latest API endpoint. Previously, it was just fetching un-aggregated results and doing the grouping itself. The change was needed to address performance issues in some cases, e.g. artifacts with thousands of results.

jskladan commented 5 years ago

@lucarval awesome! As I said, I'd be able to "put something together", but if there's somebody more skilled in all the things SQL, I'm happy to take PRs.

Thanks a lot, I think your understanding is absolutely right!

gnaponie commented 5 years ago

I would like to work on this change. I'll try to submit a PR in the next days.
Thank you to everyone for the feedback and to Lui for the summary :)

adamwill commented 5 years ago

I think we're all agreed on all practical considerations here, yeah. Just because this is an awkward little corner, it tends to provoke more philosophical discussions when it comes up :)

On the specific topic of the results/latest endpoint - personally, I just don't use it due to this whole problem. Any things I've written that get results from resultsdb just do this work for themselves, getting all potentially relevant results (by item or whatever) and then doing the 'latest' filtering for themselves.

gnaponie commented 5 years ago

Hello.
We discussed a bit and made some analysis.
We wrote a new query that can filter on multiple ResultData field that you can choose and that seems to be really efficient. The issue is that is not so efficient if there is no filter at all (not testcase or item... nothing).
Our idea is to start using the new query when there is some filtering and the old one with no filters at all.

In case that's fine we would submit a PR in these days.
But during the analysis we noticed that the /results/latest endpoint without filtering doesn't make much more sense to us... What is the purpose of /results/latest without filtering?
That would return the results aggregated by testcase, even if they reefer to different items? Is that really useful and used?

jskladan commented 5 years ago

@gnaponie the /results/latest without any additional filters now returns "latest results for each testcase", just as you say.

I don't think that it gets used (but I'm not any kind of authority on that), but I'd like to keep the behaviour stable (since it can be done, and does not neccesitate upping the api-version number, really, since it's still compatible with it) i.e.:
- no filters (aka /results/latest directly) produces "last results for each testcase, disregarding anything else", exactly as it behaves now
- non-grouping filters (aka /results/latest?item=foobar, or anything you can do now) is "last results for each testcase, while in accordance with the filter" (aka exactly what it does now)
- grouping filters (e.g. /results/latest?group_by=testcase,scenario) change the way grouping is performed, and can of course still use additional filters (/results/latest?group_by=scenario&item=foobar)

Which should be fine, as far as I understand your comment. Does this make sense?

gnaponie commented 5 years ago

It sounds reasonable. I'm on it.
Thank you!

gnaponie commented 5 years ago

PR: https://pagure.io/taskotron/resultsdb/pull-request/129#

jskladan commented 5 years ago

Fixed (among other commits) by https://pagure.io/taskotron/resultsdb/c/381bd33cca8d9bfc19e99325d7c3689806cf535d?branch=develop

Metadata Update from @jskladan:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

5 years ago