#1872 Disable Test Gating requirements until more UI is enabled
Closed 7 months ago by mooninite. Opened 8 months ago by mooninite.

Bodhi has enabled Test Gating recently and it requires certain tests to pass in order to allow an update to be pushed to stable.

These tests can sometimes return false positives (multi-lib on Wine for one example). The failures can be overridden with a "waiver" but this requires using a CLI tool with very limited documentation. The waivers also do not "take" on the first try. The update has to be unpushed and pushed.

I propose we temporarily suspend Test Gating in Bodhi until a more refined UI and user experience is available.


I am quite sympathetic to the suggestion that gating should be disabled in Bodhi until things are better. The user experience is bad and developers seem to have problems submitting waivers frequently, and doing so is painful and you can't tell if it worked until 6 hours later.

I think it would be good to temporarily suspend the gating, and to identify some criteria that must be met before it is re-enabled. I'm not sure what the full criteria should be, but here are a list of ideas we could consider from my soft, human brain:

  1. Bodhi could be given a WaiverDB auth token - this will give it a web interface with a "waive" button that is much easier than the waiverdb CLI.
  2. The Bodhi CLI could be given knowledge of gating - it should be able to display the current gating status and explain why the update isn't allowed through.
  3. The Bodhi CLI could be given the ability to waive updates - basically another option to using the web UI for this, and it would be much easier than the waverdb CLI (it'd also require Bodhi to have the same auth token as described above and would use the same mechanism).
  4. The tests could be loaded in Bodhi's web UI much more quickly, by storing them in Bodhi instead of retrieving them with JavaScript via resultsdb on every page load. Sometimes the tests never load on updates with lots of builds because the JS times out, which makes it impossible to even know which tests failed or need to be waived.
  5. The CLI could be given the ability to display test results too, while we're in there.
  6. Bodhi could know about gating decisions from Greenwave synchronously instead of polling it every 6 hours. The polling introduces a lot of confusion, and was done first because fedmsgs do get lost so you need polling as a fallback. The problem is that we made the polling and did not make the fedmsg listener.

I'm not suggesting that that is the final list, or even that they are all necessary. Just some ideas to think about to improve things. I believe there are already issues filed for all of the above with Bodhi, but I didn't verify that and don't have time to look them all up right now.

Thoughts?

and you can't tell if it worked until 6 hours later.

We used to have a hotfix that allowed us to run the cron job much more frequently

I wasn't aware that the UI here was that bad. Let's get on fixing that!

That said, I think there needs to be some clarification on what "disable test gating" means. Removing test gating completely seems to me like an out of proportion action. If certain tests aren't reliable, can we relax the greenwave policy on them, as needed?

I'm strongly against "disable everything for now" followed by a fuzzy period of "let's figure something out".

Suggestions from @pingou and @bowlofeggs seem like strategies we could implement fairly quickly: Sync more often for now and move away from javascript loading of results, add a web ui for waiving. Do we have tickets for these issues?

This will be discussed at the 2018-04-06 FESCo meeting.

Metadata Update from @sgallagh:
- Issue tagged with: meeting

8 months ago

I'm all for fixing the developer experience on this, and am +1 to disabling test gating until the developer experience is reasonable, because I don't think it's reasonable in its current state.

In addtion to a "waive" button, could bodhi also add a 'Retest' button? It seems like the vast majority of the time people want to waive results is when there aren't any, if we could retest and pass that would be actually better than waiving it (IMHO). The only way currently to get retested is to unpush and push again, but then you have to wait or get new karma to push to stable.

We used to have a hotfix that allowed us to run the cron job much more frequently

I finally took the time to add some tests to the corresponding PR: https://github.com/fedora-infra/bodhi/pull/2125 not sure if that's enough, but I know for a fact (since we used to have this hotfixed on the server) that it allows running the script in under 30 minutes (iirc it ran in something like 10 to 15 minutes) which would thus allow us to run the cron job every hour.

In addtion to a "waive" button, could bodhi also add a 'Retest' button?

This is/was the idea behing RATS but the current implementation requires a new release of flask-oidc (although I guess we could update the package with a git snapshot in the mean time) and there is a disagreement about how to go about the project in general, so we're a little stuck there :(

I'm all for fixing the developer experience on this, and am +1 to disabling test gating until the developer experience is reasonable, because I don't think it's reasonable in its current state.

I feel like this conflates two issues: A flaky test and a delay for waiving.
There are workarounds for each that don't involve disabling gating overall: Disabling gating on the flaky test and implementing @pingou's workaround to sync more often.

In addtion to a "waive" button, could bodhi also add a 'Retest' button?

I can speak for Taskotron+dist.rpmdeplint implementation and unfortunately this feature will require substantial code changes in both Taskotron and the task. It's definitely not something we can do quickly :/

For information, it is my understanding that @bowlofeggs is off today, so I do not know if he will be present at the meeting.

I thought this may influence the discussion since most of the improvements listed are concerning bodhi itself.

In addtion to a "waive" button, could bodhi also add a 'Retest' button?

I can speak for Taskotron+dist.rpmdeplint implementation and unfortunately this feature will require substantial code changes in both Taskotron and the task. It's definitely not something we can do quickly :/

My understanding is that unpushing and re-pushing an update triggers another test run, so at worst case can we not just have bodhi untag-retag or emit whatever message that tells taskotron to test again?

My understanding is that unpushing and re-pushing an update triggers another test run, so at worst case can we not just have bodhi untag-retag or emit whatever message that tells taskotron to test again?

Yes, but if your update is ready to push to stable, and tests are preventing the push, you have to wait another 7 days to push to stable if you do that. That is one of my main drives behind this ticket.

My understanding is that unpushing and re-pushing an update triggers another test run, so at worst case can we not just have bodhi untag-retag or emit whatever message that tells taskotron to test again?

Yes, but if your update is ready to push to stable, and tests are preventing the push, you have to wait another 7 days to push to stable if you do that. That is one of my main drives behind this ticket.

I understand that, I am saying that if this can re-test an update, bodhi should be able to request that without resetting karma or unpushing the update and have a 'retest' button by telling taskotron the same thing is does on an unpush/push.

FESCo Meeting (2018-04-06):

FESCo would prefer to discuss this when Randy Barlow is present. Also there is a hackfest next week that will aim to improve the situation.

I understand that, I am saying that if this can re-test an update, bodhi should be able to request that without resetting karma or unpushing the update and have a 'retest' button by telling taskotron the same thing is does on an unpush/push.

Bodhi would have to tag the package back into updates-testing-pending (or updates-pending) tag. But that implementation-specific to dist.rpmdeplint. Retriggering other tasks would require different approaches (different tasks trigger on different events). For most tasks, we could fake the event, though. The most problematic is dist.rpmdeplint, because it ignores what triggered it and tests everything in the corresponding -pending tag.

This was discussed in the FESCo meeting today:
AGREED: Wait another week to see how the Bohdi changes shake out before deciding whether to disable gating (+5, 0, 0)

This was discussed in the FESCo meeting on 2018-04-20:
AGREED: Hold off on making further decisions about gating until Bodhi 3.6.2 is released (+1:8, -1:0, +0:0)

bodhi-3.7.0 (which is planned to have some fixes for this) will not be deployed in time for this week's meeting.

Metadata Update from @bowlofeggs:
- Issue untagged with: meeting

7 months ago

bodhi-3.7.0 was deployed to production today. It improves some things a bit, but does not solve all the problems. Here's the improvements:

  • Developers can now see a list of missing tests in the test results tab - this will help them to know which tests to waive. [Edit: this does not work because the Greenwave API changed.]
  • Updates will display links to waivers under the "Test Gating Status" area on the main tab. IMO, the links are not very obvious or clear (they are little cloud icons that you can hover over to see some info about the waivers, and they link to JSON, so not ideal UX wise, but better than nothing.)
  • Bodhi now runs the greenwave polling cron job every hour instead of every six hours. This is a bit better, but still not ideal (we really should have a fedmsg listener so we can react synchronously to changes).
  • Test gating is now actually enforced (I discovered while working on this release that Bodhi had actually not been enforcing test gating.)

Unfortunately, not all problems that 3.7.0 intended to fix were successful:

  • Though Bodhi does now use the WaiverDB API correctly as the release notes state, there are a number of significant usability issues with the way the UI works in Bodhi[0-3]. Yesterday was the first time I ever was able to see it in action (it took a while to get a token for Bodhi to use WaiverDB in our staging environment) so until now I accepted patches for it trusting that the author had tested it against their own WaiverDB deployment and only focused on code quality.
  • Due to the usability issues, it was decided that we should continue to leave the WaiverDB UI disabled in Bodhi as it will cause more confusion than it will solve.
  • The developers still don't have an easy way to find out which tests are missing.

Due to not having the WaiverDB UI working, I suggest we discuss this at this week's meeting.

[0] https://github.com/fedora-infra/bodhi/issues/2361
[1] https://github.com/fedora-infra/bodhi/issues/2363
[2] https://github.com/fedora-infra/bodhi/issues/2364
[3] https://github.com/fedora-infra/bodhi/issues/2365

Metadata Update from @bowlofeggs:
- Issue tagged with: meeting

7 months ago

I've thought about this a good bit over the past day, and I believe it is against Fedora's best interests for test gating to remain enabled as-is. We are burdening our developers with this constant need for waivers, and now that we are enforcing it the problem is more severe than it was before. Several updates got rejected from today's composes due to the newly enforced gating.

One of the goals of test gating is to increase the quality of the distribution by making sure our builds pass certain tests before reaching our users. I argue that the constant need to waive tests is re-enforcing the idea that a failed test should be waived, as opposed to the idea that a failed test should be fixed. The false-negative rate is quite high, and that is going to train our contributors to ignore the tests over time. If our contributors reach a place of ignoring test results, then the stated goal is not achieved.

One of the other goals of test gating is to make our contributors' jobs easier, but as it is I believe we are making their jobs harder. Many of our contributors are volunteers, and I think it is important that we not enforce policies that are overly burdensome on them. I think this is especially important if the policies aren't a means to a positive end, and if you buy my argument in the previous paragraph you might conclude with me that we are burdening our contributors without justification.

Developers can now see a list of missing tests in the test results tab - this will help them to know which tests to waive.

Sometime in the past 24 hours, Greenwave stopped returning the data that Bodhi expects to see which means that this quote is not actually true. Bodhi does not show developers which tests are missing in the UI.

Thank you for the detailed and frank assessment. I agree that we should indeed consider disabling gating for now.

What would be the best way to achieve temporary disabling without a) preventing ongoing work to improve the tooling, and b) without spending too much effort on the disabling itself?

With the release of Bodhi 3.7.0 gating is actually enforced now so I've had to go through the waiving process. The update in question is https://bodhi.fedoraproject.org/updates/FEDORA-2018-a13691074b.

Here's my experience:

  1. Once you get the email that simply states "kernel-4.16.7-200.fc27 ejected from the push because 'Required tests did not pass on this update'", there's no indication of what to do next.

  2. Assuming you know what to do somehow, you have to go through this process.

  3. Jump directly to troubleshooting because the package ships without the resultsdb_api_url and you will hit that error.

  4. Now jump back to the Python script you have to hand edit (why isn't this in the CLI, in fedpkg rather than some new CLI I have to use?), run it, and wait 5 minutes for a response.

  5. Now translate some of the Python dict representation that got printed out to JSON by hand (or import json and json.dumps in your script). The output doesn't make it obvious what you need to put into the waiverdb-cli. Look at the example in the wiki and just guess.

  6. Run your waiverdb-cli command. At the end, some waiver seems to have been made.

  7. Nothing changes in the bodhi web UI, but try to push your update anyway.

  8. Discover you created a waiver for the wrong NVR/"product" combination, make a new waiver.

  9. Note that the test gating status icon is still red and unhappy, but now there's a mysterious cloud icon underneath. Try to push your update again. Receive "Requirement not met Required tests did not pass on this update" pop-up.

At this point I've been messing with it for more than an hour and I still can't push the update and there's no indication of why or what to do.

This feature is in no way usable, in my opinion.

Hello @zbyszek,

What would be the best way to achieve temporary disabling without a) preventing ongoing work to improve the tooling

We have a staging deployment of Bodhi that has test gating enabled - this can be used for testing. Bodhi's development environment can also enable the gating easily.

and b) without spending too much effort on the disabling itself?

Test gating is a boolean setting in Bodhi's config file, so it's as simple as s/true/false/ and a playbook run to push it out to production.

I took some time to hack out a little script to get some numbers. I ran this script against Bodhi's development environment, which has the most recent database backup from production (probably taken in the last 24-48 hours or so).

As of the snapshot, there were 1871 updates in all active releases. If we filter out EPEL and modularity updates (as they don't do any test gating), that lowers to 1140 updates that are subject to gating. Of these, 100 of them are failing gating. Of these, 80 are because there are missing tests, and the other 20 are due to failed tests. It isn't easy for me to automatically determine whether those failed tests are false negatives or not - it would need a human to look at each.

Thus, 80/1140 (7%) of all updates from ~1 day ago put pretty tedious work on our package maintainers, many of whom are volunteers, assuming that the 20 were all real failures and not false negatives.

Here's the script and its output:

  1 #!/usr/bin/python3
  2 
  3 import json
  4 
  5 from bodhi.server import config, initialize_db, models
  6 
  7 
  8 initialize_db(config.config)
  9 
 10 total = 0
 11 fedora_total = 0
 12 num_failed = 0
 13 num_missing_tests = 0
 14 
 15 
 16 for u in models.Update.query.filter(
 17         models.Update.status.in_([models.UpdateStatus.pending, models.UpdateStatus.testing])).\
 18         filter(models.Update.release_id == models.Release.id).\
 19         filter(models.Release.state.in_(
 20             [models.ReleaseState.current, models.ReleaseState.pending])):
 21 
 22     total = total + 1
 23 
 24     if 'f' not in u.release.name.lower() or 'm' in u.release.name.lower():
 25         # This is EPEL or modularity, so skip.
 26         continue
 27 
 28     fedora_total = fedora_total + 1
 29 
 30     if u.test_gating_status == models.TestGatingStatus.failed:
 31         num_failed = num_failed + 1
 32         rs = json.loads(u.greenwave_unsatisfied_requirements)
 33         for r in rs:
 34             if r['type'] == 'test-result-missing':
 35                 num_missing_tests = num_missing_tests + 1
 36             else:
 37                 print(u.greenwave_summary_string)
 38                 print(u.greenwave_unsatisfied_requirements)
 39 
 40 
 41 print(total)
 42 print(fedora_total)
 43 print(num_failed)
 44 print(num_missing_tests)
1 of 2 required tests failed
[{"item": {"item": "atomic-reactor-1.6.25.1-1.fc26", "type": "koji_build"}, "result_id": 17227841, "scenario": null, "testcase": "dist.rpmdeplint", "type": "test-result-failed"}]
1 of 2 required tests failed
[{"item": {"item": "speedometer-2.8-2.fc27", "type": "koji_build"}, "result_id": 17782954, "scenario": null, "testcase": "dist.rpmdeplint", "type": "test-result-failed"}]
1 of 2 required tests failed
[{"item": {"item": "chromium-native_client-59.0.3071.86-6.20170607gitaac1de2.fc27", "type": "koji_build"}, "result_id": 18812332, "scenario": null, "testcase": "dist.rpmdeplint", "type": "test-result-failed"}]
1 of 2 required tests failed
[{"item": {"original_spec_nvr": "atomic-1.21.1-1.fc26"}, "result_id": 19145592, "scenario": null, "testcase": "org.centos.prod.ci.pipeline.complete", "type": "test-result-failed"}]
1 of 2 required tests failed
[{"item": {"item": "ghc-wai-cors-0.2.6-1.fc27", "type": "koji_build"}, "result_id": 19697397, "scenario": null, "testcase": "dist.rpmdeplint", "type": "test-result-failed"}]
1 of 2 required tests failed
[{"item": {"original_spec_nvr": "iproute-4.14.1-5.fc26"}, "result_id": 19700064, "scenario": null, "testcase": "org.centos.prod.ci.pipeline.complete", "type": "test-result-failed"}]
1 of 2 required tests failed
[{"item": {"item": "duplicity-0.7.17-2.fc26", "type": "koji_build"}, "result_id": 19919793, "scenario": null, "testcase": "dist.rpmdeplint", "type": "test-result-failed"}]
1 of 1 required tests failed
[{"item": {"item": "mongo-tools-3.6.3-0.2.20180319git2b10d84.fc28", "type": "koji_build"}, "result_id": 20313683, "scenario": null, "testcase": "dist.rpmdeplint", "type": "test-result-failed"}]
1 of 1 required tests failed
[{"item": {"item": "python-gensim-0.10.0-13.fc28", "type": "koji_build"}, "result_id": 20339151, "scenario": null, "testcase": "dist.rpmdeplint", "type": "test-result-failed"}]
1 of 2 required tests failed
[{"item": {"item": "timeline-1.17.0-1.fc27", "type": "koji_build"}, "result_id": 20472397, "scenario": null, "testcase": "dist.rpmdeplint", "type": "test-result-failed"}]
1 of 2 required tests failed
[{"item": {"item": "deja-dup-37.1-4.fc26", "type": "koji_build"}, "result_id": 20695791, "scenario": null, "testcase": "dist.rpmdeplint", "type": "test-result-failed"}]
1 of 2 required tests failed
[{"item": {"item": "wine-3.6-1.fc26", "type": "koji_build"}, "result_id": 20801714, "scenario": null, "testcase": "dist.rpmdeplint", "type": "test-result-failed"}]
1 of 1 required tests failed
[{"item": {"item": "odcs-0.2.1-2.fc28", "type": "koji_build"}, "result_id": 20819258, "scenario": null, "testcase": "dist.rpmdeplint", "type": "test-result-failed"}]
1 of 1 required tests failed
[{"item": {"item": "prelude-correlator-4.1.1-1.fc28", "type": "koji_build"}, "result_id": 20929291, "scenario": null, "testcase": "dist.rpmdeplint", "type": "test-result-failed"}]
1 of 1 required tests failed
[{"item": {"item": "R-orcutt-2.2-1.fc28", "type": "koji_build"}, "result_id": 20940848, "scenario": null, "testcase": "dist.rpmdeplint", "type": "test-result-failed"}]
1 of 1 required tests failed
[{"item": {"item": "R-httpuv-1.4.1-1.fc28", "type": "koji_build"}, "result_id": 21002690, "scenario": null, "testcase": "dist.rpmdeplint", "type": "test-result-failed"}]
1 of 2 required tests failed
[{"item": {"item": "ubertooth-2017.03.R2-1.fc27", "type": "koji_build"}, "result_id": 21055715, "scenario": null, "testcase": "dist.rpmdeplint", "type": "test-result-failed"}]
1 of 2 required tests failed
[{"item": {"item": "libefp-1.5.0-2.fc27", "type": "koji_build"}, "result_id": 21090089, "scenario": null, "testcase": "dist.rpmdeplint", "type": "test-result-failed"}]
1 of 2 required tests failed
[{"item": {"item": "libefp-1.5.0-2.fc26", "type": "koji_build"}, "result_id": 21089465, "scenario": null, "testcase": "dist.rpmdeplint", "type": "test-result-failed"}]
1 of 1 required tests failed
[{"item": {"item": "libefp-1.5.0-2.fc28", "type": "koji_build"}, "result_id": 21090455, "scenario": null, "testcase": "dist.rpmdeplint", "type": "test-result-failed"}]
1871
1140
100
80

Note that this number also does not factor in any updates that the developer has already waived, and I would expect that to be a non-0 value as well. I need to get on to other work now, but it would be possible to calculate this number using Greenwave's API if we really need it. IMO, the 7% number is bad even not knowing the additional numbers, because that means that 7% of the time the developer could lose hours trying to figure out how to waive an update. I know that @jcline has lost a lot of time just today because of this (and his update still isn't successfully waived). I've also been trying to assist him and it is not easy to figure out.

OK I can't seem to pull myself away from this, so I modified the script to answer the question of how many current updates have waivers. It turns out that 8 updates have waivers, and that 4 updates have waivers but are still being marked as failed by greenwave (likely meaning the developer did not waive all the failures).

Here is the new script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
#!/usr/bin/python3                                         

import json                                                

import backoff                                             

from bodhi.server import config, initialize_db, models, util


initialize_db(config.config)                                                                                           

total = 0                                                  
fedora_total = 0                                                                                                       
num_failed = 0
num_missing_tests = 0
waived_updates = 0                                         
waived_failing_updates = 0                                 


# Greenwave frequently 500's, so this will help our script to automatically retry with an
# exponential backoff when that happens.                   
@backoff.on_exception(backoff.expo, Exception, max_tries=8)
def gw(data):                                              
    api_url = '{}/decision'.format(config.config.get('greenwave_api_url'))                                             

    return util.greenwave_api_post(api_url, data)          


for u in models.Update.query.filter(                       
        models.Update.status.in_([models.UpdateStatus.pending, models.UpdateStatus.testing])).\                        
        filter(models.Update.release_id == models.Release.id).\                                                        
        filter(models.Release.state.in_(                   
            [models.ReleaseState.current, models.ReleaseState.pending])):                                              

    total = total + 1                                      

    if not total % 100:                                    
        print(total)                                       

    if 'f' not in u.release.name.lower() or 'm' in u.release.name.lower():                                             
        # This is EPEL or modularity, so skip.             
        continue                                           

    fedora_total = fedora_total + 1                        

    if u.test_gating_status == models.TestGatingStatus.failed:                                                         
        num_failed = num_failed + 1                        
        rs = json.loads(u.greenwave_unsatisfied_requirements)                                                          
        for r in rs:                                       
            if r['type'] == 'test-result-missing':         
                # This is an update that is being gated because its tests haven't run.                                 
                num_missing_tests = num_missing_tests + 1  

    decision_context = u'bodhi_update_push_testing'        
    if u.status == models.UpdateStatus.testing:            
        decision_context = u'bodhi_update_push_stable'     

    data = {                                               
        'product_version': u.product_version,              
        'decision_context': decision_context,              
        'subject': u.greenwave_subject,                    
        'verbose': True}                                   
    decision = gw(data)                                    

    if decision['waivers']:                                
        # This update has waivers.                         
        waived_updates = waived_updates + 1                
        if u.test_gating_status == models.TestGatingStatus.failed:                                                     
            # This update has waivers, and is still being gated (i.e., the developer did not waive                     
            # all the failures).                           
            waived_failing_updates = waived_failing_updates + 1                                                        


print(total)                                               
print(fedora_total)                                        
print(num_failed)                                          
print(num_missing_tests)                                   
print(waived_updates)                                      
print(waived_failing_updates)

Just to update my current predicament, I worked with @bowlofeggs all afternoon trying to waive the kernel updates, but it seems to be impossible. At the moment both F27 and F26 kernel updates are gated and cannot (it seems) be waived through. The F26 kernel contains the fix for an important CVE.

One other metric worth highlighting is that the ratio of 80/100 gated updates are false negatives. This means that we really are showing developers that 80% of the time, failed tests are just noise that you now have to do a super complicated dance to get around. This supports my earlier argument that we are training our contributors to ignore failed tests.

Due to @jcline's CVE, is it possible for FESCo to vote on this issue in ticket, before Friday's meeting?

Proposal: Disable test gating in Bodhi between now and the next FESCo meeting. During Friday's meeting, discuss this issue and make a decision about where to go from there based on that discussion.

+1's?

@bowlofeggs the tests will still be run and reported right? if so +1 what is the test that is failing?

@ausil The individual tests results will still be displayed in the test results tab, but the test gating status will no longer be displayed if we disable test gating.

The kernel update cannot be pushed due to this unsatisfied requirement:

  "unsatisfied_requirements": [
    {
      "item": {
        "original_spec_nvr": "kernel-4.16.7-200.fc27"
      }, 
      "scenario": null, 
      "testcase": "org.centos.prod.ci.pipeline.complete", 
      "type": "test-result-missing"
    }
  ]

We have tried many incantations with waiverdb CLI to clear that one, but it remains and we are unsure what to do. If I can't figure out how to waive this failed test, it highlights how difficult it is for developers who are less familiar than I am with the Greenwave/WaiverDB API to do so.

Given the current circumstances, I'm +1 to disabling gating until we make the situation better. Thanks @bowlofeggs for clearly communicating the update and the current status of gating.

+1 to disable for now.

I received an IRC vote from @till in #fedora-devel:

May 09 16:35:43 <bowlofeggs>    nirik, maxamillion, dgilmore, jsmith, jwboyer, zbyszek, tyll: if you have a moment, would you mind voting in-ticket on https://pagure.io/fesco/issue/1872 ? we are unable to push a kernel CVE, so i am proposing to disable test gating in bodhi until our next meeting
May 09 16:35:56 <bowlofeggs>    (and to discuss in our next meeting what to do next)
May 09 16:36:11 <bowlofeggs>    (i will advocate in that meeting to keep it disabled until some criteria are met)
May 09 16:56:29 <tyll>  bowlofeggs: I trust you to do the right thing
May 09 16:58:18 <bowlofeggs>    tyll: thanks ☺ is that a +1? if so, can you record that in the ticket for the record? https://pagure.io/fesco/issue/1872
May 09 16:59:33 <tyll>  bowlofeggs: yes +1, will do it tomorrow it is late here and I do not have my fas credentials on the phone

We are at +6 now. I will disable the gating once the current set of composes finish. Composes seem to take unusually long lately, so it may be many hours before I can disable it.

It's unclear to me if the problem is that we have no mechanism to override gating, or if that mechanism is currently not working. I want to say it's the latter, but it would be helpful if someone could confirm.

@bowlofeggs, I wish we could've linked up yesterday to help you get the right waiverdb syntax. We need an escalation path for the future.

Confirmed I can waive this this morning:

λ python ask-greenwave.py                                                                                                                       ~
('kernel-4.16.7-200.fc27', <Response [200]>, 200)
{u'applicable_policies': [u'taskotron_release_critical_tasks_for_stable',
              u'atomic_ci_pipeline_results_stable'],
 u'policies_satisfied': False,
 u'summary': u'1 of 2 required tests not found',
 u'unsatisfied_requirements': [{u'item': {u'original_spec_nvr': u'kernel-4.16.7-200.fc27'},
                u'scenario': None,
                u'testcase': u'org.centos.prod.ci.pipeline.complete',
                u'type': u'test-result-missing'}]}

Above, you see the response Randy and Jeremy were getting.

Here's the waiverdb command to use (referencing the strange original_spec_nvr key that the atomic CI pipeline produces. It would be nice if it produced item and type keys like most other systems.)

λ waiverdb-cli -p fedora-27 -s '{"original_spec_nvr": "kernel-4.16.7-200.fc27"}' -t org.centos.prod.ci.pipeline.complete -c "A waiver for jcline."
Created waiver 147 for result with subject {"original_spec_nvr": "kernel-4.16.7-200.fc27"} and testcase org.centos.prod.ci.pipeline.complete

And this is what the query to greenwave looks like afterwards:

λ python ask-greenwave.py                                                                                                                       ~
('kernel-4.16.7-200.fc27', <Response [200]>, 200)
{u'applicable_policies': [u'taskotron_release_critical_tasks_for_stable',
              u'atomic_ci_pipeline_results_stable'],
 u'policies_satisfied': True,
 u'summary': u'all required tests passed',
 u'unsatisfied_requirements': []}

For the difference, compare the subject of @jcline's waiver from yesterday with the one I just submitted:

Another thought: we are on the verge of enabling "opt-in" gating, where packagers can add a gating.yaml file to their repo to tell greenwave that tests X and Y should be required for their package above and beyond any global distro requirements.

For disabling gating (which FESCo approved here), we should re-enable the check from Bodhi to greenwave and disable the global distro requirements in greenwave, leaving the door open to opt-in experimentation on a per-package basis.

@ralph Even if we could have linked up, I don't think you and Randy can afford to walk every developer through this process by hand. I wasted half a day fiddling with this and I have some knowledge of Bodhi and Greenwave. @bowlofeggs has significantly more knowledge of both and together we couldn't get it to work.

While it may technically be possible to waive things, it's such a convoluted, undocumented process that most people will not figure it out.

Indeed. We missed an opportunity to iron out the docs.

@jwboyer I would describe it as "the current mechanism is difficult to use and can cost hours of lost productivity for unlucky developers like myself and jcline". If we had been able to figure out the exact right JSON to pass the CLI we would have been successful yesterday, but our soft, human brains couldn't quite figure it out and the documentation is pretty thin. I talked with @ralph yesterday about fixing Bodhi's UI so it can have a working waiver button - I would advocate that we should have that working before enforcing gating across all packages again.

I think @ralph's proposal to switch Bodhi's config back on but have Greenwave only enforce gating for packages that opt in is a reasonable alternative to completely disabling it. Let's discuss that option during the next meeting.

Thanks @bowlofeggs. I agree opt-in would be better than a blanket disable. I'd be +1 for that for sure.

Dropping meeting tag as this was resolved on the list.

Metadata Update from @sgallagh:
- Issue untagged with: meeting

7 months ago

I'd like us to discuss today two things:

0) Re-enabling Bodhi's test gating while simultaneously configuring Greenwave to only enforce for packages that opt-in. This was @ralph's proposal above, and I think it is sensible.

1) What criteria we want to set before re-enabling global enforcement.

If we can answer these two questions today, I think we can close the ticket.

Metadata Update from @bowlofeggs:
- Issue tagged with: meeting

7 months ago

AGREED: When devs think it's ready, opt-in some important non-FESCo-member-maintained packages and get feedback for at least 50 updates before turning it back on globally. (+6, 0, -0) (sgallagh, 15:47:47)

Metadata Update from @sgallagh:
- Issue untagged with: meeting

7 months ago

Gating is disabled temporarily, and we voted on a set of reqs for reenabling. Nothing left to do here.

Metadata Update from @zbyszek:
- Issue status updated to: Closed (was: Open)

7 months ago

Bodhi 3.8.0 was deployed to production today, and includes a patch to fix the critical compose issue. As a result, Bodhi has again been configured to enforce Greenwave's decisions on updates. Thus, the remaining pieces to get in place are:

  • Greenwave's opt-in feature.
  • Bodhi still needs UX improvements around gating.
  • opt-in user feedback should be gathered to verify that the UX improvements are satisfactory.
  • FESCo will re-vote to turn gating back on, based on user feedback.

Login to comment on this ticket.

Metadata