We have not seen any fedmsgs from greenwave in prod for the last 16 days:
https://apps.fedoraproject.org/datagrepper/raw?category=greenwave
This is roughly the time it was last upgraded.
@giulia @dcallagh and/or @lholecek can you look and see if you can see any reason here?
Maybe redeploying "greenwave-fedmsg-consumers" pod will help.
The pod uses policy files (under "/etc/greenwave") to know when to emit fedmsg messages -- looks like it was not restarted in 18 days, though the "web" pod was restarted 2 days ago.
BTW, we have yet to figure out how to redeploy pods automatically when files in any ConfigMap they use were modified.
ConfigMap
CC @gnaponie
I redeployed it, but it didn't seem to help. Looking at the logs I don't see anything too dire looking:
[2018-07-23 16:36:24][greenwave.consumers.resultsdb DEBUG] Considering subject bodhi_update: 'FEDORA-2018-47d2ad9eaf' [2018-07-23 16:36:24][greenwave.consumers.resultsdb DEBUG] No cache value found for 'greenwave.resources:retrieve_results|bodhi_update FEDORA-2018-47d2ad9eaf' [2018-07-23 16:36:24][greenwave.consumers.resultsdb DEBUG] messaging: found 0 applicable policies of 7 for testcase 'update.server_cockpit_default' [2018-07-23 16:36:24][greenwave.consumers.resultsdb DEBUG] messaging: found 0 decision contexts [2018-07-23 16:36:24][moksha.hub DEBUG] 'ResultsDBHandler' thread 140212573837056 | Going back to waiting on the incoming queue. Message handled: True
Looks like the problem was that Greenwave did expect different message format for OpenQA compose test results.
Fix: https://pagure.io/greenwave/pull-request/262
So, I am confused... we just got some greenwave fedmsgs:
greenwave.decision.update greenwave says NO-GO on "something" for "rawhide_compose_sync_to_mirrors" (fedora-rawhide) JSON 32 minutes ago - 2018-07-25 19:42:13 UTC greenwave.decision.update greenwave says NO-GO on "something" for "rawhide_compose_sync_to_mirrors" (fedora-rawhide) JSON 35 minutes ago - 2018-07-25 19:38:57 UTC greenwave.decision.update greenwave says NO-GO on "something" for "rawhide_compose_sync_to_mirrors" (fedora-rawhide) JSON 36 minutes ago - 2018-07-25 19:38:32 UTC greenwave.decision.update greenwave says NO-GO on Fedora-AtomicHost-Vagrant-28_Update-20180703.1403.x86_64.vagrant-libvirt.box for "rawhide_compose_sync_to_mirrors" (fedora-rawhide) JSON 22 days ago - 2018-07-03 17:53:50 UTC
But nothing restarted that I can see or rebuilt. How did it start working?
Oh, were those manually sent shell ones?
We released new version of Greenwave yesterday. I didn't notice any logs for "compose" test results so I couldn't check whether it's working.
Sorry for not notifying you earlier (yesterday was too busy).
Apart from datagrepper showing "something" instead of ID, does everything work correctly?
Yeah, I guess it's working... will reopen if I see anything further.
Thanks!
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
We got some alerts today:
Service: Check datanommer for recent greenwave messages Host: busgateway01.phx2.fedoraproject.org Info: CRIT: datanommer has not seen a greenwave message in 2 days, 28 minutes, 54 seconds Source: noc01 Date: Sun Nov 11 11:16:53 UTC 2018
Perhaps the new greenwave version has an issue with fedmsg sending?
Metadata Update from @kevin: - Issue status updated to: Open (was: Closed)
We got some alerts today: Service: Check datanommer for recent greenwave messages Host: busgateway01.phx2.fedoraproject.org Info: CRIT: datanommer has not seen a greenwave message in 2 days, 28 minutes, 54 seconds Source: noc01 Date: Sun Nov 11 11:16:53 UTC 2018 Perhaps the new greenwave version has an issue with fedmsg sending?
From the part of the logs I can access in OpenShift, I don't see that greenwave-fedmsg-consumers pod tries to send any messages. It cannot find any applicable policies for the received test result changes.
Metadata Update from @ralph: - Issue assigned to ralph
Hm. Is greenwave doing the right thing here (finally) but we need to adjust nagios' expectations?
Hypothesis: we disabled all gating requirements back in June... so no greenwave decisions should actually be changing, therefore it has nothing to announce on the bus.
Metadata Update from @ralph: - Assignee reset
That hypothesis seems off, to me. It should still be making decisions for the policies with empty rules, just they should always be 'go'...and it should have new decisions any time a new package / update shows up.
Also, the openqa_important_stuff_for_rawhide policy is enabled and does have rules, in production as well as stable, so it should definitely be spitting out results for that. (Staging is happily sending out decisions for that policy - utterly wrong ones, admittedly, I'm working on fixing that, but it's doing it. :>)
openqa_important_stuff_for_rawhide
/cc @lucarval can someone in your crew have a look here?
Not sure if I understood the issue here: Greenwave is not sending messages. Is that right? I've checked the logs for the greenwave-fedmsg-consumers pod. And I see two strange things in my opinion: (1) Greenwave never finds applicable policies, everytime there's "messaging: found 0 applicable policies of 9 for testcase $TESTCASE" so it will never emit a new message. That is strange... and I would try to update Greenwave to the latest version, since we recently (after the 0.9.9 - that is the current version in fedora prod) re-factored a bit the code that finds matches for the policies. Stage is updated to the latest version I've noticed, but I don't have access to the logs there I'm afraid. (2) I see also this in the logs and I'm not sure if it is important: No routing policy defined for "org.fedoraproject.prod.taskotron.result.new" but routing_nitpicky is False so the message is being treated as authorized.
If you would like to update Greenwave in prod I could do that today or on Monday (if anyone doesn't want to do it before Monday).
"Not sure if I understood the issue here: Greenwave is not sending messages. Is that right?"
Yup. It's not emitting fedmsgs when decisions are updated. It works fine in other ways (you can query it for a decision), but it's not sending out fedmsgs.
"I would try to update Greenwave to the latest version, since we recently (after the 0.9.9 - that is the current version in fedora prod) re-factored a bit the code that finds matches for the policies. Stage is updated to the latest version I've noticed, but I don't have access to the logs there I'm afraid."
Staging is sending out messages just fine. I think it's working OK in other ways too, but we should probably check @bowlofeggs is OK with updating prod to 0.9.11 before doing it, as Bodhi's use of greenwave is currently the most important use. We're not using the fedmsgs for anything really important yet, so it's less important to fix the fedmsgs than it is to make sure Bodhi keeps working. I think Bodhi should be fine with 0.9.11, though.
"(2) I see also this in the logs and I'm not sure if it is important: No routing policy defined for "org.fedoraproject.prod.taskotron.result.new" but routing_nitpicky is False so the message is being treated as authorized."
It shouldn't matter. It's just informational. As it says, it's treating the message as authorized (i.e. it's using it).
"If you would like to update Greenwave in prod I could do that today or on Monday (if anyone doesn't want to do it before Monday)."
Sounds good to me if @bowlofeggs is OK with it :)
Oh, also, if possible I'd like to have my two recent PRs included if we're updating prod - https://pagure.io/greenwave/pull-request/349 and https://pagure.io/greenwave/pull-request/348 . I'd like to switch the Fedora deployments over to using new-style resultsdb.result.new fedmsgs instead of old-style taskotron.result.new fedmsgs, as that should allow them to emit accurate compose decision fedmsgs - but that'll go better with those two changes.
resultsdb.result.new
taskotron.result.new
On Thu, 2018-12-06 at 20:20 +0000, Adam Williamson wrote:
I think it's working OK in other ways too, but we should probably check @bowlofeggs is OK with updating prod to 0.9.11 before doing it, as Bodhi's use of greenwave is currently the most important use. We're not using the fedmsgs for anything really important yet, so it's less important to fix the fedmsgs than it is to make sure Bodhi keeps working. I think Bodhi should be fine with 0.9.11, though.
I'm OK with updating it as long as it doesn't have any known backwards incompatible changes.
@adamwill your PRs got merged, so they will be included :) @bowlofeggs Ok. I'll check just to be sure that nothing will break.
I think I might have time today to release it. I'll send an email to infrastructure@lists.fedoraproject.org before starting anyhow.
@bowlofeggs The only thing that changed in a "backwards incompatible" way is this: * https://pagure.io/greenwave/pull-request/337
Beside that, we just did some re-factoring and a new feature (new subject_type redhat-module) that won't change greenwave's behaviour. The other change is Adam's, but it appears to me that it is what we expect: https://pagure.io/greenwave/pull-request/348#
I'll write the release notes and then start the release, if no one is against that.
On Mon, 2018-12-10 at 13:25 +0000, Giulia Naponiello wrote:
The patches that the Factory 2 team wrote for Bodhi use the summary. Please do not deploy this change to production. We need to either revert that change or adjust Bodhi, or better yet, provide a formal way for Bodhi to get the information it's getting via the summary.
@bowlofeggs ack. I didn't start and I won't. Let's consider how to that in the best way.
Wait.
Does Bodhi depend on that letter not being capitalized?
Looks like: https://github.com/fedora-infra/bodhi/blob/develop/bodhi/server/models.py#L2037
I see. It's a different string though. That one wasn't changed in the greenwave PR.
Agreed, it's a different string. We noticed that one, but we didn't find the changed one in Bodhi. Is that there? If yes, let's revert the change, or find a smarter way to check that.
@bowlofeggs could you please confirm that that string is not used in body? I don't want to release something that will break it. If so, let's revert the change or find another solution, so that we can update Greenwave.
Confirmed - that string is not used in Bodhi.
So where are we here? Can a new greenwave that emits fedmsgs be deployed now? (or alternately the check we have in nagios adjusted)
Greenwave release is scheduled for Monday 14th at 3pm UTC.
Greenwave new version was released correctly. Please let me know if issues are still present. Thank you
I don't see any decisions more recent than two months old at https://apps.fedoraproject.org/datagrepper/raw?category=greenwave , but there have been Rawhide composes the last two days for which decisions should have been emitted...
Ok, I've checked the logs and for Rawhide composes, one message that I see (for example) is this one: https://apps.fedoraproject.org/datagrepper/id?id=2019-5e28f91f-e5e9-4cea-8c21-51b345163cec&is_raw=true&size=extra-large with testacase name == "org.centos.prod.ci.pipeline.allpackages-build.package.test.functional.complete"
but checking the policies configured here: https://greenwave-web-greenwave.app.os.fedoraproject.org/api/v1.0/policies I don't see any policy with that testcase. So my guess is that the policies should be updated...
Anyhow I see another issue. That is the fact that in that message I don't see any "type" information (rawhide or something similar) and not even an information about "original_spec_nvr" or other of the "handled" by greenwave. So greenwave just doesn't detect the type and comes back listening for other messages. It seems that if we had another check there it was quite an old change. Maybe @ralph remembers about this use case?
That's at least for the rawhide messages I see in the logs.
I see some applicable policies found, but there wasn't a decision change message because the decision didn't change. An example:
[2019-01-17 07:59:17][greenwave.policies DEBUG] found 2 applicable policies of 9 for: {'subject_type': 'koji_build', 'subject_identifier': 'adobe-source-han-sans-jp-fonts-2.000-1.fc29', 'testcase': 'dist.python-versions.unversioned_shebangs', 'product_version': 'fedora-29'} [2019-01-17 07:59:17][greenwave.policies DEBUG] found 2 decision contexts [2019-01-17 07:59:17][greenwave.consumers.resultsdb DEBUG] Skipped emitting fedmsg, decision did not change: {'applicable_policies': ['taskotron_release_critical_tasks_for_stable', 'atomic_ci_pipeline_results_stable'], 'policies_satisfied': True, 'satisfied_requirements': [], 'summary': 'no tests are required', 'unsatisfied_requirements': []}
I've been focusing on compose decisions rather than package decisions. I was expecting to see compose decision fedmsgs when Rawhide composes complete. Let me dig back through this stuff and double check that we don't need to change anything in ansible, though. The Pipeline does nothing interesting for composes, compose decisions are based on openQA test results. I've verified that Greenwave does actually make compose decisions properly if you ask it for one via the API.
Ah, so, this earlier comment is what we need to do:
"I'd like to switch the Fedora deployments over to using new-style resultsdb.result.new fedmsgs instead of old-style taskotron.result.new fedmsgs"
However I actually don't know exactly how we can do that in the infra deployment, because it's an openshift app and the setting is in the fedmsg config file and I don't know how to do a fedmsg config file for an openshift app!
Basically greenwave ships a stock fedmsg config file (resultsdb.py) with this setting:
resultsdb.py
# Topic on which greenwave should listen for new resultsdb results. resultsdb_topic_suffix='taskotron.result.new',
we need to change that to 'resultsdb.result.new' for the infra deployment. Anyone know the best way to go about doing that?
'resultsdb.result.new'
@adamwill I think the topic suffix can be overridden in configmap.yml template:
--- apiVersion: v1 kind: ConfigMap metadata: name: greenwave-fedmsg-configmap data: greenwave.py: |- import socket config = dict( resultsdb_topic_suffix: 'resultsdb.result.new',
Agreed with @lholecek++ :)
Thanks a lot both of you! I'll try that.
@mohanboddu wanted me to tag him on this. It still seems that we are not getting any messages emitted ATM.
Here's the output from oc status:
oc status
In project greenwave on server https://os.fedoraproject.org:443 svc/fedmsg-consumers - 172.30.140.26:8081 dc/greenwave-fedmsg-consumers deploys istag/greenwave:latest <- bc/greenwave-docker-build docker builds Dockerfile on istag/greenwave-upstream:latest (import scheduled) deployment #4 deployed 11 days ago - 1 pod deployment #3 deployed 2 months ago deployment #2 deployed 2 months ago svc/greenwave-memcached - 172.30.202.28:11211 dc/greenwave-memcached deploys registry.fedoraproject.org/f28/memcached:latest deployment #2 deployed 2 months ago - 1 pod deployment #1 deployed 4 months ago https://greenwave.fedoraproject.org (redirects) to pod port web (svc/greenwave-web) https://greenwave-web-greenwave.app.os.fedoraproject.org (redirects) to pod port web dc/greenwave-web deploys istag/greenwave:latest <- bc/greenwave-docker-build docker builds Dockerfile on istag/greenwave-upstream:latest (import scheduled) deployment #13 deployed 8 days ago - 2 pods deployment #12 deployed 11 days ago deployment #11 deployed 7 weeks ago 1 warning, 4 infos identified, use 'oc status --suggest' to see details.
Notice how the web pod was redeployed 8 days ago (relative times are annoying for pasting into tickets like this). That redeployment corresponds with this commit from you @adamwill:
https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=417020477db1568d3e403162228b5d9cb432597a
But notice that the fedmsg consumer pod was redeployed 11 days ago, before you made that change. I don't know why it didn't auto deploy, but I think a new rollout will pick up your config change. I'll try that.
OK, fresh deployment rolled out with oc rollout latest dc/greenwave-fedmsg-consumers. Now we have to wait for a fresh set of compose results to make their way into resultsdb to trigger a new event.
oc rollout latest dc/greenwave-fedmsg-consumers
Looks like a Fedora-Rawhide-20190126.n.1 is in progress right now at https://apps.fedoraproject.org/datagrepper/raw?topic=org.fedoraproject.prod.pungi.compose.status.change&topic=org.fedoraproject.prod.openqa.job.done&rows_per_page=1&delta=127800
OK - looks like this is working again: https://apps.fedoraproject.org/datagrepper/raw?category=greenwave&delta=127800&rows_per_page=1
Optimistically closing this. :)
Metadata Update from @ralph: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
yeah, it's definitely working. The "something" is a bug that can be fixed by upgrading whatever bit of infra is generating those messages to the fedmsg-meta-fedora-infrastructure updates I pushed to stable a few days ago...
Login to comment on this ticket.