#7111 greenwave not emitting fedmsgs in prod
Closed: Fixed 5 years ago by ralph. Opened 5 years ago by kevin.

We have not seen any fedmsgs from greenwave in prod for the last 16 days:

https://apps.fedoraproject.org/datagrepper/raw?category=greenwave

This is roughly the time it was last upgraded.

@giulia @dcallagh and/or @lholecek can you look and see if you can see any reason here?


Maybe redeploying "greenwave-fedmsg-consumers" pod will help.

The pod uses policy files (under "/etc/greenwave") to know when to emit fedmsg messages -- looks like it was not restarted in 18 days, though the "web" pod was restarted 2 days ago.

BTW, we have yet to figure out how to redeploy pods automatically when files in any ConfigMap they use were modified.

CC @gnaponie

I redeployed it, but it didn't seem to help. Looking at the logs I don't see anything too dire looking:

[2018-07-23 16:36:24][greenwave.consumers.resultsdb   DEBUG] Considering subject bodhi_update: 'FEDORA-2018-47d2ad9eaf'
[2018-07-23 16:36:24][greenwave.consumers.resultsdb   DEBUG] No cache value found for 'greenwave.resources:retrieve_results|bodhi_update FEDORA-2018-47d2ad9eaf'
[2018-07-23 16:36:24][greenwave.consumers.resultsdb   DEBUG] messaging: found 0 applicable policies of 7 for testcase 'update.server_cockpit_default'
[2018-07-23 16:36:24][greenwave.consumers.resultsdb   DEBUG] messaging: found 0 decision contexts
[2018-07-23 16:36:24][moksha.hub   DEBUG] 'ResultsDBHandler' thread 140212573837056 | Going back to waiting on the incoming queue.  Message handled: True

Looks like the problem was that Greenwave did expect different message format for OpenQA compose test results.

Fix: https://pagure.io/greenwave/pull-request/262

So, I am confused... we just got some greenwave fedmsgs:

greenwave.decision.update greenwave says NO-GO on "something" for "rawhide_compose_sync_to_mirrors" (fedora-rawhide) JSON   32 minutes ago - 2018-07-25 19:42:13 UTC
greenwave.decision.update greenwave says NO-GO on "something" for "rawhide_compose_sync_to_mirrors" (fedora-rawhide) JSON   35 minutes ago - 2018-07-25 19:38:57 UTC
greenwave.decision.update greenwave says NO-GO on "something" for "rawhide_compose_sync_to_mirrors" (fedora-rawhide) JSON   36 minutes ago - 2018-07-25 19:38:32 UTC
greenwave.decision.update greenwave says NO-GO on Fedora-AtomicHost-Vagrant-28_Update-20180703.1403.x86_64.vagrant-libvirt.box for "rawhide_compose_sync_to_mirrors" (fedora-rawhide) JSON  22 days ago - 2018-07-03 17:53:50 UTC

But nothing restarted that I can see or rebuilt. How did it start working?

Oh, were those manually sent shell ones?

We released new version of Greenwave yesterday. I didn't notice any logs for "compose" test results so I couldn't check whether it's working.

Sorry for not notifying you earlier (yesterday was too busy).

Apart from datagrepper showing "something" instead of ID, does everything work correctly?

Yeah, I guess it's working... will reopen if I see anything further.

Thanks!

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

5 years ago

We got some alerts today:

Service: Check datanommer for recent greenwave messages
Host: busgateway01.phx2.fedoraproject.org
Info: CRIT: datanommer has not seen a greenwave message in 2 days, 28 minutes, 54 seconds
Source: noc01
Date: Sun Nov 11 11:16:53 UTC 2018

Perhaps the new greenwave version has an issue with fedmsg sending?

Metadata Update from @kevin:
- Issue status updated to: Open (was: Closed)

5 years ago

We got some alerts today:
Service: Check datanommer for recent greenwave messages
Host: busgateway01.phx2.fedoraproject.org
Info: CRIT: datanommer has not seen a greenwave message in 2 days, 28 minutes, 54 seconds
Source: noc01
Date: Sun Nov 11 11:16:53 UTC 2018
Perhaps the new greenwave version has an issue with fedmsg sending?

From the part of the logs I can access in OpenShift, I don't see that greenwave-fedmsg-consumers pod tries to send any messages. It cannot find any applicable policies for the received test result changes.

Metadata Update from @ralph:
- Issue assigned to ralph

5 years ago

Hm. Is greenwave doing the right thing here (finally) but we need to adjust nagios' expectations?

Hypothesis: we disabled all gating requirements back in June... so no greenwave decisions should actually be changing, therefore it has nothing to announce on the bus.

Metadata Update from @ralph:
- Assignee reset

5 years ago

That hypothesis seems off, to me. It should still be making decisions for the policies with empty rules, just they should always be 'go'...and it should have new decisions any time a new package / update shows up.

Also, the openqa_important_stuff_for_rawhide policy is enabled and does have rules, in production as well as stable, so it should definitely be spitting out results for that. (Staging is happily sending out decisions for that policy - utterly wrong ones, admittedly, I'm working on fixing that, but it's doing it. :>)

/cc @lucarval can someone in your crew have a look here?

Not sure if I understood the issue here: Greenwave is not sending messages. Is that right?
I've checked the logs for the greenwave-fedmsg-consumers pod. And I see two strange things in my opinion:
(1) Greenwave never finds applicable policies, everytime there's "messaging: found 0 applicable policies of 9 for testcase $TESTCASE" so it will never emit a new message. That is strange... and I would try to update Greenwave to the latest version, since we recently (after the 0.9.9 - that is the current version in fedora prod) re-factored a bit the code that finds matches for the policies.
Stage is updated to the latest version I've noticed, but I don't have access to the logs there I'm afraid.
(2) I see also this in the logs and I'm not sure if it is important:
No routing policy defined for "org.fedoraproject.prod.taskotron.result.new" but routing_nitpicky is False so the message is being treated as authorized.

If you would like to update Greenwave in prod I could do that today or on Monday (if anyone doesn't want to do it before Monday).

"Not sure if I understood the issue here: Greenwave is not sending messages. Is that right?"

Yup. It's not emitting fedmsgs when decisions are updated. It works fine in other ways (you can query it for a decision), but it's not sending out fedmsgs.

"I would try to update Greenwave to the latest version, since we recently (after the 0.9.9 - that is the current version in fedora prod) re-factored a bit the code that finds matches for the policies.
Stage is updated to the latest version I've noticed, but I don't have access to the logs there I'm afraid."

Staging is sending out messages just fine. I think it's working OK in other ways too, but we should probably check @bowlofeggs is OK with updating prod to 0.9.11 before doing it, as Bodhi's use of greenwave is currently the most important use. We're not using the fedmsgs for anything really important yet, so it's less important to fix the fedmsgs than it is to make sure Bodhi keeps working. I think Bodhi should be fine with 0.9.11, though.

"(2) I see also this in the logs and I'm not sure if it is important:
No routing policy defined for "org.fedoraproject.prod.taskotron.result.new" but routing_nitpicky is False so the message is being treated as authorized."

It shouldn't matter. It's just informational. As it says, it's treating the message as authorized (i.e. it's using it).

"If you would like to update Greenwave in prod I could do that today or on Monday (if anyone doesn't want to do it before Monday)."

Sounds good to me if @bowlofeggs is OK with it :)

Oh, also, if possible I'd like to have my two recent PRs included if we're updating prod - https://pagure.io/greenwave/pull-request/349 and https://pagure.io/greenwave/pull-request/348 . I'd like to switch the Fedora deployments over to using new-style resultsdb.result.new fedmsgs instead of old-style taskotron.result.new fedmsgs, as that should allow them to emit accurate compose decision fedmsgs - but that'll go better with those two changes.

On Thu, 2018-12-06 at 20:20 +0000, Adam Williamson wrote:

I think it's working OK in other ways too, but we should probably
check @bowlofeggs is OK with updating prod to 0.9.11 before doing it,
as Bodhi's use of greenwave is currently the most important use.
We're not using the fedmsgs for anything really important yet, so
it's less important to fix the fedmsgs than it is to make sure Bodhi
keeps working. I think Bodhi should be fine with 0.9.11, though.

I'm OK with updating it as long as it doesn't have any known backwards
incompatible changes.

@adamwill your PRs got merged, so they will be included :)
@bowlofeggs Ok. I'll check just to be sure that nothing will break.

I think I might have time today to release it. I'll send an email to infrastructure@lists.fedoraproject.org before starting anyhow.

@bowlofeggs The only thing that changed in a "backwards incompatible" way is this:
* https://pagure.io/greenwave/pull-request/337

Beside that, we just did some re-factoring and a new feature (new subject_type redhat-module) that won't change greenwave's behaviour.
The other change is Adam's, but it appears to me that it is what we expect: https://pagure.io/greenwave/pull-request/348#

I'll write the release notes and then start the release, if no one is against that.

On Mon, 2018-12-10 at 13:25 +0000, Giulia Naponiello wrote:

@bowlofeggs The only thing that changed in a "backwards incompatible"
way is this:
* https://pagure.io/greenwave/pull-request/337

The patches that the Factory 2 team wrote for Bodhi use the summary.
Please do not deploy this change to production. We need to either
revert that change or adjust Bodhi, or better yet, provide a formal way
for Bodhi to get the information it's getting via the summary.

@bowlofeggs ack. I didn't start and I won't.
Let's consider how to that in the best way.

Wait.

Does Bodhi depend on that letter not being capitalized?

I see. It's a different string though. That one wasn't changed in the greenwave PR.

Agreed, it's a different string. We noticed that one, but we didn't find the changed one in Bodhi. Is that there? If yes, let's revert the change, or find a smarter way to check that.

@bowlofeggs could you please confirm that that string is not used in body? I don't want to release something that will break it. If so, let's revert the change or find another solution, so that we can update Greenwave.

Confirmed - that string is not used in Bodhi.

So where are we here? Can a new greenwave that emits fedmsgs be deployed now? (or alternately the check we have in nagios adjusted)

Greenwave release is scheduled for Monday 14th at 3pm UTC.

Greenwave new version was released correctly. Please let me know if issues are still present. Thank you

I don't see any decisions more recent than two months old at https://apps.fedoraproject.org/datagrepper/raw?category=greenwave , but there have been Rawhide composes the last two days for which decisions should have been emitted...

Ok, I've checked the logs and for Rawhide composes, one message that I see (for example) is this one: https://apps.fedoraproject.org/datagrepper/id?id=2019-5e28f91f-e5e9-4cea-8c21-51b345163cec&is_raw=true&size=extra-large
with testacase name == "org.centos.prod.ci.pipeline.allpackages-build.package.test.functional.complete"

but checking the policies configured here: https://greenwave-web-greenwave.app.os.fedoraproject.org/api/v1.0/policies
I don't see any policy with that testcase. So my guess is that the policies should be updated...

Anyhow I see another issue. That is the fact that in that message I don't see any "type" information (rawhide or something similar) and not even an information about "original_spec_nvr" or other of the "handled" by greenwave.
So greenwave just doesn't detect the type and comes back listening for other messages. It seems that if we had another check there it was quite an old change. Maybe @ralph remembers about this use case?

That's at least for the rawhide messages I see in the logs.

I see some applicable policies found, but there wasn't a decision change message because the decision didn't change. An example:


[2019-01-17 07:59:17][greenwave.policies DEBUG] found 2 applicable policies of 9 for: {'subject_type': 'koji_build', 'subject_identifier': 'adobe-source-han-sans-jp-fonts-2.000-1.fc29', 'testcase': 'dist.python-versions.unversioned_shebangs', 'product_version': 'fedora-29'}
[2019-01-17 07:59:17][greenwave.policies DEBUG] found 2 decision contexts
[2019-01-17 07:59:17][greenwave.consumers.resultsdb DEBUG] Skipped emitting fedmsg, decision did not change: {'applicable_policies': ['taskotron_release_critical_tasks_for_stable', 'atomic_ci_pipeline_results_stable'], 'policies_satisfied': True, 'satisfied_requirements': [], 'summary': 'no tests are required', 'unsatisfied_requirements': []}

I've been focusing on compose decisions rather than package decisions. I was expecting to see compose decision fedmsgs when Rawhide composes complete. Let me dig back through this stuff and double check that we don't need to change anything in ansible, though. The Pipeline does nothing interesting for composes, compose decisions are based on openQA test results. I've verified that Greenwave does actually make compose decisions properly if you ask it for one via the API.

Ah, so, this earlier comment is what we need to do:

"I'd like to switch the Fedora deployments over to using new-style resultsdb.result.new fedmsgs instead of old-style taskotron.result.new fedmsgs"

However I actually don't know exactly how we can do that in the infra deployment, because it's an openshift app and the setting is in the fedmsg config file and I don't know how to do a fedmsg config file for an openshift app!

Basically greenwave ships a stock fedmsg config file (resultsdb.py) with this setting:

# Topic on which greenwave should listen for new resultsdb results.
resultsdb_topic_suffix='taskotron.result.new',

we need to change that to 'resultsdb.result.new' for the infra deployment. Anyone know the best way to go about doing that?

@adamwill I think the topic suffix can be overridden in configmap.yml template:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: greenwave-fedmsg-configmap
data:
  greenwave.py: |-
    import socket
    config = dict(
        resultsdb_topic_suffix: 'resultsdb.result.new',

Thanks a lot both of you! I'll try that.

@mohanboddu wanted me to tag him on this. It still seems that we are not getting any messages emitted ATM.

Here's the output from oc status:

In project greenwave on server https://os.fedoraproject.org:443

svc/fedmsg-consumers - 172.30.140.26:8081
  dc/greenwave-fedmsg-consumers deploys istag/greenwave:latest <-
    bc/greenwave-docker-build docker builds Dockerfile on istag/greenwave-upstream:latest (import scheduled) 
    deployment #4 deployed 11 days ago - 1 pod
    deployment #3 deployed 2 months ago
    deployment #2 deployed 2 months ago

svc/greenwave-memcached - 172.30.202.28:11211
  dc/greenwave-memcached deploys registry.fedoraproject.org/f28/memcached:latest 
    deployment #2 deployed 2 months ago - 1 pod
    deployment #1 deployed 4 months ago

https://greenwave.fedoraproject.org (redirects) to pod port web (svc/greenwave-web)
https://greenwave-web-greenwave.app.os.fedoraproject.org (redirects) to pod port web
  dc/greenwave-web deploys istag/greenwave:latest <-
    bc/greenwave-docker-build docker builds Dockerfile on istag/greenwave-upstream:latest (import scheduled) 
    deployment #13 deployed 8 days ago - 2 pods
    deployment #12 deployed 11 days ago
    deployment #11 deployed 7 weeks ago


1 warning, 4 infos identified, use 'oc status --suggest' to see details.

Notice how the web pod was redeployed 8 days ago (relative times are annoying for pasting into tickets like this). That redeployment corresponds with this commit from you @adamwill:

https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=417020477db1568d3e403162228b5d9cb432597a

But notice that the fedmsg consumer pod was redeployed 11 days ago, before you made that change. I don't know why it didn't auto deploy, but I think a new rollout will pick up your config change. I'll try that.

OK, fresh deployment rolled out with oc rollout latest dc/greenwave-fedmsg-consumers. Now we have to wait for a fresh set of compose results to make their way into resultsdb to trigger a new event.

Metadata Update from @ralph:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

5 years ago

yeah, it's definitely working. The "something" is a bug that can be fixed by upgrading whatever bit of infra is generating those messages to the fedmsg-meta-fedora-infrastructure updates I pushed to stable a few days ago...

Login to comment on this ticket.

Metadata