#8011 Expected resultsdb notification not found
Closed: Fixed 4 years ago by pingou. Opened 4 years ago by pingou.

Good Morning,

I've added a test.yml and a gating.yaml to fedora-gather-easyfix: https://src.fedoraproject.org/rpms/fedora-gather-easyfix/tree/master

I've built a new version with them and created the corresponding bodhi update: https://bodhi.fedoraproject.org/updates/FEDORA-2019-4ffedfe629
Bodhi does see two failed results, one of which is required.

However, looking at datagrepper: https://apps.fedoraproject.org/datagrepper/raw?category=greenwave the last message is 4 days old.

Do we have any info that the message was sent?


I'm reporting this here in case more people can help debugging this.
Sounds like a good test-case to figure out how to track "lost" messages in fedora-messaging :)

It seems like there was something wrong with resultsdb, there were no messages yesterday from 09:22:56 am until 18:02:38:
https://apps.fedoraproject.org/datagrepper/raw?category=resultsdb&delta=127800
The results for that element were created all around 15.20:
https://taskotron.fedoraproject.org/resultsdb_api/api/v2.0/results?item=fedora-gather-easyfix-0.1.1-17.fc30

and checking Greenwave logs I also see that greenwave didn't get any messages from resultsdb during that period of time.
I would suggest to check if there was some resultsdb outage or UMB outage.

So doing some more debugging of this.

I've created another :
- bodhi update: https://bodhi.fedoraproject.org/updates/FEDORA-2019-13b9ffe38a
- CI ran: https://jenkins-continuous-infra.apps.ci.centos.org/blue/organizations/jenkins/fedora-f30-build-pipeline/detail/fedora-f30-build-pipeline/649/pipeline/
- our resultsdb instance in rdu got the info: http://resultsdb.ci.centos.org/resultsdb/results/416228
- our main resultsdb got the info as well: https://taskotron.fedoraproject.org/resultsdb/results/30940202 (via rdbsync)
- However, checking datagrepper, we do not have the corresponding message from resultsdb: https://apps.fedoraproject.org/datagrepper/raw?category=resultsdb&rows_per_page=1&delta=127800

So it seems something is off with resultsdb.

Rephrasing the title now that we've narrowed the issue

Metadata Update from @pingou:
- Assignee reset

4 years ago

I have restarted apache on resultsdb01 as well as rdbsync after that and I already see a message from CI in datagrepper: https://apps.fedoraproject.org/datagrepper/id?id=2019-54bb2f78-f2fa-41e9-8816-2f9a3d289b1a&is_raw=true&size=extra-large

Metadata Update from @pingou:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 years ago

Metadata Update from @cverna:
- Issue status updated to: Open (was: Closed)

4 years ago

I think that we should not close this until we have added the proper monitoring for resultsdb. If we don't do it now we will never do it :smile:

Ran my debug script today and the issue happened again (no messages from resultsdb), I've tried to add some more debugging to figure out what is going on but there is something up there.

Ok, finally found the reason for the error: ERROR - reasons: Channel is closed.

We'll need some help to fix this

Ok, so there is something weird going on here:

This is what I see in the logs:

[Fri Jul 19 11:59:55.198875 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:58428] [2019-07-19 11:59:55,198] DEBUG in api_v2: Preparing to publish message for result id 30961751
[Fri Jul 19 11:59:55.223947 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:58428] [2019-07-19 11:59:55,223] DEBUG in api_v2: Preparing to publish Taskotron message for result id 30961751
[Fri Jul 19 12:00:25.511575 2019] [wsgi:error] [pid 764:tid 139635107587840] [remote ::1:58438] [2019-07-19 12:00:25,511] DEBUG in api_v2: Created new result for testcase org.centos.prod.ci.pipeline.allpackages-build.complete with outcome
 FAILED
[Fri Jul 19 12:00:25.515414 2019] [wsgi:error] [pid 764:tid 139635107587840] [remote ::1:58438] [2019-07-19 12:00:25,515] DEBUG in api_v2: Preparing to publish message for result id 30961752
[Fri Jul 19 12:00:25.531425 2019] [wsgi:error] [pid 764:tid 139635107587840] [remote ::1:58438] messaging    2019-07-19 12:00:25 ERROR   Error sending message daece1ce-8900-4828-b41e-be5af445fd78: 
[Fri Jul 19 12:00:25.531869 2019] [wsgi:error] [pid 764:tid 139635107587840] [remote ::1:58438] messaging    2019-07-19 12:00:25 ERROR     ERROR: 
[Fri Jul 19 12:00:25.532264 2019] [wsgi:error] [pid 764:tid 139635107587840] [remote ::1:58438] messaging    2019-07-19 12:00:25 ERROR     ERROR - reasons: Channel is closed.

However, there is in datagrepper a message about that result: https://apps.fedoraproject.org/datagrepper/id?id=2019-5109396d-0196-4b3a-9230-7363df36b4ac&is_raw=true&size=extra-large

@jcline if you have some time to help us figuring out what's going on here, it would be nice :)

And to be complete, that result id is only listed twice in the logs:

# grep 30961751 /var/log/httpd/error_log
[Fri Jul 19 11:59:55.198875 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:58428] [2019-07-19 11:59:55,198] DEBUG in api_v2: Preparing to publish message for result id 30961751
[Fri Jul 19 11:59:55.223947 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:58428] [2019-07-19 11:59:55,223] DEBUG in api_v2: Preparing to publish Taskotron message for result id 30961751

It's the ID 30961752 that failed - I cannot find any datagrepper entry for it.

Oh, shoot you're right!

So the error doesn't trigger all the time, could it be that it's sending too many messages? Some are going through and others are not :(

Metadata Update from @pingou:
- Issue assigned to gnaponie

4 years ago

Just saw a new stacktrace in the logs:

[Fri Jul 19 12:50:05.123171 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544] [2019-07-19 12:50:05,122] DEBUG in api_v2: Preparing to publish message for result id 30962369
[Fri Jul 19 12:50:05.144310 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544] io_services_utils 2019-07-19 12:50:05 ERROR   _AsyncBaseTransport._consume() failed, aborting connection: error=SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2488)'); sock=<ssl.SSLSocket fd=26, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.5.124.207', 45620)>; Caller's stack:
[Fri Jul 19 12:50:05.144380 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544] Traceback (most recent call last):
[Fri Jul 19 12:50:05.144398 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 1222, in _consume
[Fri Jul 19 12:50:05.144454 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     super(_AsyncSSLTransport, self)._consume()
[Fri Jul 19 12:50:05.144466 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 791, in _consume
[Fri Jul 19 12:50:05.144477 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     data = self._sigint_safe_recv(self._sock, self._MAX_RECV_BYTES)
[Fri Jul 19 12:50:05.144488 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 79, in retry_sigint_wrap
[Fri Jul 19 12:50:05.144552 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     return func(*args, **kwargs)
[Fri Jul 19 12:50:05.144567 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 846, in _sigint_safe_recv
[Fri Jul 19 12:50:05.144578 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     return sock.recv(max_bytes)
[Fri Jul 19 12:50:05.144588 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib64/python3.7/ssl.py", line 1037, in recv
[Fri Jul 19 12:50:05.144599 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     return self.read(buflen)
[Fri Jul 19 12:50:05.144609 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib64/python3.7/ssl.py", line 913, in read
[Fri Jul 19 12:50:05.144619 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     return self._sslobj.read(len)
[Fri Jul 19 12:50:05.144629 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544] ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2488)
[Fri Jul 19 12:50:05.144640 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544] Traceback (most recent call last):
[Fri Jul 19 12:50:05.144650 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 1222, in _consume
[Fri Jul 19 12:50:05.144660 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     super(_AsyncSSLTransport, self)._consume()
[Fri Jul 19 12:50:05.144669 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 791, in _consume
[Fri Jul 19 12:50:05.144681 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     data = self._sigint_safe_recv(self._sock, self._MAX_RECV_BYTES)
[Fri Jul 19 12:50:05.144691 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 79, in retry_sigint_wrap
[Fri Jul 19 12:50:05.144721 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     return func(*args, **kwargs)
[Fri Jul 19 12:50:05.144747 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 846, in _sigint_safe_recv
[Fri Jul 19 12:50:05.144759 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     return sock.recv(max_bytes)
[Fri Jul 19 12:50:05.144769 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib64/python3.7/ssl.py", line 1037, in recv
[Fri Jul 19 12:50:05.144779 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     return self.read(buflen)
[Fri Jul 19 12:50:05.144789 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib64/python3.7/ssl.py", line 913, in read
[Fri Jul 19 12:50:05.144799 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     return self._sslobj.read(len)
[Fri Jul 19 12:50:05.144822 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544] ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2488)
[Fri Jul 19 12:50:05.145722 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544] diagnostic_utils 2019-07-19 12:50:05 ERROR   Wrapped func exited with exception. Caller's stack:
[Fri Jul 19 12:50:05.145804 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544] Traceback (most recent call last):
[Fri Jul 19 12:50:05.145822 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib/python3.7/site-packages/pika/diagnostic_utils.py", line 53, in log_exception_func_wrap
[Fri Jul 19 12:50:05.145833 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     return func(*args, **kwargs)
[Fri Jul 19 12:50:05.145843 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 1222, in _consume
[Fri Jul 19 12:50:05.145854 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     super(_AsyncSSLTransport, self)._consume()
[Fri Jul 19 12:50:05.145863 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 791, in _consume
[Fri Jul 19 12:50:05.145874 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     data = self._sigint_safe_recv(self._sock, self._MAX_RECV_BYTES)
[Fri Jul 19 12:50:05.145885 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 79, in retry_sigint_wrap
[Fri Jul 19 12:50:05.145895 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     return func(*args, **kwargs)
[Fri Jul 19 12:50:05.145905 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 846, in _sigint_safe_recv
[Fri Jul 19 12:50:05.145915 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     return sock.recv(max_bytes)
[Fri Jul 19 12:50:05.145924 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib64/python3.7/ssl.py", line 1037, in recv
[Fri Jul 19 12:50:05.145935 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     return self.read(buflen)
[Fri Jul 19 12:50:05.145944 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib64/python3.7/ssl.py", line 913, in read
[Fri Jul 19 12:50:05.145954 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     return self._sslobj.read(len)
[Fri Jul 19 12:50:05.145964 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544] ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2488)
[Fri Jul 19 12:50:05.145975 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544] Traceback (most recent call last):
[Fri Jul 19 12:50:05.145985 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib/python3.7/site-packages/pika/diagnostic_utils.py", line 53, in log_exception_func_wrap
[Fri Jul 19 12:50:05.145995 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     return func(*args, **kwargs)
[Fri Jul 19 12:50:05.146005 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 1222, in _consume
[Fri Jul 19 12:50:05.146016 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     super(_AsyncSSLTransport, self)._consume()
[Fri Jul 19 12:50:05.146026 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 791, in _consume
[Fri Jul 19 12:50:05.146069 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     data = self._sigint_safe_recv(self._sock, self._MAX_RECV_BYTES)
[Fri Jul 19 12:50:05.146185 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 79, in retry_sigint_wrap
[Fri Jul 19 12:50:05.146202 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     return func(*args, **kwargs)
[Fri Jul 19 12:50:05.146212 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib/python3.7/site-packages/pika/adapters/utils/io_services_utils.py", line 846, in _sigint_safe_recv
[Fri Jul 19 12:50:05.146222 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     return sock.recv(max_bytes)
[Fri Jul 19 12:50:05.146232 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib64/python3.7/ssl.py", line 1037, in recv
[Fri Jul 19 12:50:05.146242 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     return self.read(buflen)
[Fri Jul 19 12:50:05.146251 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]   File "/usr/lib64/python3.7/ssl.py", line 913, in read
[Fri Jul 19 12:50:05.146261 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544]     return self._sslobj.read(len)
[Fri Jul 19 12:50:05.146270 2019] [wsgi:error] [pid 765:tid 139635065624320] [remote ::1:59544] ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2488)

@abompard @jcline does this ring a bell to you?

Metadata Update from @pingou:
- Assignee reset

4 years ago

Note that this stacktrace doesn't happen all the time, most often all that is in the logs are Channel is closed.

Metadata Update from @pingou:
- Issue assigned to gnaponie

4 years ago

Metadata Update from @pingou:
- Assignee reset

4 years ago

Yes, this does ring a bell. Which version of fedora-messaging are you running?

# rpm -q python3-fedora-messaging
python3-fedora-messaging-1.6.0-1.fc29.noarch

I've upgraded to 1.7.1, let's see if this happens again :)

Thanks for the explanation, server side I no longer see the "Channel closed" error.

Let's consider this fixed then, thanks! :)

Metadata Update from @pingou:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 years ago

Login to comment on this ticket.

Metadata