#12829 [Pushing to Dist-git] Sending Fedora messages sometimes times out
Closed: Fixed 22 days ago by kevin. Opened 2 months ago by farchord.

Sometimes, when pushing to dist-git, the script that sends the fedora-messaging payload (Note: I am assuming here) times out.

~/FedoraWork/fedora/kde-updates-work/plasma/libksysguard ~/FedoraWork/fedora/kde-updates-work/plasma
[rawhide 45d84fc] 6.4.91
 3 files changed, 10 insertions(+), 3 deletions(-)
Énumération des objets: 9, fait.
Décompte des objets: 100% (9/9), fait.
Compression par delta en utilisant jusqu'à 24 fils d'exécution
Compression des objets: 100% (5/5), fait.
Écriture des objets: 100% (5/5), 746 octets | 746.00 Kio/s, fait.
Total 5 (delta 2), réutilisés 0 (delta 0), réutilisés du paquet 0 (depuis 0)
remote: Emitting a message to the fedora-messaging message bus.
remote: * Publishing information for 1 commits
remote: Traceback (most recent call last):
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 2262, in _wrap_pool_connect
remote:     return fn()
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 354, in connect
remote:     return _ConnectionFairy._checkout(self)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 751, in _checkout
remote:     fairy = _ConnectionRecord.checkout(pool)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 483, in checkout
remote:     rec = pool._do_get()
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/impl.py", line 138, in _do_get
remote:     self._dec_overflow()
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
remote:     compat.reraise(exc_type, exc_value, exc_tb)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 129, in reraise
remote:     raise value
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/impl.py", line 135, in _do_get
remote:     return self._create_connection()
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 299, in _create_connection
remote:     return _ConnectionRecord(self)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 428, in __init__
remote:     self.__connect(first_connect_check=True)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 630, in __connect
remote:     connection = pool._invoke_creator(self)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
remote:     return dialect.connect(*cargs, **cparams)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 453, in connect
remote:     return self.dbapi.connect(*cargs, **cparams)
remote:   File "/usr/lib64/python3.6/site-packages/psycopg2/__init__.py", line 130, in connect
remote:     conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
remote: psycopg2.OperationalError: could not connect to server: Connection timed out
remote:         Is the server running on host "db01.rdu3.fedoraproject.org" (10.16.163.41) and accepting
remote:         TCP/IP connections on port 5432?
remote:
remote:
remote: The above exception was the direct cause of the following exception:
remote:
remote: Traceback (most recent call last):
remote:   File "/usr/lib/python3.6/site-packages/pagure/hooks/files/post-receive", line 48, in <module>
remote:     run_hook_file(hooktype)
remote:   File "/usr/lib/python3.6/site-packages/pagure/hooks/__init__.py", line 551, in run_hook_file
remote:     session, repo, user=username, namespace=namespace
remote:   File "/usr/lib/python3.6/site-packages/pagure/lib/query.py", line 2874, in _get_project
remote:     return query.one()
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3275, in one
remote:     ret = self.one_or_none()
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3244, in one_or_none
remote:     ret = list(self)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3317, in __iter__
remote:     return self._execute_and_instances(context)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3339, in _execute_and_instances
remote:     querycontext, self._connection_from_session, close_with_result=True
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3354, in _get_bind_args
remote:     mapper=self._bind_mapper(), clause=querycontext.statement, **kw
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3332, in _connection_from_session
remote:     conn = self.session.connection(**kw)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 1123, in connection
remote:     execution_options=execution_options,
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 1129, in _connection_for_bind
remote:     engine, execution_options
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 430, in _connection_for_bind
remote:     conn = bind._contextual_connect()
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 2226, in _contextual_connect
remote:     self._wrap_pool_connect(self.pool.connect, None),
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 2266, in _wrap_pool_connect
remote:     e, dialect, self
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1536, in _handle_dbapi_exception_noconnection
remote:     util.raise_from_cause(sqlalchemy_exception, exc_info)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 383, in raise_from_cause
remote:     reraise(type(exception), exception, tb=exc_tb, cause=cause)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 128, in reraise
remote:     raise value.with_traceback(tb)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 2262, in _wrap_pool_connect
remote:     return fn()
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 354, in connect
remote:     return _ConnectionFairy._checkout(self)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 751, in _checkout
remote:     fairy = _ConnectionRecord.checkout(pool)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 483, in checkout
remote:     rec = pool._do_get()
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/impl.py", line 138, in _do_get
remote:     self._dec_overflow()
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
remote:     compat.reraise(exc_type, exc_value, exc_tb)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 129, in reraise
remote:     raise value
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/impl.py", line 135, in _do_get
remote:     return self._create_connection()
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 299, in _create_connection
remote:     return _ConnectionRecord(self)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 428, in __init__
remote:     self.__connect(first_connect_check=True)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 630, in __connect
remote:     connection = pool._invoke_creator(self)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
remote:     return dialect.connect(*cargs, **cparams)
remote:   File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 453, in connect
remote:     return self.dbapi.connect(*cargs, **cparams)
remote:   File "/usr/lib64/python3.6/site-packages/psycopg2/__init__.py", line 130, in connect
remote:     conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
remote: sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server: Connection timed out
remote:         Is the server running on host "db01.rdu3.fedoraproject.org" (10.16.163.41) and accepting
remote:         TCP/IP connections on port 5432?
remote:
remote: (Background on this error at: http://sqlalche.me/e/e3q8)
To ssh://pkgs.fedoraproject.org/rpms/libksysguard
   836d9d0..45d84fc  rawhide -> rawhide

@abompard you have any ideas here? why does fedora-messaging even need to connect to the db? or is this something after the message?

db01 is getting some weird load spikes... but it's never hitting max connections (2000) so it should not ever be timing out.

Metadata Update from @james:
- Issue tagged with: medium-gain, medium-trouble

2 months ago

Metadata Update from @james:
- Issue assigned to abompard

2 months ago

Metadata Update from @james:
- Issue priority set to: Waiting on Assignee (was: Needs Review)

2 months ago

@abompard you have any ideas here? why does fedora-messaging even need to connect to the db? or is this something after the message?

No, this is before sending the message. Pagure gathers information about the project in pagure/hooks/__init__.py line 551 to build the data that will be sent with the message.

db01 is getting some weird load spikes... but it's never hitting max connections (2000) so it should not ever be timing out.

Yeah and it's probably unrelated to the connection pool timeout on the client side (that we changed when moving datacenters) because the connection pool has just been created by the hook on startup. Could this be a network thing ? (there was a MTU issue at some point in the new DC no?)

Ah ha.

Yep.

I wonder if this is related to the timeouts we are seeing from kojipkgs and pkgs01 from proxies... thats the same vlan crossing (prod vs build).

I had a similar issue in https://paste.sr.ht/~gotmax23/ae3f42c566407fb1420a3aa31c7d8165b50fbfcb, although in that case, the push seems to have been rejected completely. When I pushed it a second time, it worked.

I also got a different variation today. Git pushing would fail with the error that not all sources are uploaded, when they were. Trying again would work.

I also got a different variation today. Git pushing would fail with the error that not all sources are uploaded, when they were. Trying again would work.

That's probably an issue with the fedpkg git push hook and not an infra problem.

So, unfortunately I don't see this error logged on the server side, so it's hard to try and match it up to anything on the db server at the same time. ;(

Perhaps we can enable more debugging logs for hooks?

Or if any of you see this again, can you record the exact time it happened at?

Sure, I'm about to do quite a bit of git pushing in the next days so I can let you know!

Hey @farchord , any news on this?
Thanks! :-)

It hasn't happened in a while since you had those firewall changes done.

ok, lets go ahead and close this now... if anyone sees it feel free to re-open or file a new issue.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

22 days ago

Log in to comment on this ticket.

Metadata