#10230 koji and src.fedoraproject org are slow today
Closed: Fixed 2 years ago by kevin. Opened 2 years ago by mvadkert.

Seems something is going, our monitoring is reporting today very slow responses on:

[TFT] Blackbox probe of tft-blackbox https://koji.fedoraproject.org failed   is firing (critical) 
[TFT] Blackbox probe of tft-blackbox https://kojipkgs.fedoraproject.org failed   is firing (critical) 
[TFT] Blackbox probe of tft-blackbox https://src.fedoraproject.org/rpms/setup/

We could be on the verge of outage?


Even pulling from dist git fails with:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/urllib3/connection.py", line 162, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/usr/lib/python3.6/site-packages/urllib3/util/connection.py", line 80, in create_connection
    raise err
  File "/usr/lib/python3.6/site-packages/urllib3/util/connection.py", line 70, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 343, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 839, in _validate_conn
    conn.connect()
  File "/usr/lib/python3.6/site-packages/urllib3/connection.py", line 315, in connect
    conn = self._new_conn()
  File "/usr/lib/python3.6/site-packages/urllib3/connection.py", line 171, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f258fe72828>: Failed to establish a new connection: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3.6/site-packages/urllib3/util/retry.py", line 399, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='src.fedoraproject.org', port=443): Max retries exceeded with url: /pv/ssh/checkaccess/ (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f258fe72828>: Failed to establish a new connection: [Errno 110] Connection timed out',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/libexec/pagure/aclchecker.py", line 73, in <module>
    resp = requests.post(url, data=data, headers=headers)
  File "/usr/lib/python3.6/site-packages/requests/api.py", line 116, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/lib/python3.6/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3.6/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3.6/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='src.fedoraproject.org', port=443): Max retries exceeded with url: /pv/ssh/checkaccess/ (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f258fe72828>: Failed to establish a new connection: [Errno 110] Connection timed out',))
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

Note: That appears to be a server-side traceback as I am not running anything on Python 3.6 locally.

yep, this is getting more pressing now ...

So, this was largely caused by proxy01 and proxy10 (our main iad2 proxies). They would simply stop getting any connections. A restart of httpd seemed to help for a short time, but not really fix anything.

I updated/rebooted them now and they seem much more stable.

I am thinking this was a odd kernel networking bug somehow stalling incoming connections. ;(

I'm going to leave this ticket open for a while more to make sure things are stable however and am going to try and dig though logs some more for any root cause hints.

Metadata Update from @mohanboddu:
- Issue assigned to kevin
- Issue tagged with: high-gain, high-trouble, ops

2 years ago

Visiting koji.fedoraproject.org and "fedpkg clone"ing from pkgs.fedoraproject.org are again timing out for me :(

Interestingly enough, traceroute doesn't seem to be able to get a response at all, but if I turn on ICMP mode (traceroute -I koji.fedoraproject.org), I'm getting responses *way fast and reliable) from proxy-iad02. Maybe this helps pinpoint the issue? (Or maybe ICMP messages just get routed to a different proxy that's still working...)

ok, I hope its stable now...

Can everyone retry and confirm if it's back to normal?

I think it's good now. Though I'm still sometimes getting HTTP 500 errors from PDC when pushing things to git.

LGTM, no issues spotted on our CI systems consuming all koji builds

I hope this is solved. Lets keep fingers crossed. :)

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Done