#259 texlive builds failing in Fedora koji with Genericerror
Closed: Dropped 4 years ago by dgregor. Opened 7 years ago by kevin.

Recent builds of texlive have completed, but then fail with:

GenericError: cannot update build 826095, state: COMPLETE

Examples:

texlive-2016-26.20160520.fc26
https://koji.fedoraproject.org/koji/taskinfo?taskID=17040296
GenericError: cannot update build 829112, state: COMPLETE

texlive-2016-25.20160520.fc26
https://koji.fedoraproject.org/koji/taskinfo?taskID=16915099
GenericError: cannot update build 826095, state: COMPLETE

texlive-2016-21.20160520.fc25
https://koji.fedoraproject.org/koji/taskinfo?taskID=17032753
GenericError: cannot update build 820861, state: COMPLETE

This might be due to koji-1.11.0 as all of these are after our update to that version.

Happy to gather more info or the like.


And the traceback from error_log:

'''
[Sat Dec 24 11:05:46.222510 2016] [:error] [pid 32399] 2016-12-24 11:05:46,222 [WARNING] m=host.completeBuild u=None p=3
2399 r=10.5.126.52:56444 koji.xmlrpc: Traceback (most recent call last):
[Sat Dec 24 11:05:46.222537 2016] [:error] [pid 32399] File "/usr/share/koji-hub/kojixmlrpc.py", line 234, in _wrap_ha
ndler
[Sat Dec 24 11:05:46.222541 2016] [:error] [pid 32399] response = handler(environ)
[Sat Dec 24 11:05:46.222544 2016] [:error] [pid 32399] File "/usr/share/koji-hub/kojixmlrpc.py", line 277, in handle_r
pc
[Sat Dec 24 11:05:46.222546 2016] [:error] [pid 32399] return self._dispatch(method, params)
[Sat Dec 24 11:05:46.222548 2016] [:error] [pid 32399] File "/usr/share/koji-hub/kojixmlrpc.py", line 301, in _dispatc
h
[Sat Dec 24 11:05:46.222551 2016] [:error] [pid 32399] self.check_session()
[Sat Dec 24 11:05:46.222553 2016] [:error] [pid 32399] File "/usr/share/koji-hub/kojixmlrpc.py", line 283, in check_se
ssion
[Sat Dec 24 11:05:46.222555 2016] [:error] [pid 32399] context.session = koji.auth.Session()
[Sat Dec 24 11:05:46.222557 2016] [:error] [pid 32399] File "/usr/lib/python2.7/site-packages/koji/auth.py", line 139,
in init
[Sat Dec 24 11:05:46.222560 2016] [:error] [pid 32399] % (callnum, method, id)
[Sat Dec 24 11:05:46.222562 2016] [:error] [pid 32399] RetryError: unable to retry call 27 (method host.completeBuild) f
or session 26092994
[Sat Dec 24 11:05:46.222564 2016] [:error] [pid 32399]
[Sat Dec 24 11:05:46.286967 2016] [:error] [pid 32399] 2016-12-24 11:05:46,286 [WARNING] m=host.completeBuild u=None p=32399 r=10.5.126.52:56444 koji.xmlrpc: Memory usage of process 32399 grew from 29768 KiB to 48432 KiB (+18664 KiB) processing request host.completeBuild with args (17040296, 829112, 'tasks/577/17040577/texlive-2016-26.20160520.fc26.src.rpm', ['tasks/577/17040577/texlive-ntheorem-...
[Sat Dec 24 11:05:46.333776 2016] [:error] [pid 31706] 2016-12-24 11:05:46,333 [WARNING] m=host.failBuild u=buildvm-05.phx2.fedoraproject.org p=31706 r=10.5.126.52:59200 koji.xmlrpc: Traceback (most recent call last):
[Sat Dec 24 11:05:46.333799 2016] [:error] [pid 31706] File "/usr/share/koji-hub/kojixmlrpc.py", line 234, in _wrap_handler
[Sat Dec 24 11:05:46.333802 2016] [:error] [pid 31706] response = handler(environ)
[Sat Dec 24 11:05:46.333828 2016] [:error] [pid 31706] File "/usr/share/koji-hub/kojixmlrpc.py", line 277, in handle_rpc
[Sat Dec 24 11:05:46.333831 2016] [:error] [pid 31706] return self._dispatch(method, params)
[Sat Dec 24 11:05:46.333833 2016] [:error] [pid 31706] File "/usr/share/koji-hub/kojixmlrpc.py", line 314, in _dispatch
[Sat Dec 24 11:05:46.333836 2016] [:error] [pid 31706] ret = koji.util.call_with_argcheck(func, params, opts)
[Sat Dec 24 11:05:46.333838 2016] [:error] [pid 31706] File "/usr/lib/python2.7/site-packages/koji/util.py", line 156, in call_with_argcheck
[Sat Dec 24 11:05:46.333840 2016] [:error] [pid 31706] return func(args, *kwargs)
[Sat Dec 24 11:05:46.333843 2016] [:error] [pid 31706] File "/usr/share/koji-hub/kojihub.py", line 11765, in failBuild
[Sat Dec 24 11:05:46.333845 2016] [:error] [pid 31706] (build_id, koji.BUILD_STATES[result['state']])
[Sat Dec 24 11:05:46.333847 2016] [:error] [pid 31706] GenericError: cannot update build 829112, state: COMPLETE
[Sat Dec 24 11:05:46.333849 2016] [:error] [pid 31706]
'''

Here are some relevant log entries from kojid.log on the builder.

2016-12-23 22:27:18,174 [INFO] koji: Try #1 for call 28 (host.completeBuild) failed: ResponseError()
2016-12-23 22:35:39,494 [INFO] koji: Try #2 for call 28 (host.completeBuild) failed: ResponseError()
2016-12-23 22:45:00,840 [INFO] koji: Try #3 for call 28 (host.completeBuild) failed: ResponseError()
2016-12-23 22:54:22,228 [INFO] koji: Try #4 for call 28 (host.completeBuild) failed: ResponseError()
2016-12-23 23:03:43,953 [INFO] koji: Try #5 for call 28 (host.completeBuild) failed: ResponseError()
2016-12-24 11:04:44,423 [INFO] koji: Try #6 for call 28 (host.completeBuild) failed: HTTPSConnectionPool(host='koji.fedoraproject.org', port=443): Read timed out. (read timeout=43200)

That last try is about a minute before the final error. Note that kojid was initially calling completeBuild at first. Also, that last try went for a very long time.

It looks to me like that final completeBuild try is succeeding on the server side, but the client is not getting the response. After 12 hours the connection times out, and kojid goes forward with try number 7. That one fails with the RetryError we see in your paste (because the call finished, it can no longer be retried). From the perspective of kojid, the completeBuild call failed, so it attempts to fail the build. That fails with the error that we see.

This looks like something is broken between kojid and koji-hub. Maybe something with the proxies you guys have set up.

Thanks for looking at this!

There's no proxy at all between that builder and the hub. It's on the same net in the same datacenter. ;(

I see those errors are around 10min apart. Could we be hitting some 10min timeout when it's trying to upload? Keep in mind the texlive src.rpm is about 2.1GB.

I see those errors are around 10min apart. Could we be hitting some 10min timeout when it's trying to upload? Keep in mind the texlive src.rpm is about 2.1GB.

At the point this is failing, the upload has already happened (subtask did that). The big hit here (with the completeBuild call) is the sheer number of db inserts from all those rpms.

I'm not sure what's causing those initial ResponseError exceptions.

@puiterwijk correctly points out that we do now have a proxy between builders and hub. They use a pair of proxies via apache->haproxy->hub... so perhaps the problem is indeed there.

:light_rail:

@kevin, @mikem: there is a proxy between the builders and the hub nowadays.

So, looking at the source of completeBuild, it checks all kinds of RPMs (of which there are lots with texlive) and the SRPM (which is massive with texlive) and imports them into the database.
That, together with the fact that it does import everything, leads me to believe that the hub is hitting the haproxy timeout.
Haproxy will wait for up to 30 or 60 seconds, but if the server has not yet sent a single byte back, deems the backend as broken and thus aborts the request.

There are three fixes here:
1. Have Koji do this import asynchronously and the task owner checks back every so often.
2. The hub sends back the HTTP status code ("HTTP/1.1 200 OK") so haproxy knows it got it and is working.
3. Increase the timeout on haproxy to some new arbitrary timeout.

For now I have increased the haproxy response timeout to two hours (http://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=842af31).
Our default is 500 seconds (slightly over 8 minutes.. sounds about right together with the backoff with the times from before).

I still think that Koji should probably fix this with method 1 or 2 in the long term, as I do agree with haproxy that a request that doesn't even get an acknowledgement for more than 8 minutes is... quite abnormal for the HTTP protocol.

2 The hub sends back the HTTP status code ("HTTP/1.1 200 OK") so haproxy knows it got it and is working.

After the hub has sent this status code back, it can then go ahead and process, as the haproxy timeout will no longer be triggered.
This timeout is purely a check to see if the server is alive, and if the server has acknowledged the request it will no longer timeout, and just wait.

So, the haproxy timeout was the first to be hit. After that, there's ServerTimeout in Apache's mod_proxy.
After we bumped that too to two hours, it worked, though there were just two minutes to spare before it would've been killed as well.
Which means that the completeBuild call takes two hours to complete, which in my opinion is a definite reason to do that call asynchronously.

Linking related PR: #261 "Add IMPORTING state"

Metadata Update from @dgregor:
- Custom field Size adjusted to None
- Issue close_status updated to: Dropped
- Issue status updated to: Closed (was: Open)

4 years ago

Login to comment on this ticket.

Metadata