Learn more about these different git repos.
Other Git URLs
I observed this a couple of times, and this is my best guess (if that's not the cause, feel free to edit the issue title). Example: https://download.copr.fedorainfracloud.org/results/iucar/cran/fedora-31-x86_64/01492434-R-CRAN-stars/backend.log. At the time of writing this, this task is stalled here:
... [2020-06-20 12:10:41,326][ INFO][PID:11236] Sign done [2020-06-20 12:10:41,326][ INFO][PID:11236] Incremental createrepo run, adding 01492434-R-CRAN-stars into https://download.copr.fedorainfracloud.org/results/iucar/cran/fedora-31-x86_64, (auto-create-repo=True)
It happened before with another small batch of around 15 packages: 10 of them got stalled, and sending another build unblocked them. I'm sending another build to unblock this one.
Thank you for the report, but from the logs I don't see anything suspicious that could cause this.
TBH, we don't use anything extraordinary for locking -- only one single lock.. , and as long as there's no bug in the locking library I don't think it's possible to deadlock there. The only problem is that the osloutils.concurrency doesn't support "fair" locking for external locks (multi-process, not multi-thread).
Do you remember what build unblocked the batch eventually? Were there any other F31 builds (or createrepo requests) in queue that could lock the f31 chroot?
I will make the logging a bit more verbose, because this is all we got:
[2020-06-20 12:11:13,116][ INFO][PID:17034][/usr/bin/copr-repo.pid-17034][copr-repo:run_createrepo:179] searching for rpms: cd /var/lib/copr/public_html/results/iucar/cran/fedora-32-x86_64 && find 01492434-R-CRAN-stars -name "*.rpm" [2020-06-20 12:11:13,121][ INFO][PID:17034][/usr/bin/copr-repo.pid-17034][copr-repo:run_createrepo:181] rpms: ['01492434-R-CRAN-stars/R-CRAN-stars-0.4.1-2.fc32.src.rpm', '01492434-R-CRAN-stars/R-CRAN-stars-0.4.1-2.fc32.noarch.rpm'] [2020-06-20 12:11:13,122][ INFO][PID:17034][/usr/bin/copr-repo.pid-17034][copr-repo:run_cmd:35] running: /usr/bin/createrepo_c --database --ignore-lock --local-sqlite --cachedir /tmp/ --workers 8 /var/lib/copr/public_html/results/iucar/cran/fedora-32-x86_64 --recycle-pkglist --update --skip-stat --pkglist /var/lib/copr/public_html/results/iucar/cran/fedora-32-x86_64/.copr-createrepo-pkglist [2020-06-20 12:11:36,478][ INFO][PID:17034][/usr/bin/copr-repo.pid-17034][copr-repo:add_appdata:223] appstream-builder skipped, .disable-appstream file [2020-06-20 12:11:36,479][ INFO][PID:17034][/usr/bin/copr-repo.pid-17034][copr-repo:main_locked:348] /usr/bin/copr-repo run successful [2020-06-20 12:11:50,198][ INFO][PID:17845][/usr/bin/copr-repo.pid-17845][copr-repo:run_createrepo:179] searching for rpms: cd /var/lib/copr/public_html/results/iucar/cran/fedora-rawhide-x86_64 && find 01492434-R-CRAN-stars -name "*.rpm" [2020-06-20 12:11:50,203][ INFO][PID:17845][/usr/bin/copr-repo.pid-17845][copr-repo:run_createrepo:181] rpms: ['01492434-R-CRAN-stars/R-CRAN-stars-0.4.1-2.fc33.noarch.rpm', '01492434-R-CRAN-stars/R-CRAN-stars-0.4.1-2.fc33.src.rpm'] [2020-06-20 12:11:50,204][ INFO][PID:17845][/usr/bin/copr-repo.pid-17845][copr-repo:run_cmd:35] running: /usr/bin/createrepo_c --database --ignore-lock --local-sqlite --cachedir /tmp/ --workers 8 /var/lib/copr/public_html/results/iucar/cran/fedora-rawhide-x86_64 --recycle-pkglist --update --skip-stat --pkglist /var/lib/copr/public_html/results/iucar/cran/fedora-rawhide-x86_64/.copr-createrepo-pkglist [2020-06-20 12:12:18,016][ INFO][PID:17845][/usr/bin/copr-repo.pid-17845][copr-repo:add_appdata:223] appstream-builder skipped, .disable-appstream file [2020-06-20 12:12:18,016][ INFO][PID:17845][/usr/bin/copr-repo.pid-17845][copr-repo:main_locked:348] /usr/bin/copr-repo run successful [2020-06-20 12:27:31,515][ INFO][PID:15806][/usr/bin/copr-repo.pid-15806][copr-repo:run_createrepo:179] searching for rpms: cd /var/lib/copr/public_html/results/iucar/cran/fedora-31-x86_64 && find 01492434-R-CRAN-stars -name "*.rpm" [2020-06-20 12:27:31,676][ INFO][PID:15806][/usr/bin/copr-repo.pid-15806][copr-repo:run_createrepo:181] rpms: ['01492434-R-CRAN-stars/R-CRAN-stars-0.4.1-2.fc31.noarch.rpm', '01492434-R-CRAN-stars/R-CRAN-stars-0.4.1-2.fc31.src.rpm'] [2020-06-20 12:27:31,677][ INFO][PID:15806][/usr/bin/copr-repo.pid-15806][copr-repo:run_cmd:35] running: /usr/bin/createrepo_c --database --ignore-lock --local-sqlite --cachedir /tmp/ --workers 8 /var/lib/copr/public_html/results/iucar/cran/fedora-31-x86_64 --recycle-pkglist --update --skip-stat --pkglist /var/lib/copr/public_html/results/iucar/cran/fedora-31-x86_64/.copr-createrepo-pkglist [2020-06-20 12:27:58,071][ INFO][PID:15806][/usr/bin/copr-repo.pid-15806][copr-repo:add_appdata:223] appstream-builder skipped, .disable-appstream file [2020-06-20 12:27:58,072][ INFO][PID:15806][/usr/bin/copr-repo.pid-15806][copr-repo:main_locked:348] /usr/bin/copr-repo run successful [2020-06-20 12:27:31,969][ INFO][PID:56374][/usr/bin/copr-repo.pid-56374][copr-repo:run_cmd:35] running: /usr/bin/createrepo_c --database --ignore-lock --local-sqlite --cachedir /tmp/ --workers 8 /var/lib/copr/public_html/results/iucar/cran/fedora-32-x86_64 [2020-06-20 12:53:39,222][ INFO][PID:56374][/usr/bin/copr-repo.pid-56374][copr-repo:add_appdata:223] appstream-builder skipped, .disable-appstream file [2020-06-20 12:53:39,222][ INFO][PID:56374][/usr/bin/copr-repo.pid-56374][copr-repo:main_locked:348] /usr/bin/copr-repo run successful [2020-06-20 12:28:20,457][ INFO][PID:59099][/usr/bin/copr-repo.pid-59099][copr-repo:run_createrepo:179] searching for rpms: cd /var/lib/copr/public_html/results/iucar/cran/fedora-31-x86_64 && find 01492437-R-CRAN-stars -name "*.rpm" [2020-06-20 12:28:20,463][ INFO][PID:59099][/usr/bin/copr-repo.pid-59099][copr-repo:run_createrepo:181] rpms: ['01492437-R-CRAN-stars/R-CRAN-stars-0.4.1-2.fc31.noarch.rpm', '01492437-R-CRAN-stars/R-CRAN-stars-0.4.1-2.fc31.src.rpm'] [2020-06-20 12:28:20,463][ INFO][PID:59099][/usr/bin/copr-repo.pid-59099][copr-repo:run_cmd:35] running: /usr/bin/createrepo_c --database --ignore-lock --local-sqlite --cachedir /tmp/ --workers 8 /var/lib/copr/public_html/results/iucar/cran/fedora-31-x86_64 --recycle-pkglist --update --skip-stat --pkglist /var/lib/copr/public_html/results/iucar/cran/fedora-31-x86_64/.copr-createrepo-pkglist [2020-06-20 12:28:45,343][ INFO][PID:59099][/usr/bin/copr-repo.pid-59099][copr-repo:add_appdata:223] appstream-builder skipped, .disable-appstream file [2020-06-20 12:28:45,344][ INFO][PID:59099][/usr/bin/copr-repo.pid-59099][copr-repo:main_locked:348] /usr/bin/copr-repo run successful
The weird thing which happened is 2020-06-20 12:27:31,969. It seems like we did full createrepo run (not incremental at that time). Is it possible?
2020-06-20 12:27:31,969
Hmmm, indeed, full createrepo action ~11:16:35:
# cat /var/log/copr-backend/actions.log-20200621.gz | gzip -d | grep PID:3969653 [2020-06-20 11:16:35,418][ INFO][PID:3969653][/usr/bin/copr-backend-process-action.managed.pid-3969653][copr-backend-process-action:handle_action:24] Handling action 500819 [2020-06-20 11:16:35,448][ INFO][PID:3969653][/usr/bin/copr-backend-process-action.managed.pid-3969653][copr-backend-process-action:handle_action:34] Executing: <Createrepo(Action): {'action_type': 3, 'created_on': 1592651779, 'data': '{"ownername": "iucar", "projectname": "cran", "project_dirnames": ["cran"], "chroots": ["fedora-rawhide-x86_64", "fedora-31-x86_64", "fedora-32-x86_64"]}', 'ended_on': None, 'id': 500819, 'message': None, 'new_value': None, 'object_id': 0, 'object_type': 'repository', 'old_value': None, 'priority': 0, 'result': 0}> [2020-06-20 11:16:35,448][ INFO][PID:3969653][/usr/bin/copr-backend-process-action.managed.pid-3969653][actions.py:run:114] Action createrepo [2020-06-20 11:16:35,449][ INFO][PID:3969653][/usr/bin/copr-backend-process-action.managed.pid-3969653][actions.py:run:127] Creating repo for: iucar/cran/fedora-rawhide-x86_64 [2020-06-20 12:06:27,630][ INFO][PID:3969653][/usr/bin/copr-backend-process-action.managed.pid-3969653][actions.py:run:127] Creating repo for: iucar/cran/fedora-31-x86_64 [2020-06-20 12:27:31,547][ INFO][PID:3969653][/usr/bin/copr-backend-process-action.managed.pid-3969653][actions.py:run:127] Creating repo for: iucar/cran/fedora-32-x86_64 [2020-06-20 12:53:39,285][ INFO][PID:3969653][/usr/bin/copr-backend-process-action.managed.pid-3969653][copr-backend-process-action:handle_task:47] Action 500819 ended with status=SUCCESS
The full createrepo progress is impossible to track from user-perspective, but if the directory is under heavy build load (a lot of concurrent createrepo lock requests) it's likely that the action gets to lock later.
The F31 chroot seems to get to lock at 12:06:27,630, which generated this:
[2020-06-20 12:06:28,026][ INFO][PID:4193308][/usr/bin/copr-repo.pid-4193308][copr-repo:run_cmd:35] running: /usr/bin/createrepo_c --database --ignore-lock --local-sqlite --cachedir /tmp/ --workers 8 /var/lib/copr/public_html/results/iucar/cran/fedora-31-x86_64 [2020-06-20 12:27:31,478][ INFO][PID:4193308][/usr/bin/copr-repo.pid-4193308][copr-repo:add_appdata:223] appstream-builder skipped, .disable-appstream file [2020-06-20 12:27:31,479][ INFO][PID:4193308][/usr/bin/copr-repo.pid-4193308][copr-repo:main_locked:348] /usr/bin/copr-repo run successful
Note the full (non-incremental) run takes 21minutes, which is expected considering how large the directory is.
So, after a lot of detective work with logs, the behavior now is mostly expected (even though it is not very convenient, you have to count with the fact that full createrepo is a but expensive).
There are few things that can be done better:
1. fair locking (those who request the lock first are processed first), 2. we could parallelize the "full createrepo actions" across chroots, now we done rawhide -> 31 -> 32 3. we should better log when we are "asking" for a lock, and when we "get" it 4. I noticed we do batch.commit() twice (when in a lock) so this is rather a cosmetic issue 5. several situations are worth to be logged (when we don't do batch createrepo - in case this ever happened, when we are entering the "process" phase - when we detected that no other process did our own task)
... but I don't think there's a concrete bug I could respond quickly.
Metadata Update from @praiskup: - Issue tagged with: RFE
Sorry, we don't have that one anymore, because yesterday I cleaned up a lot of stuff and rebuild other stuff.
Yes, I run a full createrepo because it was in an inconsistent state. But from 11:15 to 12:00 h, even if I had no way to know the status, I thought it was completed already. But then how do you explain that another build immediately released that one?
Anyway, I didn't experience this issue anymore during the evening.
Yes, I run a full createrepo because it was in an inconsistent state.
It would be awesome to have more data about such situation next time. It should be possible to find what happened, and fix the bug causing this.
But then how do you explain that another build immediately released that one?
I don't know about the first situation, but the second situation you mentioned here is just coincidence. The full createrepo finished at 2020-06-20 12:27:31,479 (and released the lock) and the createrepo for the build you thought was stuck started at 2020-06-20 12:27:31,515 (acquired the lock). See the logs I cited above.
2020-06-20 12:27:31,515
I'm glad to hear that. I'll keep this issue open since there's still some space for enhancements.
I spent a half of the day interpreting the fair locking issue; the problem is that every single copr-repo run now goes through locked part of code (even when nothing needs to be done as other createrepo processes did it). And such processes may stay very long in the queue ... because ... the locking is not fair.
copr-repo
Naive solution -- the osloutils lib have lock(..., fair=True) argument, but that doesn't have the effect, according to my tests (so perhaps a bug there).
lock(..., fair=True)
Hmpf, I hope this is kind of extreme example (finished createrepo runs, but still waiting for the lock for the given amount of time):
02724174-rubygem-fluent-plugin-tcp-socket - 0:37:52.220389 02724182-rubygem-fluent-plugin-throttle - 0:32:51.045287 02724189-rubygem-fluent-plugin-to-flat-json-parser - 0:27:46.696398 02724201-rubygem-fluent-plugin-uri-parser - 0:22:23.964268 02724202-rubygem-fluent-plugin-uri_decoder - 0:22:23.964269 02724210-rubygem-fluent-plugin-vmware-loginsight - 0:17:14.044029 02724217-rubygem-fluent-plugin-wendelin - 0:14:28.712867 02724219-rubygem-fluent-plugin-winevtlog - 0:14:28.712865 02724221-rubygem-fluent-plugin-xml-parser - 0:14:28.712889 02724223-rubygem-fluent-plugin-xml-simple-parser - 0:11:26.890327 02724225-rubygem-fluent-plugin-zabbix - 0:08:39.364831 02724229-rubygem-fluent-plugin-zmq-pub - 0:08:39.364828 02724230-rubygem-fluent-plugin-zookeeper - 0:08:39.364830 02724232-rubygem-fluent-plugin-zulip - 0:08:39.364838 02724237-rubygem-fluent_0.12-plugin-syslog-tls - 0:05:47.730548
Eventually went as far as:
02724201-rubygem-fluent-plugin-uri-parser - 0:42:10.571702 02724219-rubygem-fluent-plugin-winevtlog - 0:34:15.320299 02724221-rubygem-fluent-plugin-xml-parser - 0:34:15.320323 02724225-rubygem-fluent-plugin-zabbix - 0:28:25.972265 02724243-rubygem-fluent_validation - 0:19:40.454876 02724277-rubygem-flumtter - 0:09:17.061234 02724279-rubygem-flunkey - 0:09:17.061248 02724299-rubygem-flutterwave - 0:02:08.113155 02724304-rubygem-fluxx - 0:02:08.113163
Commit 801e985 relates to this ticket
<img alt="Screenshot_20211001_135422.png" src="/copr/copr/issue/raw/files/47d18ae2032addd9591771c3f56251f131cf579fd7cfec2b51aa266343b80219-Screenshot_20211001_135422.png" />
After the update today, the troughput was rougly doubled.
Wow, that's great news!
Ad 1) the fair locking is not trivial to have, but we don't need it now. 2) is a not directly related to this request, I would say, but I created #1939 for this. 3) we fine-grained debug-logging now 4) fixed, I don't see that problem anymore 5) we log properly an info message when other createrepo did our task, and also when 'no lock' was needed to decide.
Metadata Update from @praiskup: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.