#1423 Createrepo locking is not "fair"
Closed: Fixed 3 years ago by praiskup. Opened 4 years ago by iucar.

I observed this a couple of times, and this is my best guess (if that's not the cause, feel free to edit the issue title). Example: https://download.copr.fedorainfracloud.org/results/iucar/cran/fedora-31-x86_64/01492434-R-CRAN-stars/backend.log. At the time of writing this, this task is stalled here:

...
[2020-06-20 12:10:41,326][  INFO][PID:11236] Sign done
[2020-06-20 12:10:41,326][  INFO][PID:11236] Incremental createrepo run, adding 01492434-R-CRAN-stars into https://download.copr.fedorainfracloud.org/results/iucar/cran/fedora-31-x86_64, (auto-create-repo=True)

It happened before with another small batch of around 15 packages: 10 of them got stalled, and sending another build unblocked them. I'm sending another build to unblock this one.


Thank you for the report, but from the logs I don't see anything suspicious
that could cause this.

TBH, we don't use anything extraordinary for locking -- only one single
lock.. , and as long as there's no bug in the locking library I don't
think it's possible to deadlock there. The only problem is that the
osloutils.concurrency doesn't support "fair" locking for external locks
(multi-process, not multi-thread).

Do you remember what build unblocked the batch eventually? Were there any
other F31 builds (or createrepo requests) in queue that could lock the f31
chroot?

I will make the logging a bit more verbose, because this is all we got:

[2020-06-20 12:11:13,116][  INFO][PID:17034][/usr/bin/copr-repo.pid-17034][copr-repo:run_createrepo:179] searching for rpms: cd /var/lib/copr/public_html/results/iucar/cran/fedora-32-x86_64 && find 01492434-R-CRAN-stars -name "*.rpm"
[2020-06-20 12:11:13,121][  INFO][PID:17034][/usr/bin/copr-repo.pid-17034][copr-repo:run_createrepo:181] rpms: ['01492434-R-CRAN-stars/R-CRAN-stars-0.4.1-2.fc32.src.rpm', '01492434-R-CRAN-stars/R-CRAN-stars-0.4.1-2.fc32.noarch.rpm']
[2020-06-20 12:11:13,122][  INFO][PID:17034][/usr/bin/copr-repo.pid-17034][copr-repo:run_cmd:35] running: /usr/bin/createrepo_c --database --ignore-lock --local-sqlite --cachedir /tmp/ --workers 8 /var/lib/copr/public_html/results/iucar/cran/fedora-32-x86_64 --recycle-pkglist --update --skip-stat --pkglist /var/lib/copr/public_html/results/iucar/cran/fedora-32-x86_64/.copr-createrepo-pkglist
[2020-06-20 12:11:36,478][  INFO][PID:17034][/usr/bin/copr-repo.pid-17034][copr-repo:add_appdata:223] appstream-builder skipped, .disable-appstream file
[2020-06-20 12:11:36,479][  INFO][PID:17034][/usr/bin/copr-repo.pid-17034][copr-repo:main_locked:348] /usr/bin/copr-repo run successful

[2020-06-20 12:11:50,198][  INFO][PID:17845][/usr/bin/copr-repo.pid-17845][copr-repo:run_createrepo:179] searching for rpms: cd /var/lib/copr/public_html/results/iucar/cran/fedora-rawhide-x86_64 && find 01492434-R-CRAN-stars -name "*.rpm"
[2020-06-20 12:11:50,203][  INFO][PID:17845][/usr/bin/copr-repo.pid-17845][copr-repo:run_createrepo:181] rpms: ['01492434-R-CRAN-stars/R-CRAN-stars-0.4.1-2.fc33.noarch.rpm', '01492434-R-CRAN-stars/R-CRAN-stars-0.4.1-2.fc33.src.rpm']
[2020-06-20 12:11:50,204][  INFO][PID:17845][/usr/bin/copr-repo.pid-17845][copr-repo:run_cmd:35] running: /usr/bin/createrepo_c --database --ignore-lock --local-sqlite --cachedir /tmp/ --workers 8 /var/lib/copr/public_html/results/iucar/cran/fedora-rawhide-x86_64 --recycle-pkglist --update --skip-stat --pkglist /var/lib/copr/public_html/results/iucar/cran/fedora-rawhide-x86_64/.copr-createrepo-pkglist 
[2020-06-20 12:12:18,016][  INFO][PID:17845][/usr/bin/copr-repo.pid-17845][copr-repo:add_appdata:223] appstream-builder skipped, .disable-appstream file
[2020-06-20 12:12:18,016][  INFO][PID:17845][/usr/bin/copr-repo.pid-17845][copr-repo:main_locked:348] /usr/bin/copr-repo run successful

[2020-06-20 12:27:31,515][  INFO][PID:15806][/usr/bin/copr-repo.pid-15806][copr-repo:run_createrepo:179] searching for rpms: cd /var/lib/copr/public_html/results/iucar/cran/fedora-31-x86_64 && find 01492434-R-CRAN-stars -name "*.rpm"
[2020-06-20 12:27:31,676][  INFO][PID:15806][/usr/bin/copr-repo.pid-15806][copr-repo:run_createrepo:181] rpms: ['01492434-R-CRAN-stars/R-CRAN-stars-0.4.1-2.fc31.noarch.rpm', '01492434-R-CRAN-stars/R-CRAN-stars-0.4.1-2.fc31.src.rpm']
[2020-06-20 12:27:31,677][  INFO][PID:15806][/usr/bin/copr-repo.pid-15806][copr-repo:run_cmd:35] running: /usr/bin/createrepo_c --database --ignore-lock --local-sqlite --cachedir /tmp/ --workers 8 /var/lib/copr/public_html/results/iucar/cran/fedora-31-x86_64 --recycle-pkglist --update --skip-stat --pkglist /var/lib/copr/public_html/results/iucar/cran/fedora-31-x86_64/.copr-createrepo-pkglist
[2020-06-20 12:27:58,071][  INFO][PID:15806][/usr/bin/copr-repo.pid-15806][copr-repo:add_appdata:223] appstream-builder skipped, .disable-appstream file
[2020-06-20 12:27:58,072][  INFO][PID:15806][/usr/bin/copr-repo.pid-15806][copr-repo:main_locked:348] /usr/bin/copr-repo run successful

[2020-06-20 12:27:31,969][  INFO][PID:56374][/usr/bin/copr-repo.pid-56374][copr-repo:run_cmd:35] running: /usr/bin/createrepo_c --database --ignore-lock --local-sqlite --cachedir /tmp/ --workers 8 /var/lib/copr/public_html/results/iucar/cran/fedora-32-x86_64
[2020-06-20 12:53:39,222][  INFO][PID:56374][/usr/bin/copr-repo.pid-56374][copr-repo:add_appdata:223] appstream-builder skipped, .disable-appstream file
[2020-06-20 12:53:39,222][  INFO][PID:56374][/usr/bin/copr-repo.pid-56374][copr-repo:main_locked:348] /usr/bin/copr-repo run successful

[2020-06-20 12:28:20,457][  INFO][PID:59099][/usr/bin/copr-repo.pid-59099][copr-repo:run_createrepo:179] searching for rpms: cd /var/lib/copr/public_html/results/iucar/cran/fedora-31-x86_64 && find 01492437-R-CRAN-stars -name "*.rpm"
[2020-06-20 12:28:20,463][  INFO][PID:59099][/usr/bin/copr-repo.pid-59099][copr-repo:run_createrepo:181] rpms: ['01492437-R-CRAN-stars/R-CRAN-stars-0.4.1-2.fc31.noarch.rpm', '01492437-R-CRAN-stars/R-CRAN-stars-0.4.1-2.fc31.src.rpm']
[2020-06-20 12:28:20,463][  INFO][PID:59099][/usr/bin/copr-repo.pid-59099][copr-repo:run_cmd:35] running: /usr/bin/createrepo_c --database --ignore-lock --local-sqlite --cachedir /tmp/ --workers 8 /var/lib/copr/public_html/results/iucar/cran/fedora-31-x86_64 --recycle-pkglist --update --skip-stat --pkglist /var/lib/copr/public_html/results/iucar/cran/fedora-31-x86_64/.copr-createrepo-pkglist
[2020-06-20 12:28:45,343][  INFO][PID:59099][/usr/bin/copr-repo.pid-59099][copr-repo:add_appdata:223] appstream-builder skipped, .disable-appstream file
[2020-06-20 12:28:45,344][  INFO][PID:59099][/usr/bin/copr-repo.pid-59099][copr-repo:main_locked:348] /usr/bin/copr-repo run successful

The weird thing which happened is 2020-06-20 12:27:31,969. It seems like we did
full createrepo run (not incremental at that time). Is it possible?

Hmmm, indeed, full createrepo action ~11:16:35:

# cat /var/log/copr-backend/actions.log-20200621.gz | gzip -d | grep PID:3969653
[2020-06-20 11:16:35,418][  INFO][PID:3969653][/usr/bin/copr-backend-process-action.managed.pid-3969653][copr-backend-process-action:handle_action:24] Handling action 500819
[2020-06-20 11:16:35,448][  INFO][PID:3969653][/usr/bin/copr-backend-process-action.managed.pid-3969653][copr-backend-process-action:handle_action:34] Executing: <Createrepo(Action): {'action_type': 3, 'created_on': 1592651779, 'data': '{"ownername": "iucar", "projectname": "cran", "project_dirnames": ["cran"], "chroots": ["fedora-rawhide-x86_64", "fedora-31-x86_64", "fedora-32-x86_64"]}', 'ended_on': None, 'id': 500819, 'message': None, 'new_value': None, 'object_id': 0, 'object_type': 'repository', 'old_value': None, 'priority': 0, 'result': 0}>
[2020-06-20 11:16:35,448][  INFO][PID:3969653][/usr/bin/copr-backend-process-action.managed.pid-3969653][actions.py:run:114] Action createrepo
[2020-06-20 11:16:35,449][  INFO][PID:3969653][/usr/bin/copr-backend-process-action.managed.pid-3969653][actions.py:run:127] Creating repo for: iucar/cran/fedora-rawhide-x86_64
[2020-06-20 12:06:27,630][  INFO][PID:3969653][/usr/bin/copr-backend-process-action.managed.pid-3969653][actions.py:run:127] Creating repo for: iucar/cran/fedora-31-x86_64
[2020-06-20 12:27:31,547][  INFO][PID:3969653][/usr/bin/copr-backend-process-action.managed.pid-3969653][actions.py:run:127] Creating repo for: iucar/cran/fedora-32-x86_64
[2020-06-20 12:53:39,285][  INFO][PID:3969653][/usr/bin/copr-backend-process-action.managed.pid-3969653][copr-backend-process-action:handle_task:47] Action 500819 ended with status=SUCCESS

The full createrepo progress is impossible to track from user-perspective, but
if the directory is under heavy build load (a lot of concurrent createrepo lock
requests) it's likely that the action gets to lock later.

The F31 chroot seems to get to lock at 12:06:27,630, which generated this:

[2020-06-20 12:06:28,026][  INFO][PID:4193308][/usr/bin/copr-repo.pid-4193308][copr-repo:run_cmd:35] running: /usr/bin/createrepo_c --database --ignore-lock --local-sqlite --cachedir /tmp/ --workers 8 /var/lib/copr/public_html/results/iucar/cran/fedora-31-x86_64
[2020-06-20 12:27:31,478][  INFO][PID:4193308][/usr/bin/copr-repo.pid-4193308][copr-repo:add_appdata:223] appstream-builder skipped, .disable-appstream file
[2020-06-20 12:27:31,479][  INFO][PID:4193308][/usr/bin/copr-repo.pid-4193308][copr-repo:main_locked:348] /usr/bin/copr-repo run successful

Note the full (non-incremental) run takes 21minutes, which is expected
considering how large the directory is.

So, after a lot of detective work with logs, the behavior now is mostly
expected (even though it is not very convenient, you have to count with the fact
that full createrepo is a but expensive).

There are few things that can be done better:

1. fair locking (those who request the lock first are processed first),
2. we could parallelize the "full createrepo actions" across chroots, now we
   done rawhide -> 31 -> 32
3. we should better log when we are "asking" for a lock, and when we "get" it
4. I noticed we do batch.commit() twice (when in a lock) so this is rather
   a cosmetic issue
5. several situations are worth to be logged (when we don't do batch
   createrepo - in case this ever happened, when we are entering the
   "process" phase - when we detected that no other process did our own
   task)

... but I don't think there's a concrete bug I could respond quickly.

Metadata Update from @praiskup:
- Issue tagged with: RFE

4 years ago

Do you remember what build unblocked the batch eventually? Were there any
other F31 builds (or createrepo requests) in queue that could lock the f31
chroot?

Sorry, we don't have that one anymore, because yesterday I cleaned up a lot of stuff and rebuild other stuff.

Yes, I run a full createrepo because it was in an inconsistent state. But from 11:15 to 12:00 h, even if I had no way to know the status, I thought it was completed already. But then how do you explain that another build immediately released that one?

Anyway, I didn't experience this issue anymore during the evening.

Yes, I run a full createrepo because it was in an inconsistent state.

It would be awesome to have more data about such situation next time.
It should be possible to find what happened, and fix the bug causing this.

But then how do you explain that another build immediately released that one?

I don't know about the first situation, but the second situation you mentioned
here is just coincidence. The full createrepo finished at 2020-06-20 12:27:31,479
(and released the lock) and the createrepo for the build you thought was stuck
started at 2020-06-20 12:27:31,515 (acquired the lock). See the logs I cited above.

Anyway, I didn't experience this issue anymore during the evening.

I'm glad to hear that. I'll keep this issue open since there's still some space for
enhancements.

I spent a half of the day interpreting the fair locking issue; the problem is
that every single copr-repo run now goes through locked part of code
(even when nothing needs to be done as other createrepo processes did it).
And such processes may stay very long in the queue ... because ... the
locking is not fair.

Naive solution -- the osloutils lib have lock(..., fair=True) argument, but that
doesn't have the effect, according to my tests (so perhaps a bug there).

Hmpf, I hope this is kind of extreme example (finished createrepo runs,
but still waiting for the lock for the given amount of time):

02724174-rubygem-fluent-plugin-tcp-socket - 0:37:52.220389
02724182-rubygem-fluent-plugin-throttle - 0:32:51.045287
02724189-rubygem-fluent-plugin-to-flat-json-parser - 0:27:46.696398
02724201-rubygem-fluent-plugin-uri-parser - 0:22:23.964268
02724202-rubygem-fluent-plugin-uri_decoder - 0:22:23.964269
02724210-rubygem-fluent-plugin-vmware-loginsight - 0:17:14.044029
02724217-rubygem-fluent-plugin-wendelin - 0:14:28.712867
02724219-rubygem-fluent-plugin-winevtlog - 0:14:28.712865
02724221-rubygem-fluent-plugin-xml-parser - 0:14:28.712889
02724223-rubygem-fluent-plugin-xml-simple-parser - 0:11:26.890327
02724225-rubygem-fluent-plugin-zabbix - 0:08:39.364831
02724229-rubygem-fluent-plugin-zmq-pub - 0:08:39.364828
02724230-rubygem-fluent-plugin-zookeeper - 0:08:39.364830
02724232-rubygem-fluent-plugin-zulip - 0:08:39.364838
02724237-rubygem-fluent_0.12-plugin-syslog-tls - 0:05:47.730548

Eventually went as far as:

02724201-rubygem-fluent-plugin-uri-parser - 0:42:10.571702
02724219-rubygem-fluent-plugin-winevtlog - 0:34:15.320299
02724221-rubygem-fluent-plugin-xml-parser - 0:34:15.320323
02724225-rubygem-fluent-plugin-zabbix - 0:28:25.972265
02724243-rubygem-fluent_validation - 0:19:40.454876
02724277-rubygem-flumtter - 0:09:17.061234
02724279-rubygem-flunkey - 0:09:17.061248
02724299-rubygem-flutterwave - 0:02:08.113155
02724304-rubygem-fluxx - 0:02:08.113163

Commit 801e985 relates to this ticket

Wow, that's great news!

1. fair locking (those who request the lock first are processed first),
2. we could parallelize the "full createrepo actions" across chroots, now we
   done rawhide -> 31 -> 32
3. we should better log when we are "asking" for a lock, and when we "get" it
4. I noticed we do batch.commit() twice (when in a lock) so this is rather
   a cosmetic issue
5. several situations are worth to be logged (when we don't do batch
   createrepo - in case this ever happened, when we are entering the
   "process" phase - when we detected that no other process did our own
   task)

Ad 1) the fair locking is not trivial to have, but we don't need it now.
2) is a not directly related to this request, I would say, but I created
#1939 for this.
3) we fine-grained debug-logging now
4) fixed, I don't see that problem anymore
5) we log properly an info message when other createrepo did our task, and also
when 'no lock' was needed to decide.

Metadata Update from @praiskup:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Log in to comment on this ticket.

Metadata
Attachments 1
Related Pull Requests
  • #1927 Merged 3 years ago