#3976 Performance issues with new fork task implementation
Closed: Fixed 5 years ago Opened 5 years ago by bkabrda.

I'm seeing a huge degradation in the performance of the fork task because of the way it's implemented now. For example, I created a testing repo at https://stg.pagure.io/myowntest - I first tried forking it with just one branch that had one commit, it took about 6 seconds, on average. I then added three more branches, each of them having 1 commit. Forking that takes almost 20 seconds, on average. The Pagure instance that I operate has repos that have dozens of branches with hundreds of commits (these are not exceptions, but rule).

I'm not very familiar with the new code and why it works the way it does, but forking some of the repositories in dev instance of my setup takes several minutes (and these are still much smaller than what my prod instance has). If someone has a quick idea on how to fix this, please comment, otherwise I'll start digging in myself.


Do you know which part takes the most time? The forking itself or the refresh of the ACLs?

Metadata Update from @pingou:
- Issue set to the milestone: 5.2
- Issue tagged with: bug

5 years ago

I think it all adds up - for each branch, the commits are actually pushed, the hooks run (including sending a message to the message bus, which also takes some time).

What stands out on my setup is the pushing of individual branches, since I've tried this with repo that has couple hundred commits on each branch, but the experiment with https://stg.pagure.io/myowntest proves that this also happens with branches that have just 1 commit, so it's not just that.

Another interesting side-effect of the new forking approach is, that all commits from the original project get logged as committed to pagure_logs. When forking large projects by lots of users, I'd say this could put extreme strain on DB (both in terms of number of transactions and in terms of size of this table) - my instance has projects that have years of history with hundreds of branches, each having thousands of commits. This would effectively produce ~100k lines for the pagure_logs table on every fork. I think the "internal" pushes (specifically those done while forking) shouldn't be logged at all.

FWIW, commits logged while merging a PR (which get to the DB through the same mechanism of "internal" push) are correct to be stored, since the fork can get deleted and then we'd have no information about commit authors.

So here's some more detailed analysis:

It seems that the pushing branch-by-branch itself is not really an issue. The biggest problems are:

  • hookrunner, which runs 3 times for every branch (pre-receive, update, post-receive); just invoking hookrunner once takes about 1.5 seconds in my local setting (YMMV) because of all the imports, and connections to message broker, relational db etc that have to be made
  • the actual hooks - mostly the default hook that would need to connect to message broker after every pushed branch to send message about pushed commits
  • the fact that TemporaryClone.push seems to ignore the internal="yes" argument, which makes hookrunner check dynamic acls on both pre-receive and update hooks runs
  • as mentioned above, tasks are sent from default hook to log all the commits (even though sending tasks is pretty fast, it's certainly slowing things down)

All in all, if there are no hooks on the fork, the pushes are lightning fast (<10ms per branch); with hooks, they're slow (>5s even for branches with single commits). The number of commits doesn't have a noticable impact on fork performance.

Therefore it seems that the best course of action would be to somehow completely disable hooks while the fork is in progress and only enable them afterwards. I'll try to figure out a non-hackish way to do that.

Metadata Update from @pingou:
- Issue assigned to bkabrda

5 years ago

hookrunner, which runs 3 times for every branch (pre-receive, update, post-receive); just invoking hookrunner once takes about 1.5 seconds in my local setting (YMMV) because of all the imports, and connections to message broker, relational db etc that have to be made

This is not quite accurate:
it runs once per push for pre-receive, it runs once per push for post-receive, and only for "update" does it run for every branch.
That's the way Git hooks work, since pre- and post-receive get all the changes in one go.
Also, I timed the start at some point, and the vast majority of the delay is purely the "import libgit2".

the fact that TemporaryClone.push seems to ignore the internal="yes" argument, which makes hookrunner check dynamic acls on both pre-receive and update hooks runs

It should not be ignoring this, but passing it as an environment variable to the actual push: https://pagure.io/pagure/blob/master/f/pagure/lib/git.py#_1035.
Then the hookrunner is responsible for taking this into account.

Login to comment on this ticket.

Metadata
Related Pull Requests
  • #4043 Merged 5 years ago