This was encountered with a very large (~ 27 million entry) database. The logs from an LDIF import indicated that gathering non-leaf IDs for creating the ancestorid index took an enormous amount of time, over 10hrs:
ancestorid
[06/Jul/2018:21:13:40.793092791 +0000] - INFO - ldbm_get_nonleaf_ids - import userRoot: Gathering ancestorid non-leaf IDs: processed 0% (ID count 100000) ... [07/Jul/2018:07:27:52.515252224 +0000] - INFO - ldbm_get_nonleaf_ids - import userRoot: Gathering ancestorid non-leaf IDs: processed 100% (ID count 13069735)
The root cause is that the parentid index btree ordering is lexical, but the IDList being built up from it is sorted numerically. In the existing code, the IDList is maintained in constantly sorted order by idl_insert().
parentid
IDList
idl_insert()
This imposes enormous sorting and array copying overhead with large databases with lots of non-leaf entries.
The attached patch switches to idl_append_extend() for building up the IDList and then sorts the result only once, using qsort with idl_sort_cmp, after the entire list has been gathered.
idl_append_extend()
qsort
idl_sort_cmp
The improvement on identical hardware is for the operation to take 10 seconds rather than 10 hours:
[12/Jul/2018:16:52:55.620166570 +0000] - INFO - ldbm_get_nonleaf_ids - import userRoot: Gathering ancestorid non-leaf IDs: processed 0% (ID count 100000) ... [12/Jul/2018:16:53:04.885823828 +0000] - INFO - ldbm_get_nonleaf_ids - import userRoot: Gathering ancestorid non-leaf IDs: processed 100% (ID count 13069735) [12/Jul/2018:16:53:04.886990799 +0000] - INFO - ldbm_get_nonleaf_ids - import userRoot: Finished gathering ancestorid non-leaf IDs. [12/Jul/2018:16:53:04.888042215 +0000] - INFO - ldbm_get_nonleaf_ids - import userRoot: Starting sort of ancestorid non-leaf IDs. [12/Jul/2018:16:53:05.578625740 +0000] - INFO - ldbm_get_nonleaf_ids - import userRoot: Finished sort of ancestorid non-leaf IDs.
1.3.6.14 on CentOS 7
It is very, very slow.
It should be quick.
<img alt="ancestorid.sort.patch" src="/389-ds-base/issue/raw/files/061f06e7fd57b9194eecc3f62edbf89e0a29d2dfcafdf1db42a8ac49cd42fcc3-ancestorid.sort.patch" />
your analysis seems valid, I even am not sure if we need the sorting at all. In the creat index code the id list is processed and for each id parentid is determined and the idlist set, I don't see where it is needed to have the list sorted
Metadata Update from @lkrispen: - Custom field component adjusted to None - Custom field origin adjusted to None - Custom field reviewstatus adjusted to None - Custom field type adjusted to None - Custom field version adjusted to None
I was of a similar view.
Lacking certainty about whether there was some point to the sorting--though it appeared to me there was not--I went ahead and sorted it so that I could guarantee continuity before and after the change.
Hi @telackey My name is Simon. I am a part of DS team. You are very welcome with the contribution! :)
One thing. Could you, please, create a pull-request with the commit? It is part of our process now and it will be easier to test and review there.
You can find the instructions here - http://www.port389.org/docs/389ds/contributing.html#creating-a-patch-for-review
Thanks!
Metadata Update from @mreynolds: - Issue set to the milestone: 1.3.8
I did my own tests, with master, with the provided patch and with the patch without sorting. The ldif contained 12million entries with 2.4m inner nodes, it was basically a tree where each node had 5 children in 10 levels.
With master the preparation for the ancestorid with ldbm_get_nonleaf_ids() took 40min, with the patch <1min and the sorting only took <1sec.
For all three tests the generated ancestorid was identical, so sorting is not needed, but it also does not hurt. A small difference with the patch sorted and unsorted was the size of the ancestorid.db file, with master and patch(sorted) it was 2.468GB, with patch(unsorted) it was 2.486GB - so there is a slight difference in fragmentation caused by the order of insertion.
I suggest to use the provided patch as it is, it speeds up the import considerably and creates and identical db file
Thank you!
@telackey - do you want to create PR for this, or do you want me to do it?
If you do file the PR please use C-style comments, not java style comments //
//
If you are happy with the patch as-is (modulo the comment style, which is easy to change), I can open up the PR.
Should it be against master?
master
I am also happy if you do it. Whichever is easier.
yes it would be master branch, and the commit message would like:
Ticket 49850 - <DESCRIPTION OF PROBLEM> Description: <your description of the issue and fix> https://pagure.io/389-ds-base/issue/49850 Reviewed by: lkrispen
Then I will cherry-pick it to 1.3.8
Metadata Update from @mreynolds: - Issue set to the milestone: 1.4 backlog (was: 1.3.8)
Metadata Update from @tbordaz: - Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1749595
https://pagure.io/389-ds-base/pull-request/50630
Sorry this took so long @telackey
My testing with a "heavily nested" 1 million entry LDIF shows a 35% performance improvement. Most likely the more nested the LDIF is the better the improvement would be.
I did not see any impact on a flat ldif file with 1 million entries.
Commit fc47620 relates to this ticket
225f4e1..fc47620 master -> master
3b47e3f..fb4ea1d 389-ds-base-1.4.1 -> 389-ds-base-1.4.1
13340ad..9eba2ce 389-ds-base-1.4.0 -> 389-ds-base-1.4.0
3fbff08..3be2a20 389-ds-base-1.3.10 -> 389-ds-base-1.3.10
Metadata Update from @mreynolds: - Issue close_status updated to: fixed - Issue status updated to: Closed (was: Open)
Metadata Update from @lkrispen: - Issue status updated to: Open (was: Closed)
There is a crash if an ldif with only the root entry is imported, there will be no leafs an nodes is NULL and
qsort((void *)&nodes->b_ids[0], nodes->b_nids, (size_t)sizeof(ID), idl_sort_cmp);
crashes
Just do a quick NULL test before the qsort?
NULL
Or possibly take the sort out entirely, as mentioned above.
@lkrispen fixed it in this PR: https://pagure.io/389-ds-base/pull-request/50672
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/2909
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: fixed)
Login to comment on this ticket.