We write to our directory via Java API. When changing our groups the ModificationItems are lists of distinguished names of users.
There is no problem when we add new uniqueMembers to our groups, but deleting them takes much too long on the second (replicated) master. (example deleting 850 members from 86000 takes 159 seconds on dldap02).
dldap01 [01/Jun/2015:13:44:47 +0200] conn=141407 op=3 MOD dn="cn=TSNMAIL-VerteilerZugriff,cn=TSNMAIL,cn=Applications,ou=Groups,ou=TSN-AD,dc=tirol+gvOuId=AT:L7:LVN:000001,dc=gv,dc=at" [01/Jun/2015:13:44:48 +0200] conn=141407 op=3 RESULT err=0 tag=103 nentries=0 etime=0.858000 csn=556c45b0000000010000
dldap02 [01/Jun/2015:13:44:48 +0200] conn=141354 op=4 MOD dn="cn=TSNMAIL-VerteilerZugriff,cn=TSNMAIL,cn=Applications,ou=Groups,ou=TSN-AD,dc=tirol+gvOuId=AT:L7:LVN:000001,dc=gv,dc=at" [01/Jun/2015:13:47:28 +0200] conn=141354 op=4 RESULT err=0 tag=103 nentries=0 etime=159.670000 csn=556c45b0000000010000
I found a closed ticket (#346) with some simmilarities, maybe the problem still exists in a multimaster replication environment when deleting attributes. Logging shows millions of calls to plugin_call_syntax_filter_ava uniqueMember=gvGid=AT:L7:TSN:i.weitlaner,ou=People,ou=TSN-AD,dc=tirol+gvOuId=AT:L7:LVN:000001,dc=gv,dc=at ?
1) What is the version of 389-ds-base you are running? Could you run "rpm -q 389-ds-base"? 2) I cannot find the bug 01453107 in bugzilla.redhat.com. Where did you file it?
Thank you for reviewing this ..
root@dldap01 ~]# rpm -qa | grep 389-ds 389-ds-base-libs-1.2.11.15-52.el6.x86_64 389-ds-console-doc-1.2.7-1.el6.noarch 389-ds-base-1.2.11.15-52.el6.x86_64 389-ds-console-1.2.7-1.el6.noarch
My url to the Redhat support case:
https://access.redhat.com/support/cases/#/case/01453107
Sorry.
Replying to [comment:3 eweitlaner]:
My url to the Redhat support case: https://access.redhat.com/support/cases/#/case/01453107 Sorry. Thank you for the info. Could you please give me an email at nhosoi@redhat.com? Thanks, --noriko
Sorry. Thank you for the info. Could you please give me an email at nhosoi@redhat.com? Thanks, --noriko
I did further investigations, maybe it helps ...
When we delete uniqueMembers the error log on the replica (search filter processing) shows thousands of entries: plugin_call_syntax_filter_ava uniqueMember=XYZ.
XYZ are distinguishedNames of remaining members. So if we delete 1 of 100k we see approx. 100k of these log lines, if we delete 500 of 100k we see approx. 500*100k of these lines causing one cpu to be on 100% for minutes.
But it only happens when we delete the uniqueMembers and it only happens on the replicated master.
Replying to [comment:6 eweitlaner]:
I did further investigations, maybe it helps ... When we delete uniqueMembers the error log on the replica (search filter processing) shows thousands of entries: plugin_call_syntax_filter_ava uniqueMember=XYZ. XYZ are distinguishedNames of remaining members. So if we delete 1 of 100k we see approx. 100k of these log lines, if we delete 500 of 100k we see approx. 500*100k of these lines causing one cpu to be on 100% for minutes. But it only happens when we delete the uniqueMembers and it only happens on the replicated master.
What error log level are you using?
nssldapd-errorlog-level: 8192 Log Level: Search filter processing
Replying to [comment:8 eweitlaner]:
http://www.port389.org/docs/389ds/FAQ/faq.html#troubleshooting
nsslapd-errorlog-level: 8192 is Replication debugging
What do you mean by "Log Level: Search filter processing"?
Just to confirm - when you did the initial performance testing, you had the default log levels? That is, no replication and no "Search filter processing"?
attachment <img alt="log_level.png" src="/389-ds-base/issue/raw/files/7661fa9b367dc6fafad11b196151268a8344c6a60dac8b97ffe630115fb83b07-log_level.png" />
This measure was done with default settings, I will re-run this test tomorrow.
.. deleting them takes much too long on the second (replicated) master. (example deleting 850 members from 86000 takes 159 seconds on dldap02).
The attached file shows the Log Level setting in our console. I changed this to find out what happens during the 100% cpu time.
Would it be possible to enable the audit log during this "group member delete", so we can see the exact mod operation?
current settings: Initial size of group: 83114, deleting 831 takes 6,5s (ok) on dldap02 (master) but 169s on replica.
master: dldap02 [12/Jun/2015:07:23:23 +0200] conn=492509 op=3 MOD dn="cn=TSNMAIL-VerteilerZugriff,cn=TSNMAIL,cn=Applications,ou=Groups,ou=TSN-AD,dc=tirol+gvOuId=AT:L7:LVN:000001,dc=gv,dc=at" 194277 [12/Jun/2015:07:23:30 +0200] conn=492509 op=3 RESULT err=0 tag=103 nentries=0 etime=6.355000 csn=557a6cd0000000020000
replica: dldap01 [12/Jun/2015:07:23:30 +0200] conn=492628 op=4 MOD dn="cn=TSNMAIL-VerteilerZugriff,cn=TSNMAIL,cn=Applications,ou=Groups,ou=TSN-AD,dc=tirol+gvOuId=AT:L7:LVN:000001,dc=gv,dc=at" 120983 [12/Jun/2015:07:26:19 +0200] conn=492628 op=4 RESULT err=0 tag=103 nentries=0 etime=169.008000 csn=557a6cd0000000020000
The attachment contains the audit log on dldap01 (testdata). Next step is to add the members again (there the replication is fast)
attachment audit
The add test:
current settings: Initial size of group: 82283, adding 831 takes 1,8s (ok) on dldap02 (master) and 1s (ok) on replica. If I change the master, the effect is the same: Deletion is slower than adding (I thought deletion could be cheaper), and deletion on the replica is much too slow (deleting 10k members will take 20 minutes or more).
..-> I need help ;-)
master: dldap02 [12/Jun/2015:07:43:43 +0200] conn=492509 op=7 MOD dn="cn=TSNMAIL-VerteilerZugriff,cn=TSNMAIL,cn=Applications,ou=Groups,ou=TSN-AD,dc=tirol+gvOuId=AT:L7:LVN:000001,dc=gv,dc=at" [12/Jun/2015:07:43:44 +0200] conn=492509 op=7 RESULT err=0 tag=103 nentries=0 etime=1.781000 csn=557a7190000000020000
replica: dldap01 [12/Jun/2015:07:43:44 +0200] conn=493092 op=4 MOD dn="cn=TSNMAIL-VerteilerZugriff,cn=TSNMAIL,cn=Applications,ou=Groups,ou=TSN-AD,dc=tirol+gvOuId=AT:L7:LVN:000001,dc=gv,dc=at" 123818 [12/Jun/2015:07:43:45 +0200] conn=493092 op=4 RESULT err=0 tag=103 nentries=0 etime=1.065000 csn=557a7190000000020000
attachment 0001-do-not-use-valuearray_find.patch
Sounds great, we will run our tests on monday. (no sysadmin to apply the patch today)
We are sorry, we are not able to apply the patch, because we are missing the source code. We got only the binary rpms.
Please have a look at: https://access.redhat.com/support/cases/#/case/01370918
389-ds-base-libs-1.2.11.15-52.el6.x86_64.rpm 389-ds-base-1.2.11.15-52.el6.x86_64.rpm
Can you do the build for us, or shall I contact RedHat Support?
Thank you.
Replying to [comment:17 eweitlaner]:
We are sorry, we are not able to apply the patch, because we are missing the source code. We got only the binary rpms. Please have a look at: https://access.redhat.com/support/cases/#/case/01370918 389-ds-base-libs-1.2.11.15-52.el6.x86_64.rpm 389-ds-base-1.2.11.15-52.el6.x86_64.rpm Can you do the build for us, or shall I contact RedHat Support?
Yes, please contact Red Hat Support about a hot fix.
Seems it has not been commit'd to the 1.2.11.15 branch yet. As soon as it is done, we can provide the hotfix. No piece of code which is not present in the branch can be delivered through hotfix.
Ludwig, is this patch ready to be put out for review? German can not generate a hotfix without it already being commited to the source. I've looked at it, and it has my ack.
attachment 0001-Ticket-48195-Slow-replication-when-deleting-large-qu.patch
I tested this fix with the following scenario: have 2 masters, have a group with 90000 members, delete 200 members (members 70001-70200). without the fix, on the master it takes 1 sec, on the replica 40 sec with the fix it takes 1 sec on both
Seems successful. Timing without patch is like on our system. For our test with the patch we need rpms from RedHat. Im not sure, I think it is on the way (needs a commit?).
See https://access.redhat.com/support/cases/#/case/01453107
Thank you, Erwin
master:
commit 09ab8c7
1.2.11:
commit 993e37a
1.3.3: commit d9b77f5
Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1235387
Pushed to 389-ds-base-1.3.4 c413bb0..a980b79 389-ds-base-1.3.4 -> 389-ds-base-1.3.4 commit a980b79
Fix is deployed .. problem solved.
GroupSize: 83114 DeleteSize: 822
LogEntry on Replica [25/Jun/2015:15:37:06 +0200] conn=1648 op=4 RESULT err=0 tag=103 nentries=0 etime=0.885000 csn=558c0402000000020000
was 159 sec., is now 1 sec.
Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1236148
Metadata Update from @eweitlaner: - Issue assigned to mreynolds - Issue set to the milestone: 1.2.11.33
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/1526
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Log in to comment on this ticket.