#48195 Slow replication when deleting large quantities of multi-valued attributes
Closed: wontfix None Opened 9 years ago by eweitlaner.

We write to our directory via Java API. When changing our groups the ModificationItems are lists of distinguished names of users.

There is no problem when we add new uniqueMembers to our groups, but deleting them takes much too long on the second (replicated) master. (example deleting 850 members from 86000 takes 159 seconds on dldap02).

dldap01
[01/Jun/2015:13:44:47 +0200] conn=141407 op=3 MOD dn="cn=TSNMAIL-VerteilerZugriff,cn=TSNMAIL,cn=Applications,ou=Groups,ou=TSN-AD,dc=tirol+gvOuId=AT:L7:LVN:000001,dc=gv,dc=at"
[01/Jun/2015:13:44:48 +0200] conn=141407 op=3 RESULT err=0 tag=103 nentries=0 etime=0.858000 csn=556c45b0000000010000

dldap02
[01/Jun/2015:13:44:48 +0200] conn=141354 op=4 MOD dn="cn=TSNMAIL-VerteilerZugriff,cn=TSNMAIL,cn=Applications,ou=Groups,ou=TSN-AD,dc=tirol+gvOuId=AT:L7:LVN:000001,dc=gv,dc=at"
[01/Jun/2015:13:47:28 +0200] conn=141354 op=4 RESULT err=0 tag=103 nentries=0 etime=159.670000 csn=556c45b0000000010000

I found a closed ticket (#346) with some simmilarities, maybe the problem still exists in a multimaster replication environment when deleting attributes. Logging shows millions of calls to plugin_call_syntax_filter_ava uniqueMember=gvGid=AT:L7:TSN:i.weitlaner,ou=People,ou=TSN-AD,dc=tirol+gvOuId=AT:L7:LVN:000001,dc=gv,dc=at ?


1) What is the version of 389-ds-base you are running? Could you run "rpm -q 389-ds-base"?
2) I cannot find the bug 01453107 in bugzilla.redhat.com. Where did you file it?

Thank you for reviewing this ..

root@dldap01 ~]# rpm -qa | grep 389-ds
389-ds-base-libs-1.2.11.15-52.el6.x86_64
389-ds-console-doc-1.2.7-1.el6.noarch
389-ds-base-1.2.11.15-52.el6.x86_64
389-ds-console-1.2.7-1.el6.noarch

Replying to [comment:3 eweitlaner]:

My url to the Redhat support case:

https://access.redhat.com/support/cases/#/case/01453107

Sorry.
Thank you for the info. Could you please give me an email at nhosoi@redhat.com?
Thanks,
--noriko

I did further investigations, maybe it helps ...

When we delete uniqueMembers the error log on the replica (search filter processing) shows thousands of entries: plugin_call_syntax_filter_ava uniqueMember=XYZ.

XYZ are distinguishedNames of remaining members. So if we delete 1 of 100k we see approx. 100k of these log lines, if we delete 500 of 100k we see approx. 500*100k of these lines causing one cpu to be on 100% for minutes.

But it only happens when we delete the uniqueMembers and it only happens on the replicated master.

Replying to [comment:6 eweitlaner]:

I did further investigations, maybe it helps ...

When we delete uniqueMembers the error log on the replica (search filter processing) shows thousands of entries: plugin_call_syntax_filter_ava uniqueMember=XYZ.

XYZ are distinguishedNames of remaining members. So if we delete 1 of 100k we see approx. 100k of these log lines, if we delete 500 of 100k we see approx. 500*100k of these lines causing one cpu to be on 100% for minutes.

But it only happens when we delete the uniqueMembers and it only happens on the replicated master.

What error log level are you using?

nssldapd-errorlog-level: 8192
Log Level: Search filter processing

Replying to [comment:8 eweitlaner]:

nssldapd-errorlog-level: 8192
Log Level: Search filter processing

http://www.port389.org/docs/389ds/FAQ/faq.html#troubleshooting

nsslapd-errorlog-level: 8192 is Replication debugging

What do you mean by "Log Level: Search filter processing"?

Just to confirm - when you did the initial performance testing, you had the default log levels? That is, no replication and no "Search filter processing"?

This measure was done with default settings, I will re-run this test tomorrow.

.. deleting them takes much too long on the second (replicated) master. (example deleting 850 members from 86000 takes 159 seconds on dldap02).

The attached file shows the Log Level setting in our console. I changed this to find out what happens during the 100% cpu time.

Would it be possible to enable the audit log during this "group member delete", so we can see the exact mod operation?

current settings:
Initial size of group: 83114, deleting 831 takes 6,5s (ok) on dldap02 (master) but 169s on replica.

master: dldap02
[12/Jun/2015:07:23:23 +0200] conn=492509 op=3 MOD dn="cn=TSNMAIL-VerteilerZugriff,cn=TSNMAIL,cn=Applications,ou=Groups,ou=TSN-AD,dc=tirol+gvOuId=AT:L7:LVN:000001,dc=gv,dc=at"
194277 [12/Jun/2015:07:23:30 +0200] conn=492509 op=3 RESULT err=0 tag=103 nentries=0 etime=6.355000 csn=557a6cd0000000020000

replica: dldap01
[12/Jun/2015:07:23:30 +0200] conn=492628 op=4 MOD dn="cn=TSNMAIL-VerteilerZugriff,cn=TSNMAIL,cn=Applications,ou=Groups,ou=TSN-AD,dc=tirol+gvOuId=AT:L7:LVN:000001,dc=gv,dc=at"
120983 [12/Jun/2015:07:26:19 +0200] conn=492628 op=4 RESULT err=0 tag=103 nentries=0 etime=169.008000 csn=557a6cd0000000020000

The attachment contains the audit log on dldap01 (testdata). Next step is to add the members again (there the replication is fast)

The add test:

current settings:
Initial size of group: 82283, adding 831 takes 1,8s (ok) on dldap02 (master) and 1s (ok) on replica. If I change the master, the effect is the same: Deletion is slower than adding (I thought deletion could be cheaper), and deletion on the replica is much too slow (deleting 10k members will take 20 minutes or more).

..-> I need help ;-)

master: dldap02
[12/Jun/2015:07:43:43 +0200] conn=492509 op=7 MOD dn="cn=TSNMAIL-VerteilerZugriff,cn=TSNMAIL,cn=Applications,ou=Groups,ou=TSN-AD,dc=tirol+gvOuId=AT:L7:LVN:000001,dc=gv,dc=at"
[12/Jun/2015:07:43:44 +0200] conn=492509 op=7 RESULT err=0 tag=103 nentries=0 etime=1.781000 csn=557a7190000000020000

replica: dldap01
[12/Jun/2015:07:43:44 +0200] conn=493092 op=4 MOD dn="cn=TSNMAIL-VerteilerZugriff,cn=TSNMAIL,cn=Applications,ou=Groups,ou=TSN-AD,dc=tirol+gvOuId=AT:L7:LVN:000001,dc=gv,dc=at"
123818 [12/Jun/2015:07:43:45 +0200] conn=493092 op=4 RESULT err=0 tag=103 nentries=0 etime=1.065000 csn=557a7190000000020000

Already wrote this on ldap-support. In the case attachment I found for ldap01 this stack trace: Thread 16 (Thread 0x7f1b807f3700 (LWP 4562)): #0 0x00007f1d444522da in slapi_utf8ToLower () from /usr/lib64/dirsrv/libslapd.so.0 No symbol table info available. #1 0x00007f1d443e5c53 in dn_ignore_case_to_end () from /usr/lib64/dirsrv/libslapd.so.0 No symbol table info available. #2 0x00007f1d443e8c44 in slapi_dn_normalize_case_ext () from /usr/lib64/dirsrv/libslapd.so.0 No symbol table info available. #3 0x00007f1d4033f843 in value_normalize_ext () from /usr/lib64/dirsrv/plugins/libsyntax-plugin.so No symbol table info available. #4 0x00007f1d4033cffc in string_filter_ava () from /usr/lib64/dirsrv/plugins/libsyntax-plugin.so No symbol table info available. #5 0x00007f1d4033a6ad in ?? () from /usr/lib64/dirsrv/plugins/libsyntax-plugin.so No symbol table info available. #6 0x00007f1d444315d9 in plugin_call_syntax_filter_ava_sv () from /usr/lib64/dirsrv/libslapd.so.0 No symbol table info available. #7 0x00007f1d443daa40 in slapi_attr_value_cmp_ext () from /usr/lib64/dirsrv/libslapd.so.0 No symbol table info available. #8 0x00007f1d44457c2d in valuearray_find () from /usr/lib64/dirsrv/libslapd.so.0 No symbol table info available. #9 0x00007f1d44457d7b in valueset_update_csn_for_valuearray () from /usr/lib64/dirsrv/libslapd.so.0 No symbol table info available. #10 0x00007f1d443f55c5 in ?? () from /usr/lib64/dirsrv/libslapd.so.0 No symbol table info available. #11 0x00007f1d443f61bb in entry_apply_mod_wsi () from /usr/lib64/dirsrv/libslapd.so.0 No symbol table info available. #12 0x00007f1d443f62a6 in entry_apply_mods_wsi () from /usr/lib64/dirsrv/libslapd.so.0 No symbol table info available. #13 0x00007f1d3ae8edc6 in ?? () from /usr/lib64/dirsrv/plugins/libback-ldbm.so No symbol table info available. #14 0x00007f1d3ae8fab7 in ldbm_back_modify () from /usr/lib64/dirsrv/plugins/libback-ldbm.so No symbol table info available. #15 0x00007f1d4441aa71 in ?? () from /usr/lib64/dirsrv/libslapd.so.0 No symbol table info available. #16 0x00007f1d4441bd2e in do_modify () from /usr/lib64/dirsrv/libslapd.so.0 No symbol table info available. #17 0x0000000000414564 in ?? () No symbol table info available. #18 0x00007f1d4284da83 in ?? () from /lib64/libnspr4.so No symbol table info available. #19 0x00007f1d421f09d1 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #20 0x00007f1d41f3d8fd in clone () from /lib64/libc.so.6 No symbol table info available. it looks like ldap01 is spending time in valuearray_find(), called by valueset_update_csn_for_valuearray(), which is only called in urp. valuearray_find is slow, sequential rteading the values and normalizing and .... valueset_update_csn_for_valuearray is the only place it calls it directly, it was replaced in the fix for #346 by slapi_valueset_find(), but this seems to have survived. I'll attach quick fix for testing

Sounds great, we will run our tests on monday. (no sysadmin to apply the patch today)

We are sorry, we are not able to apply the patch, because we are missing the source code. We got only the binary rpms.

Please have a look at: https://access.redhat.com/support/cases/#/case/01370918

389-ds-base-libs-1.2.11.15-52.el6.x86_64.rpm
389-ds-base-1.2.11.15-52.el6.x86_64.rpm

Can you do the build for us, or shall I contact RedHat Support?

Thank you.

Replying to [comment:17 eweitlaner]:

We are sorry, we are not able to apply the patch, because we are missing the source code. We got only the binary rpms.

Please have a look at: https://access.redhat.com/support/cases/#/case/01370918

389-ds-base-libs-1.2.11.15-52.el6.x86_64.rpm
389-ds-base-1.2.11.15-52.el6.x86_64.rpm

Can you do the build for us, or shall I contact RedHat Support?

Yes, please contact Red Hat Support about a hot fix.

Thank you.

Seems it has not been commit'd to the 1.2.11.15 branch yet. As soon as it is done, we can provide the hotfix. No piece of code which is not present in the branch can be delivered through hotfix.

Ludwig, is this patch ready to be put out for review? German can not generate a hotfix without it already being commited to the source. I've looked at it, and it has my ack.

I tested this fix with the following scenario:
have 2 masters, have a group with 90000 members, delete 200 members (members 70001-70200).
without the fix, on the master it takes 1 sec, on the replica 40 sec
with the fix it takes 1 sec on both

Seems successful. Timing without patch is like on our system. For our test with the patch we need rpms from RedHat.
Im not sure, I think it is on the way (needs a commit?).

See https://access.redhat.com/support/cases/#/case/01453107

Thank you, Erwin

Pushed to 389-ds-base-1.3.4
c413bb0..a980b79 389-ds-base-1.3.4 -> 389-ds-base-1.3.4
commit a980b79

Fix is deployed .. problem solved.

GroupSize: 83114
DeleteSize: 822

LogEntry on Replica
[25/Jun/2015:15:37:06 +0200] conn=1648 op=4 RESULT err=0 tag=103 nentries=0 etime=0.885000 csn=558c0402000000020000

was 159 sec., is now 1 sec.

Metadata Update from @eweitlaner:
- Issue assigned to mreynolds
- Issue set to the milestone: 1.2.11.33

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1526

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

4 years ago

Log in to comment on this ticket.

Metadata
Attachments 1
Attached 9 years ago View Comment