When trying to make a total initial update for a replica, it randomly hangs during the data transmission (https://lists.fedoraproject.org/pipermail/389-users/2014-November/017602.html).
The factors making the problem worse: using SSL/TLS in replication agreemnt, single CPU in a VM or reduced nsslapd-db-checkpoint-interval
In my tests it hangs each time (i've checked it for all the branches 1.3.1, 13.2 and 1.3.3 - on CentOS 7). We have about 30000 entries. The hang is random and probably time-dependent, not entry-dependent.
I used 389-ds-base-1.3.2.24.tar.bz2 extracted from git to compile it on CentSO7. The two servers are ldap-edev and ldap-model, nsslapd-db-checkpoint-interval is 5.
The replicas: <ldap-model> cn=replica,cn=dc\3Did\2Cdc\3Dpolytechnique\2Cdc\3Dedu,cn=mapping tree,cn=config objectClass: top objectClass: nsDS5Replica cn: replica nsDS5ReplicaId: 1 nsDS5ReplicaRoot: dc=id,dc=polytechnique,dc=edu nsDS5Flags: 1 nsDS5ReplicaBindDN: cn=RepliX,cn=config nsds5ReplicaPurgeDelay: 604800 nsds5ReplicaTombstonePurgeInterval: 86400 nsds5ReplicaLegacyConsumer: False nsDS5ReplicaType: 3 nsState:: AQAAAAAAAAB3X1tUAAAAAAAAAAAAAAAAAAAAAAAAAAABAAAAAAAAAA== nsDS5ReplicaName: 59006004-65aa11e4-8781c5ac-8ec7ed31 nsds5ReplicaChangeCount: 0 nsds5replicareapactive: 0
<ldap-edev> cn=replica,cn=dc\3Did\2Cdc\3Dpolytechnique\2Cdc\3Dedu,cn=mapping tree,cn=config objectClass: top objectClass: nsDS5Replica cn: replica nsDS5ReplicaId: 2 nsDS5ReplicaRoot: dc=id,dc=polytechnique,dc=edu nsDS5Flags: 1 nsDS5ReplicaBindDN: cn=RepliX,cn=config nsds5ReplicaPurgeDelay: 604800 nsds5ReplicaTombstonePurgeInterval: 86400 nsds5ReplicaLegacyConsumer: False nsDS5ReplicaType: 3 nsState:: AgAAAAAAAAB3X1tUAAAAAAAAAAAAAAAAAAAAAAAAAAABAAAAAAAAAA== nsDS5ReplicaName: 5998f684-65aa11e4-89888cf6-5e3b66a8 nsds5ReplicaChangeCount: 0 nsds5replicareapactive: 0
replication agreements : <ldap-model> 0 cn=Replication from ldap-model.polytechnique.fr to ldap-edev.polytechnique.fr,cn=replica,cn=dc\3Did\2Cdc\3Dpolytechnique\2Cdc\3Dedu,cn=mapping tree,cn=config objectClass: top objectClass: nsDS5ReplicationAgreement cn: Replication from ldap-model.polytechnique.fr to ldap-edev.polytechnique.fr description: Replication agreement from server ldap-model.polytechnique.fr to server ldap-edev.polytechnique.fr nsDS5ReplicaHost: ldap-edev.polytechnique.fr nsDS5ReplicaRoot: dc=id,dc=polytechnique,dc=edu nsDS5ReplicaPort: 636 nsDS5ReplicaTransportInfo: SSL nsDS5ReplicaBindDN: cn=RepliX,cn=config nsDS5ReplicaBindMethod: simple nsds5replicaBusyWaitTime: 3 nsds5replicaTimeout: 180 nsDS5ReplicaCredentials: {DES}YVUAJ+/R1UvHTyogaLDeUHPg7lS627P9 nsds5replicareapactive: 0 nsds5replicaLastUpdateStart: 0 nsds5replicaLastUpdateEnd: 0 nsds5replicaChangesSentSinceStartup: nsds5replicaLastUpdateStatus: 402 Replication error acquiring replica: unknown error - Replica has different database generation ID, remote replica may need to be initialized nsds5replicaUpdateInProgress: FALSE nsds5replicaLastInitStart: 0 nsds5replicaLastInitEnd: 0
<ldap-edev> 0 cn=Replication from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr,cn=replica,cn=dc\3Did\2Cdc\3Dpolytechnique\2Cdc\3Dedu,cn=mapping tree,cn=config objectClass: top objectClass: nsDS5ReplicationAgreement cn: Replication from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr description: Replication agreement from server ldap-edev.polytechnique.fr to server ldap-model.polytechnique.fr nsDS5ReplicaHost: ldap-model.polytechnique.fr nsDS5ReplicaRoot: dc=id,dc=polytechnique,dc=edu nsDS5ReplicaPort: 636 nsDS5ReplicaTransportInfo: SSL nsDS5ReplicaBindDN: cn=RepliX,cn=config nsDS5ReplicaBindMethod: simple nsds5replicaBusyWaitTime: 3 nsds5replicaTimeout: 180 nsDS5ReplicaCredentials: {DES}dG/Y05iIw4ZwldMtj+xjDGW89SHsDhQo nsds5replicareapactive: 0 nsds5replicaLastUpdateStart: 0 nsds5replicaLastUpdateEnd: 0 nsds5replicaChangesSentSinceStartup: nsds5replicaLastUpdateStatus: 402 Replication error acquiring replica: unknown error - Replica has different database generation ID, remote replica may need to be initialized nsds5replicaUpdateInProgress: FALSE nsds5replicaLastInitStart: 20141106114951Z nsds5replicaLastInitEnd: 20141106115308Z nsds5replicaLastInitStatus: 10 Total update abortedLDAP error: Referral
The error corresponds to the timeout error (see https://fedorahosted.org/389/ticket/47901)
The total update was strated in teh direction ldap-edev --> ldap-model. Error logs : <ldap-model> [06/Nov/2014:12:49:51 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is going offline; disabling replication [06/Nov/2014:12:49:51 +0100] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database [06/Nov/2014:12:50:11 +0100] - import userRoot: Processed 4239 entries -- average rate 211.9/sec, recent rate 211.9/sec, hit ratio 0% [06/Nov/2014:12:50:31 +0100] - import userRoot: Processed 4239 entries -- average rate 106.0/sec, recent rate 106.0/sec, hit ratio 0% [06/Nov/2014:12:50:51 +0100] - import userRoot: Processed 4239 entries -- average rate 70.7/sec, recent rate 0.0/sec, hit ratio 0% [06/Nov/2014:12:51:11 +0100] - import userRoot: Processed 4239 entries -- average rate 53.0/sec, recent rate 0.0/sec, hit ratio 0% [06/Nov/2014:12:51:33 +0100] - import userRoot: Processed 4239 entries -- average rate 42.0/sec, recent rate 0.0/sec, hit ratio 0% [06/Nov/2014:12:51:54 +0100] - import userRoot: Processed 4239 entries -- average rate 34.5/sec, recent rate 0.0/sec, hit ratio 0% [06/Nov/2014:12:52:15 +0100] - import userRoot: Processed 4239 entries -- average rate 29.4/sec, recent rate 0.0/sec, hit ratio 0% [06/Nov/2014:12:52:35 +0100] - import userRoot: Processed 4239 entries -- average rate 25.8/sec, recent rate 0.0/sec, hit ratio 0% [06/Nov/2014:12:52:55 +0100] - import userRoot: Processed 4239 entries -- average rate 23.0/sec, recent rate 0.0/sec, hit ratio 0% [06/Nov/2014:12:53:07 +0100] - import userRoot: Workers finished; cleaning up... [06/Nov/2014:12:53:07 +0100] - import userRoot: Workers cleaned up. [06/Nov/2014:12:53:07 +0100] - import userRoot: Indexing complete. Post-processing... [06/Nov/2014:12:53:07 +0100] - import userRoot: Generating numSubordinates complete. [06/Nov/2014:12:53:07 +0100] - import userRoot: Flushing caches... [06/Nov/2014:12:53:07 +0100] - import userRoot: Closing files... [06/Nov/2014:12:53:08 +0100] - import userRoot: Import complete. Processed 4551 entries in 197 seconds. (23.10 entries/sec) [06/Nov/2014:12:53:08 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is coming online; enabling replication
<ldap-edev> [06/Nov/2014:12:49:52 +0100] NSMMReplicationPlugin - Beginning total update of replica "agmt="cn=Replication from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr" (ldap-model:636)".
The stack traces are attached:
The same initial total update using replication agreement with LDAP/389 insetad of LDAP/SSL or TLS and with nsslapd-db-checkpoint=20 goes to the end correctly: [06/Nov/2014:13:17:07 +0100] NSMMReplicationPlugin - agmt="cn=Replication from ldap-model.polytechnique.fr to ldap-edev.polytechnique.fr" (ldap-edev:636): The remote replica has a different database generation ID than the local database. You may have to reinitialize the remote replica, or the local replica. [06/Nov/2014:13:17:19 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is going offline; disabling replication [06/Nov/2014:13:17:19 +0100] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database [06/Nov/2014:13:17:39 +0100] - import userRoot: Processed 10022 entries -- average rate 501.1/sec, recent rate 501.1/sec, hit ratio 0% [06/Nov/2014:13:17:59 +0100] - import userRoot: Processed 19850 entries -- average rate 496.1/sec, recent rate 496.0/sec, hit ratio 100% [06/Nov/2014:13:18:19 +0100] - import userRoot: Workers finished; cleaning up... [06/Nov/2014:13:18:19 +0100] - import userRoot: Workers cleaned up. [06/Nov/2014:13:18:19 +0100] - import userRoot: Indexing complete. Post-processing... [06/Nov/2014:13:18:19 +0100] - import userRoot: Generating numSubordinates complete. [06/Nov/2014:13:18:19 +0100] - import userRoot: Flushing caches... [06/Nov/2014:13:18:19 +0100] - import userRoot: Closing files... [06/Nov/2014:13:18:20 +0100] - import userRoot: Import complete. Processed 29525 entries in 61 seconds. (484.02 entries/sec) [06/Nov/2014:13:18:20 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is coming online; enabling replication
[06/Nov/2014:13:17:20 +0100] NSMMReplicationPlugin - Beginning total update of replica "agmt="cn=Replication from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr" (ldap-model:636)". [06/Nov/2014:13:18:21 +0100] NSMMReplicationPlugin - Finished total update of replica "agmt="cn=Replication from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr" (ldap-model:636)". Sent 29525 entries.
attachment stacktraces-ldap-edev.tar
attachment stacktraces-ldap-model.tar
Both servers appear to be waiting to write to the other server: model {{{
Thread 75 poll poll poll _pr_poll_with_poll slapd_poll write_function openldap_write_function ber_int_sb_write ber_flush2 flush_ber send_ldap_result_ext send_ldap_result do_extended connection_dispatch_operation connection_threadmain _pt_root start_thread clone }}} edev {{{
Thread 3 poll poll poll _pr_poll_with_poll see_if_write_available perform_operation conn_send_extended_operation send_entry internal_srch_entry_callback send_ldap_search_entry_ext send_ldap_search_entry iterate send_results_ext op_shared_search search_internal_callback_pb slapi_search_internal_callback_pb repl5_tot_run prot_thread_main _pt_root start_thread clone }}}
In neither case is any of the SSL code being involved, so it is really strange that this is affected by SSL.
Does increasing the entropy help? There is also creating the VM with an rng (I use virt-install so I use --rng /dev/random - not sure how that translates to the xml files), and then inside your VM, somewhere early before dirsrv starts, do yum install rng-tools, then do rngd -r /dev/hwrng.
It does not seem to be entropy-related. I've made
watch -n.1 cat /proc/sys/kernel/random/entropy_avail on both servers before and during the replication. It varies from 130 to 180, and never goes down lower than that, independently of whether it is an agreement using SSL or not.
I'm not sure how i increase entropy in VMWare guests, we are not using KVM/RHEV...
Ok. Now i am absolutely sure it is not entropy-related. I have installed haveged daemon on both VMs and started it. Now the entropy is all the time between 2000 and 4000. The problem is still the same (once again, only when using SSL in replication agreement):
[06/Nov/2014:15:50:14 +0100] NSMMReplicationPlugin - agmt="cn=Replication from ldap-model.polytechnique.fr to ldap-edev.polytechnique.fr" (ldap-edev:636): The remote replica has a different database generation ID than the local database. You may have to reinitialize the remote replica, or the local replica. [06/Nov/2014:15:51:15 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is going offline; disabling replication [06/Nov/2014:15:51:15 +0100] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database [06/Nov/2014:15:51:35 +0100] - import userRoot: Processed 1376 entries -- average rate 68.8/sec, recent rate 68.8/sec, hit ratio 0% [06/Nov/2014:15:51:55 +0100] - import userRoot: Processed 1376 entries -- average rate 34.4/sec, recent rate 34.4/sec, hit ratio 0% [06/Nov/2014:15:52:15 +0100] - import userRoot: Processed 1376 entries -- average rate 22.9/sec, recent rate 0.0/sec, hit ratio 0% [06/Nov/2014:15:52:35 +0100] - import userRoot: Processed 1376 entries -- average rate 17.2/sec, recent rate 0.0/sec, hit ratio 0% [06/Nov/2014:15:52:55 +0100] - import userRoot: Processed 1376 entries -- average rate 13.8/sec, recent rate 0.0/sec, hit ratio 0% [06/Nov/2014:15:53:15 +0100] - import userRoot: Processed 1376 entries -- average rate 11.5/sec, recent rate 0.0/sec, hit ratio 0% [06/Nov/2014:15:53:35 +0100] - import userRoot: Processed 1376 entries -- average rate 9.8/sec, recent rate 0.0/sec, hit ratio 0% [06/Nov/2014:15:53:55 +0100] - import userRoot: Processed 1376 entries -- average rate 8.6/sec, recent rate 0.0/sec, hit ratio 0%
Looking into the stacktraces from edev (except the first one stacktrace.06-Nov-2014_12h47m36s.txt), . thread 2 (repl5_tot_result_threadmain) is trying to read the result from model with conn_read_result_ex, but it cannot acquire the lock (conn->lock) and sitting there; . thread 3 (repl5_tot_run) holds the lock in perform_operation, then waiting in PR_Poll via see_if_write_available. It times out in nsds5replicaTimeout == 180 (seconds).
On model, as Rich mentioned in comment 1, it is trying to write the result back to edev.
Andrey, could it be possible for you to try running the total update/replica initialization with setting 1 second to nsds5replicaTimeout and see how it goes? Thanks!
If you mean setting agreement timeout smaller for each replica:
with 60s for both agreements [06/Nov/2014:20:42:20 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is going offline; disabling replication [06/Nov/2014:20:42:20 +0100] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database [06/Nov/2014:20:42:40 +0100] - import userRoot: Processed 1404 entries -- average rate 70.2/sec, recent rate 70.2/sec, hit ratio 0% [06/Nov/2014:20:43:00 +0100] - import userRoot: Processed 1404 entries -- average rate 35.1/sec, recent rate 35.1/sec, hit ratio 0% [06/Nov/2014:20:43:20 +0100] - import userRoot: Processed 1404 entries -- average rate 23.4/sec, recent rate 0.0/sec, hit ratio 0% [06/Nov/2014:20:43:27 +0100] - import userRoot: Workers finished; cleaning up... [06/Nov/2014:20:43:27 +0100] - import userRoot: Workers cleaned up. [06/Nov/2014:20:43:27 +0100] - import userRoot: Indexing complete. Post-processing... [06/Nov/2014:20:43:27 +0100] - import userRoot: Generating numSubordinates complete. [06/Nov/2014:20:43:27 +0100] - import userRoot: Flushing caches... [06/Nov/2014:20:43:27 +0100] - import userRoot: Closing files... [06/Nov/2014:20:43:27 +0100] - import userRoot: Import complete. Processed 1659 entries in 67 seconds. (24.76 entries/sec) [06/Nov/2014:20:43:27 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is coming online; enabling replication
with 1s for both agreements:
[06/Nov/2014:20:56:08 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is going offline; disabling replication [06/Nov/2014:20:56:09 +0100] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database [06/Nov/2014:20:56:28 +0100] - import userRoot: Workers finished; cleaning up... [06/Nov/2014:20:56:28 +0100] - import userRoot: Workers cleaned up. [06/Nov/2014:20:56:28 +0100] - import userRoot: Indexing complete. Post-processing... [06/Nov/2014:20:56:28 +0100] - import userRoot: Generating numSubordinates complete. [06/Nov/2014:20:56:28 +0100] - import userRoot: Flushing caches... [06/Nov/2014:20:56:28 +0100] - import userRoot: Closing files... [06/Nov/2014:20:56:28 +0100] - import userRoot: Import complete. Processed 5572 entries in 19 seconds. (293.26 entries/sec) [06/Nov/2014:20:56:28 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is coming online; enabling replication
with 5s for both agreements:
[06/Nov/2014:20:59:55 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is going offline; disabling replication [06/Nov/2014:20:59:55 +0100] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database [06/Nov/2014:21:00:15 +0100] - import userRoot: Processed 4439 entries -- average rate 221.9/sec, recent rate 221.9/sec, hit ratio 0% [06/Nov/2014:21:00:18 +0100] - import userRoot: Workers finished; cleaning up... [06/Nov/2014:21:00:18 +0100] - import userRoot: Workers cleaned up. [06/Nov/2014:21:00:18 +0100] - import userRoot: Indexing complete. Post-processing... [06/Nov/2014:21:00:18 +0100] - import userRoot: Generating numSubordinates complete. [06/Nov/2014:21:00:18 +0100] - import userRoot: Flushing caches... [06/Nov/2014:21:00:18 +0100] - import userRoot: Closing files... [06/Nov/2014:21:00:18 +0100] - import userRoot: Import complete. Processed 4439 entries in 23 seconds. (193.00 entries/sec) [06/Nov/2014:21:00:18 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is coming online; enabling replication
If you mean changing the timeout of the agreement ldap-model -> ldap-edev to 1 sec leaving ldap-edev -> ldap-model at 180s :
[06/Nov/2014:20:48:13 +0100] - import userRoot: Processed 3615 entries -- average rate 180.8/sec, recent rate 180.7/sec, hit ratio 0% [06/Nov/2014:20:48:33 +0100] - import userRoot: Processed 3615 entries -- average rate 90.4/sec, recent rate 90.3/sec, hit ratio 0% [06/Nov/2014:20:48:53 +0100] - import userRoot: Processed 3615 entries -- average rate 60.2/sec, recent rate 0.0/sec, hit ratio 0% [06/Nov/2014:20:49:13 +0100] - import userRoot: Processed 3615 entries -- average rate 45.2/sec, recent rate 0.0/sec, hit ratio 0% [06/Nov/2014:20:49:33 +0100] - import userRoot: Processed 3615 entries -- average rate 36.1/sec, recent rate 0.0/sec, hit ratio 0% [06/Nov/2014:20:49:53 +0100] - import userRoot: Processed 3615 entries -- average rate 30.1/sec, recent rate 0.0/sec, hit ratio 0% [06/Nov/2014:20:50:14 +0100] - import userRoot: Processed 3615 entries -- average rate 25.8/sec, recent rate 0.0/sec, hit ratio 0% [06/Nov/2014:20:50:34 +0100] - import userRoot: Processed 3615 entries -- average rate 22.6/sec, recent rate 0.0/sec, hit ratio 0% ...
So it's not getting any better - still a random freeze at some random place.
Thanks for your quick testing. I meant "with 1s for both agreements" expecting to give more chances to repl5_tot_result_threadmain to read the total update results from model. But it does not help... We need to reproduce the problem in house for more investigation.
So, it requires the replication over SSL/TLS and 30000 entires to duplicate the total update hang?
Not necessarily. I think it could be more time-dependent - the total update should be long enough for the problem to appear (20-30 seconds at least i think).
If you look at the original mail (https://lists.fedoraproject.org/pipermail/389-users/2014-November/017601.html), 陳含林 (laneovcc at gmail.com) had only about 5000 entries and it stuck about 1500. Disabling SSL in rep agreement has also helped him/her. It would be also interesting to see what server configuration he/she was using.
The total update using 1.2.10 on CentOS 7 takes always only 8-9 seconds for the same data (instead of ~60s for 1.3.x), maybe that's why i've never seen any problem with 1.2.10 either.
Interesting... So, what you suggest is to have a huge DB and run the total update, right?
Please note that Mr. 陳含林's case was caused by some entries larger than the maxbersize.
Anyway, we'd start from there...
Thanks for your input!
No, it was not caused by large entries. That's what he has written (https://lists.fedoraproject.org/pipermail/389-users/2014-November/017604.html):
''While nsDS5ReplicaTransportInfo set to TLS, i have tried'' ''set nsslapd-maxbersize to 2147483647'' ''set nsslapd-cachememsize to 2147483648'' ''set nsslapd-pluginbetxn off'' ''set export LDAPTLS_REQCERT=never in /etc/sysconfig/dirsrv'' ''all of above not work.''
''After change ''nsDS5ReplicaTransportInfo from TLS to LDAP,'' total update finished in 5 sec and succeeded.'' ''This should be a workaround for my case.''
So he had exactly the same problem as me. Anyway, i'm going to migrate soon out production to the latest 1.3.2 and use the workaround of disabling SSL for total online intialization.
If i can be of any use for this ticket, please don't hesitate to ask me for more tests or information.
And one more thing - our db is not huge, it's about 30000 entries and about 33Mbytes in size
Thanks for the clarification, Andrey. So, TLS (or SSL?) is the issue. Still we have no luck to duplicate the problem with SSL/TLS enabled on the latest version from master... We may need to go back to the older versions.
I've disabled any other modifications of plugins during the installation other than activating TLS/SSL. The problem did not disappear.
Just in case here is some more information about the test serveres (VMs): both VMs (vSphere 5.5U2) have 1vCPU, 4Gb RAM, CentOS 7 with latest "yum update" vmtools used are the supplied with OS open-vm-tools (yum install open-vm-tools) SELinux is disabled ipv6 is disabled from sysctl.conf: echo "net.ipv6.conf.all.disable_ipv6 = 1" >> /etc/sysctl.conf echo "net.ipv6.conf.default.disable_ipv6 = 1" >> /etc/sysctl.conf sysctl -p all the server certs are signed by a single but self-signed authority autobind is activated: # nsslapd-ldapilisten: on # nsslapd-ldapiautobind: on * the 389ds was compiled without SELinux: ./configure --prefix=/Local/dirsrv --enable-autobind --with-openldap --with-tmpfiles-d=/etc/tmpfiles.d --with-systemdsystemunitdir=/usr/lib/systemd/system --with-systemdsystemconfdir=/etc/systemd/system --with-systemdgroupname=dirsrv.target
0 cn=encryption,cn=config objectClass: top objectClass: nsEncryptionConfig cn: encryption nsSSLSessionTimeout: 0 nsSSLClientAuth: allowed nsSSL2: off nsSSL3: on nsSSL3Ciphers: -rsa_null_md5,+rsa_rc4_128_md5,-rsa_rc4_40_md5,-rsa_rc2_40_md5,+rsa_des_sha,+rsa_fips_des_sha,+rsa_3des_sha,+rsa_fips_3des_sha,+fortezza,+fortezza_rc4_128_sha,+fortez za_null,+tls_rsa_export1024_with_rc4_56_sha,+tls_rsa_export1024_with_des_cbc_sha,-rc4,-rc4export,-rc2,-rc2export,-des,-desede3 nsSSLSupportedCiphers: SSL3::rc4::RC4::MD5::128 nsSSLSupportedCiphers: SSL3::rc4export::RC4::MD5::128 nsSSLSupportedCiphers: SSL3::rc2::RC2::MD5::128 nsSSLSupportedCiphers: SSL3::rc2export::RC2::MD5::128 nsSSLSupportedCiphers: SSL3::des::DES::MD5::64 nsSSLSupportedCiphers: SSL3::desede3::3DES::MD5::192 nsSSLSupportedCiphers: SSL3::rsa_rc4_128_md5::RC4::MD5::128 nsSSLSupportedCiphers: SSL3::rsa_rc4_128_sha::RC4::SHA1::128 nsSSLSupportedCiphers: SSL3::rsa_3des_sha::3DES::SHA1::192 nsSSLSupportedCiphers: SSL3::rsa_des_sha::DES::SHA1::64 nsSSLSupportedCiphers: SSL3::rsa_fips_3des_sha::3DES::SHA1::192 nsSSLSupportedCiphers: SSL3::fips_3des_sha::3DES::SHA1::192 nsSSLSupportedCiphers: SSL3::rsa_fips_des_sha::DES::SHA1::64 nsSSLSupportedCiphers: SSL3::fips_des_sha::DES::SHA1::64 nsSSLSupportedCiphers: SSL3::rsa_rc4_40_md5::RC4::MD5::128 nsSSLSupportedCiphers: SSL3::rsa_rc2_40_md5::RC2::MD5::128 nsSSLSupportedCiphers: SSL3::rsa_null_md5::NULL::MD5::0 nsSSLSupportedCiphers: SSL3::rsa_null_sha::NULL::SHA1::0 nsSSLSupportedCiphers: TLS::tls_rsa_export1024_with_rc4_56_sha::RC4::SHA1::128 nsSSLSupportedCiphers: TLS::rsa_rc4_56_sha::RC4::SHA1::128 nsSSLSupportedCiphers: TLS::tls_rsa_export1024_with_des_cbc_sha::DES::SHA1::64 nsSSLSupportedCiphers: TLS::rsa_des_56_sha::DES::SHA1::64 nsSSLSupportedCiphers: SSL3::fortezza::DES::SHA1::64 nsSSLSupportedCiphers: SSL3::fortezza_rc4_128_sha::DES::SHA1::64 nsSSLSupportedCiphers: SSL3::fortezza_null::DES::SHA1::64 nsSSLSupportedCiphers: SSL3::dhe_dss_des_sha::DES::SHA1::64 nsSSLSupportedCiphers: SSL3::dhe_dss_3des_sha::3DES::SHA1::192 nsSSLSupportedCiphers: SSL3::dhe_rsa_des_sha::DES::SHA1::64 nsSSLSupportedCiphers: SSL3::dhe_rsa_3des_sha::3DES::SHA1::192 nsSSLSupportedCiphers: TLS::tls_rsa_aes_128_sha::AES::SHA1::128 nsSSLSupportedCiphers: TLS::rsa_aes_128_sha::AES::SHA1::128 nsSSLSupportedCiphers: TLS::tls_dhe_dss_aes_128_sha::AES::SHA1::128 nsSSLSupportedCiphers: TLS::tls_dhe_rsa_aes_128_sha::AES::SHA1::128 nsSSLSupportedCiphers: TLS::tls_rsa_aes_256_sha::AES::SHA1::256 nsSSLSupportedCiphers: TLS::rsa_aes_256_sha::AES::SHA1::256 nsSSLSupportedCiphers: TLS::tls_dhe_dss_aes_256_sha::AES::SHA1::256 nsSSLSupportedCiphers: TLS::tls_dhe_rsa_aes_256_sha::AES::SHA1::256 nsSSLSupportedCiphers: TLS::tls_dhe_dss_1024_rc4_sha::RC4::SHA1::128 nsSSLSupportedCiphers: TLS::tls_dhe_dss_rc4_128_sha::RC4::SHA1::128
attachment stacktrace.1415534491.txt
hi all,
I try to duplicate this problem today, but this time total update with TLS succeeded. Another replication which using LDAP and simple bind goes wrong.
Stacktrace is in attachment stacktrace.1415534491.txt
Here's the mapping tree, meToidm1.ra.cn.idc using TLS succeeded, cloneAgreement1-idm4.jh.cn.idc-pki-tomcat using LDAP failed.
dn: cn=mapping tree,cn=config objectClass: top objectClass: extensibleObject cn: mapping tree
dn: cn=dc\3Didc,cn=mapping tree,cn=config objectClass: top objectClass: extensibleObject objectClass: nsMappingTree cn: dc=idc cn: "dc=idc" nsslapd-state: backend nsslapd-backend: userRoot
dn: cn=o\3Dipaca,cn=mapping tree,cn=config objectClass: top objectClass: extensibleObject objectClass: nsMappingTree cn: o=ipaca nsslapd-backend: ipaca nsslapd-state: Backend
dn: cn=replica,cn=dc\3Didc,cn=mapping tree,cn=config cn: replica nsDS5Flags: 1 objectClass: top objectClass: nsds5replica objectClass: extensibleobject nsDS5ReplicaType: 3 nsDS5ReplicaRoot: dc=idc nsds5ReplicaLegacyConsumer: off nsDS5ReplicaId: 37 nsDS5ReplicaBindDN: cn=replication manager,cn=config nsState:: JQAAAAAAAAAZVl9UAAAAAAAAAAAAAAAAQ3AAAAAAAAACAAAAAAAAAA== nsDS5ReplicaName: 2bfcb989-680711e4-9e77d38f-3fc8cf37 nsds5ReplicaChangeCount: 1 nsds5replicareapactive: 0
dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config objectClass: top objectClass: nsDS5Replica objectClass: extensibleobject nsDS5ReplicaRoot: o=ipaca nsDS5ReplicaType: 3 nsDS5ReplicaBindDN: cn=Replication Manager cloneAgreement1-idm4.jh.cn.idc-pki- tomcat,ou=csusers,cn=config cn: replica nsDS5ReplicaId: 1185 nsDS5Flags: 1 nsState:: oQQAAAAAAAByVl9UAAAAAAAAAAAAAAAAAAAAAAAAAAABAAAAAAAAAA== nsDS5ReplicaName: 5ea6b185-680711e4-9e77d38f-3fc8cf37 nsds5ReplicaChangeCount: 0 nsds5replicareapactive: 0
dn: cn=meToidm1.ra.cn.idc,cn=replica,cn=dc\3Didc,cn=mapping tree,cn=config cn: meToidm1.ra.cn.idc objectClass: nsds5replicationagreement objectClass: top nsDS5ReplicaTransportInfo: TLS description: me to idm1.ra.cn.idc nsDS5ReplicaRoot: dc=idc nsDS5ReplicaHost: idm1.ra.cn.idc nsds5replicaTimeout: 120 nsDS5ReplicaBindDN: cn=replication manager,cn=config nsDS5ReplicaPort: 389 nsDS5ReplicaBindMethod: simple nsDS5ReplicatedAttributeList: (objectclass=) $ EXCLUDE memberof idnssoaserial entryusn krblastsuccessfulauth krblastfailedauth krbloginfailedcount nsds5ReplicaStripAttrs: modifiersName modifyTimestamp internalModifiersName in ternalModifyTimestamp nsDS5ReplicaCredentials: {DES} nsDS5ReplicatedAttributeListTotal: (objectclass=) $ EXCLUDE entryusn krblasts uccessfulauth krblastfailedauth krbloginfailedcount nsds5replicareapactive: 0 nsds5replicaLastUpdateStart: 0 nsds5replicaLastUpdateEnd: 0 nsds5replicaChangesSentSinceStartup: nsds5replicaLastUpdateStatus: 1 Can't acquire busy replica nsds5replicaUpdateInProgress: FALSE nsds5replicaLastInitStart: 0 nsds5replicaLastInitEnd: 0
, config dn: cn=cloneAgreement1-idm4.jh.cn.idc-pki-tomcat,cn=replica,cn=o\3Dipaca,cn=ma pping tree,cn=config objectClass: top objectClass: nsds5replicationagreement cn: cloneAgreement1-idm4.jh.cn.idc-pki-tomcat nsDS5ReplicaRoot: o=ipaca nsDS5ReplicaHost: idm1.ra.cn.idc nsDS5ReplicaPort: 389 nsDS5ReplicaBindDN: cn=Replication Manager masterAgreement1-idm4.jh.cn.idc-pki -tomcat,ou=csusers,cn=config nsDS5ReplicaBindMethod: Simple description: cloneAgreement1-idm4.jh.cn.idc-pki-tomcat nsDS5ReplicaCredentials: {DES} nsds5replicareapactive: 0 nsds5replicaLastUpdateStart: 0 nsds5replicaLastUpdateEnd: 0 nsds5replicaChangesSentSinceStartup: nsds5replicaLastUpdateStatus: 402 Replication error acquiring replica: unknown error - Replica has different database generation ID, remote replica may nee d to be initialized nsds5replicaUpdateInProgress: FALSE nsds5replicaLastInitStart: 0 nsds5replicaLastInitEnd: 0
search: 2 result: 0 Success
I've made some more tests - i am able to reproduce the problem with a dbgen-generated database and without any schema modification. So it's not a particular schema/entry issue. The dbgen command i have used: dbgen.pl -o /tmp/db-test.ldif -s "dc=id,dc=polytechnique,dc=edu" -m uid -n 30000
This bug is likely related to https://bugzilla.redhat.com/show_bug.cgi?id=1166265
Looks like bug for RedHat internal use only - i am unable to see it.
attachment 0001-Ticket-47942-DS-hangs-during-online-total-update.patch
The fix is also expected for freeipa 4.1.3 (https://fedorahosted.org/freeipa/ticket/4756) iin RHEL 7.0/7.1 -(DS 1.3.3).
I'll test this fix soon and i'll report the results.
Is it possible that this problem (i.e. the sending thread holding lock and pushing a great deal of entries thus overflowing the consumer) may happen during a regular incremental sync? I had similar replica timeouts when i made a lot of modifications during a short time on source server (like adding an attribute to each ldap entry, for example).
The original code has a comment saying that the sending thread is locked in order to "prevent the result reader thread from closing the connection out from under us." This patch does the opposite - it gives a time window (100ms each 1s) for the result reader thread to close the connection. Is this possibility taken into account when you make the subsequent connection re-lock ?
668 PR_Unlock(conn->lock); 669 DS_Sleep(PR_MillisecondsToInterval(yeild_delay_msec));
---> does the conn still exist here or it was closed by the read result thread ? 670 PR_Lock(conn->lock);
I've patched the repl5_connection.c of 1.3.2.24, the tests show that the problem is still present:(
{{{ [28/Nov/2014:09:09:43 +0100] NSMMReplicationPlugin - agmt="cn=Replication from ldap-model.polytechnique.fr to ldap-edev.polytechnique.fr" (ldap-edev:636): The remote replica has a different database generation ID than the local database. You may have to reinitialize the remote replica, or the local replica. [28/Nov/2014:09:10:21 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is going offline; disabling replication [28/Nov/2014:09:10:21 +0100] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database [28/Nov/2014:09:10:41 +0100] - import userRoot: Processed 2294 entries -- average rate 114.7/sec, recent rate 114.7/sec, hit ratio 0% [28/Nov/2014:09:11:01 +0100] - import userRoot: Processed 3639 entries -- average rate 91.0/sec, recent rate 91.0/sec, hit ratio 100% [28/Nov/2014:09:11:22 +0100] - import userRoot: Processed 3639 entries -- average rate 60.6/sec, recent rate 33.6/sec, hit ratio 0% [28/Nov/2014:09:11:42 +0100] - import userRoot: Processed 3639 entries -- average rate 45.5/sec, recent rate 0.0/sec, hit ratio 0% [28/Nov/2014:09:12:02 +0100] - import userRoot: Processed 3639 entries -- average rate 36.4/sec, recent rate 0.0/sec, hit ratio 0% [28/Nov/2014:09:12:22 +0100] - import userRoot: Processed 3639 entries -- average rate 30.3/sec, recent rate 0.0/sec, hit ratio 0% ... [28/Nov/2014:09:15:22 +0100] - import userRoot: Decided to end this pass because the progress rate has dropped below the 50% threshold. [28/Nov/2014:09:15:22 +0100] - import userRoot: Ending pass number 1 ... [28/Nov/2014:09:15:22 +0100] - import userRoot: Foreman is done; waiting for workers to finish... [28/Nov/2014:09:15:22 +0100] - import userRoot: Workers finished; cleaning up... [28/Nov/2014:09:15:22 +0100] - import userRoot: Workers cleaned up. [28/Nov/2014:09:15:22 +0100] - import userRoot: Sweeping files for merging later... [28/Nov/2014:09:15:23 +0100] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database [28/Nov/2014:09:15:23 +0100] - import userRoot: Sweep done. [28/Nov/2014:09:15:23 +0100] - import userRoot: Beginning pass number 2 [28/Nov/2014:09:15:43 +0100] - import userRoot: Processed 3639 entries (pass 2) -- average rate 214748182.8/sec, recent rate 0.0/sec, hit ratio 0% [28/Nov/2014:09:16:03 +0100] - import userRoot: Processed 3639 entries (pass 2) -- average rate 107374091.4/sec, recent rate 0.0/sec, hit ratio 0% [28/Nov/2014:09:16:23 +0100] - import userRoot: Processed 3639 entries (pass 2) -- average rate 71582727.6/sec, recent rate 0.0/sec, hit ratio 0% ... }}}
The total update of RA without SSL still goes fine using the same initial data: {{{ [28/Nov/2014:09:25:16 +0100] NSMMReplicationPlugin - agmt="cn=Replication from ldap-model.polytechnique.fr to ldap-edev.polytechnique.fr" (ldap-edev:636): The remote replica has a different database generation ID than the local database. You may have to reinitialize the remote replica, or the local replica. [28/Nov/2014:09:25:39 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is going offline; disabling replication [28/Nov/2014:09:25:39 +0100] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database [28/Nov/2014:09:25:59 +0100] - import userRoot: Processed 10363 entries -- average rate 518.1/sec, recent rate 518.1/sec, hit ratio 0% [28/Nov/2014:09:26:19 +0100] - import userRoot: Processed 22080 entries -- average rate 552.0/sec, recent rate 552.0/sec, hit ratio 100% [28/Nov/2014:09:26:36 +0100] - import userRoot: Workers finished; cleaning up... [28/Nov/2014:09:26:36 +0100] - import userRoot: Workers cleaned up. [28/Nov/2014:09:26:36 +0100] - import userRoot: Indexing complete. Post-processing... [28/Nov/2014:09:26:36 +0100] - import userRoot: Generating numSubordinates complete. [28/Nov/2014:09:26:37 +0100] - import userRoot: Flushing caches... [28/Nov/2014:09:26:37 +0100] - import userRoot: Closing files... [28/Nov/2014:09:26:37 +0100] - import userRoot: Import complete. Processed 29730 entries in 58 seconds. (512.59 entries/sec) [28/Nov/2014:09:26:37 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is coming online; enabling replication }}}
The strange thing about your patch is that line numbers do not correspond to the HEAD file (https://git.fedorahosted.org/cgit/389/ds.git/tree/ldap/servers/plugins/replication/repl5_connection.c). Maybe there are some other changes i need to apply to repl5_connection.c in order to test the patch correctly?
Replying to [comment:19 pj101]:
I'll test this fix soon and i'll report the results. Is it possible that this problem (i.e. the sending thread holding lock and pushing a great deal of entries thus overflowing the consumer) may happen during a regular incremental sync? I had similar replica timeouts when i made a lot of modifications during a short time on source server (like adding an attribute to each ldap entry, for example).
Good point. I need to check if it can happen during incremental replication. Now incremental replication is different, it sends updates not entries so the amount of work to do on the consumer side is lower. Also, the replica agreement should wait for ack from previous update before sending the next one. But again it worth to be confirmed.
Replying to [comment:20 pj101]:
The original code has a comment saying that the sending thread is locked in order to "prevent the result reader thread from closing the connection out from under us." This patch does the opposite - it gives a time window (100ms each 1s) for the result reader thread to close the connection. Is this possibility taken into account when you make the subsequent connection re-lock ? 668 PR_Unlock(conn->lock); 669 DS_Sleep(PR_MillisecondsToInterval(yeild_delay_msec)); ---> does the conn still exist here or it was closed by the read result thread ? 670 PR_Lock(conn->lock);
668 PR_Unlock(conn->lock); 669 DS_Sleep(PR_MillisecondsToInterval(yeild_delay_msec)); ---> does the conn still exist here or it was closed by the read result thread ? 670 PR_Lock(conn->lock);
Good point again !.. Thanks for the review :) The connection is not freed until all entries have been sent (or error condition) and a 'end of total update' is sent to the consumer. So I think there is no problem here.
Replying to [comment:22 pj101]:
The patch I attached was done on 1.3.1.May be the lines differs but I would expect it matches.
Yes. It matches ok for the HEAD or 1.3.2.24. If it was a diff with 1.3.1 sources, it explains the discrepancy of line numbers.
Do you confirm the fix fails for your test case ? I could provide an instrumented version that logs debug info in access log to not change the dynamic of the import. Logging in error log, slows down the supplier and may hide the issue
Yes, it fails our test case - i've tested several times.
It's ok if you send me an instrumented version. However we use compilation from sources and installation in a particular folder in our installation/setup/import scripts, so the rpm will be useless.
Concerning the dynamics, in our case we use "nsslapd-accesslog-logbuffering: off" on our DEV servers. The bug works independently of whether accesslog buffering is off or on...
attachment 0001-Ticket-47942-debug-DS-hangs-during-online-total-upda.patch
Hello,
I prepared a debug version of the patch (https://fedorahosted.org/389/attachment/ticket/47942/0001-Ticket-47942-debug-DS-hangs-during-online-total-upda.patch). Because the problem of flow control between RA.reader/write is dynamic, the debug logs are written in the access log (buffers) so that it does not slow down the supplier.
To get valid info with that debug patch the supplier needs 'nsslapd-accesslog-logbuffering: on'.
thanks
In fact the bug persists whether i enable or disable access log buffering (with debug version of the patch). Though it is clear that when the log buffering is disabled,the problem appears later during the initialization (7403 entries with log buffering and 26798 without). I've attached the error, access and audit logs in both cases for supplier (ldap-edev) and consumer (ldap-model).
supplier logs with buffering on edev-logs-buffering-on.tar.gz
consumer logs with buffering on model-logs-buffering-on.tar.gz
supplier logs with buffering off edev-logs-buffering-off.tar.gz
consumer logs with buffering off model-logs-buffering-off.tar.gz
monitoring the sent entries and the rcv acked, the consumer is always ~2000 late compare to the supplier
The supplier is sending data without having to wait because of ability to write The last log shows no delay before sending the next entry
{{{ [29/Nov/2014:14:34:22 +0100] agmt="cn=Replication from xxx" (ldap-supplier:636): perform_operation sends operation (5) }}}
{{{ [29/Nov/2014:14:34:24 +0100] - import userRoot: Processed 7403 entries -- average rate 92.5/sec, recent rate 70.2/sec, hit ratio 100% [29/Nov/2014:14:34:44 +0100] - import userRoot: Processed 7403 entries -- average rate 74.0/sec, recent rate 26.6/sec, hit ratio 0% ... [29/Nov/2014:14:45:26 +0100] - import userRoot: Processed 7403 entries (pass 3) -- average rate 35791332.4/sec, recent rate 0.0/sec, hit ratio 0% [29/Nov/2014:14:45:46 +0100] - import userRoot: Processed 7403 entries (pass 3) -- average rate 30678284.9/sec, recent rate 0.0/sec, hit ratio 0% [29/Nov/2014:14:46:01 +0100] - slapd_poll(256) timed out }}}
I believe that what happened is:
This can be confirmed with several pstacks (consuemr/supplier) while the full init is not progressing.
I believe the provided fix fails because the supplier testing the connection, detects that it can write but it has not the ability to write a full entry.
I am implementing a new patch. Will it be possible you test it in your environment (I am not able to reproduce your exact failure).
ioblocktimeout in our case is 700000 (~12-13 minutes). The initial stacktraces (without patch) are attached to the ticket in the very beginning.
Sure, no problem for testing in our environment - i want the problem to be resolved :)
attachment 0001-Ticket-47942-debug-2nd-fix-DS-hangs-during-online-to.patch
Thanks for the backstack. I will look at them. They are from the original failure so they will not show the problem with the previous tentative fix but it will give a good indication.
I tested the debug patch I just attached. It succeeds but in my test env I am not able to reproduce your exact problem. Would it be possible you backport the patch (done on top of 1.3.1), and test it (with access buffering 'on'). In case of failure, would you please provide the access/errors logs and pstacks of supplier/consumer.
Thanks again for your help
You were right. The last patch works like a charm. I've attached the "succesful" logs to the ticket. The throttle ("check_flow_control_tot_init give time to the consumer before sending more entries") happens approx every two or three seconds
I still have several questions/remarks: * the bug never appeared without SSL. I was unable to reproduce it without any encryption (be it SSL/TLS or Kerberos). What could be the reason? Maybe these encryption/decryption protocols have their own buffers thus changing the flow dynamics?
the choice of the "lag" of 1000 entries is rather arbitrary/empirical (as well as 2 seconds of sleep). It's ok for the entries in our LDAP (average size ~5 or 6 kB) but what happens if the entries are larger (smaller should be ok). Maybe it should be based on the volume of the sent data if the limitation is TCP socket buffer sizes?
i've already asked about the incremental replication. If we make a lot of modifications during a short time period (or if the consumer is down and then goes up thus needing to catch up), the situation will be exactly the same - the supplier will flood the consumer. So the question is - does the incremental replication use the same code path (perform_operation() in repl5_connection.c)? And if it is the case then maybe it is worth implementing the same sort of "dike" for incremental replication?
supplier logs - succesful edev-logs-buffering-on-success.tar.gz
consumer logs - succesful model-logs-buffering-on-success.tar.gz
Replying to [comment:34 pj101]:
You were right. The last patch works like a charm. I've attached the "succesful" logs to the ticket. The throttle ("check_flow_control_tot_init give time to the consumer before sending more entries") happens approx every two or three seconds I still have several questions/remarks: * the bug never appeared without SSL. I was unable to reproduce it without any encryption (be it SSL/TLS or Kerberos). What could be the reason? Maybe these encryption/decryption protocols have their own buffers thus changing the flow dynamics?
Yes. NSS does a lot of buffering and locking which will definitely throw off the timing. Same with gssapi/kerberos.
Probably. We'll need some more testing with different data sets to determine that.
i've already asked about the incremental replication. If we make a lot of modifications during a short time period (or if the consumer is down and then goes up thus needing to catch up), the situation will be exactly the same - the supplier will flood the consumer. So the question is - does the incremental replication use the same code path (perform_operation() in repl5_connection.c)?
Yes.
And if it is the case then maybe it is worth implementing the same sort of "dike" for incremental replication?
Yes. But because of the way that incremental replication works, it will probably be much more difficult to reproduce the problem.
attachment 0001-Ticket-47942-DS-hangs-during-online-total-update.2.patch
Thanks Mark for the review. I will revisit the fix with the indentation problem.
You raise the same point as pj101 reported and Rich confirmed. Yes looking at the code I think you are right, the same flow control issue can occur in incremental replication.
In old versions (5.0 ?) the flow control was done for incremental because the RA.sender waited for the sent update to be acknowledged before sending the next one.
In current version, this flow control is not done. The RA.sender creates a list of expected acknowledgements but continue to send until there is no more update to send. This new version offers better latency regarding replication at the risk of a flow control congestion.
The risk in incremental update should be much limited than with total update as the amount of sent data is less (likely less updates to send than entries, size of the updates < size entries). Now the risk exists and the backstack from reliab15 shows it can occurs.
I will look how difficult it would be to do a common flow control (total and incremental).
Incremental and total update use different threads and functions/callback to send/receive updates/ack. Now a similar fix on the incremental update looks quite easy to implement. I will test a new fix include but flow control on total/incremental updates.
For total update, window/pause are configurable. I am not sure it worth to make it configurable for incremental update.
attachment 0002-Ticket-47942-DS-hangs-during-online-total-update.patch
attachment 0003-Ticket-47942-DS-hangs-during-online-total-update-inc.patch
Looks like there are still indentation issues in:
ldap/servers/plugins/replication/repl5_agmtlist.c
{{{ agmtlist_modify_callback(Slapi_PBlock pb, Slapi_Entry entryBefore, Slapi_Entry 326 326 rc = SLAPI_DSE_CALLBACK_ERROR; 327 327 } 328 328 } 329 else if (slapi_attr_types_equivalent(mods[i]->mod_type, 330 type_nsds5ReplicaTotalWindow)) 331 { 332 / New replica timeout / 333 if (agmt_set_totalwindow_from_entry(agmt, e) != 0) 334 { 335 slapi_log_error(SLAPI_LOG_REPL, repl_plugin_name, "agmtlist_modify_callback: " 336 "failed to update total update window for agreement %s\n", 337 agmt_get_long_name(agmt)); 338 returncode = LDAP_OPERATIONS_ERROR; 339 rc = SLAPI_DSE_CALLBACK_ERROR; 340 } 341 } 342 else if (slapi_attr_types_equivalent(mods[i]->mod_type, 343 type_nsds5ReplicaTotalPause)) 344 { 345 / New replica timeout / 346 if (agmt_set_totalpause_from_entry(agmt, e) != 0) 347 { 348 slapi_log_error(SLAPI_LOG_REPL, repl_plugin_name, "agmtlist_modify_callback: " 349 "failed to update total update pause for agreement %s\n", 350 agmt_get_long_name(agmt)); 351 returncode = LDAP_OPERATIONS_ERROR; 352 rc = SLAPI_DSE_CALLBACK_ERROR; 353 } 354 } }}}
and ldap/servers/plugins/replication/repl5_agmt.c
{{{ 354 355 / Total update window. / 356 ra->totalUpdateWindow = DEFAULT_TOTALWINDOW; 357 if (slapi_entry_attr_find(e, type_nsds5ReplicaTotalWindow, &sattr) == 0) 358 { 359 Slapi_Value sval; 360 if (slapi_attr_first_value(sattr, &sval) == 0) 361 { 362 ra->totalUpdateWindow = slapi_value_get_long(sval); 363 } 364 } 365 366 / Total update pause. / 367 ra->totalUpdatePause = DEFAULT_TOTALPAUSE; 368 if (slapi_entry_attr_find(e, type_nsds5ReplicaTotalPause, &sattr) == 0) 369 { 370 Slapi_Value sval; 371 if (slapi_attr_first_value(sattr, &sval) == 0) 372 { 373 ra->totalUpdatePause = slapi_value_get_long(sval); 374 } 375 } }}}
and here:
ldap/servers/plugins/replication/repl5_inc_protocol.c
{{{ send_updates(Private_Repl_Protocol prp, RUV remote_update_vector, PRUint32 nu 1684 1705 int finished = 0; 1685 1706 ConnResult replay_crc; 1686 1707 char csn_str[CSN_STRSIZE]; 1708 int inc_update_window = 100; / no more than 100 unacked updates before pause / 1709 int inc_update_pause = 2000; / pause of 2sec if the consumer is too late */ }}}
I find it much easier to spot the indentation issues looking at the patch file, than looking at the code in an IDE.
The rest looks good to me.
Since the fix is in essence the same for both cases, wouldn't it be logical to reuse the same two attributes to the same purpose (renaming them, of course, to something like nsds5ReplicaMaximumLag and nsds5ReplicaLagPause)?
Thanks Mark and pj101 for the review. I will rework the indentation.
Regarding the use of the same attribute I think it is a good idea. Total and incremental are not sending the same data and likely the same amount. So probably the setting could differs. But if it is needed to adapt the tuning it can be done online. So I do not see any drawback to share the same attributes. I will reimplement it and will merge the two fixes.
THanks again
The problem I have with this is that it is going to be difficult to determine the optimal numbers for the window size and the sleep period. How can we determine what the optimal values are?
For the pause period, is it necessary to sleep for some period of time, or is just doing a thread yield sufficient? The purpose of the sleep is to allow the other threads to run, correct? A yield should do that - essentially, a sleep with a time of 0.
For the window size, is there some way we could determine the optimal size? I just don't like giving people knobs to turn without some way to determine how to set them, and we also need to set a reasonable default value.
Replying to [comment:43 rmeggins]:
I fully agree. The optimal values can be continuously computed by the RA by monitoring how fast the consumer consums the sent entries/updates. Should it be implemented in that ticket or an other one ?.
I think yield should work as well (needs to be tested) but the test on sent/rcv msg should likely be done before sending the next entry. A main difference between yield and sleep is if we want to give time to the consumer to catch up.
Right. I do not know how to compute the amount of data can go on the connection before taking the risk to be blocked. It also depends how much encryption will expand the data.
Replying to [comment:44 tbordaz]:
Replying to [comment:43 rmeggins]: The problem I have with this is that it is going to be difficult to determine the optimal numbers for the window size and the sleep period. How can we determine what the optimal values are? I fully agree. The optimal values can be continuously computed by the RA by monitoring how fast the consumer consums the sent entries/updates. Should it be implemented in that ticket or an other one ?.
I guess we have the immediate problem of the hang, and we know that this patch fixes. I think we should go ahead with the patch as it is.
Then we should open another ticket to follow up this ticket, to investigate and determine how to find the optimal values of sleep time and window size, with the goals 1) find the best general purpose default values for sleep time and window size; 2) document a procedure for monitoring and tuning replication so that we can tune these values in production; 3) see if there is a way for the code to automatically adjust these values e.g. had to sleep 100 ms last time and it wasn't long enough, try 200 ms, etc.
For the pause period, is it necessary to sleep for some period of time, or is just doing a thread yield sufficient? The purpose of the sleep is to allow the other threads to run, correct? A yield should do that - essentially, a sleep with a time of 0. I think yield should work as well (needs to be tested) but the test on sent/rcv msg should likely be done before sending the next entry. A main difference between yield and sleep is if we want to give time to the consumer to catch up.
Ok.
For the window size, is there some way we could determine the optimal size? I just don't like giving people knobs to turn without some way to determine how to set them, and we also need to set a reasonable default value. Right. I do not know how to compute the amount of data can go on the connection before taking the risk to be blocked. It also depends how much encryption will expand the data.
There are a few new log messages that are FATAL - I'm not sure these should be FATAL - we don't want to spam the customer's error logs unnecessarily. However, if we tell them that the only way to determine if this hang situation is happening is by turning on the REPL log level, there are some customers who will not want to do this in their production environments because it will slow down the server too much. Perhaps we could add some sort of monitor entry/attribute for this?
attachment 0004-Ticket-47942-DS-hangs-during-online-total-update.patch
In conn_write_available: yeild is spelled yield I don't think the log message should be at FATAL level, I suggest REPL return_value should be initialized, otherwise you will get compiler and/or coverity warning should check for other stop states in the while loop - for example, what if someone wants to break the loop without shutting down slapd e.g. by deleting the agreemen, stopping the agreement, or disabling the agreement?
attachment 0005-Ticket-47942-DS-hangs-during-online-total-update.patch
Hi Thierry, I have 2 questions...
1) Is the value '0' or a negative value treated specially for these 2 parameters? flowControlWindow flowControlPause
Reading the code, the value '0' does not do much harm. (In see_if_write_available, if flowControlWindow is 0, it goes to the pause every time. It should be okay, right?) Maybe, if a negative number accidentally set to flowControlPause might be bad since PR_MillisecondsToInterval takes unsigned 32bit integer PRUint32. Do we want to check (totalUpdatePause > 0)? 691 } else if ((sent_msgid - rcv_msgid) > agmt_get_flowcontrolwindow(conn->agmt)) { 692 int totalUpdatePause; 693 694 totalUpdatePause = agmt_get_flowcontrolpause(conn->agmt); 695 if (totalUpdatePause) { ... 717 DS_Sleep(PR_MillisecondsToInterval(totalUpdatePause));
Same for repl5_inc_flow_control_results: 476 DS_Sleep(PR_MillisecondsToInterval(agmt_get_flowcontrolpause(agmt)));
2) Maybe it is a corner case :), but in case flowcontrolwindow is 0, total update does not pause if sent and rcv are identical, but incremental does due to '>='. I'm curious why total update uses '>' and incremental does '>='? 474 ((rd->last_message_id_sent - rd->last_message_id_received) >= agmt_get_flowcontrolwindow(agmt))) {
Replying to [comment:48 nhosoi]:
Hi Thierry, I have 2 questions... 1) Is the value '0' or a negative value treated specially for these 2 parameters? flowControlWindow
1) Is the value '0' or a negative value treated specially for these 2 parameters? flowControlWindow
0 means "do flow control check for every message"
flowControlPause
0 means do a thread yield instead of sleep for N milliseconds
Reading the code, the value '0' does not do much harm. (In see_if_write_available, if flowControlWindow is 0, it goes to the pause every time. It should be okay, right?) Maybe, if a negative number accidentally set to flowControlPause might be bad since PR_MillisecondsToInterval takes unsigned 32bit integer PRUint32. Do we want to check (totalUpdatePause > 0)?
Note that there are several replication parameters that do not check for negative or otherwise invalid values. We should fix all of them.
691 } else if ((sent_msgid - rcv_msgid) > agmt_get_flowcontrolwindow(conn->agmt)) { 692 int totalUpdatePause; 693 694 totalUpdatePause = agmt_get_flowcontrolpause(conn->agmt); 695 if (totalUpdatePause) { ... 717 DS_Sleep(PR_MillisecondsToInterval(totalUpdatePause)); Same for repl5_inc_flow_control_results: 476 DS_Sleep(PR_MillisecondsToInterval(agmt_get_flowcontrolpause(agmt))); 2) Maybe it is a corner case :), but in case flowcontrolwindow is 0, total update does not pause if sent and rcv are identical, but incremental does due to '>='. I'm curious why total update uses '>' and incremental does '>='? 474 ((rd->last_message_id_sent - rd->last_message_id_received) >= agmt_get_flowcontrolwindow(agmt))) {
691 } else if ((sent_msgid - rcv_msgid) > agmt_get_flowcontrolwindow(conn->agmt)) { 692 int totalUpdatePause; 693 694 totalUpdatePause = agmt_get_flowcontrolpause(conn->agmt); 695 if (totalUpdatePause) { ... 717 DS_Sleep(PR_MillisecondsToInterval(totalUpdatePause));
git merge ticket47942
Updating 96c130b..fbafee5 Fast-forward ldap/schema/01core389.ldif | 4 +- ldap/servers/plugins/replication/repl5.h | 10 +++++ ldap/servers/plugins/replication/repl5_agmt.c | 160 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ldap/servers/plugins/replication/repl5_agmtlist.c | 26 +++++++++++ ldap/servers/plugins/replication/repl5_connection.c | 163 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- ldap/servers/plugins/replication/repl5_inc_protocol.c | 32 ++++++++++++- ldap/servers/plugins/replication/repl5_prot_private.h | 2 + ldap/servers/plugins/replication/repl5_tot_protocol.c | 53 +++++++++++++++++++++- ldap/servers/plugins/replication/repl_globals.c | 2 + 9 files changed, 446 insertions(+), 6 deletions(-)
git push origin master Counting objects: 31, done. Delta compression using up to 4 threads. Compressing objects: 100% (16/16), done. Writing objects: 100% (16/16), 6.74 KiB, done. Total 16 (delta 14), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 96c130b..fbafee5 master -> master
commit fbafee5 Author: Thierry bordaz (tbordaz) tbordaz@redhat.com Date: Mon Dec 15 15:12:35 2014 +0100
git push origin 389-ds-base-1.3.3
Counting objects: 31, done. Delta compression using up to 4 threads. Compressing objects: 100% (16/16), done. Writing objects: 100% (16/16), 6.75 KiB, done. Total 16 (delta 14), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 61a4e70..69df018 389-ds-base-1.3.3 -> 389-ds-base-1.3.3
commit 69df018 Author: Thierry bordaz (tbordaz) tbordaz@redhat.com Date: Mon Dec 15 15:12:35 2014 +0100
git push origin 389-ds-base-1.3.2
Counting objects: 31, done. Delta compression using up to 4 threads. Compressing objects: 100% (16/16), done. Writing objects: 100% (16/16), 6.71 KiB, done. Total 16 (delta 14), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 09eff3d..9851929 389-ds-base-1.3.2 -> 389-ds-base-1.3.2
commit 9851929 Author: Thierry bordaz (tbordaz) tbordaz@redhat.com Date: Mon Dec 15 15:12:35 2014 +0100
Backported to 389-ds-base-1.3.1: f5458c4..3f78755 389-ds-base-1.3.1 -> 389-ds-base-1.3.1 commit 3f787554300db2ce901d244ff728aaa910e73f7a
git patch file (1.2.11 branch) -- backported by Jatin Nansi 0001-Ticket-47942-DS-hangs-during-online-total-update.3.patch
Metadata Update from @pj101: - Issue assigned to tbordaz - Issue set to the milestone: 1.2.11.33
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/1273
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Login to comment on this ticket.