#47942 DS hangs during online total update
Closed: wontfix None Opened 9 years ago by pj101.

When trying to make a total initial update for a replica, it randomly hangs during the data transmission (https://lists.fedoraproject.org/pipermail/389-users/2014-November/017602.html).

The factors making the problem worse:
using SSL/TLS in replication agreemnt, single CPU in a VM or reduced nsslapd-db-checkpoint-interval

In my tests it hangs each time (i've checked it for all the branches 1.3.1, 13.2 and 1.3.3 - on CentOS 7). We have about 30000 entries. The hang is random and probably time-dependent, not entry-dependent.

I used 389-ds-base-1.3.2.24.tar.bz2 extracted from git to compile it on CentSO7.
The two servers are ldap-edev and ldap-model, nsslapd-db-checkpoint-interval is 5.

The replicas:
<ldap-model>
cn=replica,cn=dc\3Did\2Cdc\3Dpolytechnique\2Cdc\3Dedu,cn=mapping tree,cn=config
objectClass: top
objectClass: nsDS5Replica
cn: replica
nsDS5ReplicaId: 1
nsDS5ReplicaRoot: dc=id,dc=polytechnique,dc=edu
nsDS5Flags: 1
nsDS5ReplicaBindDN: cn=RepliX,cn=config
nsds5ReplicaPurgeDelay: 604800
nsds5ReplicaTombstonePurgeInterval: 86400
nsds5ReplicaLegacyConsumer: False
nsDS5ReplicaType: 3
nsState:: AQAAAAAAAAB3X1tUAAAAAAAAAAAAAAAAAAAAAAAAAAABAAAAAAAAAA==
nsDS5ReplicaName: 59006004-65aa11e4-8781c5ac-8ec7ed31
nsds5ReplicaChangeCount: 0
nsds5replicareapactive: 0

<ldap-edev>
cn=replica,cn=dc\3Did\2Cdc\3Dpolytechnique\2Cdc\3Dedu,cn=mapping tree,cn=config
objectClass: top
objectClass: nsDS5Replica
cn: replica
nsDS5ReplicaId: 2
nsDS5ReplicaRoot: dc=id,dc=polytechnique,dc=edu
nsDS5Flags: 1
nsDS5ReplicaBindDN: cn=RepliX,cn=config
nsds5ReplicaPurgeDelay: 604800
nsds5ReplicaTombstonePurgeInterval: 86400
nsds5ReplicaLegacyConsumer: False
nsDS5ReplicaType: 3
nsState:: AgAAAAAAAAB3X1tUAAAAAAAAAAAAAAAAAAAAAAAAAAABAAAAAAAAAA==
nsDS5ReplicaName: 5998f684-65aa11e4-89888cf6-5e3b66a8
nsds5ReplicaChangeCount: 0
nsds5replicareapactive: 0

replication agreements :
<ldap-model>
0 cn=Replication from ldap-model.polytechnique.fr to ldap-edev.polytechnique.fr,cn=replica,cn=dc\3Did\2Cdc\3Dpolytechnique\2Cdc\3Dedu,cn=mapping tree,cn=config
objectClass: top
objectClass: nsDS5ReplicationAgreement
cn: Replication from ldap-model.polytechnique.fr to ldap-edev.polytechnique.fr
description: Replication agreement from server ldap-model.polytechnique.fr to server ldap-edev.polytechnique.fr
nsDS5ReplicaHost: ldap-edev.polytechnique.fr
nsDS5ReplicaRoot: dc=id,dc=polytechnique,dc=edu
nsDS5ReplicaPort: 636
nsDS5ReplicaTransportInfo: SSL
nsDS5ReplicaBindDN: cn=RepliX,cn=config
nsDS5ReplicaBindMethod: simple
nsds5replicaBusyWaitTime: 3
nsds5replicaTimeout: 180
nsDS5ReplicaCredentials: {DES}YVUAJ+/R1UvHTyogaLDeUHPg7lS627P9
nsds5replicareapactive: 0
nsds5replicaLastUpdateStart: 0
nsds5replicaLastUpdateEnd: 0
nsds5replicaChangesSentSinceStartup:
nsds5replicaLastUpdateStatus: 402 Replication error acquiring replica: unknown error - Replica has different database generation ID, remote replica may need to be initialized
nsds5replicaUpdateInProgress: FALSE
nsds5replicaLastInitStart: 0
nsds5replicaLastInitEnd: 0

<ldap-edev>
0 cn=Replication from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr,cn=replica,cn=dc\3Did\2Cdc\3Dpolytechnique\2Cdc\3Dedu,cn=mapping tree,cn=config
objectClass: top
objectClass: nsDS5ReplicationAgreement
cn: Replication from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr
description: Replication agreement from server ldap-edev.polytechnique.fr to server ldap-model.polytechnique.fr
nsDS5ReplicaHost: ldap-model.polytechnique.fr
nsDS5ReplicaRoot: dc=id,dc=polytechnique,dc=edu
nsDS5ReplicaPort: 636
nsDS5ReplicaTransportInfo: SSL
nsDS5ReplicaBindDN: cn=RepliX,cn=config
nsDS5ReplicaBindMethod: simple
nsds5replicaBusyWaitTime: 3
nsds5replicaTimeout: 180
nsDS5ReplicaCredentials: {DES}dG/Y05iIw4ZwldMtj+xjDGW89SHsDhQo
nsds5replicareapactive: 0
nsds5replicaLastUpdateStart: 0
nsds5replicaLastUpdateEnd: 0
nsds5replicaChangesSentSinceStartup:
nsds5replicaLastUpdateStatus: 402 Replication error acquiring replica: unknown error - Replica has different database generation ID, remote replica may need to be initialized
nsds5replicaUpdateInProgress: FALSE
nsds5replicaLastInitStart: 20141106114951Z
nsds5replicaLastInitEnd: 20141106115308Z
nsds5replicaLastInitStatus: 10 Total update abortedLDAP error: Referral

The error corresponds to the timeout error (see https://fedorahosted.org/389/ticket/47901)

The total update was strated in teh direction ldap-edev --> ldap-model.
Error logs :
<ldap-model>
[06/Nov/2014:12:49:51 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is going offline; disabling replication
[06/Nov/2014:12:49:51 +0100] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[06/Nov/2014:12:50:11 +0100] - import userRoot: Processed 4239 entries -- average rate 211.9/sec, recent rate 211.9/sec, hit ratio 0%
[06/Nov/2014:12:50:31 +0100] - import userRoot: Processed 4239 entries -- average rate 106.0/sec, recent rate 106.0/sec, hit ratio 0%
[06/Nov/2014:12:50:51 +0100] - import userRoot: Processed 4239 entries -- average rate 70.7/sec, recent rate 0.0/sec, hit ratio 0%
[06/Nov/2014:12:51:11 +0100] - import userRoot: Processed 4239 entries -- average rate 53.0/sec, recent rate 0.0/sec, hit ratio 0%
[06/Nov/2014:12:51:33 +0100] - import userRoot: Processed 4239 entries -- average rate 42.0/sec, recent rate 0.0/sec, hit ratio 0%
[06/Nov/2014:12:51:54 +0100] - import userRoot: Processed 4239 entries -- average rate 34.5/sec, recent rate 0.0/sec, hit ratio 0%
[06/Nov/2014:12:52:15 +0100] - import userRoot: Processed 4239 entries -- average rate 29.4/sec, recent rate 0.0/sec, hit ratio 0%
[06/Nov/2014:12:52:35 +0100] - import userRoot: Processed 4239 entries -- average rate 25.8/sec, recent rate 0.0/sec, hit ratio 0%
[06/Nov/2014:12:52:55 +0100] - import userRoot: Processed 4239 entries -- average rate 23.0/sec, recent rate 0.0/sec, hit ratio 0%
[06/Nov/2014:12:53:07 +0100] - import userRoot: Workers finished; cleaning up...
[06/Nov/2014:12:53:07 +0100] - import userRoot: Workers cleaned up.
[06/Nov/2014:12:53:07 +0100] - import userRoot: Indexing complete. Post-processing...
[06/Nov/2014:12:53:07 +0100] - import userRoot: Generating numSubordinates complete.
[06/Nov/2014:12:53:07 +0100] - import userRoot: Flushing caches...
[06/Nov/2014:12:53:07 +0100] - import userRoot: Closing files...
[06/Nov/2014:12:53:08 +0100] - import userRoot: Import complete. Processed 4551 entries in 197 seconds. (23.10 entries/sec)
[06/Nov/2014:12:53:08 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is coming online; enabling replication

<ldap-edev>
[06/Nov/2014:12:49:52 +0100] NSMMReplicationPlugin - Beginning total update of replica "agmt="cn=Replication from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr" (ldap-model:636)".

The stack traces are attached:

  • before the replication total update started - stacktrace.06-Nov-2014_12h47*.txt
  • after the replication is stuck (all the other traces)

The same initial total update using replication agreement with LDAP/389 insetad of LDAP/SSL or TLS and with nsslapd-db-checkpoint=20 goes to the end correctly:
[06/Nov/2014:13:17:07 +0100] NSMMReplicationPlugin - agmt="cn=Replication from ldap-model.polytechnique.fr to ldap-edev.polytechnique.fr" (ldap-edev:636): The remote replica has a different database generation ID than the local database. You may have to reinitialize the remote replica, or the local replica.
[06/Nov/2014:13:17:19 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is going offline; disabling replication
[06/Nov/2014:13:17:19 +0100] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[06/Nov/2014:13:17:39 +0100] - import userRoot: Processed 10022 entries -- average rate 501.1/sec, recent rate 501.1/sec, hit ratio 0%
[06/Nov/2014:13:17:59 +0100] - import userRoot: Processed 19850 entries -- average rate 496.1/sec, recent rate 496.0/sec, hit ratio 100%
[06/Nov/2014:13:18:19 +0100] - import userRoot: Workers finished; cleaning up...
[06/Nov/2014:13:18:19 +0100] - import userRoot: Workers cleaned up.
[06/Nov/2014:13:18:19 +0100] - import userRoot: Indexing complete. Post-processing...
[06/Nov/2014:13:18:19 +0100] - import userRoot: Generating numSubordinates complete.
[06/Nov/2014:13:18:19 +0100] - import userRoot: Flushing caches...
[06/Nov/2014:13:18:19 +0100] - import userRoot: Closing files...
[06/Nov/2014:13:18:20 +0100] - import userRoot: Import complete. Processed 29525 entries in 61 seconds. (484.02 entries/sec)
[06/Nov/2014:13:18:20 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is coming online; enabling replication

[06/Nov/2014:13:17:20 +0100] NSMMReplicationPlugin - Beginning total update of replica "agmt="cn=Replication from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr" (ldap-model:636)".
[06/Nov/2014:13:18:21 +0100] NSMMReplicationPlugin - Finished total update of replica "agmt="cn=Replication from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr" (ldap-model:636)". Sent 29525 entries.


Both servers appear to be waiting to write to the other server:
model
{{{

################### stacktrace.06-Nov-2014_12h51m57s.txt

Thread 75
poll poll poll _pr_poll_with_poll slapd_poll write_function openldap_write_function ber_int_sb_write ber_flush2 flush_ber send_ldap_result_ext send_ldap_result do_extended connection_dispatch_operation connection_threadmain _pt_root start_thread clone
}}}
edev
{{{

################### stacktrace.06-Nov-2014_12h52m00s.txt

Thread 3
poll poll poll _pr_poll_with_poll see_if_write_available perform_operation conn_send_extended_operation send_entry internal_srch_entry_callback send_ldap_search_entry_ext send_ldap_search_entry iterate send_results_ext op_shared_search search_internal_callback_pb slapi_search_internal_callback_pb repl5_tot_run prot_thread_main _pt_root start_thread clone
}}}

In neither case is any of the SSL code being involved, so it is really strange that this is affected by SSL.

Does increasing the entropy help? There is also creating the VM with an rng (I use virt-install so I use --rng /dev/random - not sure how that translates to the xml files), and then inside your VM, somewhere early before dirsrv starts, do yum install rng-tools, then do rngd -r /dev/hwrng.

It does not seem to be entropy-related. I've made

watch -n.1 cat /proc/sys/kernel/random/entropy_avail
on both servers before and during the replication. It varies from 130 to 180, and never goes down lower than that, independently of whether it is an agreement using SSL or not.

I'm not sure how i increase entropy in VMWare guests, we are not using KVM/RHEV...

Ok. Now i am absolutely sure it is not entropy-related. I have installed haveged daemon on both VMs and started it. Now the entropy is all the time between 2000 and 4000. The problem is still the same (once again, only when using SSL in replication agreement):

[06/Nov/2014:15:50:14 +0100] NSMMReplicationPlugin - agmt="cn=Replication from ldap-model.polytechnique.fr to ldap-edev.polytechnique.fr" (ldap-edev:636): The remote replica has a different database generation ID than the local database. You may have to reinitialize the remote replica, or the local replica.
[06/Nov/2014:15:51:15 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is going offline; disabling replication
[06/Nov/2014:15:51:15 +0100] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[06/Nov/2014:15:51:35 +0100] - import userRoot: Processed 1376 entries -- average rate 68.8/sec, recent rate 68.8/sec, hit ratio 0%
[06/Nov/2014:15:51:55 +0100] - import userRoot: Processed 1376 entries -- average rate 34.4/sec, recent rate 34.4/sec, hit ratio 0%
[06/Nov/2014:15:52:15 +0100] - import userRoot: Processed 1376 entries -- average rate 22.9/sec, recent rate 0.0/sec, hit ratio 0%
[06/Nov/2014:15:52:35 +0100] - import userRoot: Processed 1376 entries -- average rate 17.2/sec, recent rate 0.0/sec, hit ratio 0%
[06/Nov/2014:15:52:55 +0100] - import userRoot: Processed 1376 entries -- average rate 13.8/sec, recent rate 0.0/sec, hit ratio 0%
[06/Nov/2014:15:53:15 +0100] - import userRoot: Processed 1376 entries -- average rate 11.5/sec, recent rate 0.0/sec, hit ratio 0%
[06/Nov/2014:15:53:35 +0100] - import userRoot: Processed 1376 entries -- average rate 9.8/sec, recent rate 0.0/sec, hit ratio 0%
[06/Nov/2014:15:53:55 +0100] - import userRoot: Processed 1376 entries -- average rate 8.6/sec, recent rate 0.0/sec, hit ratio 0%

Looking into the stacktraces from edev (except the first one stacktrace.06-Nov-2014_12h47m36s.txt),
. thread 2 (repl5_tot_result_threadmain) is trying to read the result from model with conn_read_result_ex, but it cannot acquire the lock (conn->lock) and sitting there;
. thread 3 (repl5_tot_run) holds the lock in perform_operation, then waiting in PR_Poll via see_if_write_available. It times out in nsds5replicaTimeout == 180 (seconds).

On model, as Rich mentioned in comment 1, it is trying to write the result back to edev.

Andrey, could it be possible for you to try running the total update/replica initialization with setting 1 second to nsds5replicaTimeout and see how it goes? Thanks!

If you mean setting agreement timeout smaller for each replica:

with 60s for both agreements
[06/Nov/2014:20:42:20 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is going offline; disabling replication
[06/Nov/2014:20:42:20 +0100] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[06/Nov/2014:20:42:40 +0100] - import userRoot: Processed 1404 entries -- average rate 70.2/sec, recent rate 70.2/sec, hit ratio 0%
[06/Nov/2014:20:43:00 +0100] - import userRoot: Processed 1404 entries -- average rate 35.1/sec, recent rate 35.1/sec, hit ratio 0%
[06/Nov/2014:20:43:20 +0100] - import userRoot: Processed 1404 entries -- average rate 23.4/sec, recent rate 0.0/sec, hit ratio 0%
[06/Nov/2014:20:43:27 +0100] - import userRoot: Workers finished; cleaning up...
[06/Nov/2014:20:43:27 +0100] - import userRoot: Workers cleaned up.
[06/Nov/2014:20:43:27 +0100] - import userRoot: Indexing complete. Post-processing...
[06/Nov/2014:20:43:27 +0100] - import userRoot: Generating numSubordinates complete.
[06/Nov/2014:20:43:27 +0100] - import userRoot: Flushing caches...
[06/Nov/2014:20:43:27 +0100] - import userRoot: Closing files...
[06/Nov/2014:20:43:27 +0100] - import userRoot: Import complete. Processed 1659 entries in 67 seconds. (24.76 entries/sec)
[06/Nov/2014:20:43:27 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is coming online; enabling replication

with 1s for both agreements:

[06/Nov/2014:20:56:08 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is going offline; disabling replication
[06/Nov/2014:20:56:09 +0100] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[06/Nov/2014:20:56:28 +0100] - import userRoot: Workers finished; cleaning up...
[06/Nov/2014:20:56:28 +0100] - import userRoot: Workers cleaned up.
[06/Nov/2014:20:56:28 +0100] - import userRoot: Indexing complete. Post-processing...
[06/Nov/2014:20:56:28 +0100] - import userRoot: Generating numSubordinates complete.
[06/Nov/2014:20:56:28 +0100] - import userRoot: Flushing caches...
[06/Nov/2014:20:56:28 +0100] - import userRoot: Closing files...
[06/Nov/2014:20:56:28 +0100] - import userRoot: Import complete. Processed 5572 entries in 19 seconds. (293.26 entries/sec)
[06/Nov/2014:20:56:28 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is coming online; enabling replication

with 5s for both agreements:

[06/Nov/2014:20:59:55 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is going offline; disabling replication
[06/Nov/2014:20:59:55 +0100] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[06/Nov/2014:21:00:15 +0100] - import userRoot: Processed 4439 entries -- average rate 221.9/sec, recent rate 221.9/sec, hit ratio 0%
[06/Nov/2014:21:00:18 +0100] - import userRoot: Workers finished; cleaning up...
[06/Nov/2014:21:00:18 +0100] - import userRoot: Workers cleaned up.
[06/Nov/2014:21:00:18 +0100] - import userRoot: Indexing complete. Post-processing...
[06/Nov/2014:21:00:18 +0100] - import userRoot: Generating numSubordinates complete.
[06/Nov/2014:21:00:18 +0100] - import userRoot: Flushing caches...
[06/Nov/2014:21:00:18 +0100] - import userRoot: Closing files...
[06/Nov/2014:21:00:18 +0100] - import userRoot: Import complete. Processed 4439 entries in 23 seconds. (193.00 entries/sec)
[06/Nov/2014:21:00:18 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is coming online; enabling replication

If you mean changing the timeout of the agreement ldap-model -> ldap-edev to 1 sec leaving ldap-edev -> ldap-model at 180s :

[06/Nov/2014:20:48:13 +0100] - import userRoot: Processed 3615 entries -- average rate 180.8/sec, recent rate 180.7/sec, hit ratio 0%
[06/Nov/2014:20:48:33 +0100] - import userRoot: Processed 3615 entries -- average rate 90.4/sec, recent rate 90.3/sec, hit ratio 0%
[06/Nov/2014:20:48:53 +0100] - import userRoot: Processed 3615 entries -- average rate 60.2/sec, recent rate 0.0/sec, hit ratio 0%
[06/Nov/2014:20:49:13 +0100] - import userRoot: Processed 3615 entries -- average rate 45.2/sec, recent rate 0.0/sec, hit ratio 0%
[06/Nov/2014:20:49:33 +0100] - import userRoot: Processed 3615 entries -- average rate 36.1/sec, recent rate 0.0/sec, hit ratio 0%
[06/Nov/2014:20:49:53 +0100] - import userRoot: Processed 3615 entries -- average rate 30.1/sec, recent rate 0.0/sec, hit ratio 0%
[06/Nov/2014:20:50:14 +0100] - import userRoot: Processed 3615 entries -- average rate 25.8/sec, recent rate 0.0/sec, hit ratio 0%
[06/Nov/2014:20:50:34 +0100] - import userRoot: Processed 3615 entries -- average rate 22.6/sec, recent rate 0.0/sec, hit ratio 0%
...

So it's not getting any better - still a random freeze at some random place.

Thanks for your quick testing. I meant "with 1s for both agreements" expecting to give more chances to repl5_tot_result_threadmain to read the total update results from model. But it does not help... We need to reproduce the problem in house for more investigation.

So, it requires the replication over SSL/TLS and 30000 entires to duplicate the total update hang?

Not necessarily. I think it could be more time-dependent - the total update should be long enough for the problem to appear (20-30 seconds at least i think).

If you look at the original mail (https://lists.fedoraproject.org/pipermail/389-users/2014-November/017601.html), 陳含林 (laneovcc at gmail.com) had only about 5000 entries and it stuck about 1500. Disabling SSL in rep agreement has also helped him/her. It would be also interesting to see what server configuration he/she was using.

The total update using 1.2.10 on CentOS 7 takes always only 8-9 seconds for the same data (instead of ~60s for 1.3.x), maybe that's why i've never seen any problem with 1.2.10 either.

Interesting... So, what you suggest is to have a huge DB and run the total update, right?

Please note that Mr. 陳含林's case was caused by some entries larger than the maxbersize.

Anyway, we'd start from there...

Thanks for your input!

No, it was not caused by large entries. That's what he has written (https://lists.fedoraproject.org/pipermail/389-users/2014-November/017604.html):

''While nsDS5ReplicaTransportInfo set to TLS, i have tried''
''set nsslapd-maxbersize to 2147483647''
''set nsslapd-cachememsize to 2147483648''
''set nsslapd-pluginbetxn off''
''set export LDAPTLS_REQCERT=never in /etc/sysconfig/dirsrv''
''all of above not work.''

''After change ''nsDS5ReplicaTransportInfo from TLS to LDAP,'' total update
finished in 5 sec and succeeded.'' ''This should be a workaround for my case.''

So he had exactly the same problem as me. Anyway, i'm going to migrate soon out production to the latest 1.3.2 and use the workaround of disabling SSL for total online intialization.

If i can be of any use for this ticket, please don't hesitate to ask me for more tests or information.

And one more thing - our db is not huge, it's about 30000 entries and about 33Mbytes in size

Thanks for the clarification, Andrey. So, TLS (or SSL?) is the issue. Still we have no luck to duplicate the problem with SSL/TLS enabled on the latest version from master... We may need to go back to the older versions.

I've disabled any other modifications of plugins during the installation other than activating TLS/SSL. The problem did not disappear.

Just in case here is some more information about the test serveres (VMs):
both VMs (vSphere 5.5U2) have 1vCPU, 4Gb RAM, CentOS 7 with latest "yum update"
vmtools used are the supplied with OS open-vm-tools (yum install open-vm-tools)
SELinux is disabled
ipv6 is disabled from sysctl.conf:
echo "net.ipv6.conf.all.disable_ipv6 = 1" >> /etc/sysctl.conf
echo "net.ipv6.conf.default.disable_ipv6 = 1" >> /etc/sysctl.conf
sysctl -p
all the server certs are signed by a single but self-signed authority
autobind is activated:
# nsslapd-ldapilisten: on
# nsslapd-ldapiautobind: on
* the 389ds was compiled without SELinux:
./configure --prefix=/Local/dirsrv --enable-autobind --with-openldap --with-tmpfiles-d=/etc/tmpfiles.d --with-systemdsystemunitdir=/usr/lib/systemd/system --with-systemdsystemconfdir=/etc/systemd/system --with-systemdgroupname=dirsrv.target

  • SSL configuration:
    0 cn=RSA,cn=encryption,cn=config
    objectClass: top
    objectClass: nsEncryptionModule
    cn: RSA
    nsSSLPersonalitySSL: ldap-server-cert
    nsSSLToken: internal (software)
    nsSSLActivation: on

0 cn=encryption,cn=config
objectClass: top
objectClass: nsEncryptionConfig
cn: encryption
nsSSLSessionTimeout: 0
nsSSLClientAuth: allowed
nsSSL2: off
nsSSL3: on
nsSSL3Ciphers: -rsa_null_md5,+rsa_rc4_128_md5,-rsa_rc4_40_md5,-rsa_rc2_40_md5,+rsa_des_sha,+rsa_fips_des_sha,+rsa_3des_sha,+rsa_fips_3des_sha,+fortezza,+fortezza_rc4_128_sha,+fortez
za_null,+tls_rsa_export1024_with_rc4_56_sha,+tls_rsa_export1024_with_des_cbc_sha,-rc4,-rc4export,-rc2,-rc2export,-des,-desede3
nsSSLSupportedCiphers: SSL3::rc4::RC4::MD5::128
nsSSLSupportedCiphers: SSL3::rc4export::RC4::MD5::128
nsSSLSupportedCiphers: SSL3::rc2::RC2::MD5::128
nsSSLSupportedCiphers: SSL3::rc2export::RC2::MD5::128
nsSSLSupportedCiphers: SSL3::des::DES::MD5::64
nsSSLSupportedCiphers: SSL3::desede3::3DES::MD5::192
nsSSLSupportedCiphers: SSL3::rsa_rc4_128_md5::RC4::MD5::128
nsSSLSupportedCiphers: SSL3::rsa_rc4_128_sha::RC4::SHA1::128
nsSSLSupportedCiphers: SSL3::rsa_3des_sha::3DES::SHA1::192
nsSSLSupportedCiphers: SSL3::rsa_des_sha::DES::SHA1::64
nsSSLSupportedCiphers: SSL3::rsa_fips_3des_sha::3DES::SHA1::192
nsSSLSupportedCiphers: SSL3::fips_3des_sha::3DES::SHA1::192
nsSSLSupportedCiphers: SSL3::rsa_fips_des_sha::DES::SHA1::64
nsSSLSupportedCiphers: SSL3::fips_des_sha::DES::SHA1::64
nsSSLSupportedCiphers: SSL3::rsa_rc4_40_md5::RC4::MD5::128
nsSSLSupportedCiphers: SSL3::rsa_rc2_40_md5::RC2::MD5::128
nsSSLSupportedCiphers: SSL3::rsa_null_md5::NULL::MD5::0
nsSSLSupportedCiphers: SSL3::rsa_null_sha::NULL::SHA1::0
nsSSLSupportedCiphers: TLS::tls_rsa_export1024_with_rc4_56_sha::RC4::SHA1::128
nsSSLSupportedCiphers: TLS::rsa_rc4_56_sha::RC4::SHA1::128
nsSSLSupportedCiphers: TLS::tls_rsa_export1024_with_des_cbc_sha::DES::SHA1::64
nsSSLSupportedCiphers: TLS::rsa_des_56_sha::DES::SHA1::64
nsSSLSupportedCiphers: SSL3::fortezza::DES::SHA1::64
nsSSLSupportedCiphers: SSL3::fortezza_rc4_128_sha::DES::SHA1::64
nsSSLSupportedCiphers: SSL3::fortezza_null::DES::SHA1::64
nsSSLSupportedCiphers: SSL3::dhe_dss_des_sha::DES::SHA1::64
nsSSLSupportedCiphers: SSL3::dhe_dss_3des_sha::3DES::SHA1::192
nsSSLSupportedCiphers: SSL3::dhe_rsa_des_sha::DES::SHA1::64
nsSSLSupportedCiphers: SSL3::dhe_rsa_3des_sha::3DES::SHA1::192
nsSSLSupportedCiphers: TLS::tls_rsa_aes_128_sha::AES::SHA1::128
nsSSLSupportedCiphers: TLS::rsa_aes_128_sha::AES::SHA1::128
nsSSLSupportedCiphers: TLS::tls_dhe_dss_aes_128_sha::AES::SHA1::128
nsSSLSupportedCiphers: TLS::tls_dhe_rsa_aes_128_sha::AES::SHA1::128
nsSSLSupportedCiphers: TLS::tls_rsa_aes_256_sha::AES::SHA1::256
nsSSLSupportedCiphers: TLS::rsa_aes_256_sha::AES::SHA1::256
nsSSLSupportedCiphers: TLS::tls_dhe_dss_aes_256_sha::AES::SHA1::256
nsSSLSupportedCiphers: TLS::tls_dhe_rsa_aes_256_sha::AES::SHA1::256
nsSSLSupportedCiphers: TLS::tls_dhe_dss_1024_rc4_sha::RC4::SHA1::128
nsSSLSupportedCiphers: TLS::tls_dhe_dss_rc4_128_sha::RC4::SHA1::128

hi all,

I try to duplicate this problem today, but this time total update with TLS succeeded.
Another replication which using LDAP and simple bind goes wrong.

Stacktrace is in attachment stacktrace.1415534491.txt

Here's the mapping tree, meToidm1.ra.cn.idc using TLS succeeded, cloneAgreement1-idm4.jh.cn.idc-pki-tomcat using LDAP failed.

extended LDIF

LDAPv3

base <cn=mapping tree,cn=config> with scope subtree

filter: (objectclass=*)

requesting: ALL

mapping tree, config

dn: cn=mapping tree,cn=config
objectClass: top
objectClass: extensibleObject
cn: mapping tree

dc\3Didc, mapping tree, config

dn: cn=dc\3Didc,cn=mapping tree,cn=config
objectClass: top
objectClass: extensibleObject
objectClass: nsMappingTree
cn: dc=idc
cn: "dc=idc"
nsslapd-state: backend
nsslapd-backend: userRoot

o\3Dipaca, mapping tree, config

dn: cn=o\3Dipaca,cn=mapping tree,cn=config
objectClass: top
objectClass: extensibleObject
objectClass: nsMappingTree
cn: o=ipaca
nsslapd-backend: ipaca
nsslapd-state: Backend

replica, dc\3Didc, mapping tree, config

dn: cn=replica,cn=dc\3Didc,cn=mapping tree,cn=config
cn: replica
nsDS5Flags: 1
objectClass: top
objectClass: nsds5replica
objectClass: extensibleobject
nsDS5ReplicaType: 3
nsDS5ReplicaRoot: dc=idc
nsds5ReplicaLegacyConsumer: off
nsDS5ReplicaId: 37
nsDS5ReplicaBindDN: cn=replication manager,cn=config
nsState:: JQAAAAAAAAAZVl9UAAAAAAAAAAAAAAAAQ3AAAAAAAAACAAAAAAAAAA==
nsDS5ReplicaName: 2bfcb989-680711e4-9e77d38f-3fc8cf37
nsds5ReplicaChangeCount: 1
nsds5replicareapactive: 0

replica, o\3Dipaca, mapping tree, config

dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
objectClass: top
objectClass: nsDS5Replica
objectClass: extensibleobject
nsDS5ReplicaRoot: o=ipaca
nsDS5ReplicaType: 3
nsDS5ReplicaBindDN: cn=Replication Manager cloneAgreement1-idm4.jh.cn.idc-pki-
tomcat,ou=csusers,cn=config
cn: replica
nsDS5ReplicaId: 1185
nsDS5Flags: 1
nsState:: oQQAAAAAAAByVl9UAAAAAAAAAAAAAAAAAAAAAAAAAAABAAAAAAAAAA==
nsDS5ReplicaName: 5ea6b185-680711e4-9e77d38f-3fc8cf37
nsds5ReplicaChangeCount: 0
nsds5replicareapactive: 0

meToidm1.ra.cn.idc, replica, dc\3Didc, mapping tree, config

dn: cn=meToidm1.ra.cn.idc,cn=replica,cn=dc\3Didc,cn=mapping tree,cn=config
cn: meToidm1.ra.cn.idc
objectClass: nsds5replicationagreement
objectClass: top
nsDS5ReplicaTransportInfo: TLS
description: me to idm1.ra.cn.idc
nsDS5ReplicaRoot: dc=idc
nsDS5ReplicaHost: idm1.ra.cn.idc
nsds5replicaTimeout: 120
nsDS5ReplicaBindDN: cn=replication manager,cn=config
nsDS5ReplicaPort: 389
nsDS5ReplicaBindMethod: simple
nsDS5ReplicatedAttributeList: (objectclass=) $ EXCLUDE memberof idnssoaserial
entryusn krblastsuccessfulauth krblastfailedauth krbloginfailedcount
nsds5ReplicaStripAttrs: modifiersName modifyTimestamp internalModifiersName in
ternalModifyTimestamp
nsDS5ReplicaCredentials: {DES}
nsDS5ReplicatedAttributeListTotal: (objectclass=
) $ EXCLUDE entryusn krblasts
uccessfulauth krblastfailedauth krbloginfailedcount
nsds5replicareapactive: 0
nsds5replicaLastUpdateStart: 0
nsds5replicaLastUpdateEnd: 0
nsds5replicaChangesSentSinceStartup:
nsds5replicaLastUpdateStatus: 1 Can't acquire busy replica
nsds5replicaUpdateInProgress: FALSE
nsds5replicaLastInitStart: 0
nsds5replicaLastInitEnd: 0

cloneAgreement1-idm4.jh.cn.idc-pki-tomcat, replica, o\3Dipaca, mapping tree

, config
dn: cn=cloneAgreement1-idm4.jh.cn.idc-pki-tomcat,cn=replica,cn=o\3Dipaca,cn=ma
pping tree,cn=config
objectClass: top
objectClass: nsds5replicationagreement
cn: cloneAgreement1-idm4.jh.cn.idc-pki-tomcat
nsDS5ReplicaRoot: o=ipaca
nsDS5ReplicaHost: idm1.ra.cn.idc
nsDS5ReplicaPort: 389
nsDS5ReplicaBindDN: cn=Replication Manager masterAgreement1-idm4.jh.cn.idc-pki
-tomcat,ou=csusers,cn=config
nsDS5ReplicaBindMethod: Simple
description: cloneAgreement1-idm4.jh.cn.idc-pki-tomcat
nsDS5ReplicaCredentials: {DES}
nsds5replicareapactive: 0
nsds5replicaLastUpdateStart: 0
nsds5replicaLastUpdateEnd: 0
nsds5replicaChangesSentSinceStartup:
nsds5replicaLastUpdateStatus: 402 Replication error acquiring replica: unknown
error - Replica has different database generation ID, remote replica may nee
d to be initialized
nsds5replicaUpdateInProgress: FALSE
nsds5replicaLastInitStart: 0
nsds5replicaLastInitEnd: 0

search result

search: 2
result: 0 Success

numResponses: 8

numEntries: 7

I've made some more tests - i am able to reproduce the problem with a dbgen-generated database and without any schema modification. So it's not a particular schema/entry issue. The dbgen command i have used:
dbgen.pl -o /tmp/db-test.ldif -s "dc=id,dc=polytechnique,dc=edu" -m uid -n 30000

Looks like bug for RedHat internal use only - i am unable to see it.

The fix is also expected for freeipa 4.1.3 (https://fedorahosted.org/freeipa/ticket/4756)
iin RHEL 7.0/7.1 -(DS 1.3.3).

I'll test this fix soon and i'll report the results.

Is it possible that this problem (i.e. the sending thread holding lock and pushing a great deal of entries thus overflowing the consumer) may happen during a regular incremental sync? I had similar replica timeouts when i made a lot of modifications during a short time on source server (like adding an attribute to each ldap entry, for example).

The original code has a comment saying that the sending thread is locked in order to "prevent the result reader thread from closing the connection out from under us." This patch does the opposite - it gives a time window (100ms each 1s) for the result reader thread to close the connection. Is this possibility taken into account when you make the subsequent connection re-lock ?

668                 PR_Unlock(conn->lock); 
669                 DS_Sleep(PR_MillisecondsToInterval(yeild_delay_msec));

---> does the conn still exist here or it was closed by the read result thread ?
670 PR_Lock(conn->lock);

I've patched the repl5_connection.c of 1.3.2.24, the tests show that the problem is still present:(

{{{
[28/Nov/2014:09:09:43 +0100] NSMMReplicationPlugin - agmt="cn=Replication from ldap-model.polytechnique.fr to ldap-edev.polytechnique.fr" (ldap-edev:636): The remote replica has a different database generation ID than the local database. You may have to reinitialize the remote replica, or the local replica.
[28/Nov/2014:09:10:21 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is going offline; disabling replication
[28/Nov/2014:09:10:21 +0100] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[28/Nov/2014:09:10:41 +0100] - import userRoot: Processed 2294 entries -- average rate 114.7/sec, recent rate 114.7/sec, hit ratio 0%
[28/Nov/2014:09:11:01 +0100] - import userRoot: Processed 3639 entries -- average rate 91.0/sec, recent rate 91.0/sec, hit ratio 100%
[28/Nov/2014:09:11:22 +0100] - import userRoot: Processed 3639 entries -- average rate 60.6/sec, recent rate 33.6/sec, hit ratio 0%
[28/Nov/2014:09:11:42 +0100] - import userRoot: Processed 3639 entries -- average rate 45.5/sec, recent rate 0.0/sec, hit ratio 0%
[28/Nov/2014:09:12:02 +0100] - import userRoot: Processed 3639 entries -- average rate 36.4/sec, recent rate 0.0/sec, hit ratio 0%
[28/Nov/2014:09:12:22 +0100] - import userRoot: Processed 3639 entries -- average rate 30.3/sec, recent rate 0.0/sec, hit ratio 0%
...
[28/Nov/2014:09:15:22 +0100] - import userRoot: Decided to end this pass because the progress rate has dropped below the 50% threshold.
[28/Nov/2014:09:15:22 +0100] - import userRoot: Ending pass number 1 ...
[28/Nov/2014:09:15:22 +0100] - import userRoot: Foreman is done; waiting for workers to finish...
[28/Nov/2014:09:15:22 +0100] - import userRoot: Workers finished; cleaning up...
[28/Nov/2014:09:15:22 +0100] - import userRoot: Workers cleaned up.
[28/Nov/2014:09:15:22 +0100] - import userRoot: Sweeping files for merging later...
[28/Nov/2014:09:15:23 +0100] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[28/Nov/2014:09:15:23 +0100] - import userRoot: Sweep done.
[28/Nov/2014:09:15:23 +0100] - import userRoot: Beginning pass number 2
[28/Nov/2014:09:15:43 +0100] - import userRoot: Processed 3639 entries (pass 2) -- average rate 214748182.8/sec, recent rate 0.0/sec, hit ratio 0%
[28/Nov/2014:09:16:03 +0100] - import userRoot: Processed 3639 entries (pass 2) -- average rate 107374091.4/sec, recent rate 0.0/sec, hit ratio 0%
[28/Nov/2014:09:16:23 +0100] - import userRoot: Processed 3639 entries (pass 2) -- average rate 71582727.6/sec, recent rate 0.0/sec, hit ratio 0%
...
}}}

The total update of RA without SSL still goes fine using the same initial data:
{{{
[28/Nov/2014:09:25:16 +0100] NSMMReplicationPlugin - agmt="cn=Replication from ldap-model.polytechnique.fr to ldap-edev.polytechnique.fr" (ldap-edev:636): The remote replica has a different database generation ID than the local database. You may have to reinitialize the remote replica, or the local replica.
[28/Nov/2014:09:25:39 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is going offline; disabling replication
[28/Nov/2014:09:25:39 +0100] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[28/Nov/2014:09:25:59 +0100] - import userRoot: Processed 10363 entries -- average rate 518.1/sec, recent rate 518.1/sec, hit ratio 0%
[28/Nov/2014:09:26:19 +0100] - import userRoot: Processed 22080 entries -- average rate 552.0/sec, recent rate 552.0/sec, hit ratio 100%
[28/Nov/2014:09:26:36 +0100] - import userRoot: Workers finished; cleaning up...
[28/Nov/2014:09:26:36 +0100] - import userRoot: Workers cleaned up.
[28/Nov/2014:09:26:36 +0100] - import userRoot: Indexing complete. Post-processing...
[28/Nov/2014:09:26:36 +0100] - import userRoot: Generating numSubordinates complete.
[28/Nov/2014:09:26:37 +0100] - import userRoot: Flushing caches...
[28/Nov/2014:09:26:37 +0100] - import userRoot: Closing files...
[28/Nov/2014:09:26:37 +0100] - import userRoot: Import complete. Processed 29730 entries in 58 seconds. (512.59 entries/sec)
[28/Nov/2014:09:26:37 +0100] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=id,dc=polytechnique,dc=edu is coming online; enabling replication
}}}

The strange thing about your patch is that line numbers do not correspond to the HEAD file (https://git.fedorahosted.org/cgit/389/ds.git/tree/ldap/servers/plugins/replication/repl5_connection.c). Maybe there are some other changes i need to apply to repl5_connection.c in order to test the patch correctly?

Replying to [comment:19 pj101]:

I'll test this fix soon and i'll report the results.

Is it possible that this problem (i.e. the sending thread holding lock and pushing a great deal of entries thus overflowing the consumer) may happen during a regular incremental sync? I had similar replica timeouts when i made a lot of modifications during a short time on source server (like adding an attribute to each ldap entry, for example).

Good point. I need to check if it can happen during incremental replication.
Now incremental replication is different, it sends updates not entries so the amount of work to do on the consumer side is lower.
Also, the replica agreement should wait for ack from previous update before sending the next one. But again it worth to be confirmed.

Replying to [comment:20 pj101]:

The original code has a comment saying that the sending thread is locked in order to "prevent the result reader thread from closing the connection out from under us." This patch does the opposite - it gives a time window (100ms each 1s) for the result reader thread to close the connection. Is this possibility taken into account when you make the subsequent connection re-lock ?

668 PR_Unlock(conn->lock);
669 DS_Sleep(PR_MillisecondsToInterval(yeild_delay_msec));
---> does the conn still exist here or it was closed by the read result thread ?
670 PR_Lock(conn->lock);

Good point again !.. Thanks for the review :)
The connection is not freed until all entries have been sent (or error condition) and a 'end of total update' is sent to the consumer.
So I think there is no problem here.

Replying to [comment:22 pj101]:

The strange thing about your patch is that line numbers do not correspond to the HEAD file (https://git.fedorahosted.org/cgit/389/ds.git/tree/ldap/servers/plugins/replication/repl5_connection.c). Maybe there are some other changes i need to apply to repl5_connection.c in order to test the patch correctly?

The patch I attached was done on 1.3.1.May be the lines differs but I would expect it matches.

Yes. It matches ok for the HEAD or 1.3.2.24. If it was a diff with 1.3.1 sources, it explains the discrepancy of line numbers.

Do you confirm the fix fails for your test case ? I could provide an instrumented version that logs debug info in access log to not change the dynamic of the import. Logging in error log, slows down the supplier and may hide the issue

Yes, it fails our test case - i've tested several times.

It's ok if you send me an instrumented version. However we use compilation from sources and installation in a particular folder in our installation/setup/import scripts, so the rpm will be useless.

Concerning the dynamics, in our case we use "nsslapd-accesslog-logbuffering: off" on our DEV servers. The bug works independently of whether accesslog buffering is off or on...

Hello,

I prepared a debug version of the patch (https://fedorahosted.org/389/attachment/ticket/47942/0001-Ticket-47942-debug-DS-hangs-during-online-total-upda.patch). Because the problem of flow control between RA.reader/write is dynamic, the debug logs are written in the access log (buffers) so that it does not slow down the supplier.

To get valid info with that debug patch the supplier needs 'nsslapd-accesslog-logbuffering: on'.

thanks

In fact the bug persists whether i enable or disable access log buffering (with debug version of the patch). Though it is clear that when the log buffering is disabled,the problem appears later during the initialization (7403 entries with log buffering and 26798 without).
I've attached the error, access and audit logs in both cases for supplier (ldap-edev) and consumer (ldap-model).

  • monitoring the sent entries and the rcv acked, the consumer is always ~2000 late compare to the supplier

  • The supplier is sending data without having to wait because of ability to write
    The last log shows no delay before sending the next entry

{{{
[29/Nov/2014:14:34:22 +0100] agmt="cn=Replication from xxx" (ldap-supplier:636): perform_operation sends operation (5)
}}}

  • The consumer timeout the full init because it is enable to send (for timeout duration) the next ack

{{{
[29/Nov/2014:14:34:24 +0100] - import userRoot: Processed 7403 entries -- average rate 92.5/sec, recent rate 70.2/sec, hit ratio 100%
[29/Nov/2014:14:34:44 +0100] - import userRoot: Processed 7403 entries -- average rate 74.0/sec, recent rate 26.6/sec, hit ratio 0%
...
[29/Nov/2014:14:45:26 +0100] - import userRoot: Processed 7403 entries (pass 3) -- average rate 35791332.4/sec, recent rate 0.0/sec, hit ratio 0%
[29/Nov/2014:14:45:46 +0100] - import userRoot: Processed 7403 entries (pass 3) -- average rate 30678284.9/sec, recent rate 0.0/sec, hit ratio 0%
[29/Nov/2014:14:46:01 +0100] - slapd_poll(256) timed out
}}}

  • I believe that what happened is:

    • the supplier tested the connection to check it can write. This test succeeded BUT the connection being almost full it is not able to send a full entry. Then it hangs on a write. The problem here is that during the write, it holds the connection lock
    • The consumer is also sending ack, but those ack are not read by the RA.reader because it has not the connection lock. After a consumer timeout (ioblocktimeout), the consumer aborts the total init.

    This can be confirmed with several pstacks (consuemr/supplier) while the full init is not progressing.

    • I believe the provided fix fails because the supplier testing the connection, detects that it can write but it has not the ability to write a full entry.

    • I am implementing a new patch. Will it be possible you test it in your environment (I am not able to reproduce your exact failure).

ioblocktimeout in our case is 700000 (~12-13 minutes). The initial stacktraces (without patch) are attached to the ticket in the very beginning.

Sure, no problem for testing in our environment - i want the problem to be resolved :)

Thanks for the backstack. I will look at them. They are from the original failure so they will not show the problem with the previous tentative fix but it will give a good indication.

I tested the debug patch I just attached. It succeeds but in my test env I am not able to reproduce your exact problem.
Would it be possible you backport the patch (done on top of 1.3.1), and test it (with access buffering 'on'). In case of failure, would you please provide the access/errors logs and pstacks of supplier/consumer.

Thanks again for your help

You were right. The last patch works like a charm. I've attached the "succesful" logs to the ticket.
The throttle ("check_flow_control_tot_init give time to the consumer before sending more entries") happens approx every two or three seconds

I still have several questions/remarks:
* the bug never appeared without SSL. I was unable to reproduce it without any encryption (be it SSL/TLS or Kerberos). What could be the reason? Maybe these encryption/decryption protocols have their own buffers thus changing the flow dynamics?

  • the choice of the "lag" of 1000 entries is rather arbitrary/empirical (as well as 2 seconds of sleep). It's ok for the entries in our LDAP (average size ~5 or 6 kB) but what happens if the entries are larger (smaller should be ok). Maybe it should be based on the volume of the sent data if the limitation is TCP socket buffer sizes?

  • i've already asked about the incremental replication. If we make a lot of modifications during a short time period (or if the consumer is down and then goes up thus needing to catch up), the situation will be exactly the same - the supplier will flood the consumer. So the question is - does the incremental replication use the same code path (perform_operation() in repl5_connection.c)? And if it is the case then maybe it is worth implementing the same sort of "dike" for incremental replication?

Replying to [comment:34 pj101]:

You were right. The last patch works like a charm. I've attached the "succesful" logs to the ticket.
The throttle ("check_flow_control_tot_init give time to the consumer before sending more entries") happens approx every two or three seconds

I still have several questions/remarks:
* the bug never appeared without SSL. I was unable to reproduce it without any encryption (be it SSL/TLS or Kerberos). What could be the reason? Maybe these encryption/decryption protocols have their own buffers thus changing the flow dynamics?

Yes. NSS does a lot of buffering and locking which will definitely throw off the timing. Same with gssapi/kerberos.

  • the choice of the "lag" of 1000 entries is rather arbitrary/empirical (as well as 2 seconds of sleep). It's ok for the entries in our LDAP (average size ~5 or 6 kB) but what happens if the entries are larger (smaller should be ok). Maybe it should be based on the volume of the sent data if the limitation is TCP socket buffer sizes?

Probably. We'll need some more testing with different data sets to determine that.

  • i've already asked about the incremental replication. If we make a lot of modifications during a short time period (or if the consumer is down and then goes up thus needing to catch up), the situation will be exactly the same - the supplier will flood the consumer. So the question is - does the incremental replication use the same code path (perform_operation() in repl5_connection.c)?

Yes.

And if it is the case then maybe it is worth implementing the same sort of "dike" for incremental replication?

Yes. But because of the way that incremental replication works, it will probably be much more difficult to reproduce the problem.

Hi Thierry, One minor issue, the indentation needs fixing throughout your patch. Two, this issue looks very similar to some issues we saw with the reliability 15 testing. I know your patch is only applies to total updates, but is it possible to apply this to incremental updates? Here are some stack traces from the rel15 hangs we saw, do they look similiar to your issue with total updates? {{{ Thread 3 (Thread 0x7f7473fff700 (LWP 29182)): #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f754f0e6d77 in _L_lock_974 () from /lib64/libpthread.so.0 #2 0x00007f754f0e6d20 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7f7498053620) at pthread_mutex_lock.c:104 #3 0x00007f754f73de29 in PR_Lock (lock=0x7f7498053620) at ../../../nspr/pr/src/pthreads/ptsynch.c:174 #4 0x00007f75471eac1c in conn_read_result_ex (conn=conn@entry=0x7f7498053570, retoidp=retoidp@entry=0x0, retdatap=retdatap@entry=0x0, returned_controls=returned_controls@entry=0x0, send_msgid=send_msgid@entry=-1, resp_msgid=resp_msgid@entry=0x7f7473ffed78, block=block@entry=0) at ldap/servers/plugins/replication/repl5_connection.c:337 #5 0x00007f75471ed306 in repl5_inc_result_threadmain (param=0x7f748d1bc090) at ldap/servers/plugins/replication/repl5_inc_protocol.c:295 #6 0x00007f754f743740 in _pt_root (arg=0x7f748c6624d0) at ../../../nspr/pr/src/pthreads/ptthread.c:204 #7 0x00007f754f0e4df3 in start_thread (arg=0x7f7473fff700) at pthread_create.c:308 #8 0x00007f754ee123dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Waiting for: Thread 14 (Thread 0x7f746dff3700 (LWP 32703)): #0 0x00007f754ee07e0d in poll () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f754f73f8f7 in poll (__timeout=600000, __nfds=1, __fds=0x7f746dff27d0) at /usr/include/bits/poll2.h:46 #2 _pr_poll_with_poll (pds=pds@entry=0x7f746dff2a80, npds=npds@entry=1, timeout=timeout@entry=600000) at ../../../nspr/pr/src/pthreads/ptio.c:3920 #3 0x00007f754f742435 in PR_Poll (pds=pds@entry=0x7f746dff2a80, npds=npds@entry=1, timeout=timeout@entry=600000) at ../../../nspr/pr/src/pthreads/ptio.c:4322 #4 0x00007f75471ea56f in see_if_write_available (timeout=600000, conn=0x7f7498053570) at ldap/servers/plugins/replication/repl5_connection.c:544 #5 perform_operation (conn=0x7f7498053570, optype=optype@entry=3, dn=0x7f748c386270 "dc=domain_8440,l=my_l_2,ou=my_ou_1,o=my_suffix.com", attrs=attrs@entry=0x7f748cfa3ab0, newrdn=newrdn@entry=0x0, newparent=newparent@entry=0x0, deleteoldrdn=deleteoldrdn@entry=0, update_control=update_control@entry=0x7f748c3a75a0, extop_oid=extop_oid@entry=0x0, extop_payload=extop_payload@entry=0x0, message_id=message_id@entry=0x7f746dff2c6c) at ldap/servers/plugins/replication/repl5_connection.c:681 #6 0x00007f75471eb0bd in conn_send_modify (conn=<optimized out>, dn=<optimized out>, mods=mods@entry=0x7f748cfa3ab0, update_control=update_control@entry=0x7f748c3a75a0, message_id=message_id@entry=0x7f746dff2c6c) at ldap/servers/plugins/replication/repl5_connection.c:797 #7 0x00007f75471ed9b3 in replay_update (prp=prp@entry=0x7f74980536d0, op=0x7f746dff2cd0, message_id=message_id@entry=0x7f746dff2c6c) at ldap/servers/plugins/replication/repl5_inc_protocol.c:1387 #8 0x00007f75471eee56 in send_updates (num_changes_sent=0x7f746dff2c60, remote_update_vector=<optimized out>, prp=0x7f74980536d0) at ldap/servers/plugins/replication/repl5_inc_protocol.c:1719 #9 repl5_inc_run (prp=<optimized out>) at ldap/servers/plugins/replication/repl5_inc_protocol.c:1056 #10 0x00007f75471f3a5c in prot_thread_main (arg=0x7f7498052930) at ldap/servers/plugins/replication/repl5_protocol.c:296 #11 0x00007f754f743740 in _pt_root (arg=0x7f7498053990) at ../../../nspr/pr/src/pthreads/ptthread.c:204 #12 0x00007f754f0e4df3 in start_thread (arg=0x7f746dff3700) at pthread_create.c:308 #13 0x00007f754ee123dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 This thread was the most commonly seen: #0 0x00007fbb8ad2a1fd in write () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007fbb8bf2f338 in sb_debug_write (sbiod=0x7fbac22e09b0,buf=0x7fbac3439a3c, len=222) at sockbuf.c:854 #2 0x00007fbb8c172ce7 in tlsm_PR_Send (fd=<optimized out>,buf=0x7fbac3439a3c, len=222, timeout=<optimized out>, flags=<optimized out>) at tls_m.c:3040 #3 0x00007fbb8bd05e7c in ssl_DefSend (ss=ss@entry=0x7fbac2786390,buf=0x7fbac3439a10 "\027\003\001", len=266, flags=flags@entry=0) at ssldef.c:94 #4 0x00007fbb8bcf73fd in ssl3_SendRecord (ss=ss@entry=0x7fbac2786390,epoch=epoch@entry=0, type=type@entry=content_application_data,In=0x7fbac212feb2 '0' <repeats 11 times>, pIn@entry=0x7fbac212fdf0"0\201\277\002\002\005\223J8uid=emailperson_5429,l=my_l_0,ou=my_ou_0,o=my_suffix.com\240\177\060\033\004\027\062.16.840.1.113730.3.4.2\004",nIn=0, nIn@entry=194, flags=flags@entry=0) at ssl3con.c:2908 #5 0x00007fbb8bcf7590 in ssl3_SendApplicationData (ss=0x7fbac2786390, in=0x7fbac212fdf0 "0\201\277\002\002\005\223J8uid=emailperson_5429,l=my_l_0,ou=my_ou_0,o=my_suffix.com\240\177\060\033\004\027\062.16.840.1.113730.3.4.2\004",len=194, flags=0) at ssl3con.c:2997 #6 0x00007fbb8bd0b62e in ssl_SecureSend (ss=0x7fbac2786390, buf=0x7fbac212fdf0 "0\201\277\002\002\005\223J8uid=emailperson_5429,l=my_l_0,ou=my_ou_0,o=my_suffix.com\240\177\060\033\004\027\062.16.840.1.113730.3.4.2\004",len=194, flags=<optimized out>) at sslsecur.c:1323 #7 0x00007fbb8bd0f0b1 in ssl_Send (fd=<optimized out>,buf=0x7fbac212fdf0, len=194, flags=0, timeout=4294967295) at sslsock.c:2078 #8 0x00007fbb8c172920 in tlsm_sb_write (sbiod=0x7fbac024cc20,buf=<optimized out>, len=<optimized out>) at tls_m.c:3335 #9 0x00007fbb8bf2f338 in sb_debug_write (sbiod=0x7fbac2460280,buf=0x7fbac212fdf0, len=194) at sockbuf.c:854 #10 0x00007fbb8bf3094e in ber_int_sb_write (sb=sb@entry=0x7fbac0230d70,buf=0x7fbac212fdf0, len=len@entry=194) at sockbuf.c:445 #11 0x00007fbb8bf2cc5b in ber_flush2 (sb=0x7fbac0230d70,ber=0x7fbac20c2650, freeit=freeit@entry=0) at io.c:246 #12 0x00007fbb8c15e221 in ldap_int_flush_request(ld=ld@entry=0x7fbaa0159160, lr=lr@entry=0x7fbac0054530) at request.c:183 #13 0x00007fbb8c15e602 in ldap_send_server_request(ld=ld@entry=0x7fbaa0159160, ber=ber@entry=0x7fbac20c2650, msgid=msgid@entry=1427, parentreq=parentreq@entry=0x0, srvlist=srvlist@entry=0x0, lc=0x7fbac0303be0, lc@entry=0x0,bind=bind@entry=0x0, m_noconn=m_noconn@entry=0, m_res=m_res@entry=0) at request.c:405 #14 0x00007fbb8c15e988 in ldap_send_initial_request(ld=ld@entry=0x7fbaa0159160, msgtype=msgtype@entry=74,dn=dn@entry=0x7fbac24073e0 "uid=emailperson_5429,l=my_l_0,ou=my_ou_0,o=my_suffix.com",ber=ber@entry=0x7fbac20c2650, msgid=msgid@entry=1427) at request.c:166 #15 0x00007fbb8c15222e in ldap_delete_ext (ld=0x7fbaa0159160,dn=dn@entry=0x7fbac24073e0 "uid=emailperson_5429,l=my_l_0,ou=my_ou_0,o=my_suffix.com", sctrls=sctrls@entry=0x7fbb426f6a90, cctrls=cctrls@entry=0x0, msgidp=msgidp@entry=0x7fbb426f6a88) at delete.c:99 #16 0x00007fbb82e2854c in perform_operation (conn=0x7fbad0058640, optype=optype@entry=2, dn=0x7fbac24073e0 "uid=emailperson_5429,l=my_l_0,ou=my_ou_0,o=my_suffix.com",attrs=attrs@entry=0x0, newrdn=newrdn@entry=0x0, newparent=newparent@entry=0x0, deleteoldrdn=deleteoldrdn@entry=0,update_control=update_control@entry=0x7fbac27e1bb0, extop_oid=extop_oid@entry=0x0, extop_payload=extop_payload@entry=0x0,message_id=message_id@entry=0x7fbb426f6c6c) at ldap/servers/plugins/replication/repl5_connection.c:707 #17 0x00007fbb82e28dbc in conn_send_delete (conn=<optimized out>, dn=<optimized out>, update_control=update_control@entry=0x7fbac27e1bb0, message_id=message_id@entry=0x7fbb426f6c6c) at ldap/servers/plugins/replication/repl5_connection.c:786 #18 0x00007fbb82e2b621 in replay_update (prp=prp@entry=0x7fbad0043350, op=0x7fbb426f6cd0, message_id=message_id@entry=0x7fbb426f6c6c) at ldap/servers/plugins/replication/repl5_inc_protocol.c:1392 #19 0x00007fbb82e2cba6 in send_updates (num_changes_sent=0x7fbb426f6c60, remote_update_vector=<optimized out>, prp=0x7fbad0043350) at ldap/servers/plugins/replication/repl5_inc_protocol.c:1719 #20 repl5_inc_run (prp=<optimized out>) at ldap/servers/plugins/replication/repl5_inc_protocol.c:1056 #21 0x00007fbb82e317ac in prot_thread_main (arg=0x7fbad002ae60) at ldap/servers/plugins/replication/repl5_protocol.c:296 #22 0x00007fbb8b382740 in _pt_root (arg=0x7fbad003e900) at ../../../nspr/pr/src/pthreads/ptthread.c:204 #23 0x00007fbb8ad23df3 in start_thread (arg=0x7fbb426f7700) at pthread_create.c:308 #24 0x00007fbb8aa513dd in clone () at../sysdeps/unix/sysv/linux/x86_64/clone.S:113 }}}

Thanks Mark for the review. I will revisit the fix with the indentation problem.

You raise the same point as pj101 reported and Rich confirmed.
Yes looking at the code I think you are right, the same flow control issue can occur in incremental replication.

In old versions (5.0 ?) the flow control was done for incremental because the RA.sender waited for the sent update to be acknowledged before sending the next one.

In current version, this flow control is not done. The RA.sender creates a list of expected acknowledgements but continue to send until there is no more update to send. This new version offers better latency regarding replication at the risk of a flow control congestion.

The risk in incremental update should be much limited than with total update as the amount of sent data is less (likely less updates to send than entries, size of the updates < size entries).
Now the risk exists and the backstack from reliab15 shows it can occurs.

I will look how difficult it would be to do a common flow control (total and incremental).

Incremental and total update use different threads and functions/callback to send/receive updates/ack.
Now a similar fix on the incremental update looks quite easy to implement. I will test a new fix include but flow control on total/incremental updates.

For total update, window/pause are configurable.
I am not sure it worth to make it configurable for incremental update.

Looks like there are still indentation issues in:

ldap/servers/plugins/replication/repl5_agmtlist.c

{{{
agmtlist_modify_callback(Slapi_PBlock pb, Slapi_Entry entryBefore, Slapi_Entry
326 326 rc = SLAPI_DSE_CALLBACK_ERROR;
327 327 }
328 328 }
329 else if (slapi_attr_types_equivalent(mods[i]->mod_type,
330 type_nsds5ReplicaTotalWindow))
331 {
332 / New replica timeout /
333 if (agmt_set_totalwindow_from_entry(agmt, e) != 0)
334 {
335 slapi_log_error(SLAPI_LOG_REPL, repl_plugin_name, "agmtlist_modify_callback: "
336 "failed to update total update window for agreement %s\n",
337 agmt_get_long_name(agmt));
338 returncode = LDAP_OPERATIONS_ERROR;
339 rc = SLAPI_DSE_CALLBACK_ERROR;
340 }
341 }
342 else if (slapi_attr_types_equivalent(mods[i]->mod_type,
343 type_nsds5ReplicaTotalPause))
344 {
345 /
New replica timeout /
346 if (agmt_set_totalpause_from_entry(agmt, e) != 0)
347 {
348 slapi_log_error(SLAPI_LOG_REPL, repl_plugin_name, "agmtlist_modify_callback: "
349 "failed to update total update pause for agreement %s\n",
350 agmt_get_long_name(agmt));
351
returncode = LDAP_OPERATIONS_ERROR;
352 rc = SLAPI_DSE_CALLBACK_ERROR;
353 }
354 }
}}}

and ldap/servers/plugins/replication/repl5_agmt.c

{{{
354
355 / Total update window. /
356 ra->totalUpdateWindow = DEFAULT_TOTALWINDOW;
357 if (slapi_entry_attr_find(e, type_nsds5ReplicaTotalWindow, &sattr) == 0)
358 {
359 Slapi_Value sval;
360 if (slapi_attr_first_value(sattr, &sval) == 0)
361 {
362 ra->totalUpdateWindow = slapi_value_get_long(sval);
363 }
364 }
365
366 /
Total update pause. /
367 ra->totalUpdatePause = DEFAULT_TOTALPAUSE;
368 if (slapi_entry_attr_find(e, type_nsds5ReplicaTotalPause, &sattr) == 0)
369 {
370 Slapi_Value
sval;
371 if (slapi_attr_first_value(sattr, &sval) == 0)
372 {
373 ra->totalUpdatePause = slapi_value_get_long(sval);
374 }
375 }
}}}

and here:

ldap/servers/plugins/replication/repl5_inc_protocol.c

{{{
send_updates(Private_Repl_Protocol prp, RUV remote_update_vector, PRUint32 nu
1684 1705 int finished = 0;
1685 1706 ConnResult replay_crc;
1686 1707 char csn_str[CSN_STRSIZE];
1708 int inc_update_window = 100; /
no more than 100 unacked updates before pause /
1709 int inc_update_pause = 2000; /
pause of 2sec if the consumer is too late */
}}}

I find it much easier to spot the indentation issues looking at the patch file, than looking at the code in an IDE.

The rest looks good to me.

Since the fix is in essence the same for both cases, wouldn't it be logical to reuse the same two attributes to the same purpose (renaming them, of course, to something like nsds5ReplicaMaximumLag and nsds5ReplicaLagPause)?

Thanks Mark and pj101 for the review. I will rework the indentation.

Regarding the use of the same attribute I think it is a good idea.
Total and incremental are not sending the same data and likely the same amount. So probably the setting could differs. But if it is needed to adapt the tuning it can be done online.
So I do not see any drawback to share the same attributes.
I will reimplement it and will merge the two fixes.

THanks again

The problem I have with this is that it is going to be difficult to determine the optimal numbers for the window size and the sleep period. How can we determine what the optimal values are?

For the pause period, is it necessary to sleep for some period of time, or is just doing a thread yield sufficient? The purpose of the sleep is to allow the other threads to run, correct? A yield should do that - essentially, a sleep with a time of 0.

For the window size, is there some way we could determine the optimal size? I just don't like giving people knobs to turn without some way to determine how to set them, and we also need to set a reasonable default value.

Replying to [comment:43 rmeggins]:

The problem I have with this is that it is going to be difficult to determine the optimal numbers for the window size and the sleep period. How can we determine what the optimal values are?

I fully agree. The optimal values can be continuously computed by the RA by monitoring how fast the consumer consums the sent entries/updates. Should it be implemented in that ticket or an other one ?.

For the pause period, is it necessary to sleep for some period of time, or is just doing a thread yield sufficient? The purpose of the sleep is to allow the other threads to run, correct? A yield should do that - essentially, a sleep with a time of 0.

I think yield should work as well (needs to be tested) but the test on sent/rcv msg should likely be done before sending the next entry. A main difference between yield and sleep is if we want to give time to the consumer to catch up.

For the window size, is there some way we could determine the optimal size? I just don't like giving people knobs to turn without some way to determine how to set them, and we also need to set a reasonable default value.

Right. I do not know how to compute the amount of data can go on the connection before taking the risk to be blocked.
It also depends how much encryption will expand the data.

Replying to [comment:44 tbordaz]:

Replying to [comment:43 rmeggins]:

The problem I have with this is that it is going to be difficult to determine the optimal numbers for the window size and the sleep period. How can we determine what the optimal values are?

I fully agree. The optimal values can be continuously computed by the RA by monitoring how fast the consumer consums the sent entries/updates. Should it be implemented in that ticket or an other one ?.

I guess we have the immediate problem of the hang, and we know that this patch fixes. I think we should go ahead with the patch as it is.

Then we should open another ticket to follow up this ticket, to investigate and determine how to find the optimal values of sleep time and window size, with the goals 1) find the best general purpose default values for sleep time and window size; 2) document a procedure for monitoring and tuning replication so that we can tune these values in production; 3) see if there is a way for the code to automatically adjust these values e.g. had to sleep 100 ms last time and it wasn't long enough, try 200 ms, etc.

For the pause period, is it necessary to sleep for some period of time, or is just doing a thread yield sufficient? The purpose of the sleep is to allow the other threads to run, correct? A yield should do that - essentially, a sleep with a time of 0.

I think yield should work as well (needs to be tested) but the test on sent/rcv msg should likely be done before sending the next entry. A main difference between yield and sleep is if we want to give time to the consumer to catch up.

Ok.

For the window size, is there some way we could determine the optimal size? I just don't like giving people knobs to turn without some way to determine how to set them, and we also need to set a reasonable default value.

Right. I do not know how to compute the amount of data can go on the connection before taking the risk to be blocked.
It also depends how much encryption will expand the data.

Ok.

There are a few new log messages that are FATAL - I'm not sure these should be FATAL - we don't want to spam the customer's error logs unnecessarily. However, if we tell them that the only way to determine if this hang situation is happening is by turning on the REPL log level, there are some customers who will not want to do this in their production environments because it will slow down the server too much. Perhaps we could add some sort of monitor entry/attribute for this?

In conn_write_available:
yeild is spelled yield
I don't think the log message should be at FATAL level, I suggest REPL
return_value should be initialized, otherwise you will get compiler and/or coverity warning
should check for other stop states in the while loop - for example, what if someone wants to break the loop without shutting down slapd e.g. by deleting the agreemen, stopping the agreement, or disabling the agreement?

Hi Thierry, I have 2 questions...

1) Is the value '0' or a negative value treated specially for these 2 parameters?
flowControlWindow
flowControlPause

Reading the code, the value '0' does not do much harm. (In see_if_write_available, if flowControlWindow is 0, it goes to the pause every time. It should be okay, right?) Maybe, if a negative number accidentally set to flowControlPause might be bad since PR_MillisecondsToInterval takes unsigned 32bit integer PRUint32. Do we want to check (totalUpdatePause > 0)?
691 } else if ((sent_msgid - rcv_msgid) > agmt_get_flowcontrolwindow(conn->agmt)) {
692 int totalUpdatePause;
693
694 totalUpdatePause = agmt_get_flowcontrolpause(conn->agmt);
695 if (totalUpdatePause) {
...
717 DS_Sleep(PR_MillisecondsToInterval(totalUpdatePause));

Same for repl5_inc_flow_control_results:
476 DS_Sleep(PR_MillisecondsToInterval(agmt_get_flowcontrolpause(agmt)));

2) Maybe it is a corner case :), but in case flowcontrolwindow is 0, total update does not pause if sent and rcv are identical, but incremental does due to '>='. I'm curious why total update uses '>' and incremental does '>='?
474 ((rd->last_message_id_sent - rd->last_message_id_received) >= agmt_get_flowcontrolwindow(agmt))) {

Replying to [comment:48 nhosoi]:

Hi Thierry, I have 2 questions...

1) Is the value '0' or a negative value treated specially for these 2 parameters?
flowControlWindow

0 means "do flow control check for every message"

flowControlPause

0 means do a thread yield instead of sleep for N milliseconds

Reading the code, the value '0' does not do much harm. (In see_if_write_available, if flowControlWindow is 0, it goes to the pause every time. It should be okay, right?) Maybe, if a negative number accidentally set to flowControlPause might be bad since PR_MillisecondsToInterval takes unsigned 32bit integer PRUint32. Do we want to check (totalUpdatePause > 0)?

Note that there are several replication parameters that do not check for negative or otherwise invalid values. We should fix all of them.

691 } else if ((sent_msgid - rcv_msgid) > agmt_get_flowcontrolwindow(conn->agmt)) {
692 int totalUpdatePause;
693
694 totalUpdatePause = agmt_get_flowcontrolpause(conn->agmt);
695 if (totalUpdatePause) {
...
717 DS_Sleep(PR_MillisecondsToInterval(totalUpdatePause));

Same for repl5_inc_flow_control_results:
476 DS_Sleep(PR_MillisecondsToInterval(agmt_get_flowcontrolpause(agmt)));

2) Maybe it is a corner case :), but in case flowcontrolwindow is 0, total update does not pause if sent and rcv are identical, but incremental does due to '>='. I'm curious why total update uses '>' and incremental does '>='?
474 ((rd->last_message_id_sent - rd->last_message_id_received) >= agmt_get_flowcontrolwindow(agmt))) {

git merge ticket47942

Updating 96c130b..fbafee5
Fast-forward
ldap/schema/01core389.ldif | 4 +-
ldap/servers/plugins/replication/repl5.h | 10 +++++
ldap/servers/plugins/replication/repl5_agmt.c | 160 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ldap/servers/plugins/replication/repl5_agmtlist.c | 26 +++++++++++
ldap/servers/plugins/replication/repl5_connection.c | 163 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
ldap/servers/plugins/replication/repl5_inc_protocol.c | 32 ++++++++++++-
ldap/servers/plugins/replication/repl5_prot_private.h | 2 +
ldap/servers/plugins/replication/repl5_tot_protocol.c | 53 +++++++++++++++++++++-
ldap/servers/plugins/replication/repl_globals.c | 2 +
9 files changed, 446 insertions(+), 6 deletions(-)

git push origin master
Counting objects: 31, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (16/16), done.
Writing objects: 100% (16/16), 6.74 KiB, done.
Total 16 (delta 14), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
96c130b..fbafee5 master -> master

commit fbafee5
Author: Thierry bordaz (tbordaz) tbordaz@redhat.com
Date: Mon Dec 15 15:12:35 2014 +0100

git push origin 389-ds-base-1.3.3

Counting objects: 31, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (16/16), done.
Writing objects: 100% (16/16), 6.75 KiB, done.
Total 16 (delta 14), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
61a4e70..69df018 389-ds-base-1.3.3 -> 389-ds-base-1.3.3

commit 69df018
Author: Thierry bordaz (tbordaz) tbordaz@redhat.com
Date: Mon Dec 15 15:12:35 2014 +0100

git push origin 389-ds-base-1.3.2

Counting objects: 31, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (16/16), done.
Writing objects: 100% (16/16), 6.71 KiB, done.
Total 16 (delta 14), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
09eff3d..9851929 389-ds-base-1.3.2 -> 389-ds-base-1.3.2

commit 9851929
Author: Thierry bordaz (tbordaz) tbordaz@redhat.com
Date: Mon Dec 15 15:12:35 2014 +0100

Backported to 389-ds-base-1.3.1:
f5458c4..3f78755 389-ds-base-1.3.1 -> 389-ds-base-1.3.1
commit 3f787554300db2ce901d244ff728aaa910e73f7a

Metadata Update from @pj101:
- Issue assigned to tbordaz
- Issue set to the milestone: 1.2.11.33

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1273

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

3 years ago

Login to comment on this ticket.

Metadata