#47788 Supplier can skip a failing update, although it should retry.
Closed: Fixed None Opened 5 years ago by tbordaz.

although the consumer returns a failure, and this failure should stop (break) replication (ignore_error_and_keep_going() does not ignore unwilling to perform), the operations following the failing one are replicated and the update the RUV. At the end it is like the failing operation is skipped.

Here is the test case scenario (in bold the failing CSN and the CSN following the failing one)

M1:
MOD entry_18: csn=53592ba7000000010000
DEL entry_1: csn=53592ba7000100010000
MOD entry_18: csn=53592ba8000000010000

            <-- M2
            MOD entry_19: csn=53592ba9000000020000
            MOD  entry_1: csn=53592baa000000020000 (error)
            MOD entry_19: csn=53592baa000100020000
    MOD entry_19: csn=53592bb6000000010000
    MOD entry_19: csn=53592bbb000000010000
    MOD entry_19: csn=53592bc1000000010000
    ...
    MOD entry_19: csn=53592beb000000010000
            <-- M2
            MOD entry_19: csn=53592bf2000000020000
            MOD entry_19: csn=53592bf8000000020000
            ...

M2:
MOD entry_19: csn=53592ba9000000020000
MOD entry_1: csn=53592baa000000020000
MOD entry_19: csn=53592baa000100020000
M1 -->
MOD entry_18: csn=53592ba7000000010000
DEL entry_1: csn=53592ba7000100010000
MOD entry_18: csn=53592ba8000000010000
MOD entry_19: csn=53592bb6000000010000
MOD entry_19: csn=53592bbb000000010000
MOD entry_19: csn=53592bc1000000010000
...
...

On M1, we can see that the MOD on entry_1 failed but replication of M2 updates is not stopped

[24/Apr/2014:17:20:13 +0200] conn=5 op=5 MOD dn="cn=new_account19,cn=staged user,dc=example,dc=com"
[24/Apr/2014:17:20:14 +0200] conn=5 op=5 RESULT err=0 tag=103 nentries=0 etime=1 csn=53592ba9000000020000
[24/Apr/2014:17:20:14 +0200] conn=5 op=6 MOD dn="cn=new_account1,cn=staged user,dc=example,dc=com"
[24/Apr/2014:17:20:14 +0200] conn=5 op=6 RESULT err=53 tag=103 nentries=0 etime=0 csn=53592baa000000020000
[24/Apr/2014:17:20:14 +0200] conn=5 op=7 MOD dn="cn=new_account19,cn=staged user,dc=example,dc=com"
[24/Apr/2014:17:20:16 +0200] conn=5 op=7 RESULT err=0 tag=103 nentries=0 etime=2 csn=53592baa000100020000
[24/Apr/2014:17:20:19 +0200] conn=3 op=40 SRCH base="dc=example,dc=com" scope=2 filter="(cn=new_account19)" attrs=ALL
[24/Apr/2014:17:20:19 +0200] conn=3 op=40 RESULT err=0 tag=101 nentries=1 etime=0
[24/Apr/2014:17:20:20 +0200] conn=5 op=8 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session"
[24/Apr/2014:17:20:20 +0200] conn=5 op=8 RESULT err=0 tag=120 nentries=0 etime=0
[24/Apr/2014:17:20:21 +0200] conn=5 op=9 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
[24/Apr/2014:17:20:21 +0200] conn=3 op=42 MOD dn="cn=new_account19,cn=staged user,dc=example,dc=com"
[24/Apr/2014:17:20:21 +0200] conn=5 op=9 RESULT err=0 tag=120 nentries=0 etime=1
[24/Apr/2014:17:20:22 +0200] conn=3 op=42 RESULT err=0 tag=103 nentries=0 etime=1 csn=53592bb6000000010000
...
[24/Apr/2014:17:21:25 +0200] conn=5 op=44 MOD dn="cn=new_account19,cn=staged user,dc=example,dc=com"
[24/Apr/2014:17:21:25 +0200] conn=5 op=44 RESULT err=0 tag=103 nentries=0 etime=0 csn=53592bf2000000020000
...
[24/Apr/2014:17:21:30 +0200] conn=5 op=48 MOD dn="cn=new_account19,cn=staged user,dc=example,dc=com"
[24/Apr/2014:17:21:30 +0200] conn=5 op=48 RESULT err=0 tag=103 nentries=0 etime=0 csn=53592bf8000000020000

On M1, we can see that the failing MOD is add to the pending list, but when the next update (53592baa000100020000) is commited, the failing update is no longer in the pending list.

[24/Apr/2014:17:20:14 +0200] NSMMReplicationPlugin - 53592baa000000020000, not committed
[24/Apr/2014:17:20:14 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 53592baa000000020000 into pending list
[24/Apr/2014:17:20:14 +0200] ldbm_back_modify - Attempt to modify a tombstone entry nsuniqueid=d131a60c-cbc311e3-847ba7c2-9c316dbb,cn=new_account1,cn=staged user,dc=example,dc=com
[24/Apr/2014:17:20:14 +0200] NSMMReplicationPlugin - conn=5 op=6 csn=53592baa000000020000 process postop: canceling operation csn
[24/Apr/2014:17:20:15 +0200] NSMMReplicationPlugin - csnplInsert: CSN Pending list content:
[24/Apr/2014:17:20:15 +0200] NSMMReplicationPlugin - 53592baa000100020000, not committed
[24/Apr/2014:17:20:15 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 53592baa000100020000 into pending list
[24/Apr/2014:17:20:16 +0200] NSMMReplicationPlugin - replica_update_ruv: csn 53592baa000100020000
[24/Apr/2014:17:20:16 +0200] NSMMReplicationPlugin - csnplCommit: CSN Pending list content:
[24/Apr/2014:17:20:16 +0200] NSMMReplicationPlugin - 53592baa000100020000, not committed
[24/Apr/2014:17:20:16 +0200] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 53592baa000100020000
[24/Apr/2014:17:20:16 +0200] NSMMReplicationPlugin - ruv_update_ruv: rolled up to csn 53592baa000100020000

On M2, we can see that the second replication session find 53592baa000100020000 in the RUV

[24/Apr/2014:17:20:12 +0200] Consumer RUV:
[24/Apr/2014:17:20:12 +0200] {replicageneration} 53590540000000010000
[24/Apr/2014:17:20:12 +0200] {replica 1 ldap://localhost.localdomain:44389} 53590546000000010000 53592ba8000000010000 00000000
[24/Apr/2014:17:20:12 +0200] {replica 2 ldap://localhost.localdomain:45389}
...
[24/Apr/2014:17:20:13 +0200] agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389) - load=1 rec=2 csn=53592baa000000020000
[24/Apr/2014:17:20:13 +0200] NSMMReplicationPlugin - agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389): replay_update: Sending modify operation (dn="cn=new_account1,cn=staged user,dc=example,dc=com" csn=53592baa000000020000)
[24/Apr/2014:17:20:13 +0200] NSMMReplicationPlugin - agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389): replay_update: Consumer successfully sent operation with csn 53592baa000000020000
[24/Apr/2014:17:20:13 +0200] agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389) - load=1 rec=3 csn=53592baa000100020000
[24/Apr/2014:17:20:13 +0200] NSMMReplicationPlugin - agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389): replay_update: Sending modify operation (dn="cn=new_account19,cn=staged user,dc=example,dc=com" csn=53592baa000100020000)
[24/Apr/2014:17:20:13 +0200] NSMMReplicationPlugin - agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389): replay_update: Consumer successfully sent operation with csn 53592baa000100020000
[24/Apr/2014:17:20:14 +0200] NSMMReplicationPlugin - agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389): Consumer failed to replay change (uniqueid d131a60c-cbc311e3-847ba7c2-9c316dbb, CSN 53592baa000000020000): Server is unwilling to perform (53). Will retry later.
...
[24/Apr/2014:17:20:21 +0200] Consumer RUV:
[24/Apr/2014:17:20:21 +0200] {replicageneration} 53590540000000010000
[24/Apr/2014:17:20:21 +0200] {replica 1 ldap://localhost.localdomain:44389} 53590546000000010000 53592ba8000000010000 00000000
[24/Apr/2014:17:20:21 +0200] {replica 2 ldap://localhost.localdomain:45389} 53592ba9000000020000 53592baa000100020000 00000000

So replication did not stop on 53592baa000000020000, and the update is skipped although the supplier intended to retry it. It is possibly related to the update being removed from the pending list, the next commited update 53592baa000100020000 can go to the RUV


no cloning, upstream tests already written.

A new occurance of this bug happend (testing with IPA)

'''Problem statement:'''
an update was skipped from replication after a transiant failure

'''Short description'''

A master receives a DEL. The DEL is successful and is replicated on a replica (replica2).
The replicated DEL fails on the replica.
Others updates following the DEL are then replicated and the master skip resending the DEL.

The reason why the DEL failed on replica is unknown. Likely it is a transient deadlock
issue where the replica attempted 50 times to write the DEL but failed in DB_DEADLOCK or DB_EAGAIN.
This transiant error is not the problem investigated here, only the facts that the 
DEL was skipped is of interest

On the following commands (taken hours after the problem happened) 
we can see that the user entry 'onmaster' was deleted on master (as well as group 'onmaster')
but the user entry 'onmaster' was still present on replica2 (but the group delete was successfully done)

On Master

{{{
[root@f22master slapd-BAGAM-NET]# ldapsearch -D "cn=directory manager" -w 'xxxx' -b "uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net" -LLL dn
No such object (32)
Matched DN: cn=users,cn=accounts,dc=bagam,dc=net

[root@f22master slapd-BAGAM-NET]# ldapsearch -D "cn=directory manager" -w 'xxxx' -b "cn=onmaster,cn=groups,cn=accounts,dc=bagam,dc=net" -LLL dn
No such object (32)
Matched DN: cn=groups,cn=accounts,dc=bagam,dc=net

}}}

On replica2

{{{
[root@f22replica2 slapd-BAGAM-NET]# ldapsearch -D "cn=directory manager" -w 'xxxx' -b "uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net" -LLL dn
dn: uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net

[root@f22replica2 slapd-BAGAM-NET]# ldapsearch -D "cn=directory manager" -w 'New!e6enn0' -b "cn=onmaster,cn=groups,cn=accounts,dc=bagam,dc=net" -LLL dn
No such object (32)
Matched DN: cn=groups,cn=accounts,dc=bagam,dc=net

}}}

'''Detailed description:'''

Master (replicaId=4)

{{{

< direct OP (not filtered) >>

[16/Jun/2015:10:13:19 -0400] conn=388 op=12 ADD dn="uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net"
[16/Jun/2015:10:13:20 -0400] conn=388 op=12 RESULT err=0 tag=105 nentries=0 etime=1 csn=55802f01000700040000
...
[16/Jun/2015:10:13:20 -0400] conn=388 op=20 MOD dn="cn=ipausers,cn=groups,cn=accounts,dc=bagam,dc=net"
[16/Jun/2015:10:13:21 -0400] conn=388 op=20 RESULT err=0 tag=103 nentries=0 etime=1 csn=55802f04000100040000
...
[16/Jun/2015:10:16:45 -0400] conn=402 op=20 DEL dn="uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net"
[16/Jun/2015:10:16:45 -0400] conn=402 op=21 UNBIND
[16/Jun/2015:10:16:45 -0400] conn=402 op=21 fd=120 closed - U1
[16/Jun/2015:10:16:45 -0400] conn=402 op=20 RESULT err=0 tag=107 nentries=0 etime=0 csn=55802fcf000300040000
<< error detect
[16/Jun/2015:10:18:36 -0400] NSMMReplicationPlugin - agmt="cn=f22master-to-f22replica2" (f22replica2:389): Consumer failed to replay change (uniqueid b8242e18-143111e5-b1d0d0c3-ae5854ff, CSN 55802fcf000300040000): Operations error (1). Will retry later.
>>
...

<restart after crash/poweroff -- 17/Jun/2015:03:37:10 >

< direct OP (not filtered) are replicated>

[17/Jun/2015:05:20:25 -0400] conn=104 op=9 ADD dn="ipauniqueid=autogenerate,cn=sudocmds,cn=sudo,dc=bagam,dc=net"
[17/Jun/2015:05:20:26 -0400] conn=104 op=9 RESULT err=0 tag=105 nentries=0 etime=1 csn=55813bdd000200040000
...
[17/Jun/2015:05:20:43 -0400] conn=109 op=9 ADD dn="ipauniqueid=autogenerate,cn=sudocmds,cn=sudo,dc=bagam,dc=net"
[17/Jun/2015:05:20:43 -0400] conn=109 op=9 RESULT err=0 tag=105 nentries=0 etime=0 csn=55813bed000400040000
...
< direct OP received by master and last received by replica2 >
[17/Jun/2015:08:51:47 -0400] conn=324 op=20 MOD dn="cn=ipausers,cn=groups,cn=accounts,dc=bagam,dc=net"
[17/Jun/2015:08:51:48 -0400] conn=324 op=20 RESULT err=0 tag=103 nentries=0 etime=1 csn=55816d65000d00040000

}}}

Master RUV:

{{{
dn: cn=replica,cn=dc\3Dbagam\2Cdc\3Dnet,cn=mapping tree,cn=config
cn: replica
nsDS5Flags: 1
objectClass: nsds5replica
objectClass: top
objectClass: extensibleobject
nsDS5ReplicaType: 3
nsDS5ReplicaRoot: dc=bagam,dc=net
nsds5ReplicaLegacyConsumer: off
nsDS5ReplicaId: 4
nsds5replicabinddngroupcheckinterval: 60
nsds5replicabinddngroup: cn=replication managers,cn=sysaccounts,cn=etc,dc=baga
m,dc=net
nsDS5ReplicaBindDN: cn=replication manager,cn=config
nsDS5ReplicaBindDN: krbprincipalname=ldap/f22replica1@BAGAM.NET,cn=s
ervices,cn=accounts,dc=bagam,dc=net
nsDS5ReplicaBindDN: krbprincipalname=ldap/f22replica2@BAGAM.NET,cn=s
ervices,cn=accounts,dc=bagam,dc=net
nsDS5ReplicaBindDN: krbprincipalname=ldap/f22replica3@BAGAM.NET,cn=s
ervices,cn=accounts,dc=bagam,dc=net
nsState:: BAAAAAAAAACCcoFVAAAAAAAAAAAAAAAAAgAAAAAAAAAEAAAAAAAAAA==
nsDS5ReplicaName: a4150806-140211e5-b1d0d0c3-ae5854ff
nsds50ruv: {replicageneration} 557fdff1000000040000
nsds50ruv: {replica 4 ldap://f22master:389} 557fdffc000100040000 55817283000100040000
nsds50ruv: {replica 6 ldap://f22replica3:389} 55800e1b000000060000 55816cce000200060000
nsds50ruv: {replica 5 ldap://f22replica2:389} 557fed70000000050000 55814857000100050000
nsds50ruv: {replica 3 ldap://f22replica1:389} 557fdffa000000030000 55815c72000400030000
nsds5agmtmaxcsn: dc=bagam,dc=net;meTof22replica3;f22replica3;389;6;55816d67000200040000
nsruvReplicaLastModified: {replica 4 ldap://f22master:389} 55817282
nsruvReplicaLastModified: {replica 6 ldap://f22replica3:389} 55816ccc
nsruvReplicaLastModified: {replica 5 ldap://f22replica2:389} 55814855
nsruvReplicaLastModified: {replica 3 ldap://f22replica1:389} 55815c76
nsds5ReplicaChangeCount: 1406
nsds5replicareapactive: 0
}}}

Replica2 (replicaId=5)

{{{
< direct OP and internal OP replicated>
[16/Jun/2015:10:13:21 -0400] conn=43 op=144 ADD dn="uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net"
[16/Jun/2015:10:13:21 -0400] conn=43 op=144 RESULT err=0 tag=105 nentries=0 etime=1 csn=55802f01000700040000
[16/Jun/2015:10:13:21 -0400] conn=43 op=145 MOD dn="dnaHostname=f22master+dnaPortNum=389,cn=posix-ids,cn=dna,cn=ipa,cn=etc,dc=bagam,dc=net"
[16/Jun/2015:10:13:21 -0400] conn=43 op=145 RESULT err=0 tag=103 nentries=0 etime=0 csn=55802f01000900040000
[16/Jun/2015:10:13:21 -0400] conn=43 op=146 MOD dn="uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net"
[16/Jun/2015:10:13:22 -0400] conn=43 op=146 RESULT err=0 tag=103 nentries=0 etime=1 csn=55802f02000000040000
[16/Jun/2015:10:13:22 -0400] conn=43 op=147 MOD dn="uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net"
[16/Jun/2015:10:13:23 -0400] conn=43 op=147 RESULT err=0 tag=103 nentries=0 etime=1 csn=55802f02000100040000
[16/Jun/2015:10:13:23 -0400] conn=43 op=148 ADD dn="cn=onmaster,cn=groups,cn=accounts,dc=bagam,dc=net"
[16/Jun/2015:10:13:23 -0400] conn=43 op=148 RESULT err=0 tag=105 nentries=0 etime=0 csn=55802f02000300040000
[16/Jun/2015:10:13:23 -0400] conn=43 op=149 MOD dn="uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net"
[16/Jun/2015:10:13:24 -0400] conn=43 op=149 RESULT err=0 tag=103 nentries=0 etime=1 csn=55802f02000400040000
[16/Jun/2015:10:13:24 -0400] conn=43 op=150 MOD dn="uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net"
[16/Jun/2015:10:13:24 -0400] conn=43 op=150 RESULT err=0 tag=103 nentries=0 etime=0 csn=55802f02000500040000
[16/Jun/2015:10:13:24 -0400] conn=43 op=153 MOD dn="uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net"
[16/Jun/2015:10:13:25 -0400] conn=43 op=153 RESULT err=0 tag=103 nentries=0 etime=1 csn=55802f02000600040000
[16/Jun/2015:10:13:25 -0400] conn=43 op=154 MOD dn="cn=ipausers,cn=groups,cn=accounts,dc=bagam,dc=net"
[16/Jun/2015:10:13:25 -0400] conn=43 op=154 RESULT err=0 tag=103 nentries=0 etime=0 csn=55802f04000100040000
[16/Jun/2015:10:13:25 -0400] conn=43 op=155 MOD dn="uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net"
[16/Jun/2015:10:13:25 -0400] conn=43 op=155 RESULT err=0 tag=103 nentries=0 etime=0 csn=55802f04000300040000
...
<restart of="" replica2:="" 10:15:05="" -=""> 10:18:03 >
...
< The direct DEL fails but following internal OPs done by master are replicated>
[16/Jun/2015:10:18:27 -0400] conn=8 op=6 DEL dn="uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net"
[16/Jun/2015:10:18:35 -0400] conn=8 op=6 RESULT err=1 tag=107 nentries=0 etime=8 csn=55802fcf000300040000
< internal OP related to DEL done on the master>
[16/Jun/2015:10:18:35 -0400] conn=8 op=7 MOD dn="cn=ipausers,cn=groups,cn=accounts,dc=bagam,dc=net"
[16/Jun/2015:10:18:36 -0400] conn=8 op=7 RESULT err=0 tag=103 nentries=0 etime=1 csn=55802fcf000400040000
[16/Jun/2015:10:18:36 -0400] conn=8 op=8 DEL dn="cn=onmaster,cn=groups,cn=accounts,dc=bagam,dc=net"
[16/Jun/2015:10:18:37 -0400] conn=8 op=8 RESULT err=0 tag=107 nentries=0 etime=1 csn=55802fcf000700040000
...

< direct OP (not filtered) are replicated>

[17/Jun/2015:05:20:26 -0400] conn=109 op=8 ADD dn="ipaUniqueID=16b493f6-14d2-11e5-ba0b-52540032eb33,cn=sudocmds,cn=sudo,dc=bagam,dc=net"
[17/Jun/2015:05:20:27 -0400] conn=109 op=8 RESULT err=0 tag=105 nentries=0 etime=1 csn=55813bdd000200040000
...
[17/Jun/2015:05:20:44 -0400] conn=109 op=27 ADD dn="ipaUniqueID=217c21c8-14d2-11e5-8070-52540032eb33,cn=sudocmds,cn=sudo,dc=bagam,dc=net"
[17/Jun/2015:05:20:44 -0400] conn=109 op=27 RESULT err=0 tag=105 nentries=0 etime=0 csn=55813bed000400040000

...
< last direct OP (not filter) received >
[17/Jun/2015:08:51:52 -0400] conn=147 op=31 MOD dn="cn=ipausers,cn=groups,cn=accounts,dc=bagam,dc=net"
[17/Jun/2015:08:51:52 -0400] conn=147 op=31 RESULT err=0 tag=103 nentries=0 etime=0 csn=55816d65000d00040000
< last intenal OP received >
[17/Jun/2015:08:51:52 -0400] conn=147 op=32 MOD dn="uid=newonmaster,cn=users,cn=accounts,dc=bagam,dc=net"
[17/Jun/2015:08:51:52 -0400] conn=147 op=32 RESULT err=0 tag=103 nentries=0 etime=0 csn=55816d67000200040000

}}}

Replica RUV

{{{

dn: cn=replica,cn=dc\3Dbagam\2Cdc\3Dnet,cn=mapping tree,cn=config
cn: replica
nsDS5Flags: 1
nsDS5ReplicaBindDN: cn=replication manager,cn=config
nsDS5ReplicaBindDN: krbprincipalname=ldap/f22master@BAGAM.NET,cn=ser
vices,cn=accounts,dc=bagam,dc=net
nsDS5ReplicaId: 5
nsDS5ReplicaName: bb57f414-140a11e5-bb008590-0402f91a
nsDS5ReplicaRoot: dc=bagam,dc=net
nsDS5ReplicaType: 3
nsState:: BQAAAAAAAABLcoFVAAAAAAAAAAAAAAAAAgAAAAAAAAAEAAAAAAAAAA==
nsds5ReplicaLegacyConsumer: off
nsds5replicabinddngroup: cn=replication managers,cn=sysaccounts,cn=etc,dc=baga
m,dc=net
nsds5replicabinddngroupcheckinterval: 60
objectClass: nsds5replica
objectClass: top
objectClass: extensibleobject
nsds50ruv: {replicageneration} 557fdff1000000040000
nsds50ruv: {replica 5 ldap://f22replica2:389} 557fed70000000050000 5581724d000100050000
nsds50ruv: {replica 4 ldap://f22master:389} 557fdffc000100040000 55816d67000200040000
nsds50ruv: {replica 3 ldap://f22replica1:389} 557fdffa000000030000 55816c33000200030000
nsds50ruv: {replica 6 ldap://f22replica3:389} 55800e1b000000060000 55816cce000200060000
nsds5agmtmaxcsn: dc=bagam,dc=net;f22replica2-to-f22replica1;f22replica1;389;3;55814857000100050000
nsruvReplicaLastModified: {replica 5 ldap://f22replica2:389} 5581724b
nsruvReplicaLastModified: {replica 4 ldap://f22master:389} 55816d68
nsruvReplicaLastModified: {replica 3 ldap://f22replica1:389} 55816c31
nsruvReplicaLastModified: {replica 6 ldap://f22replica3:389} 55816cce
nsds5ReplicaChangeCount: 1261
nsds5replicareapactive: 0

}}}

I ran the reproducer script non-stop for over a week, but I could not reproduce the issue. Pushing ticket back to 1.3.5 until a better test case is developed

we have a customer case where this issue is being reproduced all the time. I could help to reproduce this if needed.

Replying to [comment:9 gparente]:

we have a customer case where this issue is being reproduced all the time. I could help to reproduce this if needed.

What are the steps to reproduce the problem?

Sf case is 01340256

after 1 million adds and dels, customer is always hitting this:

errors:[14/Aug/2015:09:03:41 -0400] - Retry count exceeded in delete

I know you have already tried this testcase but we have more info. Apparently, changing the swap space, we could reproduce or not this issue. I will contact you tomorrow with more details.

Hi,

This ticket is about having regular updates skipped during a regular replication session.
If a suppliers sends a lot of updates and one update CSN_1_RID1 fails but an other update CSN_2_RID1 is successful (CSN_2_RID1 > CSN_1_RID1) then replication is continuing without resending CSN_1_RID1.
(python test case provided).

My understanding is that 'Retry count exceeded in delete' triggered a failure of replicated DEL, then that DEL is no longer resent. Is that correct ?

So yes it is the likely the same bug, but 1M add-del is more complex than the provided test case.

Hi,

yes, it's more complex than but from comment 8, I have understood the provided testcase was not helping to reproduce the issue. Our customer is reproducing it practically always.

And as he is in "test", he will accept to test a candidate fix or a debug build.

Please take this into account because it could be helpful.

Bugzilla 1294770 is a "flavor" of this bug, specific to DEL's. See error log extracts in that bugzilla.

I need to test the patch in anger, but the approach seems sound. If there is a fault, stop as soon as possible for the admin to take action and investigate. That way we don't keep two masters with diverging datasets getting further and further out of sync.

Hi Mark,

The fix look very good. I have just one question, why catching the error condition in postop and not in betxnpostop (write_changelog_and_ruv) ?

thanks
thierry

Replying to [comment:21 tbordaz]:

Hi Mark,

The fix look very good. I have just one question, why catching the error condition in postop and not in betxnpostop (write_changelog_and_ruv) ?

thanks
thierry

I'd guess the point is the update on a master should be successful or not when the replicated op on a consumer fails. I think it should be successful since when multiple consumers exist and some are successful, we cannot redo it.

Mark's approach gracefully covers the case, doesn't it? For instance, masterA has 2 consumers -- replicaX and replicaY. And if replicating to replicaX is successful but replicaY fails, the connection to replicaY is closed and it is reconnected to replicaY, the same op is resent only to replicaY. Once it succeeds, masterA, replicaX and replicaY are in sync.

Mark, please correct me if I'm wrong... :) If everyone agrees, I'd like to ack this ticket...

Replying to [comment:22 nhosoi]:

Replying to [comment:21 tbordaz]:

Hi Mark,

The fix look very good. I have just one question, why catching the error condition in postop and not in betxnpostop (write_changelog_and_ruv) ?

thanks
thierry

I'd guess the point is the update on a master should be successful or not when the replicated op on a consumer fails. I think it should be successful since when multiple consumers exist and some are successful, we cannot redo it.

Mark's approach gracefully covers the case, doesn't it? For instance, masterA has 2 consumers -- replicaX and replicaY. And if replicating to replicaX is successful but replicaY fails, the connection to replicaY is closed and it is reconnected to replicaY, the same op is resent only to replicaY. Once it succeeds, masterA, replicaX and replicaY are in sync.

Mark, please correct me if I'm wrong... :) If everyone agrees, I'd like to ack this ticket...

Sorry this is not the final patch. There have been several improvements made to it. It is also currently being tested by the customer. I was going to wait for their results to resend the review request. I can attach the patch they are testing for a pre-review" if you want.

Looks good. Thanks, Mark! A couple of questions...

1) why this setting "return_value = CONN_BUSY" is not needed?
{{{
477 477 / Got a result /
478 if ((rc == LDAP_SUCCESS) && (err == LDAP_BUSY))
479 return_value = CONN_BUSY;
480 else if (retoidp)
}}}

2) I'm wondering you may want to compare final_result instead of return_value at the line 1954? (I could be wrong, tho... :)
{{{
1951 int final_result = 0;
1940 1952 / if connection was lost/closed, there will be nothing to read /
1941 repl5_inc_waitfor_async_results(rd);
1953 final_result = repl5_inc_waitfor_async_results(rd);
1954 if (return_value == UPDATE_NO_MORE_UPDATES){
}}}

3) An indentation mismatch? ;)
{{{
1259 /
1260 * Release this replica so new sessions can begin
1261
/
}}}

Replying to [comment:24 nhosoi]:

Looks good. Thanks, Mark! A couple of questions...

1) why this setting "return_value = CONN_BUSY" is not needed?
{{{
477 477 / Got a result /
478 if ((rc == LDAP_SUCCESS) && (err == LDAP_BUSY))
479 return_value = CONN_BUSY;
480 else if (retoidp)
}}}

Its only used for total updates(when a retoid is used) If it's not a total update we need to treat LDAP_BUSY(CONN_BUSY) as an error. It's unfortunate that CONN_BUSY can have two meanings in two different contexts. Maybe in the future this can be redesigned, but it's out of the scope for this fix. So for now we have to do it this way.

2) I'm wondering you may want to compare final_result instead of return_value at the line 1954? (I could be wrong, tho... :)
{{{
1951 int final_result = 0;
1940 1952 / if connection was lost/closed, there will be nothing to read /
1941 repl5_inc_waitfor_async_results(rd);
1953 final_result = repl5_inc_waitfor_async_results(rd);
1954 if (return_value == UPDATE_NO_MORE_UPDATES){
}}}

final_result is the last result found by the async result thread. return_value can be outdated so we double check the result before proceeding - but we only need to check if we have a new error if the previous result was successful(UPDATE_NO_MORE_UPDATES). Otherwise it's already an error and there's no need to reset it.

3) An indentation mismatch? ;)
{{{
1259 /
1260 * Release this replica so new sessions can begin
1261
/
}}}

I know, like I said this is what I sent the customer, it's not the official patch for master branch. This is pre-review :-)

Thanks, Mark! Regarding question (2), coverity may not like it since final_result is assigned but not used... If it's not needed, we don't have to declare final_result and just call repl5_inc_waitfor_async_results with (void *) or something appropriate?

Replying to [comment:26 nhosoi]:

Thanks, Mark! Regarding question (2), coverity may not like it since final_result is assigned but not used... If it's not needed, we don't have to declare final_result and just call repl5_inc_waitfor_async_results with (void *) or something appropriate?

Oh, its supposed to be:

{{{
return_result = final_result
}}}

not

{{{
return_result = rd->result
}}}

The end result is the same though. This is why I didn't put the patch out for review yet, because it's not fully polished. Hence - pre-review.

Attached new patch for review...

30c4852..ab6501a master -> master
commit ab6501a
Author: Mark Reynolds mreynolds@redhat.com
Date: Sun Jan 17 18:25:43 2016 -0500

a1635fc..407c545 389-ds-base-1.3.4 -> 389-ds-base-1.3.4
commit 407c545

ad08d1f..50dd77f 389-ds-base-1.3.3 -> 389-ds-base-1.3.3
commit 50dd77f

8e49d6d..1643bbd 389-ds-base-1.3.2 -> 389-ds-base-1.3.2
commit 1643bbd

Fix cherry pick
1643bbd..33f8698 389-ds-base-1.3.2 -> 389-ds-base-1.3.2
commit 33f8698

c24ecab..80c68e2 389-ds-base-1.3.1 -> 389-ds-base-1.3.1
commit 80c68e202ff2b50a60f35b6683ef26b41609bc56

f69d19a..96b3a5b 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
commit 96b3a5b

Running freeipa 4.3 acceptance some failures are related to this bug fix.
The following fix was successful to prevent the failures

{{{
diff --git a/ldap/servers/plugins/replication/repl5_plugins.c b/ldap/servers/plugins/replication/repl5_plugins.c
index c2fa214..17b0008 100644
--- a/ldap/servers/plugins/replication/repl5_plugins.c
+++ b/ldap/servers/plugins/replication/repl5_plugins.c
@@ -1307,7 +1307,7 @@ process_postop (Slapi_PBlock pb)
slapi_ch_free((void
) &op_params->p.p_modrdn.modrdn_newsuperior_address.uniqueid);
}
}
- if (!ignore_error_and_keep_going(retval)){
+ if (is_replicated_operation && !ignore_error_and_keep_going(retval)){
/

* We have an error we can't ignore. Release the replica and close
* the connection to stop the replication session.

}}}

But I think there is still something weird in the results. In fact some tests were failing with 'OPERATION_ERROR' (and so connection was closed), but in the access logs none the reported failed operation was logged with 'err=0'

Move postop error checking so its only called for replicated operations
0001-Ticket-47788-Only-check-postop-result-if-its-a-repli.patch

Looks good!

The patch 0001-Ticket-47788-Only-check-postop-result-if-its-a-repli.patch has the same logic as in this comment https://fedorahosted.org/389/ticket/47788#comment:31. That is, your fix is already tested!

82abaf1..d7b598d master -> master
commit d7b598d
Author: Mark Reynolds mreynolds@redhat.com
Date: Thu Feb 11 16:18:00 2016 -0500

7fb593f..bd254a2 389-ds-base-1.3.4 -> 389-ds-base-1.3.4
commit bd254a2

50dd77f..1cfd332 389-ds-base-1.3.3 -> 389-ds-base-1.3.3
commit 1cfd332

33f8698..8817a13 389-ds-base-1.3.2 -> 389-ds-base-1.3.2
commit 8817a13

80c68e2..ba0358b 389-ds-base-1.3.1 -> 389-ds-base-1.3.1
commit ba0358b5b76cbcd7ec6435e631c2eb313a5738bd

d1cf785..e122758 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
commit e122758

Metadata Update from @tbordaz:
- Issue assigned to mreynolds
- Issue set to the milestone: 1.2.11.33

2 years ago

Login to comment on this ticket.

Metadata