Issue #47386: 389 DS Replication failures due to Fractional updates - 389-ds-base - Pagure.io

389-ds-base

#47386 389 DS Replication failures due to Fractional updates

Closed: wontfix None Opened 10 years ago by jraquino.

389 DS throws an incremental replication error that stops the replication process when a consumer attempts to replicate with a supplier who's Max or Min CSN is equal to a modification that is Excluded from replications.

For example, in FreeIPA the following attributes are excluded from incremental and total replications:
nsDS5ReplicatedAttributeList: (objectclass=*) $ EXCLUDE memberof entryusn krbl
astsuccessfulauth krblastfailedauth krbloginfailedcount idnssoaserial

nsDS5ReplicatedAttributeListTotal: (objectclass=*) $ EXCLUDE entryusn krblasts
uccessfulauth krblastfailedauth krbloginfailedcount

The design of these excludes are mean to provide performance optimizations are in the case of MemberOf, there is a MemberOf plugin that is supposed to perform the modifications locally rather than having to rely upon the replication process to reconcile the difference.

This problem occurs for us fairly frequently since we have a great deal of change in our environment and should be considered a major issue.

rmeggins commented 10 years ago

Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=971966

wortmanb commented 10 years ago

attachment
agmt.ldif

wortmanb commented 10 years ago

attachment
ruv.ldif

wortmanb commented 10 years ago

attachment
cldb.txt

rmeggins commented 10 years ago

Thanks.

The error message is this:
{{{
NSMMReplicationPlugin - changelog program - agmt="cn=meTogood1.foo.com" (good1:389): CSN 520a49640000001d0000 not found, we aren't as up to date, or we purged
agmt="cn=meTogood1.foo.com" (good1:389) - Can't locate CSN 520a49640000001d0000 in the changelog (DB rc=-30988). The consumer may need to be reinitialized.
}}}

which server in the scrubbed files corresponds to good1.foo.com? What is the hostname of the supplier?

is ruv.ldif the output from the supplier? Can you also provide the ruv data from the consumer (e.g. the consumer good1.foo.com)

rmeggins commented 10 years ago

The problematic CSN is 520a49640000001d0000. This change originated on replica 29 ldap://replica4.foo.net:389 on Tue Aug 13 08:57:40 2013 (0x520a4964 == 1376405860 time_t).

The latest change from replica 29 ldap://replica4.foo.net:389 is 521258130000001d0000 which is from Mon Aug 19 11:38:27 2013

One possibility is that 520a49640000001d0000 really was purged from the supplier's changelog. But looking at the changelog the first "real" change is 51659d3b0000001c0000 which is from Wed Apr 10 11:11:23 2013 so it doesn't seem likely that the problematic CSN was purged.

It is likely that 520a49640000001d0000 is not a replicated change - it is an update to memberof or some krb attribute or some other change that was not replicated. You can verify this by going to replica4.foo.net and doing dbscan -f /var/lib/dirsrv/slapd-FOO-COM/cldb/*.db4 -k 520a49640000001d0000 - this change will be in that changelog but in no other changelog.

wortmanb commented 10 years ago

Replying to [comment:4 rmeggins]:

Thanks.

The error message is this:
{{{
NSMMReplicationPlugin - changelog program - agmt="cn=meTogood1.foo.com" (good1:389): CSN 520a49640000001d0000 not found, we aren't as up to date, or we purged
agmt="cn=meTogood1.foo.com" (good1:389) - Can't locate CSN 520a49640000001d0000 in the changelog (DB rc=-30988). The consumer may need to be reinitialized.
}}}

which server in the scrubbed files corresponds to good1.foo.com? What is the hostname of the supplier?

The supplier is ipamaster.foo.com. "good1" corresponds to "replica3". Sorry I didn't keep my anonymization consistent, though among the three attached files, it is consistent. And I have a chart now so it will be from this point forward.

is ruv.ldif the output from the supplier? Can you also provide the ruv data from the consumer (e.g. the consumer good1.foo.com)

Yes. All three of these files are from ipamaster, the server from which all the replicas...replicate. I think that makes it the supplier.

I'll see what I can pull this morning. Is there a specific command you'd like me to run to gather relevant data or use the same one as on the supplier? I'll start assuming you want the same command and if it's different, I'll adjust as needed.

wortmanb commented 10 years ago

attachment
replica3-ruv.ldif

wortmanb commented 10 years ago

The ruv.ldif from replica3 has been added. I retyped it by hand, so there may be a typo here or there, but I think I got it right.

jraquino commented 10 years ago

The number 1# symptom at this point is failures to replicate due to a missing CN resulting in a server refusing to retry and finally giving the message:
" -1 Incremental update has failed and requires administrator actionLDAP error: Can't contact LDAP server"

The CSN error in the logs look like this:.
[16/Oct/2013:07:33:55 -0700] NSMMReplicationPlugin - changelog program - agmt="cn=meToipa.example.com" (authmgr1:389): CSN 525e037b0000006e0000 not found, we aren't as up to date, or we purged

My method for having to troubleshoot these is:
Find which server generated the CSN, and restart its daemon, possibly needing to go to its subscribers and force-sync.

jraquino commented 10 years ago

The CSN's that I am currently trying to rectify appear as though they may be triggered from the recent yum update I performed last night. So far, I have had to restart the daemon of several servers who's stuck CSN upstream looked like this:

dbid: 525e07ae000700530000
replgen: 1381892367 Tue Oct 15 19:59:27 2013
csn: 525e07ae000700530000
uniqueid: e4f5c187-360e11e3-ad8ff7cb-7e360ef1
parentuniqueid: a4463b22-1abe11e1-9fdabec4-908cc93a
dn: dnaHostname=auth2.example.com+dnaPortNum=389,cn=posix-ids,cn=dna,cn=ipa,cn=etc,dc=example,dc=com
operation: add
objectClass: dnaSharedConfig
objectClass: top
dnaHostname: auth2.example.com
dnaPortNum: 389
dnaSecurePortNum: 636
dnaRemainingValues: 0
creatorsName: cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config
modifiersName: cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config
createTimestamp: 20131016025924Z
modifyTimestamp: 20131016025924Z
nsUniqueId: e4f5c187-360e11e3-ad8ff7cb-7e360ef1
parentid: 29
entryid: 16579
entrydn: dnahostname=auth2.example.com+dnaportnum=389,cn=posix-ids,cn=dna,cn=ipa,cn=etc,dc=expertcity,dc=com

jraquino commented 10 years ago

I have found CSN that -isnt- associated with DNA... This seems more in line with the original ticket as it is a memberof modification.

Subscriber servers show the following CSN in their change log:

dbid: 525df4fc001000660000
replgen: 1381904516 Tue Oct 15 23:21:56 2013
csn: 525df4fc001000660000
uniqueid: b9210e81-360311e3-b543f0eb-6027c45c
dn: fqdn=pvm22.example.com,cn=computers,cn=accounts,dc=example.com,dc=com
operation: modify
modifiersName: cn=MemberOf Plugin,cn=plugins,cn=config
modifyTimestamp: 20131016013931Z

The Supplier who actually generated the original CSN has this in its change log:

dbid: 525df4fc001000660000
replgen: 1381887571 Tue Oct 15 18:39:31 2013
csn: 525df4fc001000660000
uniqueid: b9210e81-360311e3-b543f0eb-6027c45c
dn: fqdn=pvm22.example.com,cn=computers,cn=accounts,dc=example,dc=com
operation: modify
memberOf: cn=cloudstack-qai,cn=hostgroups,cn=accounts,dc=example,dc=com
memberOf: ipauniqueid=518aa7a2-1ac8-11e1-bcf6-9c8e9927cab0,cn=hbac,dc=example,dc=com
memberOf: ipauniqueid=746e8b26-1ac8-11e1-8c2a-9c8e9927cab0,cn=sudorules,cn=sudo,dc=example,dc=com
memberOf: ipauniqueid=7bb1f7ce-1ac8-11e1-a9ff-9c8e9927cab0,cn=sudorules,cn=sudo,dc=example,dc=com
memberOf: ipauniqueid=af86055e-1ac8-11e1-94c1-9c8e9927cab0,cn=sudorules,cn=sudo,dc=example,dc=com
memberOf: ipauniqueid=5fd3d1e4-30e9-11e1-b10e-9c8e9927cab0,cn=sudorules,cn=sudo,dc=example,dc=com
memberOf: ipauniqueid=1535e8f6-3315-11e1-b77a-9c8e9927cab0,cn=hbac,dc=example,dc=com
memberOf: ipauniqueid=d3be4e14-f04b-11e1-b5f7-9c8e9927cab0,cn=hbac,dc=example,dc=com
memberOf: cn=cloudstack-qai,cn=ng,cn=alt,dc=example,dc=com
modifiersname: cn=MemberOf Plugin,cn=plugins,cn=config
modifytimestamp: 20131016013931Z
entryusn: 19645051

So in this instance, it seems that a memberof plugin operation caused the increment of CSN and a minor replication of the modify timestamp... Should we be updating the modify the replicated timestamp for a local modification that doesn't get replicated?

rmeggins commented 10 years ago

Replying to [comment:12 jraquino]:

I have found CSN that -isnt- associated with DNA... This seems more in line with the original ticket as it is a memberof modification.

Subscriber servers show the following CSN in their change log:

dbid: 525df4fc001000660000
replgen: 1381904516 Tue Oct 15 23:21:56 2013
csn: 525df4fc001000660000
uniqueid: b9210e81-360311e3-b543f0eb-6027c45c
dn: fqdn=pvm22.example.com,cn=computers,cn=accounts,dc=example.com,dc=com
operation: modify
modifiersName: cn=MemberOf Plugin,cn=plugins,cn=config
modifyTimestamp: 20131016013931Z

That's it? Nothing else? If so, then . . .

The Supplier who actually generated the original CSN has this in its change log:

dbid: 525df4fc001000660000
replgen: 1381887571 Tue Oct 15 18:39:31 2013
csn: 525df4fc001000660000
uniqueid: b9210e81-360311e3-b543f0eb-6027c45c
dn: fqdn=pvm22.example.com,cn=computers,cn=accounts,dc=example,dc=com
operation: modify
memberOf: cn=cloudstack-qai,cn=hostgroups,cn=accounts,dc=example,dc=com
memberOf: ipauniqueid=518aa7a2-1ac8-11e1-bcf6-9c8e9927cab0,cn=hbac,dc=example,dc=com
memberOf: ipauniqueid=746e8b26-1ac8-11e1-8c2a-9c8e9927cab0,cn=sudorules,cn=sudo,dc=example,dc=com
memberOf: ipauniqueid=7bb1f7ce-1ac8-11e1-a9ff-9c8e9927cab0,cn=sudorules,cn=sudo,dc=example,dc=com
memberOf: ipauniqueid=af86055e-1ac8-11e1-94c1-9c8e9927cab0,cn=sudorules,cn=sudo,dc=example,dc=com
memberOf: ipauniqueid=5fd3d1e4-30e9-11e1-b10e-9c8e9927cab0,cn=sudorules,cn=sudo,dc=example,dc=com
memberOf: ipauniqueid=1535e8f6-3315-11e1-b77a-9c8e9927cab0,cn=hbac,dc=example,dc=com
memberOf: ipauniqueid=d3be4e14-f04b-11e1-b5f7-9c8e9927cab0,cn=hbac,dc=example,dc=com
memberOf: cn=cloudstack-qai,cn=ng,cn=alt,dc=example,dc=com
modifiersname: cn=MemberOf Plugin,cn=plugins,cn=config
modifytimestamp: 20131016013931Z
entryusn: 19645051

So in this instance, it seems that a memberof plugin operation caused the increment of CSN and a minor replication of the modify timestamp... Should we be updating the modify the replicated timestamp for a local modification that doesn't get replicated?

. . . no, we should not. If we are not replicating an operation because it was only a memberof change, we should certainly not be replicating the modifiersname and modifytimestamp of that operation.

rmeggins commented 10 years ago

Do this:

grep -i nsds5ReplicaStripAttrs /etc/dirsrv/slapd-*/dse.ldif

jraquino commented 10 years ago

Ah ha! I think this condition -might- be reproducible with IPA servers or at least the nsDS5ReplicatedAttributeList and nsDS5ReplicatedAttributeListTotal from IPA....

nsDS5ReplicatedAttributeList: (objectclass=*) $ EXCLUDE memberof idnssoaserial entryusn krblastsuccessfulauth krblastfailedauth krbloginfailedcount

nsDS5ReplicatedAttributeListTotal: (objectclass=*) $ EXCLUDE entryusn krblastsuccessfulauth krblastfailedauth krbloginfailedcount

Setup a replica ring with 5 participants.

1 & 2 should have agreements with each other and 3
4 & 5 should have agreements with each other and 3

Populate them with some fake entries.

Shutdown the daemon on 3

Add an object to a group on server 1 such to trigger a memberof update

Then perform a full re-initialization of 2 from 1.

Then start the daemon on 3

I believe this should be enough to cause the condition described in this ticket

rmeggins commented 10 years ago

What does CSN 525df4fc001000660000 look like in the changelog of other servers? Does it have only modifiersname and modifytimestamp? Does it exist at all?

jraquino commented 10 years ago

Some systems don't have it in their change log at all.
Other systems have an entry that only have the modification to the timestamp that looks like this:

dbid: 525df4fc001000660000
replgen: 1381901830 Tue Oct 15 22:37:10 2013
csn: 525df4fc001000660000
uniqueid: b9210e81-360311e3-b543f0eb-6027c45c
dn: fqdn=pvm22.example,cn=computers,cn=accounts,dc=example,dc=com
operation: modify
modifiersName: cn=MemberOf Plugin,cn=plugins,cn=config
modifyTimestamp: 20131016013931Z

rmeggins commented 10 years ago

can you confirm your 389-ds-base version? rpm -q 389-ds-base

jraquino commented 10 years ago

389-ds-base-1.2.11.15-22.el6_4.x86_64
389-ds-base-libs-1.2.11.15-22.el6_4.x86_64

rmeggins commented 10 years ago

Replying to [comment:19 jraquino]:

389-ds-base-1.2.11.15-22.el6_4.x86_64
389-ds-base-libs-1.2.11.15-22.el6_4.x86_64

Ok. Those are the latest available for rhel 6.4 and it looks like the changes were generated with those packages.

Referring to https://fedorahosted.org/389/ticket/47386?replyto=19#comment:12

The originating server for csn: 525df4fc001000660000 - was the service restarted at or around Tue Oct 15 18:39:31 2013? Or Tue Oct 15 23:21:56 2013? Or Tue Oct 15 22:37:10 2013?

The fact that replicas have mods with only modifiersName and modifyTimestamp is a big problem. Looking at the code that removes the fractional and strip attrs - https://git.fedorahosted.org/cgit/389/ds.git/tree/ldap/servers/plugins/replication/repl5_protocol_util.c?h=389-ds-base-1.2.11#n690 - it looks pretty solid. The only way this can happen is if the incoming mods list is somehow corrupted or the strip attrs list is corrupted. Maybe, if the server is attempting replication during startup, the lists are not fully initialized yet?

rmeggins commented 10 years ago

Ok. Let's find out which supplier sent the change
{{{
dbid: 525df4fc001000660000
replgen: 1381904516 Tue Oct 15 23:21:56 2013
csn: 525df4fc001000660000
uniqueid: b9210e81-360311e3-b543f0eb-6027c45c
dn: fqdn=pvm22.example.com,cn=computers,cn=accounts,dc=example.com,dc=com
operation: modify
modifiersName: cn=MemberOf? Plugin,cn=plugins,cn=config
modifyTimestamp: 20131016013931Z
}}}

First - on this server
{{{
grep 525df4fc001000660000 /var/log/dirsrv/slapd-INST/access*
}}}

You will see a line like this:
{{{
[17/Oct/2013:16:06:13 -0600] conn=N op=5 RESULT err=0 tag=105 nentries=0 etime=1 csn=52605f55000000010000
}}}
We need the conn=N from this result, and the date/time stamp. Not a literal "N", but the integer connection number. Next
{{{
grep "conn=N fd=" var/log/dirsrv/slapd-m2/access*
}}}
You should see a line like this:
{{{
[17/Oct/2013:16:06:12 -0600] conn=N fd=66 slot=66 connection from 127.0.0.1 to 127.0.0.1
}}}

The "from IP address" will tell you the IP address of the supplier.

Next, on the supplier, is there anything in the supplier errors log around the date/time of when the consumer received the update?

jraquino commented 10 years ago

1st:
/var/log/dirsrv/slapd-EXAMPLE-COM/access.20131015-195014:[15/Oct/2013:23:21:56 -0700] conn=99 op=205 RESULT err=0 tag=103 nentries=0 etime=0 csn=525df4fc001000660000

2d:
/var/log/dirsrv/slapd-EXAMPLE-COM/access.20131015-195014:[15/Oct/2013:23:21:45 -0700] conn=99 fd=118 slot=118 connection from 127.0.0.1 to 127.0.0.2

3d:
[15/Oct/2013:23:21:45 -0700] NSMMReplicationPlugin - agmt="cn=meToauth1.example.com" (auth1:389): Replication bind with GSSAPI auth resumed

I don't see any errors or odd activity in the error log of the supplying server. Though, i ALSO don't see the CSN in the supplying servers change log...

However, If memory serves, at the time of the issue the supplying server had a tombstone entry for one of the other upstream replica servers who's MAX CN == 525df4fc001000660000

rmeggins commented 10 years ago

Replying to [comment:22 jraquino]:

1st:
/var/log/dirsrv/slapd-EXAMPLE-COM/access.20131015-195014:[15/Oct/2013:23:21:56 -0700] conn=99 op=205 RESULT err=0 tag=103 nentries=0 etime=0 csn=525df4fc001000660000

2d:
/var/log/dirsrv/slapd-EXAMPLE-COM/access.20131015-195014:[15/Oct/2013:23:21:45 -0700] conn=99 fd=118 slot=118 connection from 127.0.0.1 to 127.0.0.2

3d:
[15/Oct/2013:23:21:45 -0700] NSMMReplicationPlugin - agmt="cn=meToauth1.example.com" (auth1:389): Replication bind with GSSAPI auth resumed

This is the errors log from the machine 127.0.0.1?
Replication bind resumed at that time? What was going on before that? Had that server recently been restarted?

I don't see any errors or odd activity in the error log of the supplying server. Though, i ALSO don't see the CSN in the supplying servers change log...

That's very strange - how could it send a CSN that it didn't have?

However, If memory serves, at the time of the issue the supplying server had a tombstone entry

RUV tombstone? Or deleted entry tombstone?

for one of the other upstream replica servers who's MAX CN == 525df4fc001000660000

66 == Replica ID 102 - which supplier is 102? Is it the same supplier as machine 127.0.0.1?

jraquino commented 10 years ago

Replying to [comment:23 rmeggins]:

Replying to [comment:22 jraquino]:

1st:
/var/log/dirsrv/slapd-EXAMPLE-COM/access.20131015-195014:[15/Oct/2013:23:21:56 -0700] conn=99 op=205 RESULT err=0 tag=103 nentries=0 etime=0 csn=525df4fc001000660000

2d:
/var/log/dirsrv/slapd-EXAMPLE-COM/access.20131015-195014:[15/Oct/2013:23:21:45 -0700] conn=99 fd=118 slot=118 connection from 127.0.0.1 to 127.0.0.2

3d:
[15/Oct/2013:23:21:45 -0700] NSMMReplicationPlugin - agmt="cn=meToauth1.example.com" (auth1:389): Replication bind with GSSAPI auth resumed

This is the errors log from the machine 127.0.0.1?

Sorry, this is me doing IP address sanitization.

Replication bind resumed at that time?

Yes

What was going on before that? Had that server recently been restarted?

The Server had just been yum updated, so it restarted the daemon and what you see in the logs is the replication resuming.

I don't see any errors or odd activity in the error log of the supplying server. Though, i ALSO don't see the CSN in the supplying servers change log...

That's very strange - how could it send a CSN that it didn't have?

I'm not sure that these errors are related to servers sending CSNs that they don't have, I think it has to do with them having RUV Tombstone entries that don't match change logs.

However, If memory serves, at the time of the issue the supplying server had a tombstone entry

RUV tombstone?

Yes

Or deleted entry tombstone?

No. I don't have any deleted entry tombstones at this time.

for one of the other upstream replica servers who's MAX CN == 525df4fc001000660000

66 == Replica ID 102 - which supplier is 102? Is it the same supplier as machine 127.0.0.1?

Negative. 102. is a subscriber that has an agreement with machine 127.0.0.1, but not 127.0.0.2...

rmeggins commented 10 years ago

Replying to [comment:24 jraquino]:

Replying to [comment:23 rmeggins]:

Replying to [comment:22 jraquino]:

1st:
/var/log/dirsrv/slapd-EXAMPLE-COM/access.20131015-195014:[15/Oct/2013:23:21:56 -0700] conn=99 op=205 RESULT err=0 tag=103 nentries=0 etime=0 csn=525df4fc001000660000

2d:
/var/log/dirsrv/slapd-EXAMPLE-COM/access.20131015-195014:[15/Oct/2013:23:21:45 -0700] conn=99 fd=118 slot=118 connection from 127.0.0.1 to 127.0.0.2

3d:
[15/Oct/2013:23:21:45 -0700] NSMMReplicationPlugin - agmt="cn=meToauth1.example.com" (auth1:389): Replication bind with GSSAPI auth resumed

This is the errors log from the machine 127.0.0.1?

Sorry, this is me doing IP address sanitization.

Understood. I just wanted to confirm that 3d) was the errors log from the supplier, because . . .

Replication bind resumed at that time?

Yes

What was going on before that? Had that server recently been restarted?

The Server had just been yum updated, so it restarted the daemon and what you see in the logs is the replication resuming.

. . . perhaps it is related to sending an update at the same time as server startup, and it would appear to be so.

I don't see any errors or odd activity in the error log of the supplying server. Though, i ALSO don't see the CSN in the supplying servers change log...

That's very strange - how could it send a CSN that it didn't have?

I'm not sure that these errors are related to servers sending CSNs that they don't have, I think it has to do with them having RUV Tombstone entries that don't match change logs.

Yes, I think that has something to do with it, but still - how can a server send a CSN that it doesn't have?

However, If memory serves, at the time of the issue the supplying server had a tombstone entry

RUV tombstone?

Yes

Or deleted entry tombstone?

No. I don't have any deleted entry tombstones at this time.

for one of the other upstream replica servers who's MAX CN == 525df4fc001000660000

66 == Replica ID 102 - which supplier is 102? Is it the same supplier as machine 127.0.0.1?

Negative. 102. is a subscriber that has an agreement with machine 127.0.0.1, but not 127.0.0.2...

rmeggins commented 10 years ago

I am able to reproduce the problem where you have a replicated operation with only modifyTimestamp and modifiersName. I followed the steps outlined in https://fedorahosted.org/389/ticket/47386#comment:15

0) all 5 servers are the latest RHEL 6.4.z (ipa-server 3.0.0-26, 389-ds-base 1.2.11.15-22)
1) ipa-replica-install for server3
2) ipa-replica-prepare for server1,server2,server4,server5
3) ipa-replica-install on server1,server2,server4,server5
4) ipa-replica-manage connect server1,server2
5) ipa-replica-manage connect server4,server5
6) on server3 - for ii 1 to 10 ; ipa user-add user$ii ; done - ipa user-find to verify users on all servers
6) service dirsrv stop on server3
7) on server1 - ipa group-add-member admins --user=user1

dbscan -f /var/lib/dirsrv/slapd-server2/cldb/*.db4 - you will see a replicated op with only modifyTimestamp and modifiersName

I checked server1 and server2 /etc/dirsrv/slapd-DOMAIN/dse.ldif - the replication agreement to server3 has this:
{{{
nsds5ReplicaStripAttrs: modifyTimestamp modifiersName internalmodifyTimestamp internalmodifiersName
}}}
BUT NOT THE AGREEMENT to server1 from server2, or vice versa - so there is a problem here - some replication agreements have the strip attrs, some do not

After doing the reinit of server2 from server1, and restarting server3, I see the following in the errors log on server2:
{{{
agmt="meToServer3" (server3:389) - Can't locate CSN XXX in the changelog (DB rc=-30988). The consumer may need to be reinitialized
}}}

rmeggins commented 10 years ago

JR, can you please confirm that all of your replication agreements on all of your servers have nsds5ReplicaStripAttrs? Something like this:

{{{
for srv in hostname1 hostname2 .... hostnameN ; do
ldapsearch -xLLL -h $srv -D "cn=directory manager" -w 'yourpassword' -b cn=config 'objectclass=nsds5replicationagreement' nsds5ReplicaStripAttrs
done
}}}

rmeggins commented 10 years ago

Also, in my case, I see the Can't locate CSN XXX in the changelog error, but replication still continues. I still can't get it to reproduce the case where you see this error message and replication halts.

jraquino commented 10 years ago

Replying to [comment:27 rmeggins]:

JR, can you please confirm that all of your replication agreements on all of your servers have nsds5ReplicaStripAttrs? Something like this:

{{{
for srv in hostname1 hostname2 .... hostnameN ; do
ldapsearch -xLLL -h $srv -D "cn=directory manager" -w 'yourpassword' -b cn=config 'objectclass=nsds5replicationagreement' nsds5ReplicaStripAttrs
done
}}}

It appears as though there ARE agreements that are missing the nsds5ReplicaStripAttrs... The suggested ldapsearch doesn't quite make them stand out though... I'm having to go through and manually review all my agreements, but the second one that i looked at does appear to be missing nsds5ReplicaStripAttrs between one of its central replica suppliers

rmeggins commented 10 years ago

Sorry, try this:
{{{
for srv in hostname1 hostname2 .... hostnameN ; do
ldapsearch -xLLL -h $srv -D "cn=directory manager" -w 'yourpassword' -b cn=config '(&(objectclass=nsds5replicationagreement)(!(nsds5ReplicaStripAttrs=*)))' nsds5ReplicaStripAttrs
done
}}}

rmeggins commented 10 years ago

JR, are you still seeing the errors like this?
{{{
Can't locate CSN NNNN in the changelog (DB rc=-30988). The consumer may need to be reinitialized
NSMMReplicationPlugin - changelog program - agmt="cn=xxxx" (xxxx:389): CSN NNNN not found, we aren't as up to date, or we purged
agmt="cn=xxxx" (xxxx:389) - Can't locate CSN NNNN in the changelog (DB rc=-30988). The consumer may need to be reinitialized.
}}}
?

rmeggins commented 10 years ago

Looking at https://fedorahosted.org/389/attachment/ticket/47386/cldb.txt

This also contains changes with just modifyTimestamp modifiersName:
{{{
dbid: 5165dc280000001c0000
replgen: 1376661409 Fri Aug 16 09:56:49 2013
csn: 5165dc280000001c0000
uniqueid: d133d981-923e11e2-8136f444-e388078e
dn: fqdn=asdf3.foo.net,cn=computers,cn=accounts,dc=foo,dc=net
operation: modify
modifiersName: cn=Directory Manager
modifyTimestamp: 20130410213853Z
}}}

This means that one of the replication agreements on the server with Replica ID 28 (0x1c) is missing nsds5ReplicaStripAttrs.

The workaround for this is to run the ipa-ldap-updater script on all masters/replicas.

This is ipa ticket https://fedorahosted.org/freeipa/ticket/3989

I don't know if this will fix the general replication problems. But if you run the ipa-ldap-updater on all of your servers, and then are still able to reproduce the replication problems, please update this ticket.

jraquino commented 10 years ago

The majority of my systems had a number of replicas that were joined via the ipa-replica-manage connect command. I have a feeling that they stayed this way until the yum update (which was in two passes), so that the first half updated and corrected the strip, where as the second half received a bunch of updates for stuff they didn't know to strip. I have a feeling that this may have been at the core of the issue. Now that i've updated all my hosts, there seem to be no agreements missing the strip entry.

rmeggins commented 10 years ago

Replying to [comment:33 jraquino]:

The majority of my systems had a number of replicas that were joined via the ipa-replica-manage connect command. I have a feeling that they stayed this way until the yum update (which was in two passes), so that the first half updated and corrected the strip, where as the second half received a bunch of updates for stuff they didn't know to strip. I have a feeling that this may have been at the core of the issue. Now that i've updated all my hosts, there seem to be no agreements missing the strip entry.

Are you still seeing replication errors? If so, can you provide excerpts from the errors logs from the suppliers and consumers with the errors?

mkosek commented 10 years ago

It seems to me this thread is stuck. As for FreeIPA is concerned, we believe that https://fedorahosted.org/freeipa/ticket/3989 should have solved the problem.

rmeggins commented 10 years ago

Replying to [comment:35 mkosek]:

It seems to me this thread is stuck. As for FreeIPA is concerned, we believe that https://fedorahosted.org/freeipa/ticket/3989 should have solved the problem.

We don't know yet if https://fedorahosted.org/freeipa/ticket/3989 solves this problem. We are still trying to get confirmation.

nkinder commented 10 years ago

Is this issue still occuring?

nkinder commented 10 years ago

I'm going to close this. If this issue is still occurring, please feel free to reopen it.

jraquino commented 10 years ago

Sorry. Confirmed. This issue is corrected for fractional updates.

The only replication issue that remains has to do with full replica re-inits inheriting attributes that they normally should never know about.

This ticket can be closed.

Metadata Update from @rmeggins:
- Issue set to the milestone: N/A

7 years ago

spichugi commented 3 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/723

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Invalid)

3 years ago

Login to comment on this ticket.

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Priority

major

Milestone

N/A

reviewstatus

None

rhbz

https://bugzilla.redhat.com/show_bug.cgi?id=971966

origin

RHCust

Powered by Pagure 5.13.3

Documentation • File an Issue • About • SSH Hostkey/Fingerprint

© Red Hat, Inc. and others.