#47368 IPA server dirsrv RUV entry data excluded from replication
Closed: Fixed None Opened 6 years ago by rmeggins.

Ticket was cloned from Red Hat Bugzilla (product Fedora): Bug 966562

Description of problem:

This may affect 389 Directory Server in general but, I specifically saw this
behavior in my IPA test env.

After a good bit of troubleshooting, and getting dev involved, Rich found that
the RUV data was excluded from replication (at least in some cases):

From Rich:

However, I do see it in the changelog on f18-2:

dbid: 519b96d3000000030000
     replgen: 1369151185 Tue May 21 10:46:25 2013
     csn: 519b96d3000000030000
     uniqueid: c4a8008c-c15d11e2-81caa921-5c226755
     dn: fqdn=f18-2.testrelm.com,cn=computers,cn=accounts,dc=testrelm,dc=com
     operation: modify
         krbLastSuccessfulAuth: 20130521154625Z
         modifiersname: cn=Directory Manager
         modifytimestamp: 20130521154625Z
         entryusn: 1068

All of these attributes are excluded from replication, which means it
will be in the local RUV but not in any other RUVs.


Version-Release number of selected component (if applicable):
389-ds-base-1.3.0.6-1.fc18.x86_64


How reproducible:
very

Steps to Reproduce:
1.  Setup a few IPA servers
2.  use ldapsearch (like below) to compare the RUV data for each server on each
server

for RUV in $MASTER $REPLICA1 $REPLICA2 $REPLICA3 $REPLICA4; do
    for SERVER in $MASTER $REPLICA1 $REPLICA2 $REPLICA3 $REPLICA4; do
        RUVCHK=$(ldapsearch -o ldif-wrap=no -h $SERVER \
        -xLLL -D "$ROOTDN" -w $ROOTDNPWD -b $BASEDN \
'(&(objectclass=nstombstone)(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff))'
\
        nsds50ruv|grep $RUV);   echo "$(echo $SERVER|cut -f1 -d.): $RUVCHK"
    done
    echo
done


Actual results:

You see some differences in the last field (MaxCSN):

f18-1: nsds50ruv: {replica 4 ldap://f18-1.testrelm.com:389}
519a3a2b000000040000 519a8ec5000000040000
f18-2: nsds50ruv: {replica 4 ldap://f18-1.testrelm.com:389}
519a3a2b000000040000 519a8ec5000000040000
f18-3: nsds50ruv: {replica 4 ldap://f18-1.testrelm.com:389}
519a3a2b000000040000 519a8ec5000000040000
f18-4: nsds50ruv: {replica 4 ldap://f18-1.testrelm.com:389}
519a3a2b000000040000 519a8ec5000000040000
f18-5: nsds50ruv: {replica 4 ldap://f18-1.testrelm.com:389}
519a3a2b000000040000 519a8ec5000000040000

f18-1: nsds50ruv: {replica 3 ldap://f18-2.testrelm.com:389}
519a39ea000000030000 519a7c32000400030000
f18-2: nsds50ruv: {replica 3 ldap://f18-2.testrelm.com:389}
519a39ea000000030000 519a81b2000200030000
f18-3: nsds50ruv: {replica 3 ldap://f18-2.testrelm.com:389}
519a39ea000000030000 519a7c32000400030000
f18-4: nsds50ruv: {replica 3 ldap://f18-2.testrelm.com:389}
519a39ea000000030000 519a7c32000400030000
f18-5: nsds50ruv: {replica 3 ldap://f18-2.testrelm.com:389}
519a39ea000000030000 519a7c32000400030000

f18-1: nsds50ruv: {replica 5 ldap://f18-3.testrelm.com:389}
519a3cfc000000050000 519a447b000500050000
f18-2: nsds50ruv: {replica 5 ldap://f18-3.testrelm.com:389}
519a3cfc000000050000 519a447b000500050000
f18-3: nsds50ruv: {replica 5 ldap://f18-3.testrelm.com:389}
519a3cfc000000050000 519a8138000100050000
f18-4: nsds50ruv: {replica 5 ldap://f18-3.testrelm.com:389}
519a3cfc000000050000 519a8138000100050000
f18-5: nsds50ruv: {replica 5 ldap://f18-3.testrelm.com:389}
519a3cfc000000050000 519a8138000100050000

f18-1: nsds50ruv: {replica 6 ldap://f18-4.testrelm.com:389}
519a40fc000000060000 519a4477000600060000
f18-2: nsds50ruv: {replica 6 ldap://f18-4.testrelm.com:389}
519a40fc000000060000 519a4477000600060000
f18-3: nsds50ruv: {replica 6 ldap://f18-4.testrelm.com:389}
519a40fc000000060000 519a4477000600060000
f18-4: nsds50ruv: {replica 6 ldap://f18-4.testrelm.com:389}
519a40fc000000060000 519a813b000300060000
f18-5: nsds50ruv: {replica 6 ldap://f18-4.testrelm.com:389}
519a40fc000000060000 519a813b000300060000

f18-1: nsds50ruv: {replica 7 ldap://f18-5.testrelm.com:389}
519a43a1000000070000 519a4483000000070000
f18-2: nsds50ruv: {replica 7 ldap://f18-5.testrelm.com:389}
519a43a1000000070000 519a4483000000070000
f18-3: nsds50ruv: {replica 7 ldap://f18-5.testrelm.com:389}
519a43a1000000070000 519a4483000000070000
f18-4: nsds50ruv: {replica 7 ldap://f18-5.testrelm.com:389}
519a43a1000000070000 519a4483000000070000
f18-5: nsds50ruv: {replica 7 ldap://f18-5.testrelm.com:389}
519a43a1000000070000 519a813e000000070000


Expected results:

Entries match or some other definitive way to confirm that the directory
servers are in sync.

Additional info:

Simplified steps to reproduce the problem:

[1] Setup a master and a replica
[2] Setup replication
[3] Add these attributes to the replication agreement:

nsDS5ReplicatedAttributeList: (objectclass=*) $ EXCLUDE sn
nsds5ReplicaStripAttrs: modifiersname modifytimestamp

[4] Create a new user.
[5] modify the "sn" attribute in the user.
[6] Look at the database ruvs: the consumer is missing the last update

master:
nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 52583d80000000010000 525c4644000000010000

replica:
nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 52583d80000000010000 525c4532000000010000

[7] However, if you look at the ruv in the replication agreement it is the same as the replica:

nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 52583d80000000010000 525c4532000000010000

So maybe, the solution is to not look at the database ruv's, but at the agreement ruv's? This of course requires more work, as you would have to check each agreement on each server, instead of the server as a whole. Maybe revise repl-monitor.pl, or create a new task?

Still investigating...

Basic overview:

  • Update arrives
  • Write update to change log and ruv

    • Loop over the agreements
    • Update is to the replicated subtree for agmt
      • If update is a modify, strip out restricted attribute updates.
      • If there are modifications to replicate(or this is an add or delete, modrdn), update the new in-memory agmt maxcsn with the opcsn.
  • Event thread will write the agreements maxcsn to the local ruv:
    nsds5AgmtMaxCSN: <agmt rdn="">;consumer-host;consumer-port;supplier-rid;maxcsn

Replica 1
dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
nsds50ruv: {replica 111 ldap://localhost.localdomain:389} 5267db120000006f0000 52683d810000006f0000
nsds50ruv: {replica 222 ldap://localhost.localdomain:22222} 5267e90e000000de0000 52683d89000100de0000
nsds5AgmtMaxCSN: cn=to replica2;localhost.localdomain;22222;111;52683d810000006f0000

Replica 2
dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
nsds50ruv: {replica 222 ldap://localhost.localdomain:22222} 5267e90e000000de0000 52683d89000100de0000
nsds50ruv: {replica 111 ldap://localhost.localdomain:389} 5267db120000006f0000 52683d810000006f0000
nsds5AgmtMaxCSN: cn=to replica1;localhost.localdomain;389;222;52683d89000100de0000

Check the agmt maxcsn against the replica's RUV(supplier element). So on Replica 1, look at the agmt maxcsn for Replica 2(maxcsn is 52683d810000006f0000)> Compare this on Replica 2 with its local ruv element for replica 111(replica 1 rid).

Attaching patch next...

It looks good to me, just one concern.

Checking all replication agreements for any update could be some overhead and might not always be needed (eg no fractional agreements) or wanted 8no requirement to fine grained repl monitoring) - so could this additional ruv be optional ?

Replying to [comment:6 lkrispen]:

It looks good to me, just one concern.

Checking all replication agreements for any update could be some overhead and might not always be needed (eg no fractional agreements) or wanted 8no requirement to fine grained repl monitoring) - so could this additional ruv be optional ?

Originally I just did this for fractional agreements, but then I thought it would also be useful for non-fractional agreements. I could make it configurable: all agmts or just fractional agmts.

Looping over all the agreements I don't think is that bad - it's basically just a linked list so I wouldn't expect too much overhead. But, yes, it was also a concern of mine. So, today I plan on doing performance testing to see how much of an impact it has when there are many agreements.

1st round of performance testing (precise etime):

11 fractional agmts(4 -6 stripped attributes per agmt). Simply performing a mod on the same entry.

Master: average: 0.056
Patched server: average: 0.055

There is no significant change between the two versions.

Going to test with "callgrind" next...

callgrind results

{{{
write_changelog_and_ruv() - sysTime 0.018

-> agmt_update_maxcsn() - sysTime 0.0491

-> cl5WriteOperationTxn() - sysTime 24.272

}}}

0.0491 is insignificant.

Since there is basically no impact, I will not make it configurable for fractional vs. non-fractional agmts. So it will be global, and this will also make rewriting repl-monitor.pl easier and more consistent.

{{{

2750                    sprintf(buf, "%s;%s;%d;unavailable", slapi_rdn_get_rdn(agmt->rdn), 
2751                            agmt->hostname, agmt->port); 
2752                    agmt->maxcsn = slapi_ch_strdup(buf);

}}}
Just do this instead:
{{{
agmt->maxcsn = slapi_ch_smprintf("%s;%s;%d;unavailable", slapi_rdn_get_rdn(agmt->rdn),
agmt->hostname, agmt->port);
}}}
same here
{{{

2756                    sprintf(buf, "%s;%s;%d;%d;%s",slapi_rdn_get_rdn(agmt->rdn), 
2757                            agmt->hostname, agmt->port, rid, maxcsn); 
2758                    agmt->maxcsn = slapi_ch_strdup(buf);

}}}
Then you can get rid of char buf[BUFSIZ];

The other places in the code where you use sprintf e.g.
{{{
sprintf(buf,"%s;%s;%d;%d;",slapi_rdn_get_rdn(ra->rdn),ra->hostname, ra->port, rid);
}}}
Please use PR_snprintf instead, which gives you protection against buffer overruns (yes, it is highly unlikely that the rdn+hostname+port+rid will be > 8192 bytes long, but . . .)

{{{

2827            val.bv_val = agmt->maxcsn; 
2828            PR_Unlock(agmt->lock); 
2829            val.bv_len = strlen(val.bv_val); 
2830            slapi_mod_add_value(smod, &val);

}}}
If the lock is to protect agmt->maxcsn from being freed or overwritten, then you will need to unlock after the value has been copied in slapi_mod_add_value.

{{{

2911                        if(strstr(maxcsns[i], buf) || strstr(maxcsns[i], unavail_buf)){ 
2912                            ra->maxcsn = slapi_ch_strdup(maxcsns[i]); 
2913                        }

}}}
at this point in the code, is it possible that ra->maxcsn will already have been set?

Replying to [comment:10 rmeggins]:

{{{

2750 sprintf(buf, "%s;%s;%d;unavailable", slapi_rdn_get_rdn(agmt->rdn),
2751 agmt->hostname, agmt->port);
2752 agmt->maxcsn = slapi_ch_strdup(buf);
}}}
Just do this instead:
{{{
agmt->maxcsn = slapi_ch_smprintf("%s;%s;%d;unavailable", slapi_rdn_get_rdn(agmt->rdn),
agmt->hostname, agmt->port);
}}}
same here
{{{

2756 sprintf(buf, "%s;%s;%d;%d;%s",slapi_rdn_get_rdn(agmt->rdn),
2757 agmt->hostname, agmt->port, rid, maxcsn);
2758 agmt->maxcsn = slapi_ch_strdup(buf);
}}}
Then you can get rid of char buf[BUFSIZ];

The other places in the code where you use sprintf e.g.
{{{
sprintf(buf,"%s;%s;%d;%d;",slapi_rdn_get_rdn(ra->rdn),ra->hostname, ra->port, rid);
}}}
Please use PR_snprintf instead, which gives you protection against buffer overruns (yes, it is highly unlikely that the rdn+hostname+port+rid will be > 8192 bytes long, but . . .)

{{{

2827 val.bv_val = agmt->maxcsn;
2828 PR_Unlock(agmt->lock);
2829 val.bv_len = strlen(val.bv_val);
2830 slapi_mod_add_value(smod, &val);
}}}
If the lock is to protect agmt->maxcsn from being freed or overwritten, then you will need to unlock after the value has been copied in slapi_mod_add_value.

{{{

2911 if(strstr(maxcsns[i], buf) || strstr(maxcsns[i], unavail_buf)){
2912 ra->maxcsn = slapi_ch_strdup(maxcsns[i]);
2913 }
}}}
at this point in the code, is it possible that ra->maxcsn will already have been set?

Thanks Rich, I'll look into this. As I was working on repl-monitor.pl I realized I had to change part of the server fix too. So, I'm working on a new patch for the server and repl-monitor now...

I would prefer to leave repl-monitor.pl unbranded, not designated as "389 Directory Server Replication Monitor".

Will repl-monitor.pl work as before, if the server doesn't have nsds5AgmtMaxCSN? It has to work in a mixed topology where nsds5AgmtMaxCSN isn't present. Of course, it won't "work" in that it will give results that are not entirely accurate, but it should at least give a best effort, no worse than the way it works now.

Replying to [comment:12 rmeggins]:

I would prefer to leave repl-monitor.pl unbranded, not designated as "389 Directory Server Replication Monitor".

No problem.

Will repl-monitor.pl work as before, if the server doesn't have nsds5AgmtMaxCSN? It has to work in a mixed topology where nsds5AgmtMaxCSN isn't present. Of course, it won't "work" in that it will give results that are not entirely accurate, but it should at least give a best effort, no worse than the way it works now.

It will not work with older versions, but it should not take much to make it work across the board. The only real change in the script is how the supplier maxcsn is located.

New patch attached. Removed "389" branding, and script is now backwards compatible with older versions of DS that do not use "nsds5AgmtMaxCSN".

git merge ticket47368
Updating 4c1dfaf..c30c897
Fast-forward
ldap/admin/src/scripts/repl-monitor.pl.in | 271 ++++++++----
ldap/servers/plugins/replication/repl5.h | 13 +-
ldap/servers/plugins/replication/repl5_agmt.c | 472 +++++++++++++++++++-
ldap/servers/plugins/replication/repl5_agmtlist.c | 8 +
ldap/servers/plugins/replication/repl5_plugins.c | 7 +
ldap/servers/plugins/replication/repl5_protocol.c | 2 +-
ldap/servers/plugins/replication/repl5_replica.c | 43 ++-
.../plugins/replication/repl5_replica_config.c | 58 ++--
ldap/servers/plugins/replication/repl5_ruv.c | 1 +
ldap/servers/plugins/replication/repl_globals.c | 1 +
ldap/servers/slapd/entry.c | 6 +-
ldap/servers/slapd/rdn.c | 10 +
ldap/servers/slapd/slapi-plugin.h | 10 +-
13 files changed, 752 insertions(+), 150 deletions(-)

git push origin master
Counting objects: 45, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (22/22), done.
Writing objects: 100% (23/23), 9.12 KiB, done.
Total 23 (delta 20), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
4c1dfaf..c30c897 master -> master

commit c30c897
Author: Mark Reynolds mreynolds@redhat.com
Date: Thu Oct 31 12:36:14 2013 -0400

Summary:

To manually check if a consumer is up to date, you compare the agreements max csn, located in the local ruv tombstone entry, to the consumers RUV element for the suppliers rid.

Here is the format of the new attribute/value:

{{{
nsds5AgmtMaxCSN: <repl area="">:<agmt name="">:<consumer host="">:<port>:<consumer rid="">:<maxcsn>

nsds5AgmtMaxCSN: dc=example,dc=com;to replica 1;localhost.localdomain;389;111;5270095c0000014d0000
}}}

Example: Check if Replica B is caught up with Replica A

{{{
Replica A (rid 222):

ldapsearch -h localhost -D "cn=directory manager" -W -b "dc=example,dc=com" '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' -xLLL nsds50ruv nsds5AgmtMaxCSN
dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
nsds50ruv: {replica 222 ldap://localhost.localdomain:389} 527283720000006f0000 52729c8b0000006f0000
nsds5AgmtMaxCSN: dc=example,dc=com;to replica B;localhost.localdomain;22222;65535;52729c8b0000006f0000
nsds5AgmtMaxCSN: dc=example,dc=com;to replica C;localhost.localdomain;22222;444;52729c999900006f0000
}}}

Look at the nsds5AgmtMaxCSN attributes, and locate the agreement for replica B, and get its max csn value. Then compare this max csn to the max csn on Replica B(nsds50ruv) for replica 222(the supplier):

{{{
Replica B (rid 65535):

ldapsearch -h localhost -D "cn=directory manager" -W -b "dc=example,dc=com" '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' -p 22222 -xLLL nsds50ruv
dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
nsds50ruv: {replica 222 ldap://localhost.localdomain:389} 527283720000006f0000 52729c8b0000006f0000
nsds50ruv: {replica 444 ldap://localhost.localdomain:389} 527283af0000006f0000 527299990000006f0000
}}}

So now each agreement can be monitored, and this is also the only way to correctly tell if fractional replication agreements are in sync.

Fixed Jenkins errors:

git merge ticket47368
Updating c30c897..eb6a462
Fast-forward
ldap/servers/plugins/replication/repl5_agmt.c | 13 +++++--------
1 files changed, 5 insertions(+), 8 deletions(-)

git push origin master
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 887 bytes, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
c30c897..eb6a462 master -> master

commit eb6a462
Author: Mark Reynolds mreynolds@redhat.com
Date: Thu Oct 31 14:49:07 2013 -0400

git merge ticket47368
Updating 8db3a1a..8d398a5
Fast-forward
ldap/servers/plugins/replication/repl5_agmt.c | 6 +++---
ldap/servers/plugins/replication/repl5_replica.c | 8 ++++++--
2 files changed, 9 insertions(+), 5 deletions(-)

git push origin master
Counting objects: 15, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (8/8), done.
Writing objects: 100% (8/8), 929 bytes, done.
Total 8 (delta 6), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
8db3a1a..8d398a5 master -> master

commit 8d398a5
Author: Mark Reynolds mreynolds@redhat.com
Date: Mon Nov 4 09:52:11 2013 -0500

git merge ticket47368
Updating bae797c..f67e638
Fast-forward
ldap/servers/plugins/replication/repl5_agmt.c | 15 ++++++++++-----
ldap/servers/plugins/replication/repl5_replica.c | 4 +---
2 files changed, 11 insertions(+), 8 deletions(-)

git push origin master
Counting objects: 15, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (8/8), done.
Writing objects: 100% (8/8), 1016 bytes, done.
Total 8 (delta 6), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
bae797c..f67e638 master -> master

commit f67e638
Author: Mark Reynolds mreynolds@redhat.com
Date: Thu Dec 5 12:28:23 2013 -0500

Metadata Update from @mreynolds:
- Issue assigned to mreynolds
- Issue set to the milestone: 1.3.3 - 10/13 (October)

2 years ago

Login to comment on this ticket.

Metadata