Ticket was cloned from Red Hat Bugzilla (product Fedora): Bug 966562
Description of problem: This may affect 389 Directory Server in general but, I specifically saw this behavior in my IPA test env. After a good bit of troubleshooting, and getting dev involved, Rich found that the RUV data was excluded from replication (at least in some cases): From Rich: However, I do see it in the changelog on f18-2: dbid: 519b96d3000000030000 replgen: 1369151185 Tue May 21 10:46:25 2013 csn: 519b96d3000000030000 uniqueid: c4a8008c-c15d11e2-81caa921-5c226755 dn: fqdn=f18-2.testrelm.com,cn=computers,cn=accounts,dc=testrelm,dc=com operation: modify krbLastSuccessfulAuth: 20130521154625Z modifiersname: cn=Directory Manager modifytimestamp: 20130521154625Z entryusn: 1068 All of these attributes are excluded from replication, which means it will be in the local RUV but not in any other RUVs. Version-Release number of selected component (if applicable): 389-ds-base-1.3.0.6-1.fc18.x86_64 How reproducible: very Steps to Reproduce: 1. Setup a few IPA servers 2. use ldapsearch (like below) to compare the RUV data for each server on each server for RUV in $MASTER $REPLICA1 $REPLICA2 $REPLICA3 $REPLICA4; do for SERVER in $MASTER $REPLICA1 $REPLICA2 $REPLICA3 $REPLICA4; do RUVCHK=$(ldapsearch -o ldif-wrap=no -h $SERVER \ -xLLL -D "$ROOTDN" -w $ROOTDNPWD -b $BASEDN \ '(&(objectclass=nstombstone)(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff))' \ nsds50ruv|grep $RUV); echo "$(echo $SERVER|cut -f1 -d.): $RUVCHK" done echo done Actual results: You see some differences in the last field (MaxCSN): f18-1: nsds50ruv: {replica 4 ldap://f18-1.testrelm.com:389} 519a3a2b000000040000 519a8ec5000000040000 f18-2: nsds50ruv: {replica 4 ldap://f18-1.testrelm.com:389} 519a3a2b000000040000 519a8ec5000000040000 f18-3: nsds50ruv: {replica 4 ldap://f18-1.testrelm.com:389} 519a3a2b000000040000 519a8ec5000000040000 f18-4: nsds50ruv: {replica 4 ldap://f18-1.testrelm.com:389} 519a3a2b000000040000 519a8ec5000000040000 f18-5: nsds50ruv: {replica 4 ldap://f18-1.testrelm.com:389} 519a3a2b000000040000 519a8ec5000000040000 f18-1: nsds50ruv: {replica 3 ldap://f18-2.testrelm.com:389} 519a39ea000000030000 519a7c32000400030000 f18-2: nsds50ruv: {replica 3 ldap://f18-2.testrelm.com:389} 519a39ea000000030000 519a81b2000200030000 f18-3: nsds50ruv: {replica 3 ldap://f18-2.testrelm.com:389} 519a39ea000000030000 519a7c32000400030000 f18-4: nsds50ruv: {replica 3 ldap://f18-2.testrelm.com:389} 519a39ea000000030000 519a7c32000400030000 f18-5: nsds50ruv: {replica 3 ldap://f18-2.testrelm.com:389} 519a39ea000000030000 519a7c32000400030000 f18-1: nsds50ruv: {replica 5 ldap://f18-3.testrelm.com:389} 519a3cfc000000050000 519a447b000500050000 f18-2: nsds50ruv: {replica 5 ldap://f18-3.testrelm.com:389} 519a3cfc000000050000 519a447b000500050000 f18-3: nsds50ruv: {replica 5 ldap://f18-3.testrelm.com:389} 519a3cfc000000050000 519a8138000100050000 f18-4: nsds50ruv: {replica 5 ldap://f18-3.testrelm.com:389} 519a3cfc000000050000 519a8138000100050000 f18-5: nsds50ruv: {replica 5 ldap://f18-3.testrelm.com:389} 519a3cfc000000050000 519a8138000100050000 f18-1: nsds50ruv: {replica 6 ldap://f18-4.testrelm.com:389} 519a40fc000000060000 519a4477000600060000 f18-2: nsds50ruv: {replica 6 ldap://f18-4.testrelm.com:389} 519a40fc000000060000 519a4477000600060000 f18-3: nsds50ruv: {replica 6 ldap://f18-4.testrelm.com:389} 519a40fc000000060000 519a4477000600060000 f18-4: nsds50ruv: {replica 6 ldap://f18-4.testrelm.com:389} 519a40fc000000060000 519a813b000300060000 f18-5: nsds50ruv: {replica 6 ldap://f18-4.testrelm.com:389} 519a40fc000000060000 519a813b000300060000 f18-1: nsds50ruv: {replica 7 ldap://f18-5.testrelm.com:389} 519a43a1000000070000 519a4483000000070000 f18-2: nsds50ruv: {replica 7 ldap://f18-5.testrelm.com:389} 519a43a1000000070000 519a4483000000070000 f18-3: nsds50ruv: {replica 7 ldap://f18-5.testrelm.com:389} 519a43a1000000070000 519a4483000000070000 f18-4: nsds50ruv: {replica 7 ldap://f18-5.testrelm.com:389} 519a43a1000000070000 519a4483000000070000 f18-5: nsds50ruv: {replica 7 ldap://f18-5.testrelm.com:389} 519a43a1000000070000 519a813e000000070000 Expected results: Entries match or some other definitive way to confirm that the directory servers are in sync. Additional info:
Simplified steps to reproduce the problem:
[1] Setup a master and a replica [2] Setup replication [3] Add these attributes to the replication agreement:
nsDS5ReplicatedAttributeList: (objectclass=*) $ EXCLUDE sn nsds5ReplicaStripAttrs: modifiersname modifytimestamp
[4] Create a new user. [5] modify the "sn" attribute in the user. [6] Look at the database ruvs: the consumer is missing the last update
master: nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 52583d80000000010000 525c4644000000010000
replica: nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 52583d80000000010000 525c4532000000010000
[7] However, if you look at the ruv in the replication agreement it is the same as the replica:
nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 52583d80000000010000 525c4532000000010000
So maybe, the solution is to not look at the database ruv's, but at the agreement ruv's? This of course requires more work, as you would have to check each agreement on each server, instead of the server as a whole. Maybe revise repl-monitor.pl, or create a new task?
Still investigating...
Basic overview:
Write update to change log and ruv
Event thread will write the agreements maxcsn to the local ruv: nsds5AgmtMaxCSN: <agmt rdn>;consumer-host;consumer-port;supplier-rid;maxcsn
Replica 1 dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config nsds50ruv: {replica 111 ldap://localhost.localdomain:389} 5267db120000006f0000 52683d810000006f0000 nsds50ruv: {replica 222 ldap://localhost.localdomain:22222} 5267e90e000000de0000 52683d89000100de0000 nsds5AgmtMaxCSN: cn=to replica2;localhost.localdomain;22222;111;52683d810000006f0000
Replica 2 dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config nsds50ruv: {replica 222 ldap://localhost.localdomain:22222} 5267e90e000000de0000 52683d89000100de0000 nsds50ruv: {replica 111 ldap://localhost.localdomain:389} 5267db120000006f0000 52683d810000006f0000 nsds5AgmtMaxCSN: cn=to replica1;localhost.localdomain;389;222;52683d89000100de0000
Check the agmt maxcsn against the replica's RUV(supplier element). So on Replica 1, look at the agmt maxcsn for Replica 2(maxcsn is 52683d810000006f0000)> Compare this on Replica 2 with its local ruv element for replica 111(replica 1 rid).
Attaching patch next...
It looks good to me, just one concern.
Checking all replication agreements for any update could be some overhead and might not always be needed (eg no fractional agreements) or wanted 8no requirement to fine grained repl monitoring) - so could this additional ruv be optional ?
Replying to [comment:6 lkrispen]:
It looks good to me, just one concern. Checking all replication agreements for any update could be some overhead and might not always be needed (eg no fractional agreements) or wanted 8no requirement to fine grained repl monitoring) - so could this additional ruv be optional ?
Originally I just did this for fractional agreements, but then I thought it would also be useful for non-fractional agreements. I could make it configurable: all agmts or just fractional agmts.
Looping over all the agreements I don't think is that bad - it's basically just a linked list so I wouldn't expect too much overhead. But, yes, it was also a concern of mine. So, today I plan on doing performance testing to see how much of an impact it has when there are many agreements.
1st round of performance testing (precise etime):
11 fractional agmts(4 -6 stripped attributes per agmt). Simply performing a mod on the same entry.
Master: average: 0.056 Patched server: average: 0.055
There is no significant change between the two versions.
Going to test with "callgrind" next...
callgrind results
{{{ write_changelog_and_ruv() - sysTime 0.018
-> agmt_update_maxcsn() - sysTime 0.0491 -> cl5WriteOperationTxn() - sysTime 24.272
}}}
0.0491 is insignificant.
Since there is basically no impact, I will not make it configurable for fractional vs. non-fractional agmts. So it will be global, and this will also make rewriting repl-monitor.pl easier and more consistent.
{{{
2750 sprintf(buf, "%s;%s;%d;unavailable", slapi_rdn_get_rdn(agmt->rdn), 2751 agmt->hostname, agmt->port); 2752 agmt->maxcsn = slapi_ch_strdup(buf);
}}} Just do this instead: {{{ agmt->maxcsn = slapi_ch_smprintf("%s;%s;%d;unavailable", slapi_rdn_get_rdn(agmt->rdn), agmt->hostname, agmt->port); }}} same here {{{
2756 sprintf(buf, "%s;%s;%d;%d;%s",slapi_rdn_get_rdn(agmt->rdn), 2757 agmt->hostname, agmt->port, rid, maxcsn); 2758 agmt->maxcsn = slapi_ch_strdup(buf);
}}} Then you can get rid of char buf[BUFSIZ];
The other places in the code where you use sprintf e.g. {{{ sprintf(buf,"%s;%s;%d;%d;",slapi_rdn_get_rdn(ra->rdn),ra->hostname, ra->port, rid); }}} Please use PR_snprintf instead, which gives you protection against buffer overruns (yes, it is highly unlikely that the rdn+hostname+port+rid will be > 8192 bytes long, but . . .)
2827 val.bv_val = agmt->maxcsn; 2828 PR_Unlock(agmt->lock); 2829 val.bv_len = strlen(val.bv_val); 2830 slapi_mod_add_value(smod, &val);
}}} If the lock is to protect agmt->maxcsn from being freed or overwritten, then you will need to unlock after the value has been copied in slapi_mod_add_value.
2911 if(strstr(maxcsns[i], buf) || strstr(maxcsns[i], unavail_buf)){ 2912 ra->maxcsn = slapi_ch_strdup(maxcsns[i]); 2913 }
}}} at this point in the code, is it possible that ra->maxcsn will already have been set?
Replying to [comment:10 rmeggins]:
{{{ 2750 sprintf(buf, "%s;%s;%d;unavailable", slapi_rdn_get_rdn(agmt->rdn), 2751 agmt->hostname, agmt->port); 2752 agmt->maxcsn = slapi_ch_strdup(buf); }}} Just do this instead: {{{ agmt->maxcsn = slapi_ch_smprintf("%s;%s;%d;unavailable", slapi_rdn_get_rdn(agmt->rdn), agmt->hostname, agmt->port); }}} same here {{{ 2756 sprintf(buf, "%s;%s;%d;%d;%s",slapi_rdn_get_rdn(agmt->rdn), 2757 agmt->hostname, agmt->port, rid, maxcsn); 2758 agmt->maxcsn = slapi_ch_strdup(buf); }}} Then you can get rid of char buf[BUFSIZ]; The other places in the code where you use sprintf e.g. {{{ sprintf(buf,"%s;%s;%d;%d;",slapi_rdn_get_rdn(ra->rdn),ra->hostname, ra->port, rid); }}} Please use PR_snprintf instead, which gives you protection against buffer overruns (yes, it is highly unlikely that the rdn+hostname+port+rid will be > 8192 bytes long, but . . .) {{{ 2827 val.bv_val = agmt->maxcsn; 2828 PR_Unlock(agmt->lock); 2829 val.bv_len = strlen(val.bv_val); 2830 slapi_mod_add_value(smod, &val); }}} If the lock is to protect agmt->maxcsn from being freed or overwritten, then you will need to unlock after the value has been copied in slapi_mod_add_value. {{{ 2911 if(strstr(maxcsns[i], buf) || strstr(maxcsns[i], unavail_buf)){ 2912 ra->maxcsn = slapi_ch_strdup(maxcsns[i]); 2913 } }}} at this point in the code, is it possible that ra->maxcsn will already have been set?
2750 sprintf(buf, "%s;%s;%d;unavailable", slapi_rdn_get_rdn(agmt->rdn), 2751 agmt->hostname, agmt->port); 2752 agmt->maxcsn = slapi_ch_strdup(buf); }}} Just do this instead: {{{ agmt->maxcsn = slapi_ch_smprintf("%s;%s;%d;unavailable", slapi_rdn_get_rdn(agmt->rdn), agmt->hostname, agmt->port); }}} same here {{{
2756 sprintf(buf, "%s;%s;%d;%d;%s",slapi_rdn_get_rdn(agmt->rdn), 2757 agmt->hostname, agmt->port, rid, maxcsn); 2758 agmt->maxcsn = slapi_ch_strdup(buf); }}} Then you can get rid of char buf[BUFSIZ];
2827 val.bv_val = agmt->maxcsn; 2828 PR_Unlock(agmt->lock); 2829 val.bv_len = strlen(val.bv_val); 2830 slapi_mod_add_value(smod, &val); }}} If the lock is to protect agmt->maxcsn from being freed or overwritten, then you will need to unlock after the value has been copied in slapi_mod_add_value.
2911 if(strstr(maxcsns[i], buf) || strstr(maxcsns[i], unavail_buf)){ 2912 ra->maxcsn = slapi_ch_strdup(maxcsns[i]); 2913 } }}} at this point in the code, is it possible that ra->maxcsn will already have been set?
Thanks Rich, I'll look into this. As I was working on repl-monitor.pl I realized I had to change part of the server fix too. So, I'm working on a new patch for the server and repl-monitor now...
text report output text-report.txt
html report report.html
I would prefer to leave repl-monitor.pl unbranded, not designated as "389 Directory Server Replication Monitor".
Will repl-monitor.pl work as before, if the server doesn't have nsds5AgmtMaxCSN? It has to work in a mixed topology where nsds5AgmtMaxCSN isn't present. Of course, it won't "work" in that it will give results that are not entirely accurate, but it should at least give a best effort, no worse than the way it works now.
Replying to [comment:12 rmeggins]:
No problem.
It will not work with older versions, but it should not take much to make it work across the board. The only real change in the script is how the supplier maxcsn is located.
Revision #2 0001-Ticket-47368-IPA-server-dirsrv-RUV-entry-data-exclud.patch
New patch attached. Removed "389" branding, and script is now backwards compatible with older versions of DS that do not use "nsds5AgmtMaxCSN".
git merge ticket47368 Updating 4c1dfaf..c30c897 Fast-forward ldap/admin/src/scripts/repl-monitor.pl.in | 271 ++++++++---- ldap/servers/plugins/replication/repl5.h | 13 +- ldap/servers/plugins/replication/repl5_agmt.c | 472 +++++++++++++++++++- ldap/servers/plugins/replication/repl5_agmtlist.c | 8 + ldap/servers/plugins/replication/repl5_plugins.c | 7 + ldap/servers/plugins/replication/repl5_protocol.c | 2 +- ldap/servers/plugins/replication/repl5_replica.c | 43 ++- .../plugins/replication/repl5_replica_config.c | 58 ++-- ldap/servers/plugins/replication/repl5_ruv.c | 1 + ldap/servers/plugins/replication/repl_globals.c | 1 + ldap/servers/slapd/entry.c | 6 +- ldap/servers/slapd/rdn.c | 10 + ldap/servers/slapd/slapi-plugin.h | 10 +- 13 files changed, 752 insertions(+), 150 deletions(-)
git push origin master Counting objects: 45, done. Delta compression using up to 4 threads. Compressing objects: 100% (22/22), done. Writing objects: 100% (23/23), 9.12 KiB, done. Total 23 (delta 20), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 4c1dfaf..c30c897 master -> master
commit c30c897 Author: Mark Reynolds mreynolds@redhat.com Date: Thu Oct 31 12:36:14 2013 -0400
Summary:
To manually check if a consumer is up to date, you compare the agreements max csn, located in the local ruv tombstone entry, to the consumers RUV element for the suppliers rid.
Here is the format of the new attribute/value:
{{{ nsds5AgmtMaxCSN: <repl area>:<agmt name>:<consumer host>:<port>:<consumer rid>:<maxcsn>
nsds5AgmtMaxCSN: dc=example,dc=com;to replica 1;localhost.localdomain;389;111;5270095c0000014d0000 }}}
Example: Check if Replica B is caught up with Replica A
{{{ Replica A (rid 222):
ldapsearch -h localhost -D "cn=directory manager" -W -b "dc=example,dc=com" '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' -xLLL nsds50ruv nsds5AgmtMaxCSN dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config nsds50ruv: {replica 222 ldap://localhost.localdomain:389} 527283720000006f0000 52729c8b0000006f0000 nsds5AgmtMaxCSN: dc=example,dc=com;to replica B;localhost.localdomain;22222;65535;52729c8b0000006f0000 nsds5AgmtMaxCSN: dc=example,dc=com;to replica C;localhost.localdomain;22222;444;52729c999900006f0000 }}}
Look at the nsds5AgmtMaxCSN attributes, and locate the agreement for replica B, and get its max csn value. Then compare this max csn to the max csn on Replica B(nsds50ruv) for replica 222(the supplier):
{{{ Replica B (rid 65535):
ldapsearch -h localhost -D "cn=directory manager" -W -b "dc=example,dc=com" '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' -p 22222 -xLLL nsds50ruv dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config nsds50ruv: {replica 222 ldap://localhost.localdomain:389} 527283720000006f0000 52729c8b0000006f0000 nsds50ruv: {replica 444 ldap://localhost.localdomain:389} 527283af0000006f0000 527299990000006f0000 }}}
So now each agreement can be monitored, and this is also the only way to correctly tell if fractional replication agreements are in sync.
Fixed jenkins errors 0001-Ticket-47368-Fix-Jenkins-errors.patch
Fixed Jenkins errors:
git merge ticket47368 Updating c30c897..eb6a462 Fast-forward ldap/servers/plugins/replication/repl5_agmt.c | 13 +++++-------- 1 files changed, 5 insertions(+), 8 deletions(-)
git push origin master Counting objects: 13, done. Delta compression using up to 4 threads. Compressing objects: 100% (7/7), done. Writing objects: 100% (7/7), 887 bytes, done. Total 7 (delta 5), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git c30c897..eb6a462 master -> master
commit eb6a462 Author: Mark Reynolds mreynolds@redhat.com Date: Thu Oct 31 14:49:07 2013 -0400
Fix coverity errors 0001-Ticket-47368-Fix-coverity-issues.patch
ack
git merge ticket47368 Updating 8db3a1a..8d398a5 Fast-forward ldap/servers/plugins/replication/repl5_agmt.c | 6 +++--- ldap/servers/plugins/replication/repl5_replica.c | 8 ++++++-- 2 files changed, 9 insertions(+), 5 deletions(-)
git push origin master Counting objects: 15, done. Delta compression using up to 4 threads. Compressing objects: 100% (8/8), done. Writing objects: 100% (8/8), 929 bytes, done. Total 8 (delta 6), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 8db3a1a..8d398a5 master -> master
commit 8d398a5 Author: Mark Reynolds mreynolds@redhat.com Date: Mon Nov 4 09:52:11 2013 -0500
Fix memory leaks 0001-Ticket-47368-fix-memory-leaks.patch
git merge ticket47368 Updating bae797c..f67e638 Fast-forward ldap/servers/plugins/replication/repl5_agmt.c | 15 ++++++++++----- ldap/servers/plugins/replication/repl5_replica.c | 4 +--- 2 files changed, 11 insertions(+), 8 deletions(-)
git push origin master Counting objects: 15, done. Delta compression using up to 4 threads. Compressing objects: 100% (8/8), done. Writing objects: 100% (8/8), 1016 bytes, done. Total 8 (delta 6), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git bae797c..f67e638 master -> master
commit f67e638 Author: Mark Reynolds mreynolds@redhat.com Date: Thu Dec 5 12:28:23 2013 -0500
Metadata Update from @mreynolds: - Issue assigned to mreynolds - Issue set to the milestone: 1.3.3 - 10/13 (October)
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/705
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Log in to comment on this ticket.