#337 RFE - improve cleanruv functionality
Closed: wontfix None Opened 9 years ago by mreynolds.

If you have many masters, and you need to disable one then you suppose to run the CLEANRUV task afterwards. The problem is that other masters could still be sending updates for the old ruv, which then adds the RUV back to the "just cleaned" replica.

The current solution is to try and run the CLEANRUV task simultaneously on all the replicas, which is practically impossible for very large deployments.

Previously the steps to remove a replica and its RUV was problematic. I created two new "tasks" to take care of the entire replication environment.

[1] The new task "CLEANALLRUV<rid>" - run it once on any master

 -  This marks the rid as invalid.  Used to reject updates to the changelog, and the database RUV
 -  It then sends a "CLEANRUV" extended operation to each agreement.
 -  Then it cleans its own RUV.

 -  The CLEANRUV extended op then triggers that replica to send the the same CLEANRUV extop        to its replicas, then it cleans its own RID.  Basically this operation cascades through  the entire replication environment.

[2] The "RELEASERUV<rid>" task - run it once on any master

 -  Once the RUV's have been cleaned on all the replicas, you need to "release" the rid so  that it can be reused.  This operation also cascades through the entire replication environment.  This also triggers changelog trimming.

For all of this to work correctly, there is a list of steps that needs to be followed. This procedure is attached to the ticket.

1) I see these new OIDs in the patch:
97 / cleanruv/releaseruv extended ops /
98 #define REPL_CLEANRUV_OID "2.16.840.1.113730.3.6.3"
99 #define REPL_RELEASERUV_OID "2.16.840.1.113730.3.6.4"

My repl5.h already has these defines.
82 #define REPL_NSDS71_INCREMENTAL_PROTOCOL_OID "2.16.840.1.113730.3.6.4"
83 #define REPL_NSDS71_TOTAL_PROTOCOL_OID "2.16.840.1.113730.3.6.3"
and they are referred in repl_extop.c.
repl_extop.c: else if (strcmp(protocol_oid, REPL_NSDS71_INCREMENTAL_PROTOCOL_OID) == 0)
repl_extop.c: else if (strcmp(protocol_oid, REPL_NSDS71_TOTAL_PROTOCOL_OID) == 0)
You may want to choose other OIDs which have no conflicts?

2) We should get rid of possibilities to cause a static buffer overflow.
1204 char ridstr[BUFSIZ];
1213 sprintf(ridstr, "%d:%s", rid, slapi_sdn_get_dn(replica_get_root(replica)));

1297 sprintf(ridstr, "%d:%s", rid, slapi_sdn_get_dn(replica_get_root(replica)));
You may want to use slapi_ch_smprintf (the string should be released later) or use PR_snprintf and specify the buffer size.

3) payload is not initialized. The garbage value could be passed to slapi_ch_free_string in decode_cleanruv_payload.
1389 multimaster_extop_cleanruv(Slapi_PBlock pb){
1400 char
1418 decode_cleanruv_payload(extop_value, &payload);

1341 static int
1342 decode_cleanruv_payload(struct berval extop_value, char *payload)
1365 free_and_return:
1366 if (-1 == rc){
1367 slapi_ch_free_string(payload);
1368 }

4) The return value/returned payload from decode_cleanruv_payload are not checked. If payload is NULL, ldap_utf8strtok_r returns NULL and atoi crashes with segfault.
1418 decode_cleanruv_payload(extop_value, &payload);
1419 rid = atoi(ldap_utf8strtok_r(payload, ":", &iter));

1555 decode_cleanruv_payload(extop_value, &payload);
1556 rid = atoi(ldap_utf8strtok_r(payload, ":", &iter));

5) And I don't see payload is released if it's successfully allocated and returned (for instance in multimaster_extop_releaseruv). Is it okay?

[1] I grabbed the next free OIDs from https://home.corp.redhat.com/wiki/ldapschemaoids So I guess someone used them and didn't update the page. I'll correct the wiki, and use a new range of OIDs

[2,3,4, 5] I'll work on it :-)



Is it possible for there to be more than one cleaned or released rid at a time? If so, or just to be future-proof, you might want to change
rid = get_released_rid();
rid = get_cleaned_rid();
to something like this
if (is_released_rid(rid)) {
if (is_cleaned_rid(rid)) {
Looks like you'll need a del_released_rid(rid) instead of set_released_rid(0) (and possibly the same for cleaned_rid) too.

I had considered doing more than 1 rid at a time and started writing code to do it, but I removed this code as it became problematic, and I was afraid of affecting performance. We already check if the rid is "cleaned" on every op. I didn't want to loop through an array per replicated operation.

I'll start working on your proposed changes.


Forgot to check the decode functions return value... revising...

Nevermind, I did take care of decode function return value.

You have my "ack". ;)

Thanks for the reviews Noriko & Rich!

[mareynol@localhost memberof]$ git merge replica
Updating 118e483..0f50544
ldap/servers/plugins/replication/cl5_api.c | 104 +++++-
ldap/servers/plugins/replication/cl5_api.h | 4 +
ldap/servers/plugins/replication/repl5.h | 18 +-
ldap/servers/plugins/replication/repl5_agmt.c | 9 +
ldap/servers/plugins/replication/repl5_init.c | 63 ++++
ldap/servers/plugins/replication/repl5_plugins.c | 10 +
.../plugins/replication/repl5_replica_config.c | 344 +++++++++++++++++++-
ldap/servers/plugins/replication/repl5_ruv.c | 33 ++-
ldap/servers/plugins/replication/repl_extop.c | 341 +++++++++++++++++++-
9 files changed, 897 insertions(+), 29 deletions(-)

[mareynol@localhost memberof]$ git push origin master
Counting objects: 29, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (15/15), done.
Writing objects: 100% (15/15), 9.04 KiB, done.
Total 15 (delta 13), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
118e483..0f50544 master -> master

While debugging #359, I've run into an RUV problem. If I remove this patch, the problem goes away. Could you please take a look?

Steps to reproduce the problem.

1. Set up Master 1 and Master 2 and Hub
Topology: Master 1 <---> Master 2 ---> Hub
RID: 10 RID: 1
2. Add a couple of entries on Master 1
3. Shutdown Master 2
dbscan -f /path/to/master2/changelog
The max ruv does not show up.
max ruv:
{replicageneration} 4fa9c99c000000010000
{replica 10} 4fa9ca070000000a0000 4fa9ca140000000a0000
max ruv:
{replicageneration} 4fa9c99c000000010000
4. restart Master 2
The error log logs the ruv_compare_ruv error:
[..] NSMMReplicationPlugin - ruv_compare_ruv: RUV [changelog max RUV] does not contain element [{replica 10 ldap://<HOST>:<PORT>} 4fa9c2150000000a0000 4fa9c2230000000a0000] which is present in RUV [database RUV]
[..] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: for replica dc=example,dc=com there were some differences between the changelog max RUV and the database RUV. If there are obsolete elements in the database RUV, you should remove them using CLEANRUV task. If they are not obsolete, you should check their status to see why there are no changes from those servers in the changelog.
[..] - slapd started. Listening on All Interfaces port 390 for LDAP requests

Forgot some logical operator nots(!) when checking if a rid was cleaned. Attached new patch that corrects this.

I tested your patch and got the good result.

You have my ack.

[mareynol@localhost servers]$ git merge cleanruv
Updating b51abe5..16e9242
ldap/servers/plugins/replication/cl5_api.c | 6 +++---
ldap/servers/plugins/replication/repl_extop.c | 2 +-
2 files changed, 4 insertions(+), 4 deletions(-)

[mareynol@localhost servers]$ git push origin master
Counting objects: 15, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (8/8), done.
Writing objects: 100% (8/8), 857 bytes, done.
Total 8 (delta 6), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
b51abe5..16e9242 master -> master

Added initial screened field value.

Redesigned communication and data gathering to use extended ops, and improved synchronization between servers. Attaching new patch(11/20/2012)

git merge cleanruv64
Updating f102eb7..ecbd8b7
ldap/servers/plugins/replication/cl5_api.c | 21 +-
ldap/servers/plugins/replication/repl5.h | 29 +-
ldap/servers/plugins/replication/repl5_agmt.c | 99 --
ldap/servers/plugins/replication/repl5_agmtlist.c | 11 +-
.../servers/plugins/replication/repl5_connection.c | 6 +
ldap/servers/plugins/replication/repl5_init.c | 69 ++-
ldap/servers/plugins/replication/repl5_replica.c | 132 +--
.../plugins/replication/repl5_replica_config.c | 1055 +++++++++++++-------
ldap/servers/plugins/replication/repl_extop.c | 266 +++++-
ldap/servers/plugins/replication/replutil.c | 49 +-
ldap/servers/slapd/back-ldbm/ldbm_modify.c | 7 +-
ldap/servers/slapd/back-ldbm/misc.c | 2 +-
ldap/servers/slapd/back-ldbm/proto-back-ldbm.h | 1 +
ldap/servers/slapd/entry.c | 10 +-
ldap/servers/slapd/slapi-private.h | 2 +-
15 files changed, 1105 insertions(+), 654 deletions(-)

[mareynol@localhost ds]$ git push origin master
Counting objects: 45, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (23/23), done.
Writing objects: 100% (23/23), 10.80 KiB, done.
Total 23 (delta 21), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
f102eb7..ecbd8b7 master -> master

pushed to 1.2.10

git merge cleanallruv
Updating 16088c1..f9abb94
ldap/servers/plugins/replication/cl5_api.c | 17 +-
ldap/servers/plugins/replication/repl5.h | 29 +-
ldap/servers/plugins/replication/repl5_agmt.c | 110 +--
ldap/servers/plugins/replication/repl5_agmtlist.c | 12 +-
.../servers/plugins/replication/repl5_connection.c | 7 +-
ldap/servers/plugins/replication/repl5_init.c | 70 ++-
ldap/servers/plugins/replication/repl5_replica.c | 131 +--
.../plugins/replication/repl5_replica_config.c | 960 ++++++++++++--------
ldap/servers/plugins/replication/repl_extop.c | 265 +++++-
ldap/servers/plugins/replication/replutil.c | 2 +-
ldap/servers/slapd/back-ldbm/ldbm_modify.c | 8 +-
ldap/servers/slapd/back-ldbm/misc.c | 6 +-
ldap/servers/slapd/back-ldbm/proto-back-ldbm.h | 1 +
ldap/servers/slapd/entry.c | 13 +-
ldap/servers/slapd/slapi-private.h | 1 +
15 files changed, 1002 insertions(+), 630 deletions(-)

[mareynol@localhost ds]$ git push origin 389-ds-base-1.2.10
Counting objects: 45, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (23/23), done.
Writing objects: 100% (23/23), 9.75 KiB, done.
Total 23 (delta 21), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
16088c1..f9abb94 389-ds-base-1.2.10 -> 389-ds-base-1.2.10

Metadata Update from @mreynolds:
- Issue assigned to mreynolds
- Issue set to the milestone: 1.2.11.rc1

4 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/337

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

a year ago

Login to comment on this ticket.