#403 CLEANALLRUV must deal with offline replicas and older replicas
Closed: Fixed None Opened 7 years ago by rmeggins.

One of the problems with CLEANALLRUV is that if a replica is down and has pending changes to send to other replicas, once that replica comes online, it will start sending updates to the other replicas, which includes the RUV containing any replicas deleted/cleaned while it was offline. Replicas that were online and were cleaned properly need a way to ignore or otherwise gracefully deal with changes from other replicas that have the deleted replica in the CSN and the RUV.

One of the suggestions was to change the replication protocol metadata in such a way that this information could be stored and replicated, but if the data were replicated to an older server that did not understand the metadata changes, it would not break replication.

Ideally we could somehow mark RUV elements as having been deleted - delete the CSNs from the RUV, or set the CSNs to all zeros, or use some "magic" CSN value that tells newer replicas that the RUV element has been deleted, but that doesn't break older replicas.

In order to allow us to extend the replication protocol RUV element in the future, we need to add a version field to the RUV element. Replicas will then "advertise" that they support RUV version N, and suppliers of those replicas will craft the RUV so that it conforms to what the consumer expects in terms of the RUV.


Another proposal...(without doing a complete rewrite)

There were actually two versions of CLEANALLRUV. The first version was two steps, and the second version did it all at once. When dealing with the possibility of a replica being down during the task, the first version(two steps) is the best option.

The one advantage to the first version is that you need to manually release the RID, which would "semi" resolve this issue. Until you released the rid, it would reject updates(csns) that contained the rid. This ensures the cleaned replica would not get polluted with the dirty rid again, but we also silently rejected the updates. Once we are sure every replica has been clean, then we would issue the "releaseRID" task. However, if a replica was down, and then comes back online, it doesn't know that it missed this cleaning. So you would have run CLEANALLRUV on that replica again.

I think we can improve the current process. Instead of issuing "CLEANALLRUV" extended ops to each agreement, we replicate it with some kind of "special" update. This way when the replica comes online it will eventually receive the "cleanallruv" update, and clean itself. So, no replica left behind.

There are some issues with the current model. Once we set the rid to be cleaned, we stopped applying updates that came from that replica(dirty rid), and we ignore these updates in the changelog as well. This means possibly lost changes!

We should still apply and replicate these updates, but not update the RUV with the dirty rid. This is where it gets tricky, and I'm not sure what the best approach might be. Maybe change the dirty rid in the csn to that replica's rid? But I'm not sure what side effects this might have.

This the current plan of attack...

  • slapi task is created
  • the rid is marked as "cleaned" in the cn=replica. I'm marking it in cn=replica in case of a server restart...
  • send extended ops to all the replicas. If the replica is up and doesn't reject the extended op, we mark in that repl agmt that it was notified.
  • clean the ruv
  • create a "monitoring" thread

monitoring thread
- Loop over repl agmts that have NOT been notified, and send the extended op again.
- Once all replicas have been notified, then move on to check their tombstones(RUV's)
- Once all the remote RUV's have been cleaned, then release the RID.
- Releasing the rid will allow that rid to be reused, and the cn=replica entry & repl agmts will have the special attributes removed.

On the remote replicas, that receive the extended op, this process is repeated.

The only things different I'm doing is using an actual slapi task, writing a new attribute to cn=replica & repl agmts, and modifying the existing monitoring thread to continue to attempt the extended op on those replicas that have not been notified. Since we are marking the repl config and repl agmts, we can handle servers restarts without interruption of the cleaning process.

Using extended ops still seems to be the best approach. Older servers are safe(they will just reject the extended op), and we don't have to change the replication protocol.

So basically the task will never end until all the the replicas have been notified. So if there are "older" replicas the process will continue forever until that agreement is removed or that server is upgraded. This makes monitoring the task more difficult, as it could run for months until a replica is upgraded.

So the main issue is that the task can run for a long time - until all the servers are up, notified, upgraded, and cleaned. I could "end" the task once we have done the initial round of notifications, but that doesn't mean all the replicas are cleaned yet. Now, I could simply log a message saying that some replicas have not been cleaned because they do not support the CLEANALLRUV task at that time, but they will be cleaned once they do support it. I'm open to suggestions here on how to best do the task logging/monitoring.

A good thing about this approach is that once a replica is upgraded, it will automatically be cleaned.

  • Mark

See "cleanruv-proceedure" attachment for design details. Sending fix out for review...

1.i < 64 at the line 435: why 64? Could it be CLEANRIDSIZ?
{{{
… … agmt_new_from_entry(Slapi_Entry e)
425 428 ra->last_init_start_time = 0UL;
426 429 ra->last_init_status[0] = '\0';
427 430
431 /
cleanruv notification */
432 clean_vals = slapi_entry_attr_get_charray(e, type_nsds5ReplicaCleanRUVnotified);
433 if(clean_vals){
434 int i;
435 for (i = 0; i < 64 && clean_vals[i]; i++){
436 ra->cleanruv_notified[i] = atoi(clean_vals[i]);
437 }
438 ra->cleanruv_notified[i + 1] = 0;
439 slapi_ch_array_free(clean_vals);
440 } else {
441 ra->cleanruv_notified[0] = 0;
442 }
443
}}}

2.I believe cleanruv_notified and abort_notified never run over the specified size since you reject the requests if too many of them come in, but it'll be nice if it's checked that less than (CLEANRIDSIZ + 1) at all the places where cleanruv_notified or abort_notified arrays are assigned (as above in agmt_new_from_entry), which would prevent the unexpected crash...
{{{
a b typedef struct repl5agmt {
141 141 char attrs_to_strip; / for fractional replication, if a "mod" is empty, strip out these attributes:
142 142 * modifiersname, modifytimestamp, internalModifiersname, internalModifyTimestamp, etc
/
143 143 int agreement_type;
144 int cleanruv_notified[CLEANRIDSIZ + 1]; / specifies if the replica has been notified of a CLEANALLRUV task /
145 int abort_notified[CLEANRIDSIZ + 1]; / specifies if the replica has been notified of a CLEANALLRUV ABORT task /
144 146 } repl5agmt;
}}}

  1. I noticed "atoi" is used at many places. As you see in "man atoi", atoi does not detect errors, but it returns 0 if some garbage value is passed. Do we want to check it?
    2736 ra->cleanruv_notified[i] = atoi(attr_vals[i]);

    The atoi() function converts the initial portion of the string pointed
    to by nptr to int. The behavior is the same as
    strtol(nptr, (char **) NULL, 10);
    except that atoi() does not detect errors.

  2. I might be missing something, but these 3 functions look sharing the same algorithm (the mod_type and mod_op are different). Could they be merged into one function by controlling the behaviour with arg/flag? (I don't see agmt_set_abort_released. It's not needed?)
    {{{
    2619 int
    2620 agmt_set_cleanruv_notified(Repl_Agmt ra, ReplicaId rid){
    mod.mod_op = LDAP_MOD_ADD|LDAP_MOD_BVALUES;
    2636 mod.mod_type = (char
    )type_nsds5ReplicaCleanRUVnotified;
    2658 int
    2659 agmt_set_abort_notified(Repl_Agmt ra, ReplicaId rid){
    mod.mod_op = LDAP_MOD_ADD|LDAP_MOD_BVALUES;
    2675 mod.mod_type = (char
    )type_nsds5ReplicaCleanRUVnotified;
    2693 int
    2694 agmt_set_cleanruv_released(Repl_Agmt ra, ReplicaId rid){
    mod.mod_op = LDAP_MOD_DELETE|LDAP_MOD_BVALUES;
    2709 mod.mod_type = (char
    )type_nsds5ReplicaCleanRUVnotified;
    }}}

  3. No more "goto done". ;)
    {{{
    … … write_changelog_and_ruv (Slapi_PBlock pb)
    1083 1074 op_params->target_address.uniqueid = slapi_ch_strdup (uniqueid);
    1084 1075 }
    1085 1076
    1077 if( is_cleaned_rid(csn_get_replicaid(op_params->csn))){
    1078 /
    this RID has been cleaned, just goto done */ <==
    1079 object_release (repl_obj);
    1080 return 0;
    1081 }
    }}}

  4. It's good cleanall_ruv task rejects more tasks than max number. Could there be any way how many tasks are already running?
    {{{
    1303 static int
    1304 replica_execute_cleanall_ruv_task (Object r, ReplicaId rid, Slapi_Task task, char returntext)
    1318 if(get_cleanruv_task_count() >= CLEANRIDSIZ){
    1319 /
    we are already running the maximum number of tasks */
    1320 cleanruv_log(pre_task, CLEANALLRUV_ID,
    1321 "Exceeded maximum number of active CLEANALLRUV tasks(%d)",CLEANRIDSIZ);
    1322 returntext = PR_smprintf("Exceeded maximum number of active CLEANALLRUV tasks(%d), "
    1323 "you must wait for one to finish.", CLEANRIDSIZ);
    1324 return LDAP_UNWILLING_TO_PERFORM;
    1325 }
    }}}

  5. Reading the doc "http://directory.fedoraproject.org/wiki/Howto:CLEANRUV", you put the steps for CLEANALLRUV/CLEANRUV. Would it be nice to provide a script to follow the steps?

Noriko,

[1] nice catch, fixed
[2] this is already checked
[3] If there is a garbage value, zero would be the correct value to put into the array. So we are safe here.
[4] done
[5] done
[6] What do you want the script to do? Set the read only mode, delete agmts, etc? If so, I don't think that should be scripted. It's up to the admin when they can do each step, I don't think it should be forced. Now if you just want a script that fires off the task then that's fine.

Mark

in agmt_set_cleanruv_notified:
{{{
data = PR_smprintf("%d", (int)rid);
}}}
you could just use a static buffer and avoid the malloc/free. same in agmt_set_abort_notified and agmt_set_cleanruv_released. If you use snprintf, you could use the return value of snprintf as the number of characters written to the static buffer and avoid the strlen call too. Similarly in other places in the code where smprintf is used when the input data is of fixed size (e.g. a csn string).

also in those functions, the calls to the internal modify - should at least log an error message if the result was not 0

To expand upon Noriko's comment - this use of atoi in replica_check_for_tasks is problematic:
{{{
rid = atoi(ldap_utf8strtok_r(clean_vals[i], ":", &iter));
}}}
If someone were to pass in bogus data in type_replicaCleanRUV, ldap_utf8strtok_r() could return NULL, and passing a NULL to atoi() will cause a crash. There should be some sort of input validation on the type_replicaCleanRUV data to ensure no crashes, and better, to inform the user that the format of the data is incorrect.

Does replica_add_cleanruv_data need to check to make sure the new value index does not exceed CLEANRIDSIZ?

What is replica_remove_cleanruv_data supposed to do? It looks like it will free more than just the given val. Maybe the slapi_ch_free_string should come before the for loop?

Replying to [comment:9 rmeggins]:

in agmt_set_cleanruv_notified:
{{{
data = PR_smprintf("%d", (int)rid);
}}}
you could just use a static buffer and avoid the malloc/free. same in agmt_set_abort_notified and agmt_set_cleanruv_released. If you use snprintf, you could use the return value of snprintf as the number of characters written to the static buffer and avoid the strlen call too. Similarly in other places in the code where smprintf is used when the input data is of fixed size (e.g. a csn string).

Done.

also in those functions, the calls to the internal modify - should at least log an error message if the result was not 0

Done.

To expand upon Noriko's comment - this use of atoi in replica_check_for_tasks is problematic:
{{{
rid = atoi(ldap_utf8strtok_r(clean_vals[i], ":", &iter));
}}}
If someone were to pass in bogus data in type_replicaCleanRUV, ldap_utf8strtok_r() could return NULL, and passing a NULL to atoi() will cause a crash. There should be some sort of input validation on the type_replicaCleanRUV data to ensure no crashes, and better, to inform the user that the format of the data is incorrect.

Done.

Does replica_add_cleanruv_data need to check to make sure the new value index does not exceed CLEANRIDSIZ?

Probably, done.

What is replica_remove_cleanruv_data supposed to do? It looks like it will free more than just the given val. Maybe the slapi_ch_free_string should come before the for loop?

The function is just to remove the in-memory cleanallruv data from that replica entry. Yes, the free was in the wrong spot.

The new patch is attached.

Mark

git merge ticket403
Updating e860135..657efc6
Fast-forward
Makefile.am | 1 +
Makefile.in | 1 +
ldap/admin/src/scripts/template-cleanallruv.pl.in | 186 ++
ldap/schema/01core389.ldif | 17 +-
ldap/servers/plugins/replication/cl5_api.c | 71 +-
ldap/servers/plugins/replication/cl5_api.h | 9 +-
ldap/servers/plugins/replication/repl5.h | 57 +-
ldap/servers/plugins/replication/repl5_agmt.c | 110 +
ldap/servers/plugins/replication/repl5_agmtlist.c | 4 +
.../servers/plugins/replication/repl5_connection.c | 13 +-
.../plugins/replication/repl5_inc_protocol.c | 21 +-
ldap/servers/plugins/replication/repl5_init.c | 46 +-
ldap/servers/plugins/replication/repl5_plugins.c | 18 +-
ldap/servers/plugins/replication/repl5_replica.c | 477 ++++-
.../plugins/replication/repl5_replica_config.c | 2252 +++++++++++++++-----
ldap/servers/plugins/replication/repl5_ruv.c | 58 +-
ldap/servers/plugins/replication/repl5_ruv.h | 1 +
ldap/servers/plugins/replication/repl_extop.c | 446 ++--
ldap/servers/plugins/replication/repl_globals.c | 5 +-
ldap/servers/slapd/log.c | 27 +
ldap/servers/slapd/slapi-plugin.h | 3 +
ldap/servers/slapd/task.c | 56 +
22 files changed, 2839 insertions(+), 1040 deletions(-)
create mode 100644 ldap/admin/src/scripts/template-cleanallruv.pl.in

git push origin master
Counting objects: 64, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (32/32), done.
Writing objects: 100% (33/33), 24.90 KiB, done.
Total 33 (delta 28), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
e860135..657efc6 master -> master

[mareynol@localhost servers]$ git merge coverity
Updating 9229db2..75e5d3d
Fast-forward
ldap/servers/plugins/replication/repl5_agmt.c | 9 ++-
ldap/servers/plugins/replication/repl5_replica.c | 33 +++++++++---
.../plugins/replication/repl5_replica_config.c | 53 +++++++++----------
ldap/servers/plugins/replication/repl_extop.c | 4 +-
4 files changed, 58 insertions(+), 41 deletions(-)

[mareynol@localhost servers]$ git push origin master
Counting objects: 19, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (10/10), 1.67 KiB, done.
Total 10 (delta 8), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
9229db2..75e5d3d master -> master

two new issues were found by the IPA team:

[1] If you run the task on a replica that doesn't have any agmts its doesn't finish correctly.
[2] You can get stuck in a loop, if the replica was already cleaned

Patch is being attached...

Attachment 0001-Ticket-403-CLEANALLRUV-amendment.patch​ added
address two new bugs
ack.

[mareynol@localhost ldap]$ git merge ticket403
Updating 3e02250..df3a0a2
Fast-forward
.../plugins/replication/repl5_replica_config.c | 8 +++++---
ldap/servers/plugins/replication/repl_extop.c | 2 +-
2 files changed, 6 insertions(+), 4 deletions(-)

[mareynol@localhost ldap]$ git push origin master
Counting objects: 15, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (8/8), done.
Writing objects: 100% (8/8), 1.07 KiB, done.
Total 8 (delta 6), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
3e02250..df3a0a2 master -> master

Mark, go ahead and cherry-pick to 389-ds-base-1.2.11 and push - best to keep the branches in sync as much as possible. We'll do the 1.2.11.10 release soon.

Rich, sounds good but I have one more change I need to commit. Patch coming soon...

Replying to [comment:19 mreynolds]:

Rich, sounds good but I have one more change I need to commit. Patch coming soon...

ack

git merge ticket403
Updating df3a0a2..82d2c51
Fast-forward
.../plugins/replication/repl5_replica_config.c | 9 ++++++---
1 files changed, 6 insertions(+), 3 deletions(-)

[mareynol@localhost ldap]$ git push origin master
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 834 bytes, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
df3a0a2..82d2c51 master -> master

Attachment 0001-Ticket-403-fix-CLEANALLRUV-regression-from-last-comm.patch​ added

ack

git merge ticket403
Updating 6237131..819910d
Fast-forward
.../plugins/replication/repl5_replica_config.c | 31 ++++++++++++++-----
ldap/servers/plugins/replication/repl5_ruv.c | 29 ++++++++++++++++++
ldap/servers/plugins/replication/repl5_ruv.h | 1 +
ldap/servers/plugins/replication/repl_extop.c | 4 +--
4 files changed, 54 insertions(+), 11 deletions(-)

[mareynol@localhost ds]$ git push origin master
Counting objects: 19, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (10/10), 1.75 KiB, done.
Total 10 (delta 8), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
6237131..819910d master -> master

[mareynol@localhost servers]$ git merge coverity
Updating 34eea5d..37e0121
Fast-forward
ldap/servers/plugins/replication/repl5_agmt.c | 4 ++--
ldap/servers/plugins/replication/repl5_replica.c | 7 ++++---
2 files changed, 6 insertions(+), 5 deletions(-)

[mareynol@localhost servers]$ git push origin master
Counting objects: 15, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (8/8), done.
Writing objects: 100% (8/8), 915 bytes, done.
Total 8 (delta 6), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
34eea5d..37e0121 master -> master

Added initial screened field value.

10/17 - some maxcsn compare fixes, and added a new option (replica-force-cleaning)
0001-Ticket-403-CLEANALLRUV-minor-fixes-and-add-support-f.patch

Metadata Update from @mreynolds:
- Issue assigned to mreynolds
- Issue set to the milestone: 1.2.11.8

2 years ago

Login to comment on this ticket.

Metadata