#631 Replication: "Incremental update started" status message without consumer initialized
Closed: wontfix None Opened 11 years ago by nkinder.

Description of problem:
Assume single master - single consumer replication

Before replication can work, consumer must be initialized. When creating
replication agreement, user can select "Do not initialize consumer", which
should leave consumer uninitialized and replication not working after agreement
is created. However, master still needs to query consumer to find out whether
consumer has been initialized yet.

In this setup, 3 attributes (nsds5replicaLastUpdateStart,
nsds5replicaLastUpdateEnd, nsds5replicaLastUpdateStatus) tracking status of
replication agreement show incorrect data:

-nsds5replicaLastUpdateStart/nsds5replicaLastUpdateEnd show when was the last
time master has checked consumer status (NOT when was the last time update
happened)

  • nsds5replicaLastUpdateStatus shows "0 Replica acquired successfully:

Incremental update started", which is not true, since replication is not
working (update has not started)

Steps to Reproduce:

1) Configure single master-single consumer replication and do not initialize
consumer. Because consumer is not initialized, replication is not working and
no updates are sent.

2) Check replication agreement entry:

dn: cn=test3,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping
tree,cn=config
objectClass: top
objectClass: nsDS5ReplicationAgreement
description: test3
cn: test3
nsDS5ReplicaRoot: dc=example,dc=com
nsDS5ReplicaHost: dstet.example.com
nsDS5ReplicaPort: 2389
nsDS5ReplicaBindDN: cn=replication manager,cn=config
nsDS5ReplicaTransportInfo: LDAP
nsDS5ReplicaBindMethod: SIMPLE
nsDS5ReplicaCredentials: {DES}3iMsHyML5/QgiI3nMoqaYw==
nsds5replicareapactive: 0
nsds5replicaLastUpdateStart: 20130319124248Z
nsds5replicaLastUpdateEnd: 0
nsds5replicaChangesSentSinceStartup:
nsds5replicaLastUpdateStatus: 0 Replica acquired successfully: Incremental
update started
nsds5replicaUpdateInProgress: TRUE
nsds5replicaLastInitStart: 0
nsds5replicaLastInitEnd: 0

Note that this search was performed exactly when master checked consumer, so
there is no LastUpdateEnd and update is in progress
(nsds5replicaUpdateInProgress: TRUE). However, usually master checks every 3
seconds, and nsds5replicaLastUpdateStart and nsds5replicaLastUpdateEnd change
to reflect this.

Actual results:
nsds5replicaLastUpdateStart and nsds5replicaLastUpdateEnd track the time when
master checked consumer, not the last time update happened.

Expected results:
nsds5replicaLastUpdateStart should be 0
nsds5replicaLastUpdateEnd should be 0
nsds5replicaLastUpdateStatus should not say "Incremental update started" when
it hasn`t


Initial analysis, none of the messages are "wrong", but the messages are not what you might expect based off the attribute name. In the code these attributes are really describing the protocol:

nsds5replicaLastUpdateStart -> when the update protocol started
nsds5replicaLastUpdateEnd - > when the update protocol ended
nsds5replicaLastUpdateStatus -> the current status of the agmt protocol. In this case it is "started", even though it's not sending any updates, the protocol is running.

I think we can update the status with more appropriate messages though. As for the update start and stop I need to investigate that a little more.

Replying to [comment:3 mreynolds]:

Initial analysis, none of the messages are "wrong", but the messages are not what you might expect based off the attribute name. In the code these attributes are really describing the protocol:

nsds5replicaLastUpdateStart -> when the update protocol started
nsds5replicaLastUpdateEnd - > when the update protocol ended
nsds5replicaLastUpdateStatus -> the current status of the agmt protocol. In this case it is "started", even though it's not sending any updates, the protocol is running.

I think we can update the status with more appropriate messages though. As for the update start and stop I need to investigate that a little more.

Did this change in 1.2.10? 1.2.11? 1.3.0? One of the reasons why this bug was filed is because some of the QE tests now return different results than they used to in earlier releases.

Replying to [comment:4 rmeggins]:

Replying to [comment:3 mreynolds]:

Initial analysis, none of the messages are "wrong", but the messages are not what you might expect based off the attribute name. In the code these attributes are really describing the protocol:

nsds5replicaLastUpdateStart -> when the update protocol started
nsds5replicaLastUpdateEnd - > when the update protocol ended
nsds5replicaLastUpdateStatus -> the current status of the agmt protocol. In this case it is "started", even though it's not sending any updates, the protocol is running.

I think we can update the status with more appropriate messages though. As for the update start and stop I need to investigate that a little more.

Did this change in 1.2.10? 1.2.11? 1.3.0? One of the reasons why this bug was filed is because some of the QE tests now return different results than they used to in earlier releases.

These status/start/stop messages have not changed since 1.2.9. I'm waiting to hear back from Jan, but I don't recall him stating that any tests were failing, but that the status attributes didn't give the results he would expect. I do have a patch that addresses all of his concerns mentioned above, but I'm going to wait for Jan to get back to me about the TET case(Repl68) before continuing my work.

I think the intention here is that the lastupdate* attributes should reflect only successful sessions, correct? In that case, you'll have to reset lastupdatestart time to 0 upon error, and you'll have to reset agmt_set_update_in_progress(prp->agmt, PR_FALSE); in all cases, success or failure.

Replying to [comment:7 rmeggins]:

I think the intention here is that the lastupdate* attributes should reflect only successful sessions, correct?

That's what QE was saying, these attributes don't reflect useful/expected values. They don't seem to have anything to do with sending updates -which is what the attribute name implies.

In that case, you'll have to reset lastupdatestart time to 0 upon error, and you'll have to reset agmt_set_update_in_progress(prp->agmt, PR_FALSE); in all cases, success or failure.

I was thinking that we set the start time when we actually try to send updates. I don't think we should reset the start time for BACKOFF states. Now for fatal errors, yes, I agree we should reset the start time, etc.

But, if we reset the start time every time send_updates() doesn't return success, then the start and end times will always be the same(or off by a second or two) when it finally goes through. This doesn't seem very useful. I would think we want to know when we first tried to send the updates, and when they finally went through(unless there is a fatal error).

This is all open for discussion, but this is just what I was originally thinking. I'm not hard set on any of this. QE just wanted more precise messages.

If the intention is to know when you started trying to send the updates, wouldn't you also want to know when you finished trying to send the updates? That is, if lastupdatestarttime is X, wouldn't you always want lastupdateendtime to be X+N, where N >= 0?

Otherwise, we need two additional status timestamps: lastupdateattemptstarttime and lastupdateattemptendtime - and only update lastupdatestarttime and endtime if successful.

A client attempting to check replication status should look at the lastupdatestatus attribute. This attribute should be in the following form:

Number Space String
Where Number is a 0 if successful, Space is a single space character, and String is a message.

To report some sort of error, agmt_set_last_update_status() should be called with either ldaprc set to some LDAP error code, or replrc should be set to one of the NSDS50_REPL_ error codes. ldaprc takes precedence - if ldaprc is set, the status will report an ldap code and error message, regardless of whether replrc is set or not. All the code should need to do is make sure that the Number is non-zero, and there is a reasonable error message, when replication cannot take place.

Replying to [comment:9 rmeggins]:

If the intention is to know when you started trying to send the updates, wouldn't you also want to know when you finished trying to send the updates? That is, if lastupdatestarttime is X, wouldn't you always want lastupdateendtime to be X+N, where N >= 0?

Otherwise, we need two additional status timestamps: lastupdateattemptstarttime and lastupdateattemptendtime - and only update lastupdatestarttime and endtime if successful.

A client attempting to check replication status should look at the lastupdatestatus attribute. This attribute should be in the following form:

Number Space String
Where Number is a 0 if successful, Space is a single space character, and String is a message.

To report some sort of error, agmt_set_last_update_status() should be called with either ldaprc set to some LDAP error code, or replrc should be set to one of the NSDS50_REPL_ error codes. ldaprc takes precedence - if ldaprc is set, the status will report an ldap code and error message, regardless of whether replrc is set or not. All the code should need to do is make sure that the Number is non-zero, and there is a reasonable error message, when replication cannot take place.

Ok I revised the fix based off your comments.

git merge ticket631
Updating f9c9269..67233f5
Fast-forward
.../plugins/replication/repl5_inc_protocol.c | 57 +++++++++++++-------
1 files changed, 37 insertions(+), 20 deletions(-)

git push origin master
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 1.77 KiB, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
f9c9269..67233f5 master -> master

commit 67233f5

Metadata Update from @mreynolds:
- Issue assigned to mreynolds
- Issue set to the milestone: 1.3.1

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/631

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

3 years ago

Login to comment on this ticket.

Metadata