Description of problem: Assume single master - single consumer replication
Before replication can work, consumer must be initialized. When creating replication agreement, user can select "Do not initialize consumer", which should leave consumer uninitialized and replication not working after agreement is created. However, master still needs to query consumer to find out whether consumer has been initialized yet.
In this setup, 3 attributes (nsds5replicaLastUpdateStart, nsds5replicaLastUpdateEnd, nsds5replicaLastUpdateStatus) tracking status of replication agreement show incorrect data:
-nsds5replicaLastUpdateStart/nsds5replicaLastUpdateEnd show when was the last time master has checked consumer status (NOT when was the last time update happened)
Incremental update started", which is not true, since replication is not working (update has not started)
Steps to Reproduce:
1) Configure single master-single consumer replication and do not initialize consumer. Because consumer is not initialized, replication is not working and no updates are sent.
2) Check replication agreement entry:
dn: cn=test3,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config objectClass: top objectClass: nsDS5ReplicationAgreement description: test3 cn: test3 nsDS5ReplicaRoot: dc=example,dc=com nsDS5ReplicaHost: dstet.example.com nsDS5ReplicaPort: 2389 nsDS5ReplicaBindDN: cn=replication manager,cn=config nsDS5ReplicaTransportInfo: LDAP nsDS5ReplicaBindMethod: SIMPLE nsDS5ReplicaCredentials: {DES}3iMsHyML5/QgiI3nMoqaYw== nsds5replicareapactive: 0 nsds5replicaLastUpdateStart: 20130319124248Z nsds5replicaLastUpdateEnd: 0 nsds5replicaChangesSentSinceStartup: nsds5replicaLastUpdateStatus: 0 Replica acquired successfully: Incremental update started nsds5replicaUpdateInProgress: TRUE nsds5replicaLastInitStart: 0 nsds5replicaLastInitEnd: 0
Note that this search was performed exactly when master checked consumer, so there is no LastUpdateEnd and update is in progress (nsds5replicaUpdateInProgress: TRUE). However, usually master checks every 3 seconds, and nsds5replicaLastUpdateStart and nsds5replicaLastUpdateEnd change to reflect this.
Actual results: nsds5replicaLastUpdateStart and nsds5replicaLastUpdateEnd track the time when master checked consumer, not the last time update happened.
Expected results: nsds5replicaLastUpdateStart should be 0 nsds5replicaLastUpdateEnd should be 0 nsds5replicaLastUpdateStatus should not say "Incremental update started" when it hasn`t
Initial analysis, none of the messages are "wrong", but the messages are not what you might expect based off the attribute name. In the code these attributes are really describing the protocol:
nsds5replicaLastUpdateStart -> when the update protocol started nsds5replicaLastUpdateEnd - > when the update protocol ended nsds5replicaLastUpdateStatus -> the current status of the agmt protocol. In this case it is "started", even though it's not sending any updates, the protocol is running.
I think we can update the status with more appropriate messages though. As for the update start and stop I need to investigate that a little more.
Replying to [comment:3 mreynolds]:
Initial analysis, none of the messages are "wrong", but the messages are not what you might expect based off the attribute name. In the code these attributes are really describing the protocol: nsds5replicaLastUpdateStart -> when the update protocol started nsds5replicaLastUpdateEnd - > when the update protocol ended nsds5replicaLastUpdateStatus -> the current status of the agmt protocol. In this case it is "started", even though it's not sending any updates, the protocol is running. I think we can update the status with more appropriate messages though. As for the update start and stop I need to investigate that a little more.
Did this change in 1.2.10? 1.2.11? 1.3.0? One of the reasons why this bug was filed is because some of the QE tests now return different results than they used to in earlier releases.
Replying to [comment:4 rmeggins]:
Replying to [comment:3 mreynolds]: Initial analysis, none of the messages are "wrong", but the messages are not what you might expect based off the attribute name. In the code these attributes are really describing the protocol: nsds5replicaLastUpdateStart -> when the update protocol started nsds5replicaLastUpdateEnd - > when the update protocol ended nsds5replicaLastUpdateStatus -> the current status of the agmt protocol. In this case it is "started", even though it's not sending any updates, the protocol is running. I think we can update the status with more appropriate messages though. As for the update start and stop I need to investigate that a little more. Did this change in 1.2.10? 1.2.11? 1.3.0? One of the reasons why this bug was filed is because some of the QE tests now return different results than they used to in earlier releases.
These status/start/stop messages have not changed since 1.2.9. I'm waiting to hear back from Jan, but I don't recall him stating that any tests were failing, but that the status attributes didn't give the results he would expect. I do have a patch that addresses all of his concerns mentioned above, but I'm going to wait for Jan to get back to me about the TET case(Repl68) before continuing my work.
I think the intention here is that the lastupdate* attributes should reflect only successful sessions, correct? In that case, you'll have to reset lastupdatestart time to 0 upon error, and you'll have to reset agmt_set_update_in_progress(prp->agmt, PR_FALSE); in all cases, success or failure.
Replying to [comment:7 rmeggins]:
I think the intention here is that the lastupdate* attributes should reflect only successful sessions, correct?
That's what QE was saying, these attributes don't reflect useful/expected values. They don't seem to have anything to do with sending updates -which is what the attribute name implies.
In that case, you'll have to reset lastupdatestart time to 0 upon error, and you'll have to reset agmt_set_update_in_progress(prp->agmt, PR_FALSE); in all cases, success or failure.
I was thinking that we set the start time when we actually try to send updates. I don't think we should reset the start time for BACKOFF states. Now for fatal errors, yes, I agree we should reset the start time, etc.
But, if we reset the start time every time send_updates() doesn't return success, then the start and end times will always be the same(or off by a second or two) when it finally goes through. This doesn't seem very useful. I would think we want to know when we first tried to send the updates, and when they finally went through(unless there is a fatal error).
This is all open for discussion, but this is just what I was originally thinking. I'm not hard set on any of this. QE just wanted more precise messages.
If the intention is to know when you started trying to send the updates, wouldn't you also want to know when you finished trying to send the updates? That is, if lastupdatestarttime is X, wouldn't you always want lastupdateendtime to be X+N, where N >= 0?
Otherwise, we need two additional status timestamps: lastupdateattemptstarttime and lastupdateattemptendtime - and only update lastupdatestarttime and endtime if successful.
A client attempting to check replication status should look at the lastupdatestatus attribute. This attribute should be in the following form:
Number Space String Where Number is a 0 if successful, Space is a single space character, and String is a message.
To report some sort of error, agmt_set_last_update_status() should be called with either ldaprc set to some LDAP error code, or replrc should be set to one of the NSDS50_REPL_ error codes. ldaprc takes precedence - if ldaprc is set, the status will report an ldap code and error message, regardless of whether replrc is set or not. All the code should need to do is make sure that the Number is non-zero, and there is a reasonable error message, when replication cannot take place.
Revision 0001-Ticket-631-Replication-Incremental-update-started-st.patch
Replying to [comment:9 rmeggins]:
If the intention is to know when you started trying to send the updates, wouldn't you also want to know when you finished trying to send the updates? That is, if lastupdatestarttime is X, wouldn't you always want lastupdateendtime to be X+N, where N >= 0? Otherwise, we need two additional status timestamps: lastupdateattemptstarttime and lastupdateattemptendtime - and only update lastupdatestarttime and endtime if successful. A client attempting to check replication status should look at the lastupdatestatus attribute. This attribute should be in the following form: Number Space String Where Number is a 0 if successful, Space is a single space character, and String is a message. To report some sort of error, agmt_set_last_update_status() should be called with either ldaprc set to some LDAP error code, or replrc should be set to one of the NSDS50_REPL_ error codes. ldaprc takes precedence - if ldaprc is set, the status will report an ldap code and error message, regardless of whether replrc is set or not. All the code should need to do is make sure that the Number is non-zero, and there is a reasonable error message, when replication cannot take place.
Ok I revised the fix based off your comments.
git merge ticket631 Updating f9c9269..67233f5 Fast-forward .../plugins/replication/repl5_inc_protocol.c | 57 +++++++++++++------- 1 files changed, 37 insertions(+), 20 deletions(-)
git push origin master Counting objects: 13, done. Delta compression using up to 4 threads. Compressing objects: 100% (7/7), done. Writing objects: 100% (7/7), 1.77 KiB, done. Total 7 (delta 5), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git f9c9269..67233f5 master -> master
commit 67233f5
Metadata Update from @mreynolds: - Issue assigned to mreynolds - Issue set to the milestone: 1.3.1
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/631
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Login to comment on this ticket.