#49020 do not treat missing csn as fatal
Closed: wontfix 7 years ago Opened 7 years ago by lkrispen.

There have been many tickets and fixes about managing csn in a replication session and it is not yet fully settled.
There are situations where replication should backoff instead of going into fatal state.
A summary of the problem and status was discussed on a mailing list and is cited here:

In the last time I was haunted by the problem if an when we should pick an alternative start csn in a replication session if the calculated anchor csn cannot be found in the changelog.
This was investigated during the work on ticket #48766 (skipped-changes), the resulting failures in reliab15 and picked up again with the recent customer problems on 7.2 where a part of that fix was missing.
I was trying to analyze the scenarios which can arise and to be finally able to answer these two core questions:

1] If an anchor csn cannot be found should we choose an alternative starting csn (under which conditions) or should the repl session be aborted ?
2] if the answer to 1] is abort, should this be a fatal error or transient ?

I have been moving in circles, but hope to have a good understanding of the problem now and would really like to get this finally settled, so pleas read the follwing, even if it is a bit lengthy, and challenge my arguments.

Lets start looking back at the state before the "skipped-changes":
- if an anchorcsn was not found, and the purgRUV in the changelog did not contain csns an alternative start csn was chosen
- the alternative was the minCSN of the supplierRUV
- after an online init, a start iteration record was written to the changelog, which corresponded to the minCSN in the ruv after init.

This worked, but when working on the "skipped-changes" problem I noticed that the selection of the alterantive csn could lead to a loss of many changes (I have a test case to demonstrate this - TEST #1)
So, under the assumption that we should find an anchor or break and that the existing mechanism to select an alternative was incorrect, this fallback was removed from the original fix patch48766-0

Unfortunately reliab15 failed with this patch, tracked in ticket #48954
A first analysis showed that the failure was after init, when an anchor csn was looked up for a replicaID which did not have a start-iteration record, and a first attempt to fix it was to log start-iteration records for all RIDs with csns contained in the ruv after init, patch48954-1
This did not completely solve the initial phase of reliab15 and we decided to go back to the method of selecting an alternative start csn, but chosing a better one, as close as possible to the missing csn: patch48954-2.

This resolved the reliab15 problem is the current state.

In between Thierry detected that this change also changed behaviour if a replica did accept updates to early after initialization #48976
And I found the testcase TEST#1, where with the old selection of the alternative many changes can be lost, but with the new method still one change is lost

So I looked again closer to the failures we had in reliab15 and noticed that one reason that patch48954-1 did not work was that in the process of initializations M!->M2->.. the ruv for most replicIDs only contained a purl, but no csn. This could be improved by fixing the tickets: #48995 and #48999
With a fix for these tickets the reliab15 scenario worked without the need of chosing an alternative anchor csn.

So I am back to the question what could be a valid scenario where an anchor csn cannot be found.

From the following it should not happen:
If the ruv used for total init does contain a csn for rid N, a start-iteration csn-N will be created for this csn-N, the server will only receive updates csn-X> csn-N, it will have all updates and if the consumer csn csn-C >= csn-N the anchor csn csn-C always should be found.
if the ruv used for total init does not contain a csn for rid N, this means the server has not seen any changes for this rid and ALL changes will be received via replication, so no csn for rid N should ever be missing.

But it does. A potential scenario is in TEST #2, and the creation of a keep alive entry is a special case for this scenario, where the ADD is internal, not external.
I have opened ticket #49000 for this.

I think with the fixes for #48995, #48999 and  #49000 we should be safe not to use  fallback for an alternative anchor csn.
If an anchor csn no longer can be found it is because it was purged, or the server was initialized with a RUV>than the missing CSN (test #1) , or it accepted an update before it was in sync after an init (ticket #48976).

But in these cases the error is not permanent, if the consumer is updated by another server the required anchorcsn can increase and be found in the changelog, so a missing anchor csn should in my opinion not switch to FATAL state but to BACKOFF.


So to summarize here is my suggestion:
- Implement fixes for #48995, #48999 and  #49000
- treat missing anchor csn as error
- switch to backoff for missing anchor csn

There might be one reason to keep the choice of an alternatve anchor csn: I have seen customer requests to keep replication going even if there are some inconsistencies, they want to schedule a reinit or restore or whatever at their conveninece and not have an environment where replication is completely broken.


For completeness, here are my test scenarios:

TEST #1:

(the disabling and enabling of replication agreements is to enforce a sepcific timing and replication flow, it could happen like this by itself as well).
Have 3 masters A,B,C in triangle replication.
Total init A-->B
Disable agreement A-->B
Disable agreemnet C-->B
Add an entry cn=X on A
Total init A-->C
Add entries cn=Y1,.....cn=Y10 on A
Enable agreement C-->B again

Result: with patch 48766 entry cn=X is missing on B
with version before 48766 cn=X, cn=Y1,,.... cn=Y9 are missing

TEST #2:

master A,B
start a sequence of adds to A, adding cn=E1,cn=E2,.......
while this is running start a total init A-->B
when the total init is done, the incremental update starts and we see a couple of messages on B:
urp_add: csn=..... entry with cn=Ex already exists

the csns reported in these messages are then NOT in the changelog of B

Just sharing some recent information:

Redhat IT just ran into this in a testing environment with DS 10.0. A csn failed to be committed to the changelog (deadlock retry errors). But this sent several agreements into a stop-fatal state (from which there is no return). The CSN was committed to the changelog one second later, but the agreements were already halted. Restarting the server fixed the issue (I'm assuming disabling/enabling the agreements would have worked too).

the attached part 1 removes the automatic selection of an alternative csn and goes into backoff instead of fatal
it also defines and manages the creation of the keep going mechanism.

The second part would have to check the "enforce" attr and the uses a next best csn

there are two new attached patches
the first one is "part 2", implementing the missing parts: allow admin to enforce usage of an alternate startcsn. It also includes requests from review to choose a better name for the paramter in the agreement

the second on is a consolidated version of part1+part2 and should be used in reviews

Thanks Ludwig, I'm happy with the change you made for my suggestion!

We have a regression with cleanallruv. This was found by running ds/dirsrvtests/tests/suites/replication/cleanallruv_test.py

After running a cleanallruv task just once we get missing CSN errors and things break:

{{{
[22/Dec/2016:09:17:50.319854151 -0500] - ERR - agmt="cn=meTo_localhost.localdomain:38942" (localhost:38942) - clcache_load_buffer - Can't locate CSN 585be002000000010000 in the changelog (DB rc=-30988). If replication stops, the consumer may need to be reinitialized.
[22/Dec/2016:09:17:50.321863791 -0500] - ERR - NSMMReplicationPlugin - changelog program - repl_plugin_name_cl - agmt="cn=meTo_localhost.localdomain:38942" (localhost:38942): CSN 585be002000000010000 not found, we aren't as up to date, or we purged
[22/Dec/2016:09:17:50.324031635 -0500] - ERR - NSMMReplicationPlugin - send_updates - agmt="cn=meTo_localhost.localdomain:38942" (localhost:38942): Data required to update replica has been purged from the changelog. If the error persists the replica must be reinitialized.
}}}

Going to gather more info, and add it to the ticket.

Update: So when the cleanallruv task is run on Master A to remove Master D, it purges the changelog of all the changes from Master A (but not Master B, C, or D). D is the only one that's supposed to be cleaned, but only the local changes(Master A) are purged.

False alarm. This fix exposed a regression from ticket 48964

Do we need to backport to the 1.2.11 branch?

attached backport to 1.2.11

Attachment 0001-Ticket-49020-v1.2.11-do-not-treat-missing-csn-as-fat.patch​ added

Ack. Thanks, Ludwig!

committed to 1.2.11 branch:

commit 55aa091

Metadata Update from @firstyear:
- Issue set to the milestone: 1.2.11.33

7 years ago

Metadata Update from @vashirov:
- Custom field component reset
- Issue close_status updated to: None

7 years ago

Metadata Update from @vashirov:
- Custom field reviewstatus adjusted to review (was: ack)

7 years ago

Metadata Update from @mreynolds:
- Custom field reviewstatus adjusted to ack (was: review)

7 years ago

commit 73f221a
To ssh://pagure.io/389-ds-base.git
ed5b925..73f221a master -> master

Metadata Update from @vashirov:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/2079

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: fixed)

3 years ago

Login to comment on this ticket.

Metadata
Attachments 1