Issue #49020: do not treat missing csn as fatal - 389-ds-base

389-ds-base

#49020 do not treat missing csn as fatal

Closed: wontfix 7 years ago Opened 7 years ago by lkrispen.

There have been many tickets and fixes about managing csn in a replication session and it is not yet fully settled.
There are situations where replication should backoff instead of going into fatal state.
A summary of the problem and status was discussed on a mailing list and is cited here:

In the last time I was haunted by the problem if an when we should pick an alternative start csn in a replication session if the calculated anchor csn cannot be found in the changelog.
This was investigated during the work on ticket #48766 (skipped-changes), the resulting failures in reliab15 and picked up again with the recent customer problems on 7.2 where a part of that fix was missing.
I was trying to analyze the scenarios which can arise and to be finally able to answer these two core questions:

1] If an anchor csn cannot be found should we choose an alternative starting csn (under which conditions) or should the repl session be aborted ?
2] if the answer to 1] is abort, should this be a fatal error or transient ?

I have been moving in circles, but hope to have a good understanding of the problem now and would really like to get this finally settled, so pleas read the follwing, even if it is a bit lengthy, and challenge my arguments.

Lets start looking back at the state before the "skipped-changes":
- if an anchorcsn was not found, and the purgRUV in the changelog did not contain csns an alternative start csn was chosen
- the alternative was the minCSN of the supplierRUV
- after an online init, a start iteration record was written to the changelog, which corresponded to the minCSN in the ruv after init.

This worked, but when working on the "skipped-changes" problem I noticed that the selection of the alterantive csn could lead to a loss of many changes (I have a test case to demonstrate this - TEST #1)
So, under the assumption that we should find an anchor or break and that the existing mechanism to select an alternative was incorrect, this fallback was removed from the original fix patch48766-0

Unfortunately reliab15 failed with this patch, tracked in ticket #48954
A first analysis showed that the failure was after init, when an anchor csn was looked up for a replicaID which did not have a start-iteration record, and a first attempt to fix it was to log start-iteration records for all RIDs with csns contained in the ruv after init, patch48954-1
This did not completely solve the initial phase of reliab15 and we decided to go back to the method of selecting an alternative start csn, but chosing a better one, as close as possible to the missing csn: patch48954-2.

This resolved the reliab15 problem is the current state.

In between Thierry detected that this change also changed behaviour if a replica did accept updates to early after initialization #48976
And I found the testcase TEST#1, where with the old selection of the alternative many changes can be lost, but with the new method still one change is lost

So I looked again closer to the failures we had in reliab15 and noticed that one reason that patch48954-1 did not work was that in the process of initializations M!->M2->.. the ruv for most replicIDs only contained a purl, but no csn. This could be improved by fixing the tickets: #48995 and #48999
With a fix for these tickets the reliab15 scenario worked without the need of chosing an alternative anchor csn.

So I am back to the question what could be a valid scenario where an anchor csn cannot be found.

From the following it should not happen:
If the ruv used for total init does contain a csn for rid N, a start-iteration csn-N will be created for this csn-N, the server will only receive updates csn-X> csn-N, it will have all updates and if the consumer csn csn-C >= csn-N the anchor csn csn-C always should be found.
if the ruv used for total init does not contain a csn for rid N, this means the server has not seen any changes for this rid and ALL changes will be received via replication, so no csn for rid N should ever be missing.

But it does. A potential scenario is in TEST #2, and the creation of a keep alive entry is a special case for this scenario, where the ADD is internal, not external.
I have opened ticket #49000 for this.

I think with the fixes for #48995, #48999 and  #49000 we should be safe not to use  fallback for an alternative anchor csn.
If an anchor csn no longer can be found it is because it was purged, or the server was initialized with a RUV>than the missing CSN (test #1) , or it accepted an update before it was in sync after an init (ticket #48976).

But in these cases the error is not permanent, if the consumer is updated by another server the required anchorcsn can increase and be found in the changelog, so a missing anchor csn should in my opinion not switch to FATAL state but to BACKOFF.


So to summarize here is my suggestion:
- Implement fixes for #48995, #48999 and  #49000
- treat missing anchor csn as error
- switch to backoff for missing anchor csn

There might be one reason to keep the choice of an alternatve anchor csn: I have seen customer requests to keep replication going even if there are some inconsistencies, they want to schedule a reinit or restore or whatever at their conveninece and not have an environment where replication is completely broken.


For completeness, here are my test scenarios:

TEST #1:

(the disabling and enabling of replication agreements is to enforce a sepcific timing and replication flow, it could happen like this by itself as well).
Have 3 masters A,B,C in triangle replication.
Total init A-->B
Disable agreement A-->B
Disable agreemnet C-->B
Add an entry cn=X on A
Total init A-->C
Add entries cn=Y1,.....cn=Y10 on A
Enable agreement C-->B again

Result: with patch 48766 entry cn=X is missing on B
with version before 48766 cn=X, cn=Y1,,.... cn=Y9 are missing

TEST #2:

master A,B
start a sequence of adds to A, adding cn=E1,cn=E2,.......
while this is running start a total init A-->B
when the total init is done, the incremental update starts and we see a couple of messages on B:
urp_add: csn=..... entry with cn=Ex already exists

the csns reported in these messages are then NOT in the changelog of B

mreynolds commented 7 years ago

Just sharing some recent information:

Redhat IT just ran into this in a testing environment with DS 10.0. A csn failed to be committed to the changelog (deadlock retry errors). But this sent several agreements into a stop-fatal state (from which there is no return). The CSN was committed to the changelog one second later, but the agreements were already halted. Restarting the server fixed the issue (I'm assuming disabling/enabling the agreements would have worked too).

nhosoi commented 7 years ago

Linked to Bugzilla bug: https://bugzilla.redhat.com/show_bug.cgi?id=1391700

nhosoi commented 7 years ago

Linked to Bugzilla bug: https://bugzilla.redhat.com/show_bug.cgi?id=1391701

lkrispen commented 7 years ago

attachment
0001-ticket-49020-part-1.patch

lkrispen commented 7 years ago

the attached part 1 removes the automatic selection of an alternative csn and goes into backoff instead of fatal
it also defines and manages the creation of the keep going mechanism.

The second part would have to check the "enforce" attr and the uses a next best csn

lkrispen commented 7 years ago

attachment
0001-ticket-49020-part-2.patch

lkrispen commented 7 years ago

attachment
0001-Ticket-49020-do-not-treat-missing-csn-as-fatal.patch

lkrispen commented 7 years ago

there are two new attached patches
the first one is "part 2", implementing the missing parts: allow admin to enforce usage of an alternate startcsn. It also includes requests from review to choose a better name for the paramter in the agreement

the second on is a consolidated version of part1+part2 and should be used in reviews

nhosoi commented 7 years ago

Looks good to me.

firstyear commented 7 years ago

Thanks Ludwig, I'm happy with the change you made for my suggestion!

mreynolds commented 7 years ago

We have a regression with cleanallruv. This was found by running ds/dirsrvtests/tests/suites/replication/cleanallruv_test.py

After running a cleanallruv task just once we get missing CSN errors and things break:

{{{
[22/Dec/2016:09:17:50.319854151 -0500] - ERR - agmt="cn=meTo_localhost.localdomain:38942" (localhost:38942) - clcache_load_buffer - Can't locate CSN 585be002000000010000 in the changelog (DB rc=-30988). If replication stops, the consumer may need to be reinitialized.
[22/Dec/2016:09:17:50.321863791 -0500] - ERR - NSMMReplicationPlugin - changelog program - repl_plugin_name_cl - agmt="cn=meTo_localhost.localdomain:38942" (localhost:38942): CSN 585be002000000010000 not found, we aren't as up to date, or we purged
[22/Dec/2016:09:17:50.324031635 -0500] - ERR - NSMMReplicationPlugin - send_updates - agmt="cn=meTo_localhost.localdomain:38942" (localhost:38942): Data required to update replica has been purged from the changelog. If the error persists the replica must be reinitialized.
}}}

Going to gather more info, and add it to the ticket.

mreynolds commented 7 years ago

Update: So when the cleanallruv task is run on Master A to remove Master D, it purges the changelog of all the changes from Master A (but not Master B, C, or D). D is the only one that's supposed to be cleaned, but only the local changes(Master A) are purged.

mreynolds commented 7 years ago

False alarm. This fix exposed a regression from ticket 48964

nhosoi commented 7 years ago

Do we need to backport to the 1.2.11 branch?

lkrispen commented 7 years ago

attachment
0001-Ticket-49020-v1.2.11-do-not-treat-missing-csn-as-fat.patch

lkrispen commented 7 years ago

attached backport to 1.2.11

nhosoi commented 7 years ago

Attachment 0001-Ticket-49020-v1.2.11-do-not-treat-missing-csn-as-fat.patch added

Ack. Thanks, Ludwig!

lkrispen commented 7 years ago

committed to 1.2.11 branch:

commit 55aa091

Metadata Update from @firstyear:
- Issue set to the milestone: 1.2.11.33

7 years ago

vashirov commented 7 years ago

Metadata Update from @vashirov:
- Custom field component reset
- Issue close_status updated to: None

7 years ago

Metadata Update from @vashirov:
- Custom field reviewstatus adjusted to review (was: ack)

7 years ago

Metadata Update from @mreynolds:
- Custom field reviewstatus adjusted to ack (was: review)

7 years ago

mreynolds commented 7 years ago

Acked

vashirov commented 7 years ago

commit 73f221a
To ssh://pagure.io/389-ds-base.git
ed5b925..73f221a master -> master

Metadata Update from @vashirov:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

7 years ago

spichugi commented 3 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/2079

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: fixed)

3 years ago

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Priority

major

Milestone

1.2.11.33

reviewstatus

ack

rhbz

https://bugzilla.redhat.com/show_bug.cgi?id=1391700
https://bugzilla.redhat.com/show_bug.cgi?id=1391701

origin

Community

Attachments 1

0001-Ticket-49020-Add-CI-test.patch

Attached 7 years ago View Comment

389-ds-base

Source Code

#49020 do not treat missing csn as fatal Closed: wontfix 7 years ago Opened 7 years ago by lkrispen.

Metadata

Attachments 1

#49020 do not treat missing csn as fatal

Closed: wontfix 7 years ago Opened 7 years ago by lkrispen.