#51260 Syncrepl may send changelog entries twice causing client data corruption
Closed: wontfix 4 years ago by spichugi. Opened 4 years ago by firstyear.

Issue Description

While investigating #51171 and expanding the test cases, I noticed that it was possible for certain syncrepl events to be skipped causing client data corruption. This is when the following order of operations occurs

       server modrdn out of scope
client sync to state X
       server modrdn into scope
client sync state to X + 1

Further investigation showed that we would send previously seen changelog entries, and due to how these are processed while optimising out to remove entries that were added/deleted in succession, this would cause the sync repl events to not be sent. This leads to a state where clients would recieve the delete, but not the subsequent re-add as the server deems the delete followed by add as a no-op.


Metadata Update from @firstyear:
- Issue assigned to firstyear

4 years ago

@firstyear, do you have a reproducible test case ?
I am surprise that sync_repl can miss to send the ADD. The first MODRDN should be translated into a DEL update queue to the sync_repl thrad, then the next into the ADD update. Looking at the code there seems to be no clever thing done by sync_repl thread like "merging" DEL and ADD. Would expect that it sends a DEL op of nsuniqueid entry then a ADD op, along with the added (modrdn into) entry with the same nsuniqueid.

Metadata Update from @tbordaz:
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to None

4 years ago

@tbordaz Yes, the test case improvements in the PR are able to reproduce this. https://pagure.io/389-ds-base/pull-request/51261

What occurs is that the entry is moved out of scope by the modrdn. This causes a "delete" to be sent to the client to remove the entry. This is recorded in my tests as clnum 5. The client issues a sync and the cookie records that the client has seen up to and including change 5.

Then the next operation is to modrdn the entry back into scope. This should cause an entry present (aka add) to be issued. However it is not. This is because of the incorrect filter where it requests changenum greater than AND equal to the client cookie value of 5. This causes the change log of the modrdn out of scope and the modrdn into scope to be in the same window for consideration. Remember, the client has already seen clnum 5, so there is no reason we should be re-processing it. We only need greater than 5 not greater than or equal to 5, which alone makes this a bug let alone the impact on clients :)

At this point we are now ready for the "corruption" to occur. sync_read_entry_from_changelog is now called. https://pagure.io/389-ds-base/blob/master/f/ldap/servers/plugins/sync/sync_refresh.c#_483 The item in clnum 5 is examined and matches LDAP_REQ_MODRDN. No previous items exist, so prev -1 and the value is added to cb->cb_updates[index] (where index == 0). Next the item in clnum 6 is examined. This is also an LDAP_REQ_MODRDN. Since the old/new scope isn't changing (the entry appears to stay in the same place), then if (old_scope && new_scope) is taken. prev = sync_find_ref_by_uuid now sets the value prev to 0 since the entry is already in the cb->cb_updates. Because the old_scope and new scope appear to be the same the change is nullified via the else condition on 578.

This means that at this point, no values are sent. The cookie is update to clnum 6.

At this point the client has seen from the previous request a delete of the uuid in question, but due to the incorrect changelog processing when going from cookie 5 -> 6, no entry present is added or sent, meaning that the entry now does not exist in the clients view.

In the test case you can see this here:

https://pagure.io/389-ds-base/pull-request/51261#_1__162

Without the change to the changenumber selection, sync present is len 0.

I am surprise that sync_repl can miss to send the ADD.

I am not surprised that an issue like this exists. The major consumer of this feature is FreeIPA with the bind dyndb module, and that re-syncs at every startup, so "glitches" can be dismissed on a restart, and it would be extremely hard to reproduce without a very specific set of client and server interactions. As well, this is a reasonable complex feature with what appears to be very little testing, and from my own work I can already see that most sync repl clients are not correctly implemented due to the complexity of the RFC. So it's also possible this has been missed as clients are not compliant to the RFC and did not know to check this condition. After reading the RFC and this code, I'm already concerned our own implementation may not be RFC compliant but I'm still not 100% confident to make that assessment.

Ohh also. Reading sync_read_entry_from_changelog it's also easy to create another scenario similar.

If clnum = 1 was a LDAP_REQ_ADD, this would be sent to the client. Then clnum = 2 is a REQ_DELETE. Because of the code in https://pagure.io/389-ds-base/blob/master/f/ldap/servers/plugins/sync/sync_refresh.c#_615 the delete would not be sent to the client, because the change process would have incorrectly re-processed clnum = 1, which would then cause the prev lookup to find the add from clnum 1, then the "delete" is nullified since the sync module things the add/del happened between the two cookies, when in reality only the delete occured.

So that's another example of a trivial corruption.

Remember also this can happen over much larger windows too. Consider we add say ... 100 entries, and the client syncs. We then have say some deletes, some adds. The client syncs again. If the item in clnum 100 was "an add" and the entry was deleted at any future point, we'd not send the delete. which again, would leave the client inconsistent.

Metadata Update from @mreynolds:
- Issue priority set to: major
- Issue set to the milestone: 1.4.2

4 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/4313

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix
- Issue status updated to: Closed (was: Open)

4 years ago

Log in to comment on this ticket.

Metadata
Related Pull Requests