this is a clone of bz 1460070, which contains more data.
The issue seems to be:
there are multiple backens and a change in one backend triggers a change by a plugin in another backend (memberof). The csn pending list mechanism currently does not work for changes affecting multiple backends
Metadata Update from @tbordaz: - Custom field origin adjusted to RHCust - Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1460070 - Custom field type adjusted to defect - Custom field version adjusted to 1.3.5
Metadata Update from @mreynolds: - Issue set to the milestone: 1.3.6.0
<img alt="0001-Ticket-49287-extend-csnpl-handling-to-multiple-backe.patch" src="/389-ds-base/issue/raw/7accb768b1e06a3623c2674daf8b0108daf594b72887014bf4c4831c06b59eeb-0001-Ticket-49287-extend-csnpl-handling-to-multiple-backe.patch" />
<img alt="0001-test-for-pending-list-handling-across-multiple-backe.patch" src="/389-ds-base/issue/raw/7077a3becff06e5936b34ea4a6bf0b83127d25134fd3a2a8ba06ba1b4c7cfe45-0001-test-for-pending-list-handling-across-multiple-backe.patch" />
added patch and test design at: http://www.port389.org/docs/389ds/design/csn-pending-lists-and-ruv-update.html
Metadata Update from @mreynolds: - Issue set to the milestone: 1.3.7.0 (was: 1.3.6.0)
My comments:
I like the design. How do we transmit knowledge of which csns are prim and which are op to a replica?
ack from me, this looks reasonable.
176 +» int repl_alloc; /* max number of replicas */ 177 +» int repl_cnt; /* number of replicas affected by operation */
Please don't use bare int - use int32_t or int64_t (preferred).
250 251 +void 252 +add_replica_to_primcsn(CSNPL_CTX *csnpl_ctx, Replica *repl) 253 +{ 254 +» int found = 0; 255 +» int it = 0; 256 +» while (it < csnpl_ctx->repl_cnt) {
it should be size_t not int. Same with int found, should be int64_t or similar.
263 +» if (found) return;
should have braces for clarity.
469 +int ruv_cancel_csn_inprogress (void *repl, RUV *ruv, const CSN *csn, ReplicaId local_rid) ... 511 +» for (it=0; it<prim_csn->repl_cnt; it++) {
shoud be for (size_t it = 0; ...).
178 +» Replica *def_repl[CSNPL_CTX_REPLCNT]; /* pirm array of affected replicas */ 179 +» Replica **replicas; /* additional replicas affected */
Rather than having two lists of the replicas, shouldn't we just have **replicas, and default init to CSNPL_CTX_REPLCNT? IE 4? This way it makes the replicas handling easier, as it's always just the one array.
I'm sure that other people will have more. I think that it's a nice design, thanks for all your work on this!
@firstyear thanks for your feedback, here are some answers
design not sure I fully understand your question (so the design needs to be more clear), but I'll try to answer: the thread context always contains the bare primcsn and the replica it was generetaed for first. and in the pending list, each added csn has the prim_csn and prim_replica in its data, so we always know for each csn in a pending list if it is prim or not outside the pending list if we need to decide if a csn is prim, we need to get the prim_ctx from the thread data and check
implementation yes, I am sloppy with the usage of types, and I will change it about the replica array: only in exceptional cases we have more than one replica for a prim_csn, so I wanted to avoid the allocation of an array to almost always add only replica. But your remark lets me think of a better solution. I noticed at some point that we need to disinguish and track the prim_replica and I took advantage of the case that the prim_replica is always replicas[0]. I think we can change this to
Replica *prim_replica; Replica **sec_replicas;
so we have one distinguished replica always available - and in the rare cases where we need more will allocate the secondary replicas
<img alt="0001-Ticket-49287-v2-extend-csnpl-handling-to-multiple-ba.patch" src="/389-ds-base/issue/raw/b91a275cc3176a5da39166f791678cdb837933a6471eaf57226d1e6ac3787d82-0001-Ticket-49287-v2-extend-csnpl-handling-to-multiple-ba.patch" />
The patch, design and explanations are so nice !! Few comments/questions:
A csn is only valid in the topology where it was generated. To be sure a csn belongs to the right topology you are using 'prim_replica' and 'prim_repl'. IMHO using prim_repl word is a bit confusing. we do not care which replica generated the csn but we care about the topology where this replica belongs. I am fine keeping those names but do you think a macro could help to clarify this ?
in ruv_cancel_csn_inprogress, we remove secondary_csn (children of prim_csn) at the condition the replica is not a consumer. Why this restriction ? an internal update on a consumer is it rejected ?
csnpldata_free, you may comment that prim_replica and prim_csn should not be freed
we are only operating in one server and with multiple backends we need to know on which backend the operation originated (this indirectly identifies the topology). When I use replica or repl it is a reference to the Replica object, which is corresponding to the backend. So I think it is correct to use the replica and I don't know a better name. I don't see what ou want to replace by a macro and macros also are not always well received, so if you make a suggestion I will think about it
this condition did exist before my fix, I think the reason is that the read only replica is not part of the RUV and there are a few checks for this, but it is worth to look into it, if it could affect the csnpl handling, although I think not
will do
about the readonly replica: an operation in a read only replica doesn't get a csn assigned, so there is no csn ro insert or remove. The check seems to be useless. It might be a leftover of legacy replication, but not sure about this. If you want I can remove this check.
I agree with you with the use of the replica pointer to identify on which backend (aka topology) belong the csn. I was just saying that pending list code is interested in backends and talking of replicas in that place looked to me a bit confusing.
I made a comment patch isolating the access to the replica info. It is not a macro but a wrapper just to comment that we want to check which backend/topology the csn belongs to. <img alt="0001-49287-isolation-of-prim_repl-prim_replica.patch" src="/389-ds-base/issue/raw/c001b81fc8ae1cddf73a65526db78812b28603b5c259feb71cb6f859a43af3d3-0001-49287-isolation-of-prim_repl-prim_replica.patch" />
thanks, that's looking good
I just made some changes to get it compiled, but I agree that your fix is more clear
<img alt="0001-Ticket-49287-v3-extend-csnpl-handling-to-multiple-ba.patch" src="/389-ds-base/issue/raw/6d855f63e2e6968692eb06332c322aedbdc3e528232e63881344077864be75ec-0001-Ticket-49287-v3-extend-csnpl-handling-to-multiple-ba.patch" />
I'm happy with the type changes you made here, so my side is ack, but I think tbordaz should check as well as his knowledge of this is better than mine.
The patch looks good to me as well --> ack
Metadata Update from @lkrispen: - Issue close_status updated to: fixed - Issue status updated to: Closed (was: Open)
tentative patch to handle failueres in read only replicas
<img alt="0001-add-fix-for-49287-handle-readonly-replica.patch" src="/389-ds-base/issue/raw/fa334184f87ff0e773a9ddde43a28c4fee0dc98cb577e88aa0c2e4e959dd5dd6-0001-add-fix-for-49287-handle-readonly-replica.patch" />
Hi,
The patch looks good. Just to be sure I understand it correctly. You switched the order of csnplRemoveAll calls from local_rid first with prim_rid second. Is the order important ? I believe it is not.
Now the fix prevents a call to ruvGetReplica with a read-only replica. I guess this is to prevent to get NULL repl_ruv and then dereference it. Am I correct ?
Metadata Update from @tbordaz: - Custom field component adjusted to None - Custom field reviewstatus adjusted to None
the order is not important, we need to prevent a call to remove the readonly replica id. I changed the order as we always should have th eprimary rid to be removed and only if a replicated op generates a csn a different local rid. We should probably do an extra check if the ruv element id not NULL, coverity will probably request this anyway
Thanks for the explanations, indeed checking returned repl_ruv from ruvGetReplica appears quite systematically in rest of the code. It would be great to add this in the committed patch. ACK
Metadata Update from @tbordaz: - Custom field reviewstatus adjusted to ack (was: None)
Thanks, here is a new version:
<img alt="0001-additional-fix-for-49287-handle-readonly-replica.patch" src="/389-ds-base/issue/raw/files/4056a8185763660bf0e5ca987f44edf05b2629a4f9dc10998c666386caeffbdf-0001-additional-fix-for-49287-handle-readonly-replica.patch" />
This is looking great !. ACK (already set)
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/2346
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: fixed)
Log in to comment on this ticket.