#47750 Creating a glue fails if one above level is a conflict or missing
Closed: Fixed None Opened 5 years ago by nhosoi.

Example:
dn: ou=test,ou=projects,nsuniqueid=bd44c40a-afc711e3-9b7bd5e4-7027b69c
+o=org246,ou=ou0,dc=example,dc=com

Trying to resurrect ou=test and adding a glue entry ou=projects. Then, the add fails due to its parent not found since o=org246 is a conflict.

210 create_glue_entry ( Slapi_PBlock pb, char sessionid, Slapi_DN dn, const char uniqueid, CSN opcsn )
239 op_result= do_create_glue_entry(rdn, superiordn, uniqueid, "missingEntry", opcsn);
254 case LDAP_NO_SUCH_OBJECT:
255 /
The parent is missing /
256 {
257 /
JCMREPL - Create the parent ... recursion?... but what's the uniqueid? /
258 PR_ASSERT(0); /
JCMREPL */ <== This assertion fails.
259 }


{{{
380 if (buf->buf_prev_anchorcsn && csn_compare (buf->buf_prev_anchorcsn, anchorcsn) == 0 ) {
381 slapi_log_error(SLAPI_LOG_FATAL, buf->buf_agmt_name,
382 "Can't locate CSN %s in the changelog (DB rc=%d). "
383 "Unless the operation is delete, the consumer may need to be reinitialized.\n",
384 (char*)buf->buf_key.data, rc);
}}}

I don't know how easy it will be for the admin to tell if this is a delete operation. We should try to only show this message if the operation is not a delete operation.

{{{

225         else if (LDAP_ALREADY_EXISTS == op_result) 
226         { 
227                 slapi_log_error (SLAPI_LOG_FATAL, repl_plugin_name, 
228                         "%s: No need to turn tombstone %s to glue; it was already resurrected.\n", sessionid, addingdn); 
229                 op_result = LDAP_SUCCESS; 
230         }

}}}
Does this need to be FATAL?

{{{
1250 if (e && e_in_cache) {
1251 CACHE_REMOVE(&inst->inst_cache, e);
1212 1252 cache_unlock_entry( &inst->inst_cache, e );
1213 1253 CACHE_RETURN( &inst->inst_cache, &e );
}}}
Do you need to add a CACHE_REMOVE here?

{{{

206                 if (is_tombstone && 
207                     !PL_strncasecmp(dns[0], SLAPI_ATTR_UNIQUEID, SLAPI_ATTR_UNIQUEID_LENGTH) && 
208                     (*(dns[0] + SLAPI_ATTR_UNIQUEID_LENGTH) == '=') && 
209                     (*(dns[0] + SLAPI_ATTR_UNIQUEID_LENGTH + slapi_uniqueIDSize() + 1/*=*/) == '\0')) {

}}}

Are you guaranteed that the length of dns[0] is greater than or equal to SLAPI_ATTR_UNIQUEID_LENGTH + slapi_uniqueIDSize() + 1? Same for slapi_rdn_set_dn_ext() and slapi_rdn_is_conflict()

git patch file (389-ds-base-1.2.11) -- Additional bug fixes... (Eventually, merged into the main patch)
0001-Ticket-47750-Creating-a-glue-fails-if-one-above-leve.2.patch

Still cache issues exist...
{{{
==30936== Thread 26:
==30936== Invalid read of size 8
==30936== at 0x10466365: backentry_get_sdn (backentry.c:133)
==30936== by 0x104664D9: entry_same_dn (cache.c:167)
==30936== by 0x10466793: find_hash (cache.c:241)
==30936== by 0x10468A18: cache_find_dn (cache.c:1186)
==30936== by 0x1047D22E: dn2entry_ext (dn2entry.c:85)
==30936== by 0x104809CF: find_entry_internal_dn (findentry.c:130)
==30936== by 0x10480EAC: find_entry_internal (findentry.c:293)
==30936== by 0x10480F44: find_entry (findentry.c:313)
==30936== by 0x104C6CA9: ldbm_back_search (ldbm_search.c:545)
==30936== by 0x4CAE46A: op_shared_search (opshared.c:803)
==30936== by 0x13AE50: do_search (search.c:357)
==30936== by 0x11FBF6: connection_dispatch_operation (connection.c:622)
==30936== by 0x121584: connection_threadmain (connection.c:2339)
==30936== by 0x66D7C85: _pt_root (ptthread.c:204)
==30936== by 0x6D14D14: start_thread (pthread_create.c:308)
==30936== by 0x723253C: clone (clone.S:114)
==30936== Address 0x23075900 is 48 bytes inside a block of size 96 free'd
==30936== at 0x4A077E6: free (vg_replace_malloc.c:446)
==30936== by 0x4C5D8CF: slapi_ch_free (ch_malloc.c:363)
==30936== by 0x10466215: backentry_free (backentry.c:62)
==30936== by 0x104688B1: entrycache_return (cache.c:1157)
==30936== by 0x10468754: cache_return (cache.c:1130)
==30936== by 0x104BE176: modify_term (ldbm_modify.c:101)
==30936== by 0x104A1AC5: ldbm_back_add (ldbm_add.c:1209)
==30936== by 0x4C52D57: op_shared_add (add.c:681)
==30936== by 0x4C51D23: do_add (add.c:258)
==30936== by 0x11FAC4: connection_dispatch_operation (connection.c:579)
==30936== by 0x121584: connection_threadmain (connection.c:2339)
==30936== by 0x66D7C85: _pt_root (ptthread.c:204)
==30936== by 0x6D14D14: start_thread (pthread_create.c:308)
==30936== by 0x723253C: clone (clone.S:114)
}}}

git patch file (389-ds-base-1.2.11) -- test passed with no leaks
0001-Ticket-47750-Creating-a-glue-fails-if-one-above-leve.3.patch

Need to fix this comment:
{{{
/ The entry is already a Tombstone, ignore this delete. /
}}}
We are not ignoring the delete now.

{{{
217 op_result = urp_fixup_rename_entry (addingentry, conflictrdn, parentuniqueid, OP_FLAG_RESURECT_ENTRY|OP_FLAG_TOMBSTONE_ENTRY);
}}}
The previous place you call this with slapi_ch_strdup(parentuniqueid) - do you need to use strdup here too?

{{{
220 dn = slapi_sdn_get_dn(sdn);
221 ldap_result_code = slapi_dn_syntax_check(pb, dn, 1);
220 ldap_result_code = slapi_dn_syntax_check(pb, slapi_sdn_get_dn(sdn), 1);
}}}
Now dn will not be set - is there code that expects dn to be set?

{{{
317 LDAPDebug2Args(LDAP_DEBUG_BACKLDBM, "find_entry2modify_only returned NULL parententry pdn: %s, uniqueid: %s\n",
318 slapi_sdn_get_dn(&parentsdn), addr.uniqueid);
}}}
is it possible for addr.uniqueid to be NULL here?

{{{
457 if (!PL_strncasecmp(edn, SLAPI_ATTR_UNIQUEID, SLAPI_ATTR_UNIQUEID_LENGTH) &&
458 ((edn + SLAPI_ATTR_UNIQUEID_LENGTH) == '=') &&
459 (
(edn + SLAPI_ATTR_UNIQUEID_LENGTH + slapi_uniqueIDSize() + 1/=/) == ',')) {
}}}

Are you sure that edn is long enough? same here
{{{
206 PRUint32 uniqueidlen = SLAPI_ATTR_UNIQUEID_LENGTH + slapi_uniqueIDSize() + 1/=/;
207 if (is_tombstone && (strlen(dns[0]) == uniqueidlen) &&
208 !PL_strncasecmp(dns[0], SLAPI_ATTR_UNIQUEID, SLAPI_ATTR_UNIQUEID_LENGTH) &&
209 ((dns[0] + SLAPI_ATTR_UNIQUEID_LENGTH) == '=')) {
}}}
and here
{{{
336 if (!PL_strncasecmp(dn, SLAPI_ATTR_UNIQUEID, SLAPI_ATTR_UNIQUEID_LENGTH) &&
337 (
(mydn = (dn + SLAPI_ATTR_UNIQUEID_LENGTH)) == '=') &&
338 ((mydn = (mydn + slapi_uniqueIDSize() + 1/=*/)) == ',')) {
}}}
and in slapi_rdn_is_conflict

Probably should make this a function so that you can reuse it in several places.

Replying to [comment:9 rmeggins]:

Need to fix this comment:
{{{
/ The entry is already a Tombstone, ignore this delete. /
}}}
We are not ignoring the delete now.
{{{
/ The entry is already a Tombstone, ignore this delete. /
op_result= LDAP_ALREADY_EXISTS;
}}}
We ignore the delete once here in ldbm_back_delete...
{{{
228 rc = plugin_call_plugins(pb, SLAPI_PLUGIN_BE_PRE_DELETE_FN);
229 if (rc == -1)
230 {
231 /
232 * Plugin indicated some kind of failure,
233 * or that this Operation became a No-Op.
234
/
235 slapi_pblock_get(pb, SLAPI_RESULT_CODE, &ldap_result_code);
236 if (!ldap_result_code) {
237 if (LDAP_ALREADY_EXISTS == ldap_result_code) {
238 /
239 * The target entry is already a tombstone.
240 * We need to treat this as a success,
241 * but we need to remove the entry e from the entry cache.
242
/
243 remove_e_from_cache = 1;
244 ldap_result_code = LDAP_SUCCESS;
245 }
246 slapi_pblock_get(pb, SLAPI_RESULT_CODE, &ldap_result_code);
247 }
248 / restore original entry so the front-end delete code can free it /
249 slapi_pblock_get( pb, SLAPI_PLUGIN_OPRETURN, &opreturn );
250 if (!opreturn) {
251 slapi_pblock_set( pb, SLAPI_PLUGIN_OPRETURN, ldap_result_code ? &ldap_result_code : &rc );
252 }
253 / retval is -1 /
254 goto error_return;
}}}

{{{
217 op_result = urp_fixup_rename_entry (addingentry, conflictrdn, parentuniqueid, OP_FLAG_RESURECT_ENTRY|OP_FLAG_TOMBSTONE_ENTRY);
}}}
The previous place you call this with slapi_ch_strdup(parentuniqueid) - do you need to use strdup here too?
It's a bit ugly, but as commented, parentuniqueid is consumed once passed to urp_fixup_XXX_entry.
194 / addingentry and parentuniqueid are consumed, regardless of the result /
So, the first time, I duplicated it and passed it to urp_fixup_add_entry. Next time, it's okay to be consumed. So, I passed parentuniqueid to urp_fixup_rename_entry directly. And when urp_fixup_rename_entry is no need to be called, parentuniqueid is freed.

{{{
220 dn = slapi_sdn_get_dn(sdn);
221 ldap_result_code = slapi_dn_syntax_check(pb, dn, 1);
220 ldap_result_code = slapi_dn_syntax_check(pb, slapi_sdn_get_dn(sdn), 1);
}}}
Now dn will not be set - is there code that expects dn to be set?
Oh, I just go rid of the local variable "dn"... :)
$ egrep "\<dn>" ldbm_add.c
This returns no "char *dn".

{{{
317 LDAPDebug2Args(LDAP_DEBUG_BACKLDBM, "find_entry2modify_only returned NULL parententry pdn: %s, uniqueid: %s\n",
318 slapi_sdn_get_dn(&parentsdn), addr.uniqueid);
}}}
is it possible for addr.uniqueid to be NULL here?
Ah, it might be... I'm going to fix it...

{{{
457 if (!PL_strncasecmp(edn, SLAPI_ATTR_UNIQUEID, SLAPI_ATTR_UNIQUEID_LENGTH) &&
458 ((edn + SLAPI_ATTR_UNIQUEID_LENGTH) == '=') &&
459 (
(edn + SLAPI_ATTR_UNIQUEID_LENGTH + slapi_uniqueIDSize() + 1/=/) == ',')) {
}}}

Are you sure that edn is long enough? same here
{{{
206 PRUint32 uniqueidlen = SLAPI_ATTR_UNIQUEID_LENGTH + slapi_uniqueIDSize() + 1/=/;
207 if (is_tombstone && (strlen(dns[0]) == uniqueidlen) &&
208 !PL_strncasecmp(dns[0], SLAPI_ATTR_UNIQUEID, SLAPI_ATTR_UNIQUEID_LENGTH) &&
209 ((dns[0] + SLAPI_ATTR_UNIQUEID_LENGTH) == '=')) {
}}}
and here
{{{
336 if (!PL_strncasecmp(dn, SLAPI_ATTR_UNIQUEID, SLAPI_ATTR_UNIQUEID_LENGTH) &&
337 (
(mydn = (dn + SLAPI_ATTR_UNIQUEID_LENGTH)) == '=') &&
338 ((mydn = (mydn + slapi_uniqueIDSize() + 1/=*/)) == ',')) {
}}}
and in slapi_rdn_is_conflict

Probably should make this a function so that you can reuse it in several places.
Yes, I should have. I'm turning it to a function and reuse it (with checking the length)...

Thank you, Rich!

diff from the previous patch (389-ds-base-1.2.11 branch)
diffs.txt.1.2.11

git patch file (389-ds-base-1.2.11) -- merged patch 3 and the diff
0001-Ticket-47750-Creating-a-glue-fails-if-one-above-leve.4.patch

Looks good to me, nice work!

Reviewed by Mark (Thank you!!)

Pushed to 389-ds-base-1.2.11 branch
8f39dd5..9e5b333 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
commit 9e5b333

Pushed to master
804d923..160cb3f master -> master
commit 160cb3f

Pushed to 389-ds-base-1.3.2
b12c22b..3f933fe 389-ds-base-1.3.2 -> 389-ds-base-1.3.2
commit 3f933fe

Pushed to 389-ds-base-1.3.1
85cc40c..68b52b7 389-ds-base-1.3.1 -> 389-ds-base-1.3.1
commit 68b52b72a4a682e5778938a3c780f4d5ff08ca0f

Looks like there may still be a problem - with 1.3.2.17:
[27/Jun/2014:16:43:03 +0200] - Turning a tombstone into a tombstone! "fqdn=sauron.pipebreaker.pl+nsuniqueid=9453e301-793511e3-a4d9a4f9-faae4520,cn=computers,cn=accounts,dc=pipebreaker,dc=pl"

What if the entry that is to be deleted is a repl conflict entry of the form fqdn=sauron.pipebreaker.pl+nsuniqueid=9453e301-793511e3-a4d9a4f9-faae4520, but is not a tombstone entry?

I think slapi_is_special_rdn(rdn, RDN_IS_TOMBSTONE) doesn't work correctly in this case. It is saying that the above rdn is a tombstone, but it is not, it is a repl conflict.

Thank you, Rich. It is definitely a very bad bug.

Fix description: The logic in slapi_is_special_rdn to check if the
RDN is a tombstone or not had a problem. Even if the first part of
RDN is not nsuniqueid=<unique_id>, it was returning true. In this
patch, code to check the case is added and it returns false if the
first part of RDN is not nsuniqueid=<unique_id>.

We need to fix this bug on all the versions.

If you have access to the entry, you should not call slapi_is_special_rdn to determine if an entry is a tombstone or a repl conflict. However, if all you have is the RDN, and cannot look at the objectclass or attributes of the entry, then slapi_is_special_rdn() is ok. Are there any places in the code where slapi_is_special_rdn is used, but we could use the actual entry instead?

find . -name "*.c" | xargs egrep -l slapi_is_special_rdn
./servers/slapd/rdn.c <== RDN only
./servers/slapd/back-ldbm/ldbm_delete.c <== entry is available.
{{{
But I'd like to make sure the RDN is not starting with nsuniqueid=<unique_id> here...
587 if (slapi_is_special_rdn(edn, RDN_IS_TOMBSTONE)) {
588 slapi_log_error(SLAPI_LOG_FATAL, "ldbm_back_delete",
589 "conn=%lu op=%d Turning a tombstone into a tombstone! \"%s\"; e: 0x %p, cache_state: 0x%x, refcnt: %d\n",
590 conn_id, op_id, edn, e, e->ep_state, e->ep_refcnt);
591 ldap_result_code= LDAP_OPERATIONS_ERROR;
592 retval = -1;
593 goto error_return;
594 }
595 tombstone_dn = compute_entry_tombstone_dn(edn, childuniqueid);
}}}
./servers/slapd/back-ldbm/ldbm_entryrdn.c <== RDN only

Thanks for reviewing the patch, Mark and Rich!

Pushed to master:
e4b4419..286559d master -> master
commit 286559d

Pushed to 389-ds-base-1.3.2:
7d19149..36b8830 389-ds-base-1.3.2 -> 389-ds-base-1.3.2
commit 36b8830

Pushed to 389-ds-base-1.3.1:
18bc3e8..2f3852f 389-ds-base-1.3.1 -> 389-ds-base-1.3.1
commit 2f3852fea43efde5994b3805d65cf2fc80210d73

Pushed to 389-ds-base-1.2.11:
ed48761..fe0b795 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
commit fe0b795

Issue in ldbm_delete.c

You still need to decrement the inst_ref_count:

{{{
@@ -1267,17 +1385,6 @@ common_return:
plugin_call_plugins (pb, SLAPI_PLUGIN_BE_POST_DELETE_FN);
}

  • / Need to return to cache after post op plugins are called /
  • if (inst) {
  • if (retval && e) { / error case /
  • cache_unlock_entry( &inst->inst_cache, e );
  • CACHE_RETURN( &inst->inst_cache, &e );
  • }
  • if (inst->inst_ref_count) {
  • slapi_counter_decrement(inst->inst_ref_count);
  • }
  • }
    }}}

Need to add this back as it causes any kind of database task to fail(it always thinks its busy because the ref count was never decremented):

{{{
if (inst->inst_ref_count) {
slapi_counter_decrement(inst->inst_ref_count);
}
}}}

Thanks for your finding, Mark.

I understood this must be done regardless of the result.
{{{
if (inst->inst_ref_count) {
slapi_counter_decrement(inst->inst_ref_count);
}
}}}

How about this part? We should not unlock and return in the error case?
{{{
- if (retval && e) { / error case /
- cache_unlock_entry( &inst->inst_cache, e );
- CACHE_RETURN( &inst->inst_cache, &e );
- }
}}}

Replying to [comment:26 nhosoi]:

Thanks for your finding, Mark.

I understood this must be done regardless of the result.
{{{
if (inst->inst_ref_count) {
slapi_counter_decrement(inst->inst_ref_count);
}
}}}

How about this part? We should not unlock and return in the error case?
{{{
- if (retval && e) { / error case /
- cache_unlock_entry( &inst->inst_cache, e );
- CACHE_RETURN( &inst->inst_cache, &e );
- }
}}}

I'm not sure. This code was removed in patch5.

In my case there was no error, just a regular successful delete. I would assume that you removed the cache functions intentionally, so I would just add the "decrement ref count" back and nothing else.

Thanks for clarification, Mark! It seems we have to resurrect this code only...
{{{
if (inst->inst_ref_count) {
slapi_counter_decrement(inst->inst_ref_count);
}
}}}

git patch file (master) -- adding back accidentally removed slapi_counter_decrement.
0001-Ticket-47750-Creating-a-glue-fails-if-one-above-leve.7.patch

Thanks for the fix, Mark!

Pushed to master:
24a4103..788a48f master -> master
commit 788a48f

Pushed to 389-ds-base-1.3.2:
36b8830..dfbc46a 389-ds-base-1.3.2 -> 389-ds-base-1.3.2
commit dfbc46a

Pushed to 389-ds-base-1.3.1:
2f3852f..f18411d 389-ds-base-1.3.1 -> 389-ds-base-1.3.1
commit f18411d98a441f26f1ef2246b62e84b72ce39ec4

Pushed to 389-ds-base-1.2.11:
fe0b795..30f1ba8 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
commit 30f1ba8

This bug "Ticket #47830 - usn tombstone entry not properly created" was introduced by the patch: 0001-Ticket-47750-Creating-a-glue-fails-if-one-above-leve.5.patch.

Reopening...

Pushed to master:
43c6ff2..98cf424 master -> master
commit 98cf424

Pushed to 389-ds-base-1.3.2:
7666910..8c496b1 389-ds-base-1.3.2 -> 389-ds-base-1.3.2
commit 8c496b1

Pushed to 389-ds-base-1.3.1:
0dd3192..51a76a4 389-ds-base-1.3.1 -> 389-ds-base-1.3.1
commit 51a76a4d565f0e93092e589ee7e3064d5371c54a

Pushed to 389-ds-base-1.2.11:
30f1ba8..5b76c17 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
commit 5b76c17

git patch file (master) -- adding back the code to add entry back to cache if it was replaced.
0001-Ticket-47750-Creating-a-glue-fails-if-one-above-leve.8.patch

Attached patch to fix memory leak introduced by incomplete backport.

3be0931..0b5f0d6 389-ds-base-1.3.1 -> 389-ds-base-1.3.1
commit 0b5f0d68233cec976d57597d1c810b79c62528a5

67e4eed..649e726 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
commit 649e726

The previous patch that fixed the leak caused a crash, the new patch (part 2) correctly addresses the leak.

the VERSION.sh change should be in a separate commit, preferably the one that is tagged for the release.

Accessing bep without locking it is a red flag to me. Is there anywhere else in the code where bep is accessed without locking? If so, how is it ensured that it is safe to do so?

Replying to [comment:39 rmeggins]:

the VERSION.sh change should be in a separate commit, preferably the one that is tagged for the release.

Yup sorry forgot that was in there.

Accessing bep without locking it is a red flag to me. Is there anywhere else in the code where bep is accessed without locking? If so, how is it ensured that it is safe to do so?

This was just a partial backport of Noriko's fix from 47834, I'll look into this bep stuff shortly...

Replying to [comment:40 mreynolds]:

Replying to [comment:39 rmeggins]:

the VERSION.sh change should be in a separate commit, preferably the one that is tagged for the release.

Yup sorry forgot that was in there.

Accessing bep without locking it is a red flag to me. Is there anywhere else in the code where bep is accessed without locking? If so, how is it ensured that it is safe to do so?

This was just a partial backport of Noriko's fix from 47834, I'll look into this bep stuff shortly...

I'm not sure why Noriko added and then commented out the "PR_Lock" in the cache_is_in_cache() functions. I'm not sure why the "cache" structure is even passed back to those functions as it's not used. The entry that is passed to these functions is usually a fresh duplicate of the backentry(or it's cache entry has already been locked) - either way I see no need to lock (unless I'm missing something).

Replying to [comment:41 mreynolds]:

Replying to [comment:40 mreynolds]:

Replying to [comment:39 rmeggins]:

the VERSION.sh change should be in a separate commit, preferably the one that is tagged for the release.

Yup sorry forgot that was in there.

Accessing bep without locking it is a red flag to me. Is there anywhere else in the code where bep is accessed without locking? If so, how is it ensured that it is safe to do so?

This was just a partial backport of Noriko's fix from 47834, I'll look into this bep stuff shortly...

I'm not sure why Noriko added and then commented out the "PR_Lock" in the cache_is_in_cache() functions. I'm not sure why the "cache" structure is even passed back to those functions as it's not used. The entry that is passed to these functions is usually a fresh duplicate of the backentry(or it's cache entry has already been locked) - either way I see no need to lock (unless I'm missing something).

Ok. If it is not in the cache, and there is no way it could be referenced by another thread, then it is not necessary to lock it - but then, in that case, the ref count should never be greater than 1.

Replying to [comment:42 rmeggins]:

Replying to [comment:41 mreynolds]:

Replying to [comment:40 mreynolds]:

Replying to [comment:39 rmeggins]:

the VERSION.sh change should be in a separate commit, preferably the one that is tagged for the release.

Yup sorry forgot that was in there.

Accessing bep without locking it is a red flag to me. Is there anywhere else in the code where bep is accessed without locking? If so, how is it ensured that it is safe to do so?

This was just a partial backport of Noriko's fix from 47834, I'll look into this bep stuff shortly...

I'm not sure why Noriko added and then commented out the "PR_Lock" in the cache_is_in_cache() functions. I'm not sure why the "cache" structure is even passed back to those functions as it's not used. The entry that is passed to these functions is usually a fresh duplicate of the backentry(or it's cache entry has already been locked) - either way I see no need to lock (unless I'm missing something).

Ok. If it is not in the cache, and there is no way it could be referenced by another thread, then it is not necessary to lock it - but then, in that case, the ref count should never be greater than 1.

The tombstones are added tentatively to the cache, but in ldbm_add() if we resurrect a tombstone then it could be in the cache (and potentially removed by another thread?) Obviously the ref count can be greater than 1 in the resurrection case. I'm going to backport more of Noriko's caching fix(she's uses a monitor, instead of a mutex for the cache lock), and I'll add the locking back to the new cache_is_in_cache/refcnt functions.

test patch wiht partial backport from ticket 47834
0001-Ticket-47750-Entry-cache-part-2.patch

Looks good! I found one difference with the master branch.
This is not an error case, so instead of -1, we return 1.
{{{
… … entrycache_add_int(struct cache cache, struct backentry e, int state,
1350 1351 (*alt)->ep_refcnt++;
1351 1352 LOG("the entry %s already exists. returning existing entry %s (state: 0x%x)\n",
1352 1353 ndn, backentry_get_ndn(my_alt), state);
1354 cache_unlock(cache);
1355 return -1;
1353 1356 }
}}}

git patch file (389-ds-base-1.2.11) -- Additional patch for the previous patch 0001-Ticket-47750-Entry-cache-part-2.patch
0002-Ticket-47750-Entry-cache-part-2.patch

git patch file (master) -- Following the 1.2.11 patch by Mark, enabiling cache lock in cache_is_in_cache and cache_has_otherref
0001-Ticket-47750-Creating-a-glue-fails-if-one-above-leve.9.patch

Pushed Mark's patch 000[12]-Ticket-47750-Entry-cache-part-2.patch​ to 389-ds-base-1.2.11 on behalf of him.
649e726..189eb8a 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
commit 2c16d7d
commit 189eb8a

Pushed 0001-Ticket-47750-Creating-a-glue-fails-if-one-above-leve.9.patch​ to master:
66e43ae..5ae1307 master -> master
commit 5ae1307

Pushed to 389-ds-base-1.3.3:
9fe0713..ef76678 389-ds-base-1.3.3 -> 389-ds-base-1.3.3
commit ef76678

Pushed the equivalent patches to 1.3.1, as well:
7b7d092..154abc9 389-ds-base-1.3.1 -> 389-ds-base-1.3.1
commit 112203f49844bdc47649a6305b64d00485193671
commit 891d03a3d36fc46d3851eeaf2582bd3575fd5823
commit 44b66b4ceabc78c21ba48a7fe153915663dbc749
commit 154abc975af643ebecc70750eddf5921e2befc12

Refresh cache entry after calling the delete betxn postop plugins
0001-Ticket-47750-Need-to-refresh-cache-entry-after-calle.patch

Why are we seeing this problem now? Seems like this should have been a problem for a long time.

Replying to [comment:51 rmeggins]:

Why are we seeing this problem now? Seems like this should have been a problem for a long time.

The entry cache redesign that Noriko has done is quite new. So it is a regression from her recent work.

fbafee5..4a5eee6 master -> master
commit 4a5eee6
Author: Mark Reynolds mreynolds@redhat.com
Date: Tue Dec 16 10:37:02 2014 -0500

69df018..6164553 389-ds-base-1.3.3 -> 389-ds-base-1.3.3
commit 6164553

9851929..9295966 389-ds-base-1.3.2 -> 389-ds-base-1.3.2
commit 9295966

250fc63..153e574 389-ds-base-1.3.1 -> 389-ds-base-1.3.1
commit 153e574ceeefec5af05a529b550c69441994ba2a

912deee..5ef3183 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
commit 5ef3183

Testing copr build (2014_12_17-1.fc20 mreynolds-389-ds-base) with ipa 4.1.2 branch. There is a crash that could be related to this last fix: {{{ (gdb) where #0 __GI___pthread_mutex_lock (mutex=mutex@entry=0x8) at ../nptl/pthread_mutex_lock.c:66 #1 0x00007fc581206991 in PR_ExitMonitor (mon=0x0) at ../../../nspr/pr/src/pthreads/ptsynch.c:615 #2 0x00007fc5778db8f9 in cache_unlock_entry (cache=cache@entry=0x7fc584b87968, e=<optimized out="">) at ldap/servers/slapd/back-ldbm/cache.c:1568 #3 0x00007fc577912d36 in ldbm_back_delete (pb=0x7fc55b7f5ae0) at ldap/servers/slapd/back-ldbm/ldbm_delete.c:1275 #4 0x00007fc582dd30b0 in op_shared_delete (pb=pb@entry=0x7fc55b7f5ae0) at ldap/servers/slapd/delete.c:364 #5 0x00007fc582dd3373 in do_delete (pb=pb@entry=0x7fc55b7f5ae0) at ldap/servers/slapd/delete.c:128 #6 0x00007fc5832eb38e in connection_dispatch_operation (pb=0x7fc55b7f5ae0, op=0x7fc584f2bba0, conn=0x7fc570386130) at ldap/servers/slapd/connection.c:650 #7 connection_threadmain () at ldap/servers/slapd/connection.c:2534 #8 0x00007fc58120be3b in _pt_root (arg=0x7fc584b3ea40) at ../../../nspr/pr/src/pthreads/ptthread.c:212 #9 0x00007fc580babee5 in start_thread (arg=0x7fc55b7f6700) at pthread_create.c:309 #10 0x00007fc5808dab8d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 }}} The crash is systematic: * install 389-ds * install ipa 4.1.2 (I was testing master branch from today) * run ipa-server-install -a Secret123 -p Secret123 --hostname=xxx -r <realm> -n <domain> --setup-dns --no-forwarders --no-reverse -U {{{ Configuring Kerberos KDC (krb5kdc): Estimated time 30 seconds [1/10]: adding sasl mappings to the directory [2/10]: adding kerberos container to the directory [3/10]: configuring KDC [4/10]: initialize kerberos container [5/10]: adding default ACIs [6/10]: creating a keytab for the directory [error] NetworkError: cannot connect to 'ldapi://<socket>': Unexpected error - see /var/log/ipaserver-install.log for details: NetworkError: cannot connect to 'ldapi://<socket>': }}} * After install a sigsev is reported in /var/log/audit

Attachment 0001-Ticket-47750-During-delete-operation-do-not-refresh-.patch​ added
Do not refresh cache entry if it is a tomstone

Your fix looks good. Acked.

c3389a4..4ae6794 master -> master
commit 4ae6794
Author: Mark Reynolds mreynolds@redhat.com
Date: Thu Dec 18 09:08:52 2014 -0500

6e4e21f..f634732 389-ds-base-1.3.3 -> 389-ds-base-1.3.3
commit f634732

9793fc4..6c4ad25 389-ds-base-1.3.2 -> 389-ds-base-1.3.2
commit 6c4ad25

153e574..f5458c4 389-ds-base-1.3.1 -> 389-ds-base-1.3.1
commit f5458c4a03cc7a3a369a80999cf6643d34bb620b

3d69f15..83ecd45 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
commit 83ecd45

A crash is possibly related to that fix/ticket I was using an old M1 instance containing entries. I updated with master build, restarted it. Lookup all entries, then issued a DEL. So the entry was in the entry cache. {{{ ldapsearch -h localhost ... -b "dc=example,dc=com" ... dn: cn=new_account19,cn=staged user,dc=example,dc=com telephoneNumber: 27185 objectClass: top objectClass: person sn: new_account19 cn: new_account19 [12/Jan/2015:17:50:22 +0100] conn=3 op=1 DEL dn="cn=new_account19,cn=staged user,dc=example,dc=com" }}} Then it triggers this crash {{{ Program terminated with signal 11, Segmentation fault. #0 PR_ExitMonitor (mon=0x0) at ../../../nspr/pr/src/pthreads/ptsynch.c:547 547 if (!pthread_equal(mon->owner, self)) Missing separate debuginfos, use: debuginfo-install audit-libs-2.2.2-2.fc17.x86_64 keyutils-libs-1.5.5-2.fc17.x86_64 libselinux-2.1.10-3.fc17.x86_64 nss-softokn-3.14.3-1.fc17.x86_64 nss-softokn-freebl-3.14.3-1.fc17.x86_64 nss-util-3.14.3-1.fc17.x86_64 sqlite-3.7.11-3.fc17.x86_64 zlib-1.2.5-7.fc17.x86_64 (gdb) where #0 PR_ExitMonitor (mon=0x0) at ../../../nspr/pr/src/pthreads/ptsynch.c:547 #1 0x00007f6c259cb9c9 in cache_unlock_entry (cache=0x2903868, e=0x7f6bd8005330) at ../workspaces/389-master-branch/ds/ldap/servers/slapd/back-ldbm/cache.c:1568 #2 0x00007f6c25a1786d in ldbm_back_delete (pb=0x7f6c0a7fbae0) at ../workspaces/389-master-branch/ds/ldap/servers/slapd/back-ldbm/ldbm_delete.c:1275 #3 0x00007f6c290f45fa in op_shared_delete (pb=0x7f6c0a7fbae0) at ../workspaces/389-master-branch/ds/ldap/servers/slapd/delete.c:364 #4 0x00007f6c290f3dcb in do_delete (pb=0x7f6c0a7fbae0) at ../workspaces/389-master-branch/ds/ldap/servers/slapd/delete.c:128 #5 0x00000000004160ba in connection_dispatch_operation (conn=0x7f6c2012b410, op=0x2aa8740, pb=0x7f6c0a7fbae0) at ../workspaces/389-master-branch/ds/ldap/servers/slapd/connection.c:650 #6 0x00000000004180c9 in connection_threadmain () at ../workspaces/389-master-branch/ds/ldap/servers/slapd/connection.c:2534 #7 0x0000003c53028b46 in _pt_root (arg=0x2a9ef20) at ../../../nspr/pr/src/pthreads/ptthread.c:204 #8 0x0000003bee407d14 in start_thread (arg=0x7f6c0a7fc700) at pthread_create.c:309 #9 0x0000003bee0f168d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 (gdb) frame 1 #1 0x00007f6c259cb9c9 in cache_unlock_entry (cache=0x2903868, e=0x7f6bd8005330) at ../workspaces/389-master-branch/ds/ldap/servers/slapd/back-ldbm/cache.c:1568 1568 if (PR_ExitMonitor(e->ep_mutexp)) { (gdb) print *e $1 = {ep_type = 0, ep_lrunext = 0x0, ep_lruprev = 0x0, ep_id = 35, ep_state = 1 '\001', ep_refcnt = 2, ep_size = 4646, ep_entry = 0x7f6bd80053a0, ep_vlventry = 0x0, ep_dn_link = 0x0, ep_id_link = 0x0, ep_uuid_link = 0x0, ep_mutexp = 0x0} (gdb) print *e->ep_entry $2 = {e_sdn = {flag = 14 '\016', udn = 0x7f6bd8005280 "nsuniqueid=b891c520-601811e4-a12de4f8-88f6fc48,cn=new_account19,cn=staged user,dc=example,dc=com", dn = 0x7f6bd8008130 "nsuniqueid=b891c520-601811e4-a12de4f8-88f6fc48,cn=new_account19,cn=staged user,dc=example,dc=com", ndn = 0x7f6bd8008300 "nsuniqueid=b891c520-601811e4-a12de4f8-88f6fc48,cn=new_account19,cn=staged user,dc=example,dc=com", ndn_len = 96}, e_srdn = {flag = 0 '\000', rdn = 0x7f6bd8003190 "nsuniqueid=b891c520-601811e4-a12de4f8-88f6fc48,cn=new_account19", rdns = 0x0, butcheredupto = -1, nrdn = 0x7f6bd8003000 "nsuniqueid=b891c520-601811e4-a12de4f8-88f6fc48,cn=new_account19", all_rdns = 0x7f6bd80054a0, all_nrdns = 0x7f6bd800e080}, e_uniqueid = 0x7f6bd8005580 "b891c520-601811e4-a12de4f8-88f6fc48", e_dncsnset = 0x7f6bd8005530, e_maxcsn = 0x7f6bd8005560, e_attrs = 0x7f6bd80055b0, e_deleted_attrs = 0x0, e_virtual_attrs = 0x0, e_virtual_watermark = 0, e_virtual_lock = 0x7f6bd8004500, e_extension = 0x7f6bd8002aa0, e_flags = 1 '\001', e_aux_attrs = 0x0} }}}

Replying to [comment:58 tbordaz]:

A crash is possibly related to that fix/ticket

I was using an old M1 instance containing entries. I updated with master build, restarted it. Lookup all entries, then issued a DEL. So the entry was in the entry cache.
{{{
Program terminated with signal 11, Segmentation fault.

0 PR_ExitMonitor (mon=0x0) at ../../../nspr/pr/src/pthreads/ptsynch.c:547

547 if (!pthread_equal(mon->owner, self))
(gdb) where

0 PR_ExitMonitor (mon=0x0) at ../../../nspr/pr/src/pthreads/ptsynch.c:547

1 0x00007f6c259cb9c9 in cache_unlock_entry (cache=0x2903868, e=0x7f6bd8005330)

at ../workspaces/389-master-branch/ds/ldap/servers/slapd/back-ldbm/cache.c:1568

2 0x00007f6c25a1786d in ldbm_back_delete (pb=0x7f6c0a7fbae0)

at ../workspaces/389-master-branch/ds/ldap/servers/slapd/back-ldbm/ldbm_delete.c:1275

Line 1275 in ldbm_delete.c lines up with cache_unlock_entry() from a previous patch. However, the latest patch fixes this exact crash. I believe your workspace needs a "git pull".

Metadata Update from @nhosoi:
- Issue assigned to nhosoi
- Issue set to the milestone: 1.2.11.30

2 years ago

Login to comment on this ticket.