#335 transaction retries need to be cache aware

Created 4 years ago by rmeggins
Modified 13 days ago

The code that resets the original entries/mods/etc. when retrying deadlocked transactions in ldbm_back_delete et. al. need to be cache aware when deleting entries that are cached - for example, ldbm_delete.c:485

            backentry_free(&e);

The problem is that e is cached, so cannot be freed like this. Freeing cached entries seems to cause problems in the server, assertion failures, other crashes.

Agh.. You are right.

We have to do something like this...
cache_find_dn/id
if exists, cache_return; then cache_remove
otherwise backentry_free

Could this be the cause of the other crash bug, e.g., bz808770?

Replying to [comment:1 nhosoi]:

Agh.. You are right.

We have to do something like this...
cache_find_dn/id
if exists, cache_return; then cache_remove
otherwise backentry_free

Could this be the cause of the other crash bug, e.g., bz808770?

No, because the txn retry code is only in 1.2.11 - I found it trying to reproduce 808770 with 1.2.11 . . .

Rich, I'm wondering if you are already working on this ticket?
This ticket is assigned to me. Is it okay for me to start looking into this?

Replying to [comment:3 nhosoi]:

Rich, I'm wondering if you are already working on this ticket?
This ticket is assigned to me. Is it okay for me to start looking into this?

I'm not working on this. Feel free to take it.

Replying to [comment:4 rmeggins]:

I'm not working on this. Feel free to take it.
Thanks!

Rich, do you happen to have a reproducer (for the assertion failure / crashes) handy?

Replying to [comment:6 nhosoi]:

Rich, do you happen to have a reproducer (for the assertion failure / crashes) handy?

not easily - what we really need is a way to "inject" deadlocks - that is, we need to be able to trigger a deadlock programatically rather than having to put the directory server under a heavy load with many updates + searches at the same time

Replying to [comment:7 rmeggins]:

Replying to [comment:6 nhosoi]:

Rich, do you happen to have a reproducer (for the assertion failure / crashes) handy?

not easily - what we really need is a way to "inject" deadlocks - that is, we need to be able to trigger a deadlock programatically rather than having to put the directory server under a heavy load with many updates + searches at the same time

I see. Thanks... We may be able to make it happen easier once we disable the backend SERIALLOCK. BTW, the SERIALLOCK is not applied to the replicated operation. Are you getting the assertion failure/crash in the MMR topology?

Replying to [comment:8 nhosoi]:

Replying to [comment:7 rmeggins]:

Replying to [comment:6 nhosoi]:

Rich, do you happen to have a reproducer (for the assertion failure / crashes) handy?

not easily - what we really need is a way to "inject" deadlocks - that is, we need to be able to trigger a deadlock programatically rather than having to put the directory server under a heavy load with many updates + searches at the same time

I see. Thanks... We may be able to make it happen easier once we disable the backend SERIALLOCK. BTW, the SERIALLOCK is not applied to the replicated operation. Are you getting the assertion failure/crash in the MMR topology?

My test replicates the same setup as the reported problem, which does use MMR, but I don't think you need to use MMR to reproduce the problem - we just need some way to make calls to db->put() db->get() dbc->c_put() dbc->c_get() and dbc->c_close() return DB_LOCK_DEADLOCK errors without having to generate large loads.

Thanks! I could reproduce the problem quite easily on the standalone server. I'm seeing the assertion failures and crashes. Continue debugging...

Fix description:
When libdb returns DEADLOCK and backend update function retries
the operation, the target entry is reset to the original shape.
The target entry could be or could not be in the entry cache.
Regardless of the status, the original code just released the
entry with backentry_free before going into the next loop, which
causes the cache error.

This patch checks the status of the entry. If it is in the entry
cache, remove it from the entry cache and add a new entry back to
the cache if necessary. To get the accurate cache status of each
entry, the output argument cache_res to id2entry_add_ext is added.

Additinally, error checking for the conflict value in index_add_mods
was week (curr_attr). This patch is adding the check.

Reviewed by Rich (Thank you!!)

Pushed to master.

$ git merge ticket335
Updating a50f8c0. bddb5a4
Fast-forward
ldap/servers/slapd/back-ldbm/cache.c | 2 +-
ldap/servers/slapd/back-ldbm/id2entry.c | 16 ++++++---
ldap/servers/slapd/back-ldbm/import-threads.c | 2 +-
ldap/servers/slapd/back-ldbm/index.c | 35 +++++++++++---------
ldap/servers/slapd/back-ldbm/ldbm_add.c | 20 ++++++++++-
ldap/servers/slapd/back-ldbm/ldbm_delete.c | 33 ++++++++++++++-----
ldap/servers/slapd/back-ldbm/ldbm_modify.c | 41 +++++++++++++++++++-----
ldap/servers/slapd/back-ldbm/ldbm_modrdn.c | 23 +++++++++++--
ldap/servers/slapd/back-ldbm/proto-back-ldbm.h | 2 +-
9 files changed, 129 insertions(+), 45 deletions(-)
$ git push
Counting objects: 88, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (34/34), done.
Writing objects: 100% (34/34), 6.17 KiB, done.
Total 34 (delta 30), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
a50f8c0. bddb5a4 master -> master

Fix description:
Commit bddb5a4 includes this fix:

Additinally, error checking for the conflict value in index_add_mods
was week (curr_attr). This patch is adding the check.

The fix was incomplete. If an add-attempted attribute type itself
does not exist in the entry (not only the attribute value) after
mods applied, the attribute type/value should not have been indexed.
This patch fixes it.

It's been reviewed by Rich on bz832560. (Thanks, Rich!!)

Pushed to master.

$ git push
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 992 bytes, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
79799c9. 420c6c2 master -> master

Pushed to 389-ds-base-1.2.11, as well.

$ git push origin ds1211-local:389-ds-base-1.2.11
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 1.02 KiB, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
ff11ccc. 7d6801f ds1211-local -> 389-ds-base-1.2.11

master:
commit changeset 420c6c2/389-ds-base
Author: Noriko Hosoi nhosoi@totoro.usersys.redhat.com
Date: Fri Jun 15 16:32:09 2012 -0700

1.2.11:
commit changset 7d6801f/389-ds-base
Author: Noriko Hosoi nhosoi@totoro.usersys.redhat.com
Date: Fri Jun 15 16:32:09 2012 -0700

Added initial screened field value.

13 days ago

Metadata Update from @nhosoi:
- Issue assigned to nhosoi
- Issue set to the milestone: 1.2.11.rc1

Login to comment on this ticket.

ack

Database - General

defect

cancel