#50490 objects and memory leaks
Opened 5 months ago by lkrispen. Modified 4 months ago

When I investigated memory leaks I came across one type of leak with objects, which also raises the question of teh safety of the object implementation.

The leak is (part of it, there are much more allocations inside replica_new):100:

 Indirect leak of 272 byte(s) in 1 object(s) allocated from:
     #0 0x7f6780d19210 in calloc (/lib64/libasan.so.5+0xef210)
     #1 0x7f678050d8d7 in slapi_ch_calloc /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/slapd/ch_malloc.c:175
     #2 0x7f67713b106e in replica_new_from_entry /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/plugins/replication/repl5_replica.c:159
     #3 0x7f67713b0e0c in replica_new /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/plugins/replication/repl5_replica.c:131
     #4 0x7f67713a16be in multimaster_mtnode_construct_replicas /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/plugins/replication/repl5_mtnode_ext.c:77
     #5 0x7f67713a097a in multimaster_start /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/plugins/replication/repl5_init.c:798
     #6 0x7f6780622b22 in plugin_call_func /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/slapd/plugin.c:2034
     #7 0x7f6780622787 in plugin_call_one /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/slapd/plugin.c:1983
     #8 0x7f67806214c4 in plugin_dependency_startall /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/slapd/plugin.c:1737
     #9 0x7f6780622716 in plugin_startall /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/slapd/plugin.c:1955
     #10 0x4483e1 in main /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/slapd/main.c:1153
     #11 0x7f677cf8211a in __libc_start_main (/lib64/libc.so.6+0x2311a)

So there is a replica object created by a call to replica_new() and the added as an object to the mapping tree:

 ext->replica = object_new(r, replica_destroy);

it provides a destructor: replica_destroy(), but this is only called if refcount==1 when object_release is called.

It is hard to find where the balance of object_acquire() and object_release() gets broken.
I did some investigation and tentative fixes, adding calls to object_release() in places I thought they were missing - with mixed success. In some test scenarios the leak disappeared, in others the refcount to the last call of object_release was reduced to 2 (one still missing), in others I did get a crash because the object was already freed before the last call.

So there is definitely a mess in our acquire and release balance, but investigating this I also have a doubt how safe our object implementation is.
The only protection is a refcount to prevent freeing of an object while in use, but there are some conditions where it can go wrong.

First look at object_acquire:

 object_acquire(Object *o)
     PR_ASSERT(NULL != o);
     slapi_atomic_incr_64(&(o->refcnt), __ATOMIC_RELEASE);

there is no guarantee that the object is valid when called. We have eg references in the prp replication protocol struct, which is then used again and again by the replication protocol.

And looking at object_release() we see that there could be race conditions not protected by the refcount:

 object_release(Object *o)
     PRInt32 refcnt_after_release;

     PR_ASSERT(NULL != o);
     refcnt_after_release = slapi_atomic_decr_64(&(o->refcnt), __ATOMIC_ACQ_REL);
     if (refcnt_after_release == 0) {
         /* Object can be destroyed */
         if (o->destructor)
         /* Make it harder to reuse a dangling pointer */
         o->data = NULL;
         o->destructor = NULL;
         slapi_ch_free((void **)&o);

if acquire and release are called in parallel for an object with refcount of 1, release could decrement it to 0 before acquire has incremented it. the single ops are atomic, but the order in which they are called is open. So onen thread can decrement it, the other thread increment it and be happy, but it can be freed before use.

Yeah I've had plenty of issues with this object acquiring and releasing when I worked on CleanAllRUV. It is definitely a mess, but at the same time we don't seem to have any known crashes related to this - at least I'm not aware of any of crashes. I guess the crash threat is more around removing a replica while the replica is processing updates?

Adding a lock might work, but I worry about the perf impact.

Metadata Update from @mreynolds:
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to None

5 months ago

yes, we shoul not put any effort into the object mechanism, I am thinking about an approach in the other direction, at least for the replica objects.
A replica object is created and the refcount is set to 1, then there are or should be only balanced calls to acquire/release and at shutdown we do another release, which should free the replica. So in short we create a replica and it will live until shutdown, we just could forget about the acuire/release calls - just use it until shhutdown. At shutdown we have stopped the worker threads, closed the changelog, stopped the agreements, there is nothing left to access the replica.
The case of removing a replica is slightly different, we have to remove the agreements first, otherwise we could not delete the replica, the agmts are child entries, we close the changelog and set the reference to the replica object to NULL, this lets any further update operation detect that there is no replica. We probably cannot free it since there might be some ops having a reference to it. But the reference in the mapping tree is lost and we will leak it anyway.
I will investigate further in this direction, comments welcome

How hard would it be to remove the "object" mechanism from the replica struct, and then re-approach how we do refcounting?

Just a comment

I agree with @mreynolds and @lkrispen, objects code would benefit from a lock but considering the possible impact on performance and that we have not seen any crash we may postpone the addition of a lock.

I think object are not designed for long life object. We continuously access (acquire/release) an object that is here for sure until the end. It is a waste of energy and like you said @lkrispen it is buggy anyway as acquire/release are not balance.
I like the idea of moving long life structure out of object framework. We can allocate a replica and keep it until shutdown. IMHO if we stop all workers/tasks I think it is safe to free the replica at shutdown.

Metadata Update from @mreynolds:
- Issue set to the milestone: 1.4.1

4 months ago

Login to comment on this ticket.