#47489 Under specific values of nsDS5ReplicaName, replication may get broken or updates missing
Closed: wontfix None Opened 10 years ago by jyotidas81.

In a replication environment, if the changelog db file name contains extension string multiple times in the file name, the change log file is getting recreated if we perform the db2ldif and ldif2db on the master/hub instance.

Ex: d736e482-198111e1-8d7bedb4-8c53b85f_502ce263000000020000.db4
In this file name "db4" is present twice, once as the extension and other one is in the replica name string ("8d7bedb4").

There is a logic problem in the below function where it is trying to find the filename ends with extension. It calls strstr()function to search the "ext" and which returns the first occurrence of the "ext" string in the filename. if the the "ext" string exist multiple times in the file name it returns false always, which result in creating multiple changelog db file.

====
filename: cl5_api.c

/
- return 1: true (the "filename" ends with "ext")
- return 0: false
/
static int _cl5FileEndsWith(const char filename, const char ext)
{
char *p = NULL;
int flen = strlen(filename);
int elen = strlen(ext);
if (0 == flen || 0 == elen)
{
return 0;
}
p = strstr(filename, ext);
if (NULL == p)
{
return 0;
}
if (p - filename + elen == flen)
{
return 1;
}
return 0;
}
=====

I have modified this function to fix this issue. Could you please verify the same and include the fix in the master branch?

/
- return 1: true (the "filename" ends with "ext")
- return 0: false
/
static int _cl5FileEndsWith(const char filename, const char ext)
{
char *p = NULL;
int flen = strlen(filename);
int elen = strlen(ext);
if (0 == flen || 0 == elen)
{
return 0;
}
p = strstr(filename, ext);
if (NULL == p)
{
return 0;
}

    do {
    if (p - filename + elen == flen)
    {
    return 1;
    }
        p = strstr(p+elen, ext);
    } while ( p != NULL );

return 0;

}

Thanks and Regards,
Jyoti


Hi,

Can anyone please verify this fix?

Thanks in advance.

Regards,
Jyoti

Here is the current status

  • Thanks for nailing down the problematic routine, I was able to reproduce the failure of _cl5DBOpen.
    To reproduce this I created a single Master. Then before doing any update, I updated dse.ldif and
    changed the 'nsDS5ReplicaName' of the replica.

{{{
from: c7c6377c-196e11e3-831c8895-1f2ce016
to: c7c6377c-196e11e3-831c88db-1f2ce016 (change '95' -> 'db' in the 3rd component)
}}}

Then started DS, I can see the logs:

{{{
[09/Sep/2013:16:37:34 +0200] NSMMReplicationPlugin - changelog program - _cl5AppInit: fetched backend dbEnv (1efff10)
[09/Sep/2013:16:37:34 +0200] NSMMReplicationPlugin - changelog program - _cl5DBOpen: opened 0 existing databases in /var/lib/dirsrv/slapd-master/changelogdb
[09/Sep/2013:16:37:51 +0200] NSMMReplicationPlugin - replica_add_by_dn: added dn (dc=com)
[09/Sep/2013:16:37:51 +0200] NSMMReplicationPlugin - _replica_configure_ruv: No ruv tombstone found for replica dc=com. Created a new one
[09/Sep/2013:16:37:51 +0200] NSMMReplicationPlugin - replica_delete_by_dn: removed dn (dc=com)
[09/Sep/2013:16:37:51 +0200] NSMMReplicationPlugin - changelog program - _cl5GetDBFile: no DB object found for database /var/lib/dirsrv/slapd-master/changelogdb/4ade9183-195d11e3-831cdb94-1f2ce016_522ddd3f000000010000.db
[09/Sep/2013:16:37:51 +0200] NSMMReplicationPlugin - changelog program - cl5GetOperationCount: could not get DB object for replica
[09/Sep/2013:16:37:51 +0200] NSMMReplicationPlugin - changelog program - _cl5GetDBFile: no DB object found for database /var/lib/dirsrv/slapd-master/changelogdb/4ade9183-195d11e3-831cdb94-1f2ce016_522ddd3f000000010000.db
[09/Sep/2013:16:37:51 +0200] NSMMReplicationPlugin - changelog program - cl5GetOperationCount: could not get DB object for replica
[09/Sep/2013:16:37:51 +0200] NSMMReplicationPlugin - changelog program - _cl5GetDBFile: no DB object found for database /var/lib/dirsrv/slapd-master/changelogdb/4ade9183-195d11e3-831cdb94-1f2ce016_522ddd3f000000010000.db
}}}

  • I was unsure of the reported test case. In fact except those errors, db2ldif (master) followed
    by ldif2db (hub) worked and after restart, replication was also running well

  • I created a test case where replication skip updates
    I do not know if it is the reported issue, but it is the one I will use as a test case.

{{{
Create Master, C1, C2
Update nsDS5ReplicaName on Master, so that it contains 'db' (my database suffix. It can be db3 or db4).
Create user t1
Create user t2
<check replication is working>
Stop C2
Create user t3
<check t3 is replicated on C1>
Stop Master, C1
export Master (-r)
import C1 (this step can likely be skipped)
Start Master, C1, C2
Create user t4

    -> On Master: t1, t2, t3, t4
    -> On Cons.1: t1, t2, t3, t4
    -> On Cons.2: t1, t2,     t4

}}}

  • The dump of the changelog shows an incomplete record for 'user t3'

{{{
dbid: 0000006f000000000000
entry count: 7

dbid: 000000de000000000000
    purge ruv:
        {replicageneration} 522dfa79000000010000
        {replica 1 ldap://pctbordaz.redhat.com:47489}

dbid: 0000014d000000000000
    max ruv:
        {replicageneration} 522dfa79000000010000
        {replica 1} 522dfb19000000010000 522dfd1b000000010000

dbid: 522dfb19000000010000
    uniqueid: 31464581-196f11e3-831cdb94-1f2ce016
    dn: uid=t1,dc=com
    operation: add

dbid: 522dfb38000000010000
    uniqueid: 31464582-196f11e3-831cdb94-1f2ce016
    dn: uid=t2,dc=com
    operation: add

dbid: 522dfb6c000000010000  <<<<<< broken entry
    uniqueid: 00000000-00000000-00000000-00000000
    dn: cn=start iteration
    operation: delete

dbid: 522dfcc6000000010000
    uniqueid: 2809a881-197011e3-831cdb94-1f2ce016
    dn: uid=t4,dc=com
operation: add

}}}

Here are the next steps

- I will verify the fix

Can we get this fix into RHEL 6.5? Does this affect 389-ds-base-1.2.11?

Thanks Rich for the review.

At the source level, it applies on 1.2.11.
I will test and confirm if I can reproduce on 1.2.11

I confirm the same bug applies on 389-ds-base-1.2.11.
I can reproduce the skipped updates with the same test case, the only difference is that in 1.2.11 database suffix is 'db4' and 'nsDS5ReplicaName' should contain 'db4' to reproduce the issue.

Push to master:

git merge ticket47489

Updating b73f1e8..7a7609d
Fast-forward
ldap/servers/plugins/replication/cl5_api.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

git push origin master

Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 1.05 KiB, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
b73f1e8..7a7609d master -> master

commit 7a7609d
Author: Thierry bordaz (tbordaz) tbordaz@redhat.com
Date: Wed Sep 11 11:08:58 2013 +0200

389-ds-base-1.3.1 branch: commit ac8aad8
389-ds-base-1.2.11 branch: commit f944cd0

Metadata Update from @tbordaz:
- Issue assigned to tbordaz
- Issue set to the milestone: 1.3.2 - 09/13 (September)

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/826

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

3 years ago

Login to comment on this ticket.

Metadata