#47655 Multi master replication initialization incomplete after restore of one master
Closed: Invalid None Opened 6 years ago by setcheverry.

Initial configuration:

2 directories DS1 and DS2 with multimaster replication properly configured and working.
Around 3 millions of entries in these directories.

Steps to follow to reproduce :

  1. Backup DS2 with db2bak.pl script
    See log "1.Backup_DS2.log"
    [30/Dec/2013:11:12:06 +0100] - Beginning backup of 'ldbm database'

  2. Restore this previous backup with bak2db command on DS2
    See log "1.Restore_DS2.log"
    [30/Dec/2013:13:03:51 +0100] - slapd shutting down - signaling operation threads

  3. Initialize DS1 with DS2 data restored => initialization ends properly but only 400 000 entries are present in DS1
    See logs "3.Init replication_DS*_KO.log"
    [30/Dec/2013:13:18:38 +0100] NSMMReplicationPlugin - conn=11 op=3 repl="o=iah_extranet_msa": Begin total protocol

  4. Wait for approximately 24 hours
    This log appears in DS2 error log :
    [31/Dec/2013:13:24:04 +0100] - Trimmed 1 changes from the changelog

Initialize consumer DS1 again with DS2 data => initialization ends properly and all the entries are present in DS1
See logs "4.Init replication_DS*_KO.log"


  1. DS2 error log during initial backup
    1.Backup_DS2.log
  1. DS1 error log during initialization of replication (not complete)
    3.Init replication_DS1_KO.log

3b. DS2 error log during initialization of replication (not complete)
3.Init replication_DS2_KO.log

  1. DS1 error log during initialization of replication (complete)
    4.Init replication_DS1_OK.log
  1. DS2 error log during initialization of replication (complete)
    4.Init replication_DS2_OK.log

Problem description

In a multi-master replication context, after the restore via bak2db (or ldif2db) of one master, the initialization of the second master ends with no error but is incomplete.
24 hours later, without any modification of the directories, the initialization is restarted, ends properly and is complete.

It looks in the first failed case (3.Init replication_DS*_KO.log), there was some problem and the total update (consumer initialization) was timed out.
{{{
3.Init\ replication_DS2_KO.log
[30/Dec/2013:13:40:15 +0100] - repl5_tot_waitfor_async_results: 185314 399492
[30/Dec/2013:13:40:16 +0100] - repl5_tot_waitfor_async_results timed out waiting for responses: 185340 399492
}}}
Timeout cam be caused by multiple reasons. Network problem could be one and a huge entry could be another. Looking at the DS1 log 3.Init\ replication_DS1_KO.log, importing the 396102nd entry took very long time. Do you happen to have any idea how this entry looks?
{{{
3.Init\ replication_DS1_KO.log
[30/Dec/2013:13:25:21 +0100] - import base_msa: Processed 396101 entries -- average rate 985.3/sec, recent rate 188.2/sec, hit ratio 100%
}}}
To obtain the info, you could dump the contents of the backed up database with this command.
dbscan -f /DATASOFT/msa/bak/hotbackup_3M/base_msa/id2entry.db4
The entry with "id 396102" may tell us what entry is slowing down the import. I'm wondering if it contains lots of state information.

If you find a huge entry, you may want to increase this configuration parameter.
https://access.redhat.com/site/documentation/en-US/Red_Hat_Directory_Server/9.0/html/Configuration_Command_and_File_Reference/Core_Server_Configuration_Reference.html#cnconfig-nsslapd_maxbersize_Maximum_Message_Size

If such state info is an issue, you could purge them quicker than the default settings.

https://access.redhat.com/site/documentation/en-US/Red_Hat_Directory_Server/9.0/html/Configuration_Command_and_File_Reference/Core_Server_Configuration_Reference.html#Replication_Attributes_under_cnreplica_cnsuffixName_cnmapping_tree_cnconfig-nsDS5ReplicaPurgeDelay
https://access.redhat.com/site/documentation/en-US/Red_Hat_Directory_Server/9.0/html/Configuration_Command_and_File_Reference/Core_Server_Configuration_Reference.html#Replication_Attributes_under_cnreplica_cnsuffixName_cnmapping_tree_cnconfig-nsDS5ReplicaTombstonePurgeInterval

That's said, I think backup/restore is not the direct cause of the problem, but some entries in the database might be. If you did consumer initialization when you took the backup, you ran into the time out, I guess...

The directory has been populated for testing purpose, all the entries are similar.
I checked with the command you gave, and the entry "id 396102" is very small...actually just as small as any other entry.

But it's true: we have a huge directory with 3 to 9 millions of entries.
Maybe I should try to increase this timeout value ?

I've tried increasing nsslapd_maxbersize to 50 Mb and nsDS5ReplicaTimeout to 1200s, but no luck...

I think that the initialization problem disappears by itself after this type of event which is recurrent every 24 hour, after the restore.
What is this ? Is it the purge of the replication log ?

[09/Feb/2014:19:30:15 +0100] NSMMReplicationPlugin - Info: Beginning tombstone reap for replica f741df82-5db511e3-ba80af9b-9fd73342.
[09/Feb/2014:19:30:15 +0100] NSMMReplicationPlugin - Info: No purge CSN for tombstone reap for replica f741df82-5db511e3-ba80af9b-9fd73342.
[09/Feb/2014:19:30:15 +0100] NSMMReplicationPlugin - Info: Finished tombstone reap for replica f741df82-5db511e3-ba80af9b-9fd73342.

I've learned that this log is generated by the replica tombstone purge.
I've tried to:
- restore the directory
- decrease this tombstone purge interval to 10 minutes (24H by default)
- restart the directory
- wait for the purge log to be written
- increase the tombstone purge interval to its default value (24H)
- restart the directory
- initialize the consumer

=> Problem still there !

setcheverry,

Is there any chance you can provide the LDIF file of the database you used for this testing? Are you using generic entries for testing, or are you using "real/production" user data?

I've been testing a with 3 million entry database on 1.2.11, and it does work for me.

Can you reproduce this every time?

Typically when dealing with larger databases, online reinitializations are not recommended. Instead an offline initialization is preferred(this does require copying a large LDIF file to the remote server):

https://access.redhat.com/site/documentation/en-US/Red_Hat_Directory_Server/9.0/html/Administration_Guide/Managing_Replication-Initializing_Consumers.html#Initializing_Consumers-Manual_Consumer_Initialization_Using_the_Command_Line

Can you test the offline initialization to see if it also is incomplete?

Thanks,
Mark

I'm waiting for my client approval before providing the LDIF file.

We've tried using filesystem initialization and it works ok.
https://access.redhat.com/site/documentation/en-US/Red_Hat_Directory_Server/9.0/html/Administration_Guide/Managing_Replication-Initializing_Consumers.html#Initializing_Consumers-Filesystem_Replica_Initialization

Can you confirm that there is no problem with this initialization for multi master replication ?

Replying to [comment:11 setcheverry]:

I'm waiting for my client approval before providing the LDIF file.

We've tried using filesystem initialization and it works ok.
https://access.redhat.com/site/documentation/en-US/Red_Hat_Directory_Server/9.0/html/Administration_Guide/Managing_Replication-Initializing_Consumers.html#Initializing_Consumers-Filesystem_Replica_Initialization

Can you confirm that there is no problem with this initialization for multi master replication ?

Actually, this is probably the best way to do it. You are literally just copying the database over - no server interaction at all.

Aside from getting you up and running, there might be another issue with the import process, that's why we want to reproduce it. So did using that LDIF always fail? Or was it random?

If you can provide the ldif, could you also send along your dse.ldif? Just curious what the caches(db cache, import cache, etc) are set to.

Thanks

I totally agree !
The initialization always fail after a restore.
My client tried a LDIF import via ldif2db with his own data and it failed too.

I'll find a way to send you the data file an dse.ldif.
The file will be pretty big (around 50MBytes). Will it be ok to post it ?

Replying to [comment:13 setcheverry]:

I totally agree !
The initialization always fail after a restore.
My client tried a LDIF import via ldif2db with his own data and it failed too.

I'll find a way to send you the data file an dse.ldif.
The file will be pretty big (around 50MBytes). Will it be ok to post it ?

Can you email it to me? mreynolds@redhat.com

Any update on the ldif? Like I said you could email it to me, or we could use something else(maybe google drive?). Let me know.

Still waiting for my client approval but I think it'll be ok soon.
Sorry for the delay.

Replying to [comment:16 setcheverry]:

Still waiting for my client approval but I think it'll be ok soon.
Sorry for the delay.

It's not a problem, just didn't want this ticket getting stale. Just keep me posted, thanks!

Thanks again for providing the LDIF file, but unfortunately I can not reproduce your problem(using the exact same config). The one thing I noticed is that your import performance is VERY slow. I had to turn on all kinds of verbose logging to slow my import down(to trigger multiple passes like what you saw). Are you testing this on a VM?

Since I can not reproduce it I was hoping you could gather a stack trace when the import starts to loop on the same ID(recent rate is 0.0/sec)

[30/Dec/2013:13:34:34 +0100] - import base_msa: Processed 396101 entries (pass 3) -- average rate 21472856.0/sec, recent rate 0.0/sec, hit ratio 0%
...
[30/Dec/2013:13:34:54 +0100] - import base_msa: Processed 396101 entries (pass 3) -- average rate 19520778.2/sec, recent rate 0.0/sec, hit ratio 0%

You will need to install 389-ds-base-debuginfo & gdb packages, then follow this document on how to get the stack trace:

http://port389.org/wiki/FAQ#Debugging_Hangs

Thanks!

I forgot to mention, I would like you to get several stack traces, 20 seconds apart. That will help narrow down where it is getting stuck.

Yes we are testing this on a small VM.
1 CPU
RAM: 2GB
We don't have any verbose mode activated, except replication logs.

Our version: 1.2.11.15 B2013.312.1642

I'm trying to produce the stacktraces.

I've uploaded the stacktraces.
Tell me if it's ok for you.

I've generated the stacktraces after it started to loop.
Maybe I should try to stacktrace before ?

Replying to [comment:23 setcheverry]:

I've uploaded the stacktraces.
Tell me if it's ok for you.

Something is wrong. All the stack traces are from an idle server with NO import occurring. Maybe the wrong pid was used? Can you please try it again and make sure the correct pid is being used in that gdb command.

I've generated the stacktraces after it started to loop.

Yes this is what I wanted.

Maybe I should try to stacktrace before ?

Nope, not needed. I need to see what is happening once the import stalls.

New stack traces added.
I hope it's ok now, indeed a PID problem because we have two instances of ns-slapd running.

Replying to [comment:27 setcheverry]:

New stack traces added.
I hope it's ok now, indeed a PID problem because we have two instances of ns-slapd running.

Ok, this was correct pid - thanks! Okay, so there was some interesting results. I'd like you now to actually attach a debugger to the process once it starts to loop, and gather some specific information.
I am attaching a step by step set of instructions I would like you to perform when you get a chance. I would like you to do this twice. So, attach it to the process, gather the info, and exit the debugger. Wait 30 seconds and do it all over again. Then please send me the output requested. Thanks again!

Instructions to attach gdb, and gather specific information
gdb-import-debug-instructions.txt

Please find the debug session as asked.

Okay, I'd like you to run a different test now.

Instead of doing an online initialization, I want you to do an LDIF initialization:

Using console:

[1] right click on the agreement and select "Export Replica", instead of "Initialize Consumer". This will create a LDIF file.

[2] Copy this new LDIF file to the consumer/replica system

[3] Import this LDIF. Going the "data' node, and find your suffix, then right click and select "Initialize database", and select your new LDIF file you just copied over.

Does this import hang or partially complete? Please provide error log showing the import. Thanks!

Replying to [comment:30 mreynolds]:

Okay, I'd like you to run a different test now.

Instead of doing an online initialization, I want you to do an LDIF initialization:

Using console:

[1] right click on the agreement and select "Export Replica", instead of "Initialize Consumer". This will create a LDIF file.

[2] Copy this new LDIF file to the consumer/replica system

Can you also put this LDIF on your ftp site? This LDIF will contain replication state information, that was not present in the original LDIF you provided me. This might help me to reproduce the issue on our side. Thanks!

[3] Import this LDIF. Going the "data' node, and find your suffix, then right click and select "Initialize database", and select your new LDIF file you just copied over.

Does this import hang or partially complete? Please provide error log showing the import. Thanks!

Update:

I have not been able to reproduce the problem, but I have found the reason for the import stopping. In this case an "online initialization" is being performed. The master replica sends each entry over the network to the consumer replica. After a while performance drops, and the master replica times out waiting for confirmation that the last entry was added. Once this timeout occurs, the import simply stops - thinking that it has received all the entries. This explains the "successful" but partial import.

As for the poor import performance that triggers the timeout, I do not know if there is a bug, or simply that the consumer system can not handle the initialization load it is being given.

Still investigating...

Hi Mark,

Great analysis! It looks your research shows this ticket needed more reports... (Ticket #47606 - replica init/bulk import errors should be more verbose)

So, could you open a new ticket for the error report?
Thanks!
--noriko

Summary of issue:

The problem here is that as we are sending entries over the network to the consumer(online total initialization), and we run into an issue writing to the network socket. This stops sending new entries to the consumer, and basically aborts the online import:

Import is started:

{{{
[30/Dec/2013:13:18:40 +0100] NSMMReplicationPlugin - Beginning total update of replica "agmt="cn=2 vers 1" (172:389)".
}}}

Failed to write to the socket buffer, this stops the sending of new entries:

{{{
[30/Dec/2013:13:35:15 +0100] NSMMReplicationPlugin - agmt="cn=2 vers 1" (172:389): poll timed out - poll interval [600000]
}}}

Now we start waiting for all the entries to be seen by the consumer, but we have only sent 399492 entries:

{{{

[30/Dec/2013:13:35:15 +0100] - repl5_tot_waitfor_async_results: 177774 399492
...
[30/Dec/2013:13:40:15 +0100] - repl5_tot_waitfor_async_results: 185314 399492
[30/Dec/2013:13:40:16 +0100] - repl5_tot_waitfor_async_results timed out waiting for responses: 185340 399492
}}}

There there is a 5 minute timeout waiting for all the sent entries to be seen by the consumer. It is making progress so I feel there is a bug(we should not timeout when progress is being made), however, it will still stop at entry 399492. - the import was still aborted.

So, once we run into a connection issue, the online initialization will simply stop. This is all by design - there is no bug.

I'm not sure if the VM is the root cause of the connections issues, but it is not the Directory Server's fault.

Workaround: Do binary or LDIF initializations for large databases, or in environments where connectivity can become unstable.

I will be improving the error logging, so that it is clear that the import was not complete.

git merge ticket47655
Updating 9940ca2..48cd960
Fast-forward
ldap/servers/plugins/replication/repl5.h | 1 +
.../plugins/replication/repl5_tot_protocol.c | 31 +++++++++++++------
2 files changed, 22 insertions(+), 10 deletions(-)

git push origin master
9940ca2..48cd960 master -> master

commit 48cd960
Author: Mark Reynolds mreynolds@redhat.com
Date: Tue Apr 1 13:25:40 2014 -0400

Thanks Mark for your analysis and correction !

What is surprising is that , on the same network and same VM, the initialization finally works if we try it 24 hours after the restore!

Indeed, we chosed the binary initialization but something is strange about this network behaviour.

Replying to [comment:38 setcheverry]:

Thanks Mark for your analysis and correction !

What is surprising is that , on the same network and same VM, the initialization finally works if we try it 24 hours after the restore!

Indeed, we chosed the binary initialization but something is strange about this network behaviour.

Yeah it is strange, but all it takes is one connection issue to abort the entire initialization. Maybe there is some network/OS tuning that might help. Hard to say, but doing the binary/LDIF initialization will always be significantly faster, and less error prone, than doing the online init.

Metadata Update from @mreynolds:
- Issue assigned to mreynolds
- Issue set to the milestone: N/A

3 years ago

Login to comment on this ticket.

Metadata