#5235 Random synchronization issue after ipa user-add
Closed: Duplicate None Opened 8 years ago by trana.

Hi,

We have set up an environment with two IPA server 3.0.0 (the second done with ipa-replica-prepare) with CentOS 6.4.

Symptom:
When adding users using IPA UI of the server 1 (in firefox), sometimes (around 1 case every 10), the user is not added in server 2 (tested with ipa user-show). Users are always added correctly in server 1. The logs in /var/log/dirsrv-*/errors does not show any relevant information, like no error in synchronization between the two servers.

Workaround:
In server 2, we do a ipa-replica-manager re-initialize --from ... to force replicate.

Any more log I can find useful ? Thanks.


this sound like a replication issue and the aforementioned /var/log/dirsrv-*/errors log is the right place to check.

Does it occur only for specific records and replication of other records works or after an occurrence no other new record/modification is replicated?

Hi,

I tested again. I did two tail in /var/log/dirsrv/.../errors (there are 2 files : one in SLAPD-PKI-... and one with SLAPD-<name of domain or realm>-...).

I created 14 users, one-by-one in controller-1 with

ipa user-add --first=test --last=test todelete$i

In controller-2, with ipa user-show, only the first user todelete1 is not created. Nothing (no more line) was shown in errors files. Nothing in iptables logs (no DROP).

I did a for loop, created until I have 50 todelete$i users. I cannot reproduce again. I deleted and recreated again all 50 users sequentially. No error.

And was todelete1 created on controller-1?

So far it's not clear if there is bug and where.

If you manage to reproduce it, also check httpd/error_log and dirsrv access logs on both machine around the time.

Replying to [comment:3 pvoborni]:

And was todelete1 created on controller-1?
Yes. Each ipa user-add done in controller-1 exists immediately in controller-1, but not always in 2. It seems like sometime, a synchronization command from controller-1 to controller-2 is not done correctly. Maybe a faulty network? Does controller-1 anything to gets acknowledgment from other IPA server?

I will try to reproduce it but it takes time and luckiness.

maybe it is related to https://fedorahosted.org/389/ticket/48266

Thierry, what could be done to verify that?

I doubt 5235 is the consequence of 48266. In fact 48266 can prevent/slow down replication of updates with 3 servers. here we have only 2 servers. The only case I can think of (being triggered by 48266) is that the replication connection get closed because there are too many skipped updates and so idletimeout closes the replication connection.
When the problem occurs I would need access/errors logs (DS) with replication log level to confirm that.

I prepared a 389-ds testcase for 48266, I have no freeipa test case to reproduce it.

I'm closing this bug for now. Without logs there is nothing we can do. Please reopen when logs are available.

Hi, sorry, I was busy with other tasks. I tried to replicate the issue and found some logs that I haven't found before. This is what I have done.

I have a first master, controller-1, and the replicated master controller-2. To begin, I did a ipa-replica-manage re-initialize --from controller-1 in controller-2 today 15/09/2015 at around 17:04 UTC (you will see that in the logs).

Then in controller-1, I did 13 users every second :
for i in $(seq 1 13) ; do ipa user-add todelete$i --first=todelete --last=todelete && sleep 1 ; done

Then in controller-2:
for i in $(seq 1 13) ; do ipa user-show todelete$i >/dev/null 2>/dev/null && echo User todelete$i exist | | echo WARNING User todelete$i not found ; done

Result: todelete1 does not exist, but the other users exist.

Log attached.

- the entry todelete1 was likely not created on controller-1 as well because of _cl5WriteOperationTxn being in DB_DEADLOCK

- The reason of the DB_DEADLOCK is strange because the total init completed before so I see no reason of that

- Would you retry the same test with a 1s sleep between end of 're-initialize' and creation of entries on controller-1

- Before doing that could you provide the access log from both controller ?

thanks
thierry

I reproduced the failure as well as a DS crash. This is not sytematic. I did not enable core dump so I have to reproduce again.

[21/Sep/2015:16:54:06 +0200] conn=229 op=6 ADD dn="uid=user_427_0,cn=users,cn=accounts,SUFFIX"
<< crash>>


[21/Sep/2015:16:54:04 +0200] NSMMReplicationPlugin - Finished total update of replica "agmt="cn=meTovm-xxx.abc.idm.lab.eng.brq.redhat.com" (vm-xxx:389)". Sent 734 entries.
[21/Sep/2015:16:54:11 +0200] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=56001a0f000100040000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[21/Sep/2015:16:54:11 +0200] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (56001a0f000100040000); db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[21/Sep/2015:16:54:11 +0200] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for uid=user_427_0,cn=users,cn=accounts,SUFFIX (uniqid: 8940d809-607011e5-b603f794-98322a69, optype: 16) to changelog csn 56001a0f000100040000
[21/Sep/2015:16:54:12 +0200] managed-entries-plugin - mep_add_managed_entry: Failed to add managed entry objectclass in origin entry "uid=user_427_1,cn=users,cn=accounts,SUFFIX", error (No such object)
[21/Sep/2015:16:54:12 +0200] managed-entries-plugin - mep_add_managed_entry: Unable to add pointer to managed entry "cn=user_427_1,cn=groups,cn=accounts,SUFFIX" in origin entry "uid=user_427_1,cn=users,cn=accounts,SUFFIX"

Replying to [comment:10 tbordaz]:

I reproduced the failure as well as a DS crash. This is not sytematic. I did not enable core dump so I have to reproduce again.

Hi,

I did the test again. Attached are the logs from both controller-1 and controller-2, for both access and errors log files. I took some time to clean irrelevant information (FQDN, IP, etc.). Also attached are the commands and

I managed to reproduce the issue.

Test details:

I cleaned my environment. I did a clean of all todelete${i} users both in controller-1 and 2. Than a ipa re-initialize in controller-2.

In controller-1, I created 100 users with sleep of 1s each time.

# Begin at 10:20:43 UTC.
for user in $(seq 1 100) ; do ipa user-add todelete${user} --first=todelete --last=todelete > /dev/null 2>&1 && echo User todelete${user} created && sleep 1 || echo User todelete${user} NOT DONEEEEEEEEE && sleep 1; done
# Ended at 10:24:52 UTC.

Test results:

# I did ths check in both controllers.
for user in $(seq 1 100) ; do ipa user-show todelete${user} >/dev/null && 2>&1 && echo User todelete${user} created || echo User todelete${user} NOT DONEEEEEEEEE ; done

What happens is, as the logs shows, only todelete29 does not exist in controller-2. All other users exists in both controllers, including todelete29 in controller-1. There seems to be a deadlock somewhere with a race condition, which prevents replication. The logs (full logs attached) below shows that we reproduced the same issue.

[22/Sep/2015:10:21:48 +0000] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=56012bbc000000040000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[22/Sep/2015:10:21:48 +0000] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (56012bbc000000040000); db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[22/Sep/2015:10:21:48 +0000] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for uid=todelete29,cn=users,cn=accounts,dc=example,dc=com (uniqid: b72b2883-611311e5-b5e6e6bf-0a8e862e, optype: 16) to changelog csn 56012bbc000000040000

The status of this ticket is complex. There are 3 issues:
- DS crash (https://fedorahosted.org/389/ticket/48287)
- update in changelog returning DB_DEADLOCK
- An entry existing on the supplier but not on the consumer

It is possible that those 3 issue are linked.
I propose to investigate the crash 48287 and when it will be fixed, check if the DB_DEADLOCK/replication_issue still occurs.

Hi tbordaz,

I have seen that the first issue 48287 is fixed 3 weeks ago in DS component. Does this mean this issue is also solved? Thank you.

Hi trana,

The DS crash (48287) is fixed in 1.2.11 and would be available in 1.2.11.32 when it will be build.
There is a possibility that 48287 would be the RC of DB_DEADLOCK and entry not replicated on consumer.
Now I have a doubt of that.
I intend to test the two others symptoms with 48287 fix.

Hi tbordaz,

If the fix 48287 correct this issue, do you think it will be backported to 3.X branch? Indeed, we have a system in production, in Red Hat 6, which do not have Ipa V4.X.

Thank you.

The usual way to backport a patch into RHEL is to open a customer case with Red Hat.

But in this case it might not be needed, see https://bugzilla.redhat.com/show_bug.cgi?id=1265851

Thierry, do you know if the issue is fixed in Fedora? E.g. 389-ds-base-1.3.4.8-1.fc23? So that we can close this ticket?

Replying to [comment:20 pvoborni]:

Thierry, do you know if the issue is fixed in Fedora? E.g. 389-ds-base-1.3.4.8-1.fc23? So that we can close this ticket?

Hi Trana,

Yes absolutely this bug is fixed since 1.3.4.0 as side effect of tickets 47808 and 47815.
I am now closing this ticket as duplicate of ​https://fedorahosted.org/389/ticket/48287

regards
thierry

Metadata Update from @trana:
- Issue assigned to tbordaz
- Issue set to the milestone: FreeIPA 4.3.1

7 years ago

Login to comment on this ticket.

Metadata