#4036 RHEL7 ipa-replica-manage hang waiting on CLEANALLRUV tasks
Closed: Duplicate None Opened 10 years ago by dpal.

Ticket was cloned from Red Hat Bugzilla (product Red Hat Enterprise Linux 7): Bug 1031852

Description of problem:

I'm seeing ipa-replica-manage hang in a few cases.  For ipa-replica-manage del
in some environments, I'm seeing it hang.  It was discovered that it was
hanging on some CLEANALLRUV tasks:


[root@ipaqa64vmk ~]# ps -ef|grep ipa-replica-manage
root     18587 11341  0 19:47 pts/0    00:00:00 grep --color=auto
ipa-replica-manage
root     21526  4389  0 18:42 ?        00:00:09 /usr/bin/python -E
/usr/sbin/ipa-replica-manage -p Secret123 del qeblade6.testrelm.com -f

[root@ipaqa64vmk ~]# ipa-replica-manage list-ruv
ipaqa64vmk.testrelm.com:389: 6
ipaqa64vmb.testrelm.com:389: 5
ipaqavmd.testrelm.com:389: 4
qeblade6.testrelm.com:389: 8
ipaqa64vma.testrelm.com:389: 12

[root@ipaqa64vmk ~]# ipa-replica-manage list-clean-ruv
CLEANALLRUV tasks
RID 8: Not all replicas caught up, retrying in 2560 seconds

No abort CLEANALLRUV tasks running

Version-Release number of selected component (if applicable):
ipa-server-3.3.3-4.el7.x86_64
389-ds-base-1.3.1.6-10.el7.x86_64


How reproducible:
happening often in automated testing.

Steps to Reproduce:
1.  Setup IPA environment with 5 nodes in line 1-2-3-4-5
2.  on node4: ipa-replica-manage -p $PASSWD del node5

It should be noted that I've seen hangs with other ipa-replicamanage other
places but, those have been very infrequent compared to the del one.

Actual results:

ipa-replica-manage command hangs

ipa-replica-manage list-clean-ruvs shows "Not all replicas caught up,"


Expected results:
No hang and node is deleted from replication agreement topology.

Additional info:


Env where I see the del hang:

   M
  / \
R1   R2
      \
       R3
        \
         R4

R3 dels R4.

Honza, please check this one.

I discussed this issue with Mark Reynolds, this was his advise:

It appears that replica 8 (cloud-qe-15) was not correctly decommissioned. cloud-qe-7 has an update from the "deleted" replica(cloud-qe-15) that the other replicas have not seen. This means cloud-eq-15 was removed before it sent all of its updates out.

I'm not sure how IPA performs the cleanallruv task, but these steps should be followed:

[1] On the replica to delete (cloud-eq-15 in this example), it should be put into read only mode.
[2] Remove all the repl agmts that point to this replica(cloud-qe-15).
[3] Send cleanallruv task, and wait for it to finish
[4] Remove replica(cloud-qe-15)

A workaround is to use the "force" option in the cleanallruv task - this will not check if the replicas are caught up, and it will just clean them. Potential for lost changes, but it also won't "hang" while waiting for a change that will never come.

We need to see if we do the correct steps in replica removal. For example, I think we do not put a replica to be removed to readonly mode for the whole agreement removal and decommission. We only put it to readonly when we are removing one link.

Moving to next month iteration.

Moving to next month iteration.

As I commented in the Bugzilla, I am closing this ticket. As discussed with Nathan, there was a problem with a wrong test procedure. When Scott fixed it, the problem went away. There was still the 389-ds-base freeze, but it is being investigated in Bug 1034832, nothing to do with FreeIPA, yet - thus, closing it.

Reopening the ticket. Mark had several good suggestions how the current re-initialize command can be improved, like:

(In reply to Martin Kosek from comment #22)
> Mark, thanks for explanation. I am now thinking what from the proposed
> improvements could be automated in ipa-replica-manage.
> 
> We could warn user that he has to re-initialize also other IPA masters in
> case he reinitializes "C" as in your example.

I think there should always be some type of warning when doing an online reinit.  Stating something like the remote database will be removed, and its changelog invalidated, and that it might require the remote replicas peers to be reinited as well.

> But for that, we would first
> need to be able to get a full graph of the IPA network:
> https://fedorahosted.org/freeipa/ticket/3058
> 
> As for other enhancements, I am thinking about following update to the
> process:
> 
> 1) enable the agreement from this host to the remote host (put
> nsds5ReplicaEnabled to ON)
> 2) enable the agreement from the remote host to local host (put
> nsds5ReplicaEnabled to ON)
> 
> FOR EACH replication peer:
>     a) Force synchronization from the remote host to the local host (play
> with nsDS5ReplicaUpdateSchedule)
>     b) Wait until replication is stale (nsds5replicaUpdateInProgress is
> false)

This won't guarantee that replication is idle when you actually do the reinit.
You would need to:

 a) Set this server to read-only mode.
 b) Force synchronization from the remote host to the local host (play with nsDS5ReplicaUpdateSchedule).
 c) Then wait for nsds5replicaUpdateInProgress to be false. 
 d) Do the reinit on the remote replica.
 e) Finally, disable read-only mode.

While this is disruptive to clients/replicas, this should not be a common task being performed.  If it needs to be run, then there are probably already disruptive problems occurring, or, nothing was even setup yet(in which case it doesn't really matter).

> 
> 3) Re-initialize the replication (change nsds5BeginReplicaRefresh to start)
> 
> Would that improve the process? I was not sure what exactly do you mean by
> "This requires checking RUVs in each agreement against the consumer replica
> database RUV, etc.", i.e. how should I check/compare that.

I was referring to the "hard" way of determining if the replica was idle.  Checking nsds5replicaUpdateInProgress should be sufficient.

See linked Bugzilla for more information.

See https://bugzilla.redhat.com/show_bug.cgi?id=1031852#c29, this should be all inherently improved with the topology feature (#4302) and #5307.

Metadata Update from @dpal:
- Issue assigned to jcholast
- Issue set to the milestone: Future Releases

7 years ago

Login to comment on this ticket.

Metadata