Issue #4036: RHEL7 ipa-replica-manage hang waiting on CLEANALLRUV tasks - freeipa

freeipa

#4036 RHEL7 ipa-replica-manage hang waiting on CLEANALLRUV tasks

Closed: Duplicate None Opened 10 years ago by dpal.

Ticket was cloned from Red Hat Bugzilla (product Red Hat Enterprise Linux 7): Bug 1031852

Description of problem:

I'm seeing ipa-replica-manage hang in a few cases.  For ipa-replica-manage del
in some environments, I'm seeing it hang.  It was discovered that it was
hanging on some CLEANALLRUV tasks:


[root@ipaqa64vmk ~]# ps -ef|grep ipa-replica-manage
root     18587 11341  0 19:47 pts/0    00:00:00 grep --color=auto
ipa-replica-manage
root     21526  4389  0 18:42 ?        00:00:09 /usr/bin/python -E
/usr/sbin/ipa-replica-manage -p Secret123 del qeblade6.testrelm.com -f

[root@ipaqa64vmk ~]# ipa-replica-manage list-ruv
ipaqa64vmk.testrelm.com:389: 6
ipaqa64vmb.testrelm.com:389: 5
ipaqavmd.testrelm.com:389: 4
qeblade6.testrelm.com:389: 8
ipaqa64vma.testrelm.com:389: 12

[root@ipaqa64vmk ~]# ipa-replica-manage list-clean-ruv
CLEANALLRUV tasks
RID 8: Not all replicas caught up, retrying in 2560 seconds

No abort CLEANALLRUV tasks running

Version-Release number of selected component (if applicable):
ipa-server-3.3.3-4.el7.x86_64
389-ds-base-1.3.1.6-10.el7.x86_64


How reproducible:
happening often in automated testing.

Steps to Reproduce:
1.  Setup IPA environment with 5 nodes in line 1-2-3-4-5
2.  on node4: ipa-replica-manage -p $PASSWD del node5

It should be noted that I've seen hangs with other ipa-replicamanage other
places but, those have been very infrequent compared to the del one.

Actual results:

ipa-replica-manage command hangs

ipa-replica-manage list-clean-ruvs shows "Not all replicas caught up,"


Expected results:
No hang and node is deleted from replication agreement topology.

Additional info:


Env where I see the del hang:

   M
  / \
R1   R2
      \
       R3
        \
         R4

R3 dels R4.

mkosek commented 10 years ago

Honza, please check this one.

I discussed this issue with Mark Reynolds, this was his advise:

It appears that replica 8 (cloud-qe-15) was not correctly decommissioned. cloud-qe-7 has an update from the "deleted" replica(cloud-qe-15) that the other replicas have not seen. This means cloud-eq-15 was removed before it sent all of its updates out.

I'm not sure how IPA performs the cleanallruv task, but these steps should be followed:

[1] On the replica to delete (cloud-eq-15 in this example), it should be put into read only mode.
[2] Remove all the repl agmts that point to this replica(cloud-qe-15).
[3] Send cleanallruv task, and wait for it to finish
[4] Remove replica(cloud-qe-15)

A workaround is to use the "force" option in the cleanallruv task - this will not check if the replicas are caught up, and it will just clean them. Potential for lost changes, but it also won't "hang" while waiting for a change that will never come.

We need to see if we do the correct steps in replica removal. For example, I think we do not put a replica to be removed to readonly mode for the whole agreement removal and decommission. We only put it to readonly when we are removing one link.

mkosek commented 10 years ago

Moving to next month iteration.

mkosek commented 10 years ago

Moving to next month iteration.

mkosek commented 10 years ago

As I commented in the Bugzilla, I am closing this ticket. As discussed with Nathan, there was a problem with a wrong test procedure. When Scott fixed it, the problem went away. There was still the 389-ds-base freeze, but it is being investigated in Bug 1034832, nothing to do with FreeIPA, yet - thus, closing it.

mkosek commented 10 years ago

Reopening the ticket. Mark had several good suggestions how the current re-initialize command can be improved, like:

(In reply to Martin Kosek from comment #22)
> Mark, thanks for explanation. I am now thinking what from the proposed
> improvements could be automated in ipa-replica-manage.
> 
> We could warn user that he has to re-initialize also other IPA masters in
> case he reinitializes "C" as in your example.

I think there should always be some type of warning when doing an online reinit.  Stating something like the remote database will be removed, and its changelog invalidated, and that it might require the remote replicas peers to be reinited as well.

> But for that, we would first
> need to be able to get a full graph of the IPA network:
> https://fedorahosted.org/freeipa/ticket/3058
> 
> As for other enhancements, I am thinking about following update to the
> process:
> 
> 1) enable the agreement from this host to the remote host (put
> nsds5ReplicaEnabled to ON)
> 2) enable the agreement from the remote host to local host (put
> nsds5ReplicaEnabled to ON)
> 
> FOR EACH replication peer:
>     a) Force synchronization from the remote host to the local host (play
> with nsDS5ReplicaUpdateSchedule)
>     b) Wait until replication is stale (nsds5replicaUpdateInProgress is
> false)

This won't guarantee that replication is idle when you actually do the reinit.
You would need to:

 a) Set this server to read-only mode.
 b) Force synchronization from the remote host to the local host (play with nsDS5ReplicaUpdateSchedule).
 c) Then wait for nsds5replicaUpdateInProgress to be false. 
 d) Do the reinit on the remote replica.
 e) Finally, disable read-only mode.

While this is disruptive to clients/replicas, this should not be a common task being performed.  If it needs to be run, then there are probably already disruptive problems occurring, or, nothing was even setup yet(in which case it doesn't really matter).

> 
> 3) Re-initialize the replication (change nsds5BeginReplicaRefresh to start)
> 
> Would that improve the process? I was not sure what exactly do you mean by
> "This requires checking RUVs in each agreement against the consumer replica
> database RUV, etc.", i.e. how should I check/compare that.

I was referring to the "hard" way of determining if the replica was idle.  Checking nsds5replicaUpdateInProgress should be sufficient.

See linked Bugzilla for more information.

mkosek commented 8 years ago

See https://bugzilla.redhat.com/show_bug.cgi?id=1031852#c29, this should be all inherently improved with the topology feature (#4302) and #5307.

Metadata Update from @dpal:
- Issue assigned to jcholast
- Issue set to the milestone: Future Releases

7 years ago

Metadata

Assignee

jcholast

Tags

None

Blocking

None

Depending on

None

Priority

normal

Milestone

Future Releases

None

affects_doc

None

source

None

knownissue

None

type

defect

blockedby

None

test_case

None

component

Replication

blocking

None

on_review

keywords

None

test_coverage

None

reviewer

None

external_tracker

None

rhbz

https://bugzilla.redhat.com/show_bug.cgi?id=1031852

tester

None

changelog

None

design

None

freeipa

Source Code

#4036 RHEL7 ipa-replica-manage hang waiting on CLEANALLRUV tasks Closed: Duplicate None Opened 10 years ago by dpal.

Metadata

#4036 RHEL7 ipa-replica-manage hang waiting on CLEANALLRUV tasks

Closed: Duplicate None Opened 10 years ago by dpal.