Issue #2842: ipa-replica-manage re-initialize update failed due to named ldap timeout - freeipa

freeipa

#2842 ipa-replica-manage re-initialize update failed due to named ldap timeout

Closed: Invalid None Opened 11 years ago by dpal.

https://bugzilla.redhat.com/show_bug.cgi?id=831661 (Red Hat Enterprise Linux 6)

Description of problem:

Occasionally I'm seeing ipa-replica-manage re-initialize fail with this error:

[<HOSTNAME>] reports: Update failed! Status: [-2  - System error]

With some digging, I get down to see that named on the server repicating from
timed out talking to the LDAP server.

Version-Release number of selected component (if applicable):

ipa-server-2.2.0-16.el6.x86_64
389-ds-base-1.2.10.2-17.el6_3.x86_64
bind-9.8.2-0.10.rc1.el6.x86_64
bind-dyndb-ldap-1.1.0-0.9.b1.el6.x86_64


How reproducible:

Not predictable but, I've seen this a number of times now running automated
tests.

Steps to Reproduce:
1.  <setup 3 server env with 1 master and 2 replicas of that in simple triangle
topology>
2.  ipa-replica-manage -p <PASSWORD> re-initialize --from=<replica_fqdn>

I'm not quite sure

Actual results:

On REPLICA1

ipa-replica-manage -p XXXXXXXXX re-initialize --from=replica2.testrelm.com
ipa: INFO: Setting agreement
cn=meToreplica1.testrelm.com,cn=replica,cn=dc\3Dtestrelm\2Cdc\3Dcom,cn=mapping
tree,cn=config schedule to 2358-2359 0 to force synch
ipa: INFO: Deleting schedule 2358-2359 0 from agreement
cn=meToreplica1.testrelm.com,cn=replica,cn=dc\3Dtestrelm\2Cdc\3Dcom,cn=mapping
tree,cn=config
Update in progress
Update in progress
Update in progress
Update in progress
Update in progress
Update in progress
Update in progress
Update in progress
Update in progress
Update in progress
Update in progress
Update in progress
Update in progress
Update in progress
Update in progress
Update in progress
Update in progress
Update in progress
Update in progress
Update in progress
Update in progress
Update in progress
Update in progress
[replica2.testrelm.com] reports: Update failed! Status: [-2  - System error]

Expected results:

Update succeeds without named ldap timeout error in logs?

Additional info:

Found in replica2:/var/log/messages:
Jun 13 09:32:16 replica2 named[14811]: LDAP query timed out. Try to adjust
"timeout" parameter
Jun 13 09:32:16 replica2 named[14811]: LDAP query timed out. Try to adjust
"timeout" parameter
Jun 13 09:32:26 replica2 named[14811]: LDAP query timed out. Try to adjust
"timeout" parameter
Jun 13 09:32:26 replica2 named[14811]: LDAP query timed out. Try to adjust
"timeout" parameter
Jun 13 09:32:26 replica2 ns-slapd: GSSAPI Error: An invalid name was supplied
(Hostname cannot be canonicalized)
Jun 13 09:32:26 replica2 named[14811]: client <master_ip>#51903: received
notify for zone 'testrelm.com'
Jun 13 09:32:26 nec-em24-3 named[14811]: client <master_ip>#51903: received
notify for zone '<replica_ptr_zone_reverse_ip>.in-addr.arpa'


Found in replica2:/var/log/dirsrv/slapd-TESTRELM-COM/errors:

[13/Jun/2012:09:32:26 -0400] slapd_ldap_sasl_interactive_bind - Error: could
not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2 (Local
error) (SASL(-1): generic failure: GSSAPI Error: An invalid name was supplied
(Hostname cannot be canonicalized)) errno 110 (Connection timed out)
[13/Jun/2012:09:32:26 -0400] slapi_ldap_bind - Error: could not perform
interactive bind for id [] mech [GSSAPI]: error -2 (Local error)
[13/Jun/2012:09:32:26 -0400] NSMMReplicationPlugin -
agmt="cn=meToreplica1.testrelm.com" (replica1:389): Replication bind with
GSSAPI auth failed: LDAP error -2 (Local error) (SASL(-1): generic failure:
GSSAPI Error: An invalid name was supplied (Hostname cannot be canonicalized))

dpal commented 11 years ago

This seems like a ldap driver issue.

pspacek commented 11 years ago

It seem like DS problem more than bind-dyndb-ldap problem.

bind-dyndb-ldap reports "LDAP query timed out" - for some reason DS is not able to answer a LDAP query.

Technically: ldap_search_ext_s() returned an error and LDAP_OPT_RESULT_CODE was LDAP_TIMEOUT.

Scott, can you confirm if DS is really not able to reply to LDAP query during replica-manage-process (you can query for anything in cn=dns subtree)?

Unfortunatelly there is no automatic way to recover from this situation. When DS is not functional, then bind-dyndb-ldap can't do any name translation.

You can specify another DNS server in /etc/resolv.conf before changing local DS configuration.

OR - Hypothetically - ipa-replica-manage script can do all DNS queries before any action in DS. In that case all necessary DNS data will be cached inside bind-dyndb-ldap for 120 seconds (by default) and then it is not necessary to contact DS for name translation.

rmeggins commented 11 years ago

Why does re-initialize do a force-sync? This is not necessary.

        repl = replication.ReplicationManager(realm, fromhost, dirman_passwd)
        agreement = repl.get_replication_agreement(thishost)
        repl.force_sync(repl.conn, thishost)

        repl.initialize_replication(agreement.dn, repl.conn)
        repl.wait_for_repl_init(repl.conn, agreement.dn)

force_sync is a waste of time in this case.

With the DNS issue, I'm not sure what's going on. What DNS server does replica2 use? Nothing is happening to bind on replica2, correct? So replica2 should look up the hostname for replica1 in its DNS, which should be up and running. Or is bind configured such that the DNS on replica2 doesn't know anything about replica1 and has to contact DNS running on replica1 to get that information? I'm not sure why replica2 would have problems looking up replica1.

spoore commented 11 years ago

I don't have the same servers up at this time to reproduce readily. I am looking at trying to get something else setup where I can reproduce. I've uploaded the named.conf from replica1/2 to the BZ though.

Each replica though is pointing to itself for DNS and was installed with setup-dns. bind on replica2 should know about replica1 because the info is replicated in the directory...right? There's no custom DNS config outside of IPA here.

If I can reproduce on another environment, I will check with an ldapsearch to see if I can access cn=dns. In the mean time, let me know if you think of anything else I should look for there too.

Thanks,
Scott

rmeggins commented 11 years ago

I guess maybe the DNS information about replica1 was not in the DNS database for replica2? Was it ever in there?

pspacek commented 11 years ago

In case of missing information bind-dyndb-ldap simply returns NXDOMAIN (name doesn't exist). "LDAP query timed out" points to a problem with DS, but I don't have an idea why DS refuses to reply to a query. I'm trying to reproduce it also.

rmeggins commented 11 years ago

I've done a small test - set up two masters - init one from the other - on the one that is the supplier for the init, perform some searches - the searches work. So I'm not sure what's going on. If you can reproduce it, please provide the access and error logs from the 389 supplier so we can confirm that the search attempts are occurring during the init.

rmeggins commented 11 years ago

There is something really strange going on. Looks like replicaA is attempting to re-init replica B:

[13/Jun/2012:09:32:01 -0400] conn=47 op=4 MOD dn="cn=meToreplicaB.testrelm.com,cn=replica,cn=dc\3Dtestrelm\2Cdc\3Dcom,cn=mapping tree,cn=config"
[13/Jun/2012:09:32:01 -0400] conn=47 op=4 RESULT err=0 tag=103 nentries=0 etime=0
[13/Jun/2012:09:32:02 -0400] conn=47 op=5 MOD dn="cn=meToreplicaB,cn=replica,cn=dc\3Dtestrelm\2Cdc\3Dcom,cn=mapping tree,cn=config"
[13/Jun/2012:09:32:02 -0400] conn=47 op=5 RESULT err=0 tag=103 nentries=0 etime=0
[13/Jun/2012:09:32:02 -0400] conn=47 op=6 MOD dn="cn=meToreplicaB.testrelm.com,cn=replica,cn=dc\3Dtestrelm\2Cdc\3Dcom,cn=mapping tree,cn=config"
[13/Jun/2012:09:32:03 -0400] conn=47 op=6 RESULT err=0 tag=103 nentries=0 etime=1

This is the usual force sync followed by re-init that ipa-replica-manage re-initialize does. From the consumer:

[13/Jun/2012:09:32:03 -0400] conn=57 fd=72 slot=72 connection from replicaA to replicaB
[13/Jun/2012:09:32:03 -0400] conn=57 op=0 BIND dn="" method=sasl version=3 mech=GSSAPI
[13/Jun/2012:09:32:06 -0400] conn=57 op=0 RESULT err=14 tag=97 nentries=0 etime=3, SASL bind in progress
[13/Jun/2012:09:32:08 -0400] conn=14 op=7714 SRCH base="cn=dns,dc=testrelm,dc=com" scope=2 filter="(objectClass=idnsConfigObject)" attrs="idnsForwardPolicy idnsForwarders idnsAllowSyncPTR idnsZoneRefresh idnsPersistentSearch"
[13/Jun/2012:09:32:08 -0400] conn=14 op=7714 RESULT err=0 tag=101 nentries=1 etime=0
[13/Jun/2012:09:32:08 -0400] conn=14 op=7715 SRCH base="cn=dns,dc=testrelm,dc=com" scope=2 filter="(&(objectClass=idnsZone)(idnsZoneActive=TRUE))" attrs="idnsName idnsUpdatePolicy idnsAllowQuery idnsAllowTransfer idnsForwardPolicy idnsForwarders"
[13/Jun/2012:09:32:08 -0400] conn=14 op=7715 RESULT err=0 tag=101 nentries=4 etime=0
[13/Jun/2012:09:32:26 -0400] conn=57 op=1 UNBIND
[13/Jun/2012:09:32:26 -0400] conn=57 op=1 fd=72 closed - U1

conn=57 starts a SASL/GSSAPI bind, then nothing for 20 seconds, then it does an UNBIND and closes the connection. From replicaA, I can see what looks to be BIND traffic:

[13/Jun/2012:09:32:04 -0400] conn=15 op=7709 SRCH base="cn=dns,dc=testrelm,dc=com" scope=2 filter="(objectClass=idnsConfigObject)" attrs="idnsForwardPolicy idnsForwarders idnsAllowSyncPTR idnsZoneRefresh idnsPersistentSearch"
[13/Jun/2012:09:32:04 -0400] conn=15 op=7709 RESULT err=0 tag=101 nentries=1 etime=1
[13/Jun/2012:09:32:04 -0400] conn=15 op=7710 SRCH base="cn=dns,dc=testrelm,dc=com" scope=2 filter="(&(objectClass=idnsZone)(idnsZoneActive=TRUE))" attrs="idnsName idnsUpdatePolicy idnsAllowQuery idnsAllowTransfer idnsForwardPolicy idnsForwarders"
[13/Jun/2012:09:32:04 -0400] conn=15 op=7710 RESULT err=0 tag=101 nentries=4 etime=0

Then a lot of search requests from the ipa-replica-manage client to get the status, then this:

[13/Jun/2012:09:32:26 -0400] conn=15 op=7712 SRCH base="idnsname=testrelm.com,cn=dns,dc=testrelm,dc=com" scope=0 filter="(objectClass=idnsRecord)" attrs=ALL
[13/Jun/2012:09:32:26 -0400] conn=16 op=21 SRCH base="idnsname=testrelm.com,cn=dns,dc=testrelm,dc=com" scope=0 filter="(objectClass=idnsRecord)" attrs=ALL
[13/Jun/2012:09:32:26 -0400] conn=15 op=7712 RESULT err=0 tag=101 nentries=1 etime=20
[13/Jun/2012:09:32:26 -0400] conn=15 op=7713 SRCH base="idnsname=testrelm.com,cn=dns,dc=testrelm,dc=com" scope=0 filter="(objectClass=idnsRecord)" attrs=ALL
[13/Jun/2012:09:32:26 -0400] conn=16 op=21 RESULT err=0 tag=101 nentries=1 etime=20
[13/Jun/2012:09:32:26 -0400] conn=15 op=7713 RESULT err=0 tag=101 nentries=1 etime=0

Two strange things here:
1) there was no op=7711 - it goes from conn=15 op=7710 to op=7712 - what happened to op=7711???
2) etime=20 - the search took 20 seconds to complete - this is very bad

It looks as though the initial lookup of the consumer hostname by the supplier worked - the supplier was able to open a connection to the consumer - but something during the kerberos handshake also attempted to lookup the hostname, and that failed. I don't see anything from the krb5kdc log at this time.

Could be some sort of deadlock inside the server that gets resolved by the BIND connection timeout.

I will keep trying to reproduce.

pspacek commented 11 years ago

If you want to prolong BIND timeout, you can change timeout value in /etc/named.conf to some very high value (in seconds):

Relevant part of /etc/named.conf follows:

dynamic-db "ipa" {
         library "ldap.so";
         arg "uri ldapi://%2fvar%2frun%2fslapd-E-ORG.socket";
         arg "timeout 86400";
};

rmeggins commented 11 years ago

Petr, the searches like

[13/Jun/2012:09:32:04 -0400] conn=15 op=7709 SRCH base="cn=dns,dc=testrelm,dc=com" scope=2 filter="(objectClass=idnsConfigObject)" attrs="idnsForwardPolicy idnsForwarders idnsAllowSyncPTR idnsZoneRefresh idnsPersistentSearch"
[13/Jun/2012:09:32:04 -0400] conn=15 op=7709 RESULT err=0 tag=101 nentries=1 etime=1
[13/Jun/2012:09:32:04 -0400] conn=15 op=7710 SRCH base="cn=dns,dc=testrelm,dc=com" scope=2 filter="(&(objectClass=idnsZone)(idnsZoneActive=TRUE))" attrs="idnsName idnsUpdatePolicy idnsAllowQuery idnsAllowTransfer idnsForwardPolicy idnsForwarders"
[13/Jun/2012:09:32:04 -0400] conn=15 op=7710 RESULT err=0 tag=101 nentries=4 etime=0

It looks like bind is doing these every 30 seconds?

What about the searches like this:

[13/Jun/2012:09:32:26 -0400] conn=15 op=7712 SRCH base="idnsname=testrelm.com,cn=dns,dc=testrelm,dc=com" scope=0 filter="(objectClass=idnsRecord)" attrs=ALL
[13/Jun/2012:09:32:26 -0400] conn=16 op=21 SRCH base="idnsname=testrelm.com,cn=dns,dc=testrelm,dc=com" scope=0 filter="(objectClass=idnsRecord)" attrs=ALL
[13/Jun/2012:09:32:26 -0400] conn=15 op=7712 RESULT err=0 tag=101 nentries=1 etime=20
[13/Jun/2012:09:32:26 -0400] conn=15 op=7713 SRCH base="idnsname=testrelm.com,cn=dns,dc=testrelm,dc=com" scope=0 filter="(objectClass=idnsRecord)" attrs=ALL
[13/Jun/2012:09:32:26 -0400] conn=16 op=21 RESULT err=0 tag=101 nentries=1 etime=20
[13/Jun/2012:09:32:26 -0400] conn=15 op=7713 RESULT err=0 tag=101 nentries=1 etime=0

1) Is bind doing these searches? Why? What I mean is, is bind doing these searches because some client of bind is asking for DNS information? Or is it some sort of heartbeat/cache refresh like the conn=15 op=7709 and conn=15 op=7710 searches above?
2) Why are there two separate connections doing these searches? Does bind open two connections? Is there some other client doing these idnsname searches?

The problem does not necessarily seem to be replication related. I have seen named LDAP timeouts when the only clients are sssd, kdc, and named - no replication at all. I'm trying to find out what causes this named traffic.

pspacek commented 11 years ago

Rich,

1) Is bind doing these searches? Why? What I mean is, is bind doing these searches because some client of bind is asking for DNS information? Or is it some sort of heartbeat/cache refresh like the conn=15 op=7709 and conn=15 op=7710 searches above?

searches as

[13/Jun/2012:09:32:04 -0400] conn=15 op=7709 SRCH base="cn=dns,dc=testrelm,dc=com" scope=2 filter="(objectClass=idnsConfigObject)" attrs="idnsForwardPolicy idnsForwarders idnsAllowSyncPTR idnsZoneRefresh idnsPersistentSearch"

[13/Jun/2012:09:32:04 -0400] conn=15 op=7710 SRCH base="cn=dns,dc=testrelm,dc=com" scope=2 filter="(&(objectClass=idnsZone)(idnsZoneActive=TRUE))" attrs="idnsName idnsUpdatePolicy idnsAllowQuery idnsAllowTransfer idnsForwardPolicy idnsForwarders"

are done periodically to fetch new configuration (op 7709) and new DNS zones (op 7710). The period is configurable by "zone_refresh" argument in /etc/named.conf. It is 30 seconds in default IPA install.

An other "kind" of searches is

[13/Jun/2012:09:32:26 -0400] conn=15 op=7713 SRCH base="idnsname=testrelm.com,cn=dns,dc=testrelm,dc=com" scope=0 filter="(objectClass=idnsRecord)" attrs=ALL

Searches as in op 7713 are done when specific DNS record was requested. It is done in two basic situations:

a. A DNS client asked for an record.
b. A new DNS zone was found (during op 7710) and plugin needs to access some basic records (SOA, NS, A/AAAA in zone root record) from new DNS zone.

Basic "search flow" after BIND startup is
1. op 7709 (fetch plugin configuration)
2. op 7710 (fetch list of all active DNS zones)
3. One or more individual search operations for some DNS record objects. It consists from "op 7713" searches. The plugin looks for one or more objects with (objectClass=idnsRecord) belonging to each active DNS zone.

(Active DNS zone = object which matches (&(objectClass=idnsZone)(idnsZoneActive=TRUE)) in cn=dns subtree).

2) Why are there two separate connections doing these searches? Does bind open two connections? Is there some other client doing these idnsname searches?

BIND opens multiple connections to DS (count is configurable by "connections" argument in /etc/named.conf). It was designed before I started, but I think it was mainly for performance reasons.

There are structural/code reasons also, but I have a ticket for plugin structural improvements. The plugin will be able to work with single connection in a future. (For now it is not a priority. Without persistent search required minimum is 2 connections.)

There should be no other client doing searches in cn=dns subtree (if nobody accesses IPA interfaces).

Is principle of operation understandable from this text? Please ask if something unclear.

pspacek commented 11 years ago

Reassigning from me, because I really don't have anything to do with the issue.

dpal commented 11 years ago

This issue is not reproducible any more. Closing the ticket.

Metadata Update from @dpal:
- Issue assigned to rmeggins
- Issue set to the milestone: FreeIPA 3.0 RC1

7 years ago

Metadata

Assignee

rmeggins

Tags

None

Blocking

None

Depending on

None

Priority

important

Milestone

FreeIPA 3.0 RC1

None

affects_doc

None

source

None

knownissue

None

type

defect

blockedby

None

test_case

None

component

IPA

blocking

None

on_review

keywords

None

test_coverage

None

reviewer

None

external_tracker

None

rhbz

https://bugzilla.redhat.com/show_bug.cgi?id=831661

tester

None

changelog

None

design

None

freeipa

Source Code

#2842 ipa-replica-manage re-initialize update failed due to named ldap timeout Closed: Invalid None Opened 11 years ago by dpal.

Metadata

#2842 ipa-replica-manage re-initialize update failed due to named ldap timeout

Closed: Invalid None Opened 11 years ago by dpal.