Issue #2702: RFE: Improve AD site discovery process - sssd

SSSD / sssd

#2702 RFE: Improve AD site discovery process

Closed: cloned-to-github 3 years ago by pbrezina. Opened 8 years ago by ondrejv2.

Currently, AD site discovery as it is implemented in SSSD follows this document:
https://fedorahosted.org/sssd/wiki/DesignDocs/ActiveDirectoryDNSSites

The problem with this process is, that it assumes all domain controllers are accessible from the client. If they are not (for example different geographical region behind a VPN), we are in a danger of timeout.

What I am proposing is, that LDAP ping is sent in parallel (i.e. not sequentially) to all domain controllers found at once. This would cover scenario where only couple of DCs are accessible to the client and would be also more in line of the AD locator process as described here:

https://technet.microsoft.com/en-us/library/cc978011.aspx

Quote:

The Net Logon service sends a datagram to the discovered domain controllers ("pings#34; the computers) that register the name. For NetBIOS domain names, the datagram is implemented as a mailslot message. For DNS domain names, the datagram is implemented as an LDAP UDP search.

Each available domain controller responds to the datagram to indicate that it is currently operational and then returns the information to DsGetDcName.

The Net Logon service returns the information to the client from the domain controller that responds first.

Ondrej

krissn commented 8 years ago

Fields changed

cc: => krzysztof.a.nowicki+fedora@gmail.com

jhrozek commented 8 years ago

Should be done together with the failover refactoring in 1.14.

milestone: NEEDS_TRIAGE => SSSD 1.14 beta

ondrejv2 commented 8 years ago

Note that quick fix is to enforce REJECT rule on the firewall instead of DROP. This way timeout is never reached & everything works well

jhrozek commented 8 years ago

Thanks for testing.

jhrozek commented 8 years ago

Fields changed

rhbz: => todo

ondrejv2 commented 8 years ago

Actually this now bug seems to be far more serious:
1. We start up the AD site discovery process sequentially connecting to ALL DCs that are registered in SRV. This has to be done this way as you do not know yet to which site you belong to.
If the random DC you pick up responds, you're lucky and you can finish the discovery process and discover the site you belong to.
If we are unlucky, the DC is unresponsive and we'll end up with AD provider offline.

Once we discovered the site, there is still the risk that some DCs defined for the site will be down or unresponsible - so the scenario above can happen as well.

Moreover it seems to me that SSSD is doing the site discovery too often. It should be done once SSSD starts up or if we spot the IP address change - there is no reason to do it otherwise. Nowadays I see it in logs on a regular basis:[[BR]]

(Thu Aug 13 13:11:13 2015) [sssd[be[default]]] [collapse_srv_lookup] (0x0100): Need to refresh SRV lookup for domain Prague._sites.dublin.ad.s3group.com[[BR]]
(Thu Aug 13 13:11:13 2015) [sssd[be[default]]] [ad_srv_plugin_send] (0x0400): About to find domain controllers[[BR]]
(Thu Aug 13 13:11:13 2015) [sssd[be[default]]] [ad_get_dc_servers_send] (0x0400): Looking up domain controllers in domain dublin.ad.s3group.com[[BR]]

-> what the heck? We know our SITE is Prague already. We do not need to enumerate the whole domain now.

jhrozek commented 8 years ago

I suspect this is no longer an issue since we fixed https://fedorahosted.org/sssd/ticket/2765

If you disagree, please reopen.

resolution: => fixed
status: new => closed

ondrejv2 commented 8 years ago

I think having #2765 fixed is a good improvement, but it is not enough, unfortunately as in most cases the ad_site parameter is not specified, we still need to discover the site first - and that's the problem.
If we keep connecting DCs sequentially, there will always be a problem that we eventually timeout and sssd go offline.

Ideally we need to:
1. Send request to all DCs in parallel, connect to the one which respond first
2. Discover our site
3. Continue as per #2765

ondrejv2 commented 8 years ago

Fields changed

resolution: fixed =>
status: closed => reopened

cristi commented 7 years ago

Fields changed

cc: krzysztof.a.nowicki+fedora@gmail.com => krzysztof.a.nowicki+fedora@gmail.com, cristi.falcas@gmail.com

jhrozek commented 7 years ago

This ticket tracks the incremental improvement over #2765, but is not in the scope of 1.14 beta.

milestone: SSSD 1.14 beta => SSSD 1.15 beta

Metadata Update from @ondrejv2:
- Issue set to the milestone: SSSD Future releases (no date set yet)

7 years ago

Metadata Update from @thalman:
- Custom field design_review reset (from 0)
- Custom field mark reset (from 0)
- Custom field patch reset (from 0)
- Custom field review reset (from 0)
- Custom field sensitive reset (from 0)
- Custom field testsupdated reset (from 0)
- Issue close_status updated to: None
- Issue tagged with: Future milestone

4 years ago

hicksdc commented 4 years ago

Hi. Has this ticket fallen through the cracks? The behaviour appears unchanged in 1.16.4-21...

I also raised a support case around the same time (01756341) and another recently (02581536), because the ad_site workaround is about to get more awkward for us to live with (our new deployment system is for multiple sites, regions, networks and AD domains).

Any chance this one could be discussed again please?..

pbrezina commented 4 years ago

Hi, from upstream point of view, we do plan to work on this ticket but not in nearest future due to capacity reasons.

Since you are a RHEL customer, please use the customer portal and communicate with support engineers to increase priority.

Metadata Update from @pbrezina:
- Custom field design_review reset (from false)
- Custom field mark reset (from false)
- Custom field patch reset (from false)
- Custom field review reset (from false)
- Custom field sensitive reset (from false)
- Custom field testsupdated reset (from false)

4 years ago

hicksdc commented 4 years ago

Thanks Pavel.

Recently I have been trying to push this through our RHEL support channels, but have only managed to get them to raise https://bugzilla.redhat.com/show_bug.cgi?id=1819012 on my behalf. Is there anything more I can ask them to do?

pbrezina commented 4 years ago

Thank you. This is enough for the moment. It will go though downstream planning processing and we'll let you know in the bugzilla.

4 years ago

pbrezina commented 3 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/3743

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @pbrezina:
- Issue close_status updated to: cloned-to-github
- Issue status updated to: Closed (was: Open)

3 years ago

Metadata

Assignee

None

Tags

Blocking

None

Depending on

None

Priority

major

Milestone

SSSD Future releases (no date set yet)

type

enhancement

component

SSSD

version

1.12.5

selected

None

testsupdated

false

patch

false

rhbz

todo

design_review

false

review

false

changelog

None

keywords

None

coverity

None

mark

false

blocking

None

design

None

sensitive

false

krzysztof.a.nowicki+fedora@gmail.com, cristi.falcas@gmail.com

blockedby

None

feature_milestone

None

SSSD / sssd

Source Code

Documentation

#2702 RFE: Improve AD site discovery process Closed: cloned-to-github 3 years ago by pbrezina. Opened 8 years ago by ondrejv2.

The Net Logon service returns the information to the client from the domain controller that responds first.

Metadata

Future milestone

#2702 RFE: Improve AD site discovery process

Closed: cloned-to-github 3 years ago by pbrezina. Opened 8 years ago by ondrejv2.