Issue #2936: Slow IPA login - sssd - Pagure.io

SSSD / sssd

#2936 Slow IPA login

Closed: Invalid None Opened 8 years ago by mogthesprog.

Hi,

I've done plenty of research on the internet looking for a solution to this problem but i'm still pretty stuck.

I have a freeipa instance which contains 3 users, and a similar amount of groups. I've configured one small vm to authenticate with the ipa instance. However logon takes 4/5 seconds. I've attached my config file to the ticket, any help in solving this would be appreciated.

I've tested kinit, it's instant. When i execute "time ipa user-find admin", that query takes about 2.5 seconds. I think this may be the source of the problem.

I've also ensured that multicasting is disabled and that resolv.conf has the right nameservers.

Cheers,

Morgan

mogthesprog commented 8 years ago

attachment
sssd.conf

dpal commented 8 years ago

Sounds like a DNS issue.
How much does it take to resolve the IPA server from the test machine?
Also krb5.conf from test machine might be helpful.

jhrozek commented 8 years ago

In general looking at the delay in the logs might be worth the time, see https://fedorahosted.org/sssd/wiki/Troubleshooting

We're working at improving performance for 1.14, but a simple user lookup shouldn't take this long.

mogthesprog commented 8 years ago

Hi, Thanks for the speedy response...

So I'm not sure of the best way to test the dns response time, but i used dig and nslookup and the output is below. Note

[root@central-puppet morganj]# dig @10.0.0.4 central-ipa-001.globalscience.dh

; <<>> DiG 9.9.4-RedHat-9.9.4-29.el7_2.1 <<>> @10.0.0.4 central-ipa-001.globalscience.dh
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 57945
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;central-ipa-001.globalscience.dh. IN   A

;; ANSWER SECTION:
central-ipa-001.globalscience.dh. 1200 IN A     10.0.0.4

;; AUTHORITY SECTION:
globalscience.dh.       86400   IN      NS      central-ipa-001.globalscience.dh.

;; Query time: 1 msec
;; SERVER: 10.0.0.4#53(10.0.0.4)
;; WHEN: Mon Jan 25 08:30:12 UTC 2016
;; MSG SIZE  rcvd: 91

[root@central-puppet morganj]# time nslookup central-ipa-001.globalscience.dh
Server:         10.0.0.4
Address:        10.0.0.4#53

Name:   central-ipa-001.globalscience.dh
Address: 10.0.0.4


real    0m0.007s
user    0m0.002s
sys     0m0.004s
[root@central-puppet morganj]# time nslookup central-ipa-001
Server:         10.0.0.4
Address:        10.0.0.4#53

Name:   central-ipa-001.globalscience.dh
Address: 10.0.0.4


real    0m0.021s
user    0m0.005s
sys     0m0.001s

However i then tried to use dig with the ip of the free=ipa server and that took 170ms, which is an age. the second time i run it the time came down, but it then fluctuates between 20ms and 50ms.

Also, I'll upload the krb5.conf now :) Let me know if you need anything else.

mogthesprog commented 8 years ago

Cheers jhrozek, I took a log for a logon and i've kept the part where the delay takes place. I'll upload that too, just a sec.

mogthesprog commented 8 years ago

attachment
sssd_domain.log

mogthesprog commented 8 years ago

kerb config on the client
krb5.conf

mogthesprog commented 8 years ago

I just wanted to clarify that a kinit returns instantaneously. Cheers

sbose commented 8 years ago

Can you attach the krb5_child.log as well? Looks like the most time is spend in the krb5_child.

sbose commented 8 years ago

Fields changed

cc: => sbose

mogthesprog commented 8 years ago

is there a way to turn up the verbosity on krb5_child.log? Attaching at the current verbosity anyway...

mogthesprog commented 8 years ago

attachment
krb5_child.log

sbose commented 8 years ago

Just set debug_level=10 in the [domain/...] section of sssd.conf and restart SSSD, this will increase the debug_level of keb5_child as well.

mogthesprog commented 8 years ago

Hi guys sorry for the delay, uploading the log now.

mogthesprog commented 8 years ago

krb5_child at log level 10
krb5_child.2.log

sbose commented 8 years ago

Thank you for the logs. To me it looks like there is no specific issue which makes the authentication slow. But in contrast to a simple kinit SSSD's krb5_child does some extra operations to increase the security here which result in some extra roundtrips between the client and the KDC.

First SSSD uses FAST (see e.g. http://k5wiki.kerberos.org/wiki/Projects/FAST for details) to establish an encrypted channel to protect against some kind of brute force attacks against captured network packages. For this an additional authentication the a key from the host keytab is needed.

After the TGT for the user is returned, SSSD validates the ticket by requesting a service ticket for a service from the host keytab and checks is the returned service ticket can be decrypted with the related key form the keytab. This prevents attacks where a malicious KDC tries to hijack the connection.

You can disable those with 'krb5_use_fast=never' and 'krb5_validate=false' respectively, but I would recommend against it.

As a side note, when the kinit shows the password prompt all communication with the KDC is already done, the password is just needed to decrypt the response from the KDC.

mogthesprog commented 8 years ago

okay, no worries. So by disabling krb5_use_fast, what exactly changes compared to a standard kinit? I just want to understand the implications before doing so. As it stands the log in 4/5 seconds is quite a long time. Especially when a standard kinit is instantaneous.

thanks again for the quick response guys. We're running a POC on FreeIPA at the moment and only issue (at the moment) is the duration of the logins.

Cheers,

Morgan

mogthesprog commented 8 years ago

Also, do these configs go under the domain/ section of sssd.conf?

I'm still wondering whether the two issues are related. Adding the tow kerberos configs (just to see if they improved performance for the login) and they didn't seem to do much. Are we sure that the user look up time is unrelated to the slow login?

The time for login also fluctuates quite a lot, on the order of a second or a second and a half difference. any idea why that may be? Bare in mind that there is only 3 users in ipa, and only me authenticating at any one time.

_comment0: Also, do these configs go under the domain/ section of sssd.conf? => 1453734328179468

sbose commented 8 years ago

To get reproducible results it might be best to run sssd freshly started with an empty cache. Typically I run something like

systemctl stop sssd.service ; rm -f /var/lib/sss/db/* ; rm -f /var/log/sssd/* ; systemctl start sssd

to achieve this. Before running the test please add debug_level=10 to the [pam] and [nss] section as well.

I agree that 5s are quite long. I run FreeIPA and a SSSD client in different VMs on a laptop and see the login happen in the same second. How does your PoC look like, are you running the test in the FreeIPA server or on a client?

mogthesprog commented 8 years ago

Thanks for that, just on my way home from work now so I'll either try later this evening or tomorrow morning. And I'll post back the logs :)

The poc currently consists of two VMs in azure running centos 7.2. I don't think I can know where in the cloud they sit, and whether or not they're on the same machine. In a month or so we plan to do the same thing in our DC, where we'll know a lot more about the environment. I'm wondering whether it is a DNS related thing from on of the previous comments. Do you run the DNS server on the same VM as IPA-server?

Cheers,

Morgan

dpal commented 8 years ago

IPA server can have an integrated DNS server. We generally recommend but not require it.

mogthesprog commented 8 years ago

Okay, so i've done the recommended thing, my ipa server also runs DNS. I'll post back in half hour or so with those extra logs.

mogthesprog commented 8 years ago

okay, so as an asside, something strange is going on, maybe related to caching.

[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m2.309s
user    0m0.021s
sys     0m0.004s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m0.978s
user    0m0.032s
sys     0m0.009s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m2.709s
user    0m0.018s
sys     0m0.010s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m0.958s
user    0m0.022s
sys     0m0.005s


########
NOW I STARTED REPEATING THE TASK IMMEDIATELY AFTER IT FINISHED
########


[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m2.872s
user    0m0.015s
sys     0m0.005s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m0.912s
user    0m0.030s
sys     0m0.004s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m0.761s
user    0m0.011s
sys     0m0.004s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m1.253s
user    0m0.014s
sys     0m0.003s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m0.988s
user    0m0.029s
sys     0m0.007s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m0.713s
user    0m0.013s
sys     0m0.005s

Notice that as soon as I start repeating the task very quickly, the login is happening at a normal speed. Is this an artifact of some other underlying thing, or could it be pointing us in the right direction. Just to note, this is with the two krb5 settings (validate and fast) DISABLED and selinux_provider=none, so the login should be much quicker than 3 seconds.

Just to clarify, I don't want to run with these settings disabled in production, I'm just trying to narrow down the choke point for authentication.

_comment0: okay, so as an asside, something strange is going on, maybe related to caching.

{{{
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m2.309s
user 0m0.021s
sys 0m0.004s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m0.978s
user 0m0.032s
sys 0m0.009s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m2.709s
user 0m0.018s
sys 0m0.010s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m0.958s
user 0m0.022s
sys 0m0.005s

NOW I STARTED REPEATING THE TASK IMMEDIATELY AFTER IT FINISHED

[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m2.872s
user 0m0.015s
sys 0m0.005s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m0.912s
user 0m0.030s
sys 0m0.004s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m0.761s
user 0m0.011s
sys 0m0.004s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m1.253s
user 0m0.014s
sys 0m0.003s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m0.988s
user 0m0.029s
sys 0m0.007s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m0.713s
user 0m0.013s
sys 0m0.005s
}}}

Notice that as soon as I start repeating the task very quickly, the login is happening at a normal speed. Is this an artifact of some other underlying thing, or could it be pointing us in the right direction. Just to note, this is with the two krb5 settings (validate and fast) DISABLED, so the login should be much quicker than 3 seconds.

=> 1453798942929847

mogthesprog commented 8 years ago

attachment
krb5_child.log.ipaclient

mogthesprog commented 8 years ago

attachment
ldap_child.log.ipaclient

mogthesprog commented 8 years ago

attachment
sssd_globalscience.dh.log.ipaclient

mogthesprog commented 8 years ago

attachment
sssd.log.ipaclient

mogthesprog commented 8 years ago

attachment
sssd_nss.log.ipaclient

mogthesprog commented 8 years ago

attachment
sssd_pam.log.ipaclient

mogthesprog commented 8 years ago

attachment
ldap_child.log.ipaserver

mogthesprog commented 8 years ago

attachment
sssd.log.ipaserver

mogthesprog commented 8 years ago

attachment
sssd_globalscience.dh.log.ipaserver

mogthesprog commented 8 years ago

attachment
sssd_nss.log.ipaserver

mogthesprog commented 8 years ago

attachment
sssd_pam.log.ipaserver

mogthesprog commented 8 years ago

attachment
sssd_ssh.log.ipaserver

mogthesprog commented 8 years ago

I've uploaded all of the logs from both server and client. The logs and cache were cleared prior to the log in to "central-puppet" client.

The logs are for one login, and a few seconds of inactivity either side.

Thanks again for all the help guys :)

sbose commented 8 years ago

With every new login SSSD tries to read all group memberships of the user from the server t make sure that e.g. access control is done with the most recent data. There is a small timeout of 5s (pam_id_timeout) during which no group lookup happen and SSSD will use the data from the cache. I think this is what you seeing in your tests.

Nevertheless, with the given the small number of users and group 2+s seems quite long (I assume you already have a valid Kerberos service ticket so that the ssh client does not need to connect to the KDC in the first test to get one). I'll check the logs and come back to you.

mogthesprog commented 8 years ago

yep that's correct, I did a kinit before issuing an ssh -K, and assuming the defaults are the usual kerberos ticket expirations the ticket shouldn't need renewing.

Thanks sbose

mogthesprog commented 8 years ago

Hi Guys,

Sbose, any idea from the logs what's causing the delay?

Cheers,

Morgan

sbose commented 8 years ago

Unfortunately each step is happening in a single second so that the timestamps do not really help to identify which operation is slow.

08:49:22: the user is looked up in the nss responder, i.e. sshd is checking if the user exists

... I guess you typed the password here ...

08:49:24: authentication request enters the PAM responder and the request to lookup the groupmemberships is send to the backend

08:49:25: group-membership lookup is done, authentication request is send to the backend

08:49:26: authentication result is send back to sshd and sshd sends access control request

08:49:27: access control request is send to sshd and sshd calls the remaining PAM steps which are basically no-ops and happen in the same second.

There is an option debug_microseconds to get a better time resolution.

Additionally to rule out disk I/O effects you might want to try to put the SSSD cache on tmpfs as described on https://jhrozek.wordpress.com/2015/08/19/performance-tuning-sssd-for-large-ipa-ad-trust-deployments/ section 'Mount the cache in tmpfs'.

mogthesprog commented 8 years ago

Hi Sbose,

thanks for this link. I implemented the tmpfs cache and it's performing a lot more consistently. This cause me to test IO with dd, and it seems azure disk performance is a bit shocking (likely due to a noisy neighbour of come sort).

As a result of this I'll implement this on two VMs inside our own network and then take another look at authentication time.

We can close this for now if you like, unless you want thos microsecond logs anyway? I can do those test on monday morning.

If after i've implemented this internally I still have the same problem then I'll come back and reopen this ticket.

Thanks so much for your help, I've learnt a lot about sssd, I appreciate it.

Cheers,

Morgan

jhrozek commented 8 years ago

Before Sumit closes the issue, I would just like to point out that the cache performance is something we will be working on implementing during the 1.14 development cycle.

sbose commented 8 years ago

Thank you for the feedback. With the results of the research you did I think the additional logs are not needed anymore.

resolution: => worksforme
status: new => closed

Metadata Update from @mogthesprog:
- Issue set to the milestone: NEEDS_TRIAGE

7 years ago

pbrezina commented 3 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/3977

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Priority

major

Milestone

NEEDS_TRIAGE

type

defect

component

SSSD

version

1.13.3

selected

None

testsupdated

0

patch

0

rhbz

None

design_review

0

review

0

changelog

None

keywords

slow sssd authentication

coverity

None

mark

0

blocking

None

design

None

sensitive

0

cc

sbose

blockedby

None

feature_milestone

None

Powered by Pagure 5.13.3

Documentation • File an Issue • About • SSH Hostkey/Fingerprint

© Red Hat, Inc. and others.