Learn more about these different git repos.
Other Git URLs
Hi,
I've done plenty of research on the internet looking for a solution to this problem but i'm still pretty stuck.
I have a freeipa instance which contains 3 users, and a similar amount of groups. I've configured one small vm to authenticate with the ipa instance. However logon takes 4/5 seconds. I've attached my config file to the ticket, any help in solving this would be appreciated.
I've tested kinit, it's instant. When i execute "time ipa user-find admin", that query takes about 2.5 seconds. I think this may be the source of the problem.
I've also ensured that multicasting is disabled and that resolv.conf has the right nameservers.
Cheers,
Morgan
attachment sssd.conf
Sounds like a DNS issue. How much does it take to resolve the IPA server from the test machine? Also krb5.conf from test machine might be helpful.
In general looking at the delay in the logs might be worth the time, see https://fedorahosted.org/sssd/wiki/Troubleshooting
We're working at improving performance for 1.14, but a simple user lookup shouldn't take this long.
Hi, Thanks for the speedy response...
So I'm not sure of the best way to test the dns response time, but i used dig and nslookup and the output is below. Note
[root@central-puppet morganj]# dig @10.0.0.4 central-ipa-001.globalscience.dh ; <<>> DiG 9.9.4-RedHat-9.9.4-29.el7_2.1 <<>> @10.0.0.4 central-ipa-001.globalscience.dh ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 57945 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;central-ipa-001.globalscience.dh. IN A ;; ANSWER SECTION: central-ipa-001.globalscience.dh. 1200 IN A 10.0.0.4 ;; AUTHORITY SECTION: globalscience.dh. 86400 IN NS central-ipa-001.globalscience.dh. ;; Query time: 1 msec ;; SERVER: 10.0.0.4#53(10.0.0.4) ;; WHEN: Mon Jan 25 08:30:12 UTC 2016 ;; MSG SIZE rcvd: 91 [root@central-puppet morganj]# time nslookup central-ipa-001.globalscience.dh Server: 10.0.0.4 Address: 10.0.0.4#53 Name: central-ipa-001.globalscience.dh Address: 10.0.0.4 real 0m0.007s user 0m0.002s sys 0m0.004s [root@central-puppet morganj]# time nslookup central-ipa-001 Server: 10.0.0.4 Address: 10.0.0.4#53 Name: central-ipa-001.globalscience.dh Address: 10.0.0.4 real 0m0.021s user 0m0.005s sys 0m0.001s
However i then tried to use dig with the ip of the free=ipa server and that took 170ms, which is an age. the second time i run it the time came down, but it then fluctuates between 20ms and 50ms.
Also, I'll upload the krb5.conf now :) Let me know if you need anything else.
Cheers jhrozek, I took a log for a logon and i've kept the part where the delay takes place. I'll upload that too, just a sec.
attachment sssd_domain.log
kerb config on the client krb5.conf
I just wanted to clarify that a kinit returns instantaneously. Cheers
Can you attach the krb5_child.log as well? Looks like the most time is spend in the krb5_child.
Fields changed
cc: => sbose
is there a way to turn up the verbosity on krb5_child.log? Attaching at the current verbosity anyway...
attachment krb5_child.log
Just set debug_level=10 in the [domain/...] section of sssd.conf and restart SSSD, this will increase the debug_level of keb5_child as well.
Hi guys sorry for the delay, uploading the log now.
krb5_child at log level 10 krb5_child.2.log
Thank you for the logs. To me it looks like there is no specific issue which makes the authentication slow. But in contrast to a simple kinit SSSD's krb5_child does some extra operations to increase the security here which result in some extra roundtrips between the client and the KDC.
First SSSD uses FAST (see e.g. http://k5wiki.kerberos.org/wiki/Projects/FAST for details) to establish an encrypted channel to protect against some kind of brute force attacks against captured network packages. For this an additional authentication the a key from the host keytab is needed.
After the TGT for the user is returned, SSSD validates the ticket by requesting a service ticket for a service from the host keytab and checks is the returned service ticket can be decrypted with the related key form the keytab. This prevents attacks where a malicious KDC tries to hijack the connection.
You can disable those with 'krb5_use_fast=never' and 'krb5_validate=false' respectively, but I would recommend against it.
As a side note, when the kinit shows the password prompt all communication with the KDC is already done, the password is just needed to decrypt the response from the KDC.
okay, no worries. So by disabling krb5_use_fast, what exactly changes compared to a standard kinit? I just want to understand the implications before doing so. As it stands the log in 4/5 seconds is quite a long time. Especially when a standard kinit is instantaneous.
thanks again for the quick response guys. We're running a POC on FreeIPA at the moment and only issue (at the moment) is the duration of the logins.
Also, do these configs go under the domain/ section of sssd.conf?
I'm still wondering whether the two issues are related. Adding the tow kerberos configs (just to see if they improved performance for the login) and they didn't seem to do much. Are we sure that the user look up time is unrelated to the slow login?
The time for login also fluctuates quite a lot, on the order of a second or a second and a half difference. any idea why that may be? Bare in mind that there is only 3 users in ipa, and only me authenticating at any one time.
_comment0: Also, do these configs go under the domain/ section of sssd.conf? => 1453734328179468
To get reproducible results it might be best to run sssd freshly started with an empty cache. Typically I run something like
systemctl stop sssd.service ; rm -f /var/lib/sss/db/* ; rm -f /var/log/sssd/* ; systemctl start sssd
to achieve this. Before running the test please add debug_level=10 to the [pam] and [nss] section as well.
I agree that 5s are quite long. I run FreeIPA and a SSSD client in different VMs on a laptop and see the login happen in the same second. How does your PoC look like, are you running the test in the FreeIPA server or on a client?
Thanks for that, just on my way home from work now so I'll either try later this evening or tomorrow morning. And I'll post back the logs :)
The poc currently consists of two VMs in azure running centos 7.2. I don't think I can know where in the cloud they sit, and whether or not they're on the same machine. In a month or so we plan to do the same thing in our DC, where we'll know a lot more about the environment. I'm wondering whether it is a DNS related thing from on of the previous comments. Do you run the DNS server on the same VM as IPA-server?
IPA server can have an integrated DNS server. We generally recommend but not require it.
Okay, so i've done the recommended thing, my ipa server also runs DNS. I'll post back in half hour or so with those extra logs.
okay, so as an asside, something strange is going on, maybe related to caching.
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null real 0m2.309s user 0m0.021s sys 0m0.004s [root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null real 0m0.978s user 0m0.032s sys 0m0.009s [root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null real 0m2.709s user 0m0.018s sys 0m0.010s [root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null real 0m0.958s user 0m0.022s sys 0m0.005s ######## NOW I STARTED REPEATING THE TASK IMMEDIATELY AFTER IT FINISHED ######## [root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null real 0m2.872s user 0m0.015s sys 0m0.005s [root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null real 0m0.912s user 0m0.030s sys 0m0.004s [root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null real 0m0.761s user 0m0.011s sys 0m0.004s [root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null real 0m1.253s user 0m0.014s sys 0m0.003s [root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null real 0m0.988s user 0m0.029s sys 0m0.007s [root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null real 0m0.713s user 0m0.013s sys 0m0.005s
Notice that as soon as I start repeating the task very quickly, the login is happening at a normal speed. Is this an artifact of some other underlying thing, or could it be pointing us in the right direction. Just to note, this is with the two krb5 settings (validate and fast) DISABLED and selinux_provider=none, so the login should be much quicker than 3 seconds.
Just to clarify, I don't want to run with these settings disabled in production, I'm just trying to narrow down the choke point for authentication.
_comment0: okay, so as an asside, something strange is going on, maybe related to caching.
{{{ [root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null
real 0m2.309s user 0m0.021s sys 0m0.004s [root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null
real 0m0.978s user 0m0.032s sys 0m0.009s [root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null
real 0m2.709s user 0m0.018s sys 0m0.010s [root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null
real 0m0.958s user 0m0.022s sys 0m0.005s
NOW I STARTED REPEATING THE TASK IMMEDIATELY AFTER IT FINISHED
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null
real 0m2.872s user 0m0.015s sys 0m0.005s [root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null
real 0m0.912s user 0m0.030s sys 0m0.004s [root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null
real 0m0.761s user 0m0.011s sys 0m0.004s [root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null
real 0m1.253s user 0m0.014s sys 0m0.003s [root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null
real 0m0.988s user 0m0.029s sys 0m0.007s [root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null
real 0m0.713s user 0m0.013s sys 0m0.005s }}}
Notice that as soon as I start repeating the task very quickly, the login is happening at a normal speed. Is this an artifact of some other underlying thing, or could it be pointing us in the right direction. Just to note, this is with the two krb5 settings (validate and fast) DISABLED, so the login should be much quicker than 3 seconds.
=> 1453798942929847
attachment krb5_child.log.ipaclient
attachment ldap_child.log.ipaclient
attachment sssd_globalscience.dh.log.ipaclient
attachment sssd.log.ipaclient
attachment sssd_nss.log.ipaclient
attachment sssd_pam.log.ipaclient
attachment ldap_child.log.ipaserver
attachment sssd.log.ipaserver
attachment sssd_globalscience.dh.log.ipaserver
attachment sssd_nss.log.ipaserver
attachment sssd_pam.log.ipaserver
attachment sssd_ssh.log.ipaserver
I've uploaded all of the logs from both server and client. The logs and cache were cleared prior to the log in to "central-puppet" client.
The logs are for one login, and a few seconds of inactivity either side.
Thanks again for all the help guys :)
With every new login SSSD tries to read all group memberships of the user from the server t make sure that e.g. access control is done with the most recent data. There is a small timeout of 5s (pam_id_timeout) during which no group lookup happen and SSSD will use the data from the cache. I think this is what you seeing in your tests.
Nevertheless, with the given the small number of users and group 2+s seems quite long (I assume you already have a valid Kerberos service ticket so that the ssh client does not need to connect to the KDC in the first test to get one). I'll check the logs and come back to you.
yep that's correct, I did a kinit before issuing an ssh -K, and assuming the defaults are the usual kerberos ticket expirations the ticket shouldn't need renewing.
Thanks sbose
Hi Guys,
Sbose, any idea from the logs what's causing the delay?
Unfortunately each step is happening in a single second so that the timestamps do not really help to identify which operation is slow.
08:49:22: the user is looked up in the nss responder, i.e. sshd is checking if the user exists
... I guess you typed the password here ...
08:49:24: authentication request enters the PAM responder and the request to lookup the groupmemberships is send to the backend
08:49:25: group-membership lookup is done, authentication request is send to the backend
08:49:26: authentication result is send back to sshd and sshd sends access control request
08:49:27: access control request is send to sshd and sshd calls the remaining PAM steps which are basically no-ops and happen in the same second.
There is an option debug_microseconds to get a better time resolution.
Additionally to rule out disk I/O effects you might want to try to put the SSSD cache on tmpfs as described on https://jhrozek.wordpress.com/2015/08/19/performance-tuning-sssd-for-large-ipa-ad-trust-deployments/ section 'Mount the cache in tmpfs'.
Hi Sbose,
thanks for this link. I implemented the tmpfs cache and it's performing a lot more consistently. This cause me to test IO with dd, and it seems azure disk performance is a bit shocking (likely due to a noisy neighbour of come sort).
As a result of this I'll implement this on two VMs inside our own network and then take another look at authentication time.
We can close this for now if you like, unless you want thos microsecond logs anyway? I can do those test on monday morning.
If after i've implemented this internally I still have the same problem then I'll come back and reopen this ticket.
Thanks so much for your help, I've learnt a lot about sssd, I appreciate it.
Before Sumit closes the issue, I would just like to point out that the cache performance is something we will be working on implementing during the 1.14 development cycle.
Thank you for the feedback. With the results of the research you did I think the additional logs are not needed anymore.
resolution: => worksforme status: new => closed
Metadata Update from @mogthesprog: - Issue set to the milestone: NEEDS_TRIAGE
SSSD is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in SSSD's github repository.
This issue has been cloned to Github and is available here: - https://github.com/SSSD/sssd/issues/3977
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Login to comment on this ticket.