#4172 [DEFECT] segfault: error 4 in libc-2.23.so
Closed: worksforme 4 years ago by wbollock. Opened 4 years ago by wbollock.

Hello,

I've been running SSSD on ~25 Ubuntu 16.04 LTS hosts. Everything has worked great for a while, until now.

kernel: sssd[12892]: segfault at 1000 ip 00007f1fb7a68cc0 sp 00007ffc40e20a20 error 4 in libc-2.23.so[7f1fb7a1a000+1c0000]

Then, SSSD crashes and cannot restart until restarted manually with sudo systemctl restart systemd

Recently, three servers have experienced these SSSD crashes. Information below:

Hosts: Ubuntu 16.04 LTS

SSSD version: Version: 1.13.4-1ubuntu1.15
(looks to be the LTM version)

Network Topology: Active Directory

Steps to reproduce: SSSD setup for Unix and SSH login with AD usernames/passwords

SSSD logs (debug 3, just turned it to 6 on one host...):

Mar 23 13:38:39 <my-host> kernel: audit: type=1400 audit(1584985119.552:20713): apparmor="ALLOWED" operation="capable" profile="/usr/sbin/sssd" pid=13619 comm="sssd_be" capability=3 capname="fowner"
Mar 23 13:38:48 <my-host> kernel: sssd[12892]: segfault at 1000 ip 00007f1fb7a68cc0 sp 00007ffc40e20a20 error 4 in libc-2.23.so[7f1fb7a1a000+1c0000]
Mar 23 13:38:48 <my-host> sssd[12895]: Shutting down
Mar 23 13:38:48 <my-host> systemd[1]: sssd.service: Main process exited, code=dumped, status=11/SEGV
Mar 23 13:38:48 <my-host> sssd[be[13619]: Shutting down
Mar 23 13:38:48 <my-host> sssd[12896]: Shutting down
Mar 23 13:40:18 <my-host> systemd[1]: sssd.service: State 'stop-sigterm' timed out. Killing.
Mar 23 13:40:18 <my-host> systemd[1]: sssd.service: Unit entered failed state.
Mar 23 13:40:18 <my-host> systemd[1]: sssd.service: Failed with result 'timeout'.
Mar 23 13:43:56 <my-host> sshd[13752]: pam_sss(sshd:auth): Request to sssd failed. Connection refused
Mar 23 13:53:46 <my-host> sshd[14011]: pam_sss(sshd:auth): Request to sssd failed. Connection refused
Mar 23 14:03:51 <my-host> sshd[14217]: pam_sss(sshd:auth): Request to sssd failed. Connection refused
Mar 23 14:07:15 <my-host> sshd[14252]: pam_sss(sshd:session): Request to sssd failed. Connection refused

More logs:

Mar 21 03:19:33 <my-host> systemd[1]: sssd.service: Main process exited, code=dumped, status=11/SEGV
Mar 21 03:21:03 <my-host> systemd[1]: sssd.service: State 'stop-sigterm' timed out. Killing.
Mar 21 03:21:03 <my-host> systemd[1]: sssd.service: Unit entered failed state.
Mar 21 03:21:03 <my-host> systemd[1]: sssd.service: Failed with result 'timeout'.
Mar 23 13:25:28 <my-host> systemd[1]: Starting System Security Services Daemon...

SSSD config:


[sssd]
services = nss, pam
config_file_version = 2
domains = <redacted>
debug_level=3
reconnection_retries = 3

[domain/<redacted>]
id_provider = ad
access_provider = ad

AD ACCESS FILTER RULES:

only 1 at a time has been tested - either AND or OR

use & to specify ALL must be required (logical AND)

use | to specify ANY combination can suffice (logical OR)

ad_access_filter = <redacted>

default_shell = /bin/bash
debug_level=3
cache_credentials = true

Use this if users are being logged in at /.

This example specifies /home/DOMAIN-FQDN/user as $HOME. Use with pam_mkhomedir.so

if you wanted the domain also to be in the path, do: override_homedir = /home/%d/%u

override_homedir = /home/%u

Uncomment if the client machine hostname doesn't match the computer object on the DC.

ad_hostname = mymachine.myubuntu.example.com

https://linux.die.net/man/5/sssd-ad

Uncomment if DNS SRV resolution is not working

connects in order of preference set by comma seperated list

ad_server = <redacted>

Uncomment if the AD domain is named differently than the Samba domain

ad_domain = <redacted>

Enumeration is discouraged for performance reasons.

enumerate = true

[pam]
reconnection_retries = 3

Filtered users generated from sssd.sh

[nss]
reconnection_retries = 3
filter_users = root,www-data,sshd,snmp,bin,clamav,daemon,ntp,postfix,Debian-exim,amavis,backup,bind,debian-spamd,dovecot,dovenull,games,gnats,irc,landscape,libuuid,list,lp,mail,man,messagebus,mysql,nagios,news,postgrey,proxy,smmsp,smmta,statd,sync,sys,syslog,uucp,vmail,whoopsie
filter_groups = root,www-data,sshd,snmp,bin,clamav,daemon,ntp,postfix,Debian-exim,amavis,backup,bind,debian-spamd,dovecot,dovenull,games,gnats,irc,landscape,libuuid,list,lp,mail,man,messagebus,mysql,nagios,news,postgrey,proxy,smmsp,smmta,statd,sync,sys,syslog,uucp,vmail,whoopsie

Proposed workaround for now:

Change SSSD systemd service to more aggressively restart. Something like:

[Unit]
Description=System Security Services Daemon

SSSD must be running before we permit user sessions

Before=systemd-user-sessions.service nss-user-lookup.target autofs.service
Wants=nss-user-lookup.target

[Service]
ExecStart=/usr/sbin/sssd -i -f
Type=notify
NotifyAccess=main
PIDFile=/var/run/sssd.pid
Restart=always
RestartSec=10
StartLimitIntervalSec=0

[Install]
WantedBy=multi-user.target

Thank you! Let me know if any other information is needed.


Hi,

SSSD version: Version: 1.13.4-1ubuntu1.15

This is really outdated branch...

Could you please share sssd logs (/var/log/sssd/*) and coredump?

Yes it is outdated, but SSSD Releases marks that as the "LTM" branch, and that is the default package in the Ubuntu LTS repos.

I'm not familiar with coredumps but did a quick google search.. I have a 1.1MB file in /var/crash.
sssd.crash

I shared relevant SSSD logs above, I turned on more verbose logging for the next time it happens.

Edit: Should I transition all my installs to the newer SSSD releases? I thought the LTM ones were okay.

I shared relevant SSSD logs above

You shared system journal.

And I am asking for the content of /var/log/sssd/*.log (feel free to hide sensitive information)

You're right, my apologies. Trying to only paste relevant logs around the time of failure...

It may be related to connectivity issues to "MY.AD.ORG".... I assume that's what the ping fails mean.

sssd.log:

(Mon Mar 23 13:25:28 2020) [sssd] [client_registration] (0x0020): Failed to mark service [nss]!
(Mon Mar 23 13:38:28 2020) [sssd] [ping_check] (0x0020): A service PING timed out on [MY.AD.ORG]. Attempt [0]
(Mon Mar 23 13:38:38 2020) [sssd] [ping_check] (0x0020): A service PING timed out on [MY.AD.ORG]. Attempt [1]
(Mon Mar 23 13:38:39 2020) [sssd] [mt_svc_exit_handler] (0x0040): Child [MY.AD.ORG] exited with code [1]
(Mon Mar 23 14:28:03 2020) [sssd] [server_setup] (0x0400): CONFDB: /var/lib/sss/db/config.ldb
(Mon Mar 23 14:28:03 2020) [sssd] [sysdb_domain_init_internal] (0x0200): DB File for MY.AD.ORG: /var/lib/sss/db/cache_MY.AD.ORG.ldb
(Mon Mar 23 14:28:03 2020) [sssd] [ldb] (0x0400): asq: Unable to register control with rootdse!
(Mon Mar 23 14:28:03 2020) [sssd] [sbus_new_server] (0x0400): D-BUS Server listening on unix:path=/var/lib/sss/pipes/private/sbus-monitor,guid=5b1638820031d4ba8960c05b5e78ffb3
(Mon Mar 23 14:28:03 2020) [sssd] [get_ping_config] (0x0100): Time between service pings for [MY.AD.ORG]: [10]
(Mon Mar 23 14:28:03 2020) [sssd] [get_ping_config] (0x0100): Time between SIGTERM and SIGKILL for [MY.AD.ORG]: [60]
(Mon Mar 23 14:28:03 2020) [sssd] [start_service] (0x0100): Queueing service MY.AD.ORG for startup
(Mon Mar 23 14:28:03 2020) [sssd] [sbus_server_init_new_connection] (0x0200): Entering.
(Mon Mar 23 14:28:03 2020) [sssd] [sbus_server_init_new_connection] (0x0200): Adding connection 0x110bc20.
(Mon Mar 23 14:28:03 2020) [sssd] [sbus_init_connection] (0x0400): Adding connection 0x110bc20
(Mon Mar 23 14:28:03 2020) [sssd] [sbus_server_init_new_connection] (0x0200): Got a connection
(Mon Mar 23 14:28:03 2020) [sssd] [monitor_service_init] (0x0400): Initializing D-BUS Service
(Mon Mar 23 14:28:03 2020) [sssd] [sbus_opath_hash_add_iface] (0x0400): Registering interface org.freedesktop.sssd.monitor with path /org/freedesktop/sssd/monitor
(Mon Mar 23 14:28:03 2020) [sssd] [sbus_conn_register_path] (0x0400): Registering object path /org/freedesktop/sssd/monitor with D-Bus connection
(Mon Mar 23 14:28:03 2020) [sssd] [sbus_opath_hash_add_iface] (0x0400): Registering interface org.freedesktop.DBus.Properties with path /org/freedesktop/sssd/monitor
(Mon Mar 23 14:28:03 2020) [sssd] [sbus_opath_hash_add_iface] (0x0400): Registering interface org.freedesktop.DBus.Introspectable with path /org/freedesktop/sssd/monitor
(Mon Mar 23 14:28:03 2020) [sssd] [client_registration] (0x0100): Received ID registration: (%BE_MY.AD.ORG,1)
(Mon Mar 23 14:28:03 2020) [sssd] [mark_service_as_started] (0x0200): Marking MY.AD.ORG as started.
(Mon Mar 23 14:28:03 2020) [sssd] [mark_service_as_started] (0x0100): Now starting services!

sssd.log.1

(Mon Mar 16 02:59:59 2020) [sssd] [monitor_quit_signal] (0x0040): Monitor received Terminated: terminating children
(Mon Mar 16 02:59:59 2020) [sssd] [monitor_quit] (0x0040): Returned with: 0
(Mon Mar 16 02:59:59 2020) [sssd] [monitor_quit] (0x0020): Terminating [pam][1513]
(Mon Mar 16 02:59:59 2020) [sssd] [monitor_quit] (0x0020): Child [pam] exited gracefully
(Mon Mar 16 02:59:59 2020) [sssd] [monitor_quit] (0x0020): Terminating [nss][1512]
(Mon Mar 16 02:59:59 2020) [sssd] [monitor_quit] (0x0020): Child [nss] exited gracefully
(Mon Mar 16 02:59:59 2020) [sssd] [monitor_quit] (0x0020): Terminating [FSU.EDU][1499]
(Mon Mar 16 02:59:59 2020) [sssd] [monitor_quit] (0x0020): Child [FSU.EDU] terminated with a signal
(Mon Mar 16 03:00:43 2020) [sssd] [client_registration] (0x0020): Failed to mark service [nss]!
(Mon Mar 16 06:25:04 2020) [sssd] [monitor_hup] (0x0020): Received SIGHUP.
(Mon Mar 16 06:25:04 2020) [sssd] [te_server_hup] (0x0020): Received SIGHUP. Rotating logfiles.
(Mon Mar 16 06:25:04 2020) [sssd] [monitor_hup] (0x0020): Received SIGHUP.
(Mon Mar 16 06:25:04 2020) [sssd] [te_server_hup] (0x0020): Received SIGHUP. Rotating logfiles.
(Sat Mar 21 03:19:12 2020) [sssd] [ping_check] (0x0020): A service PING timed out on [FSU.EDU]. Attempt [0]
(Sat Mar 21 03:19:22 2020) [sssd] [ping_check] (0x0020): A service PING timed out on [FSU.EDU]. Attempt [1]
(Sat Mar 21 03:19:24 2020) [sssd] [mt_svc_exit_handler] (0x0040): Child [FSU.EDU] exited with code [1]

sssd_nss.log

(Mon Mar 23 13:25:31 2020) [sssd[nss]] [id_callback] (0x0010): The Monitor returned an error [org.freedesktop.DBus.Error.NoReply]

Mixing two logs together:

(Mon Mar 23 13:25:28 2020) [sssd] [client_registration] (0x0020): Failed to mark service [nss]!
(Mon Mar 23 13:38:28 2020) [sssd] [ping_check] (0x0020): A service PING timed out on [MY.AD.ORG]. Attempt [0]
(Mon Mar 23 13:38:38 2020) [sssd] [ping_check] (0x0020): A service PING timed out on [MY.AD.ORG]. Attempt [1]
(Mon Mar 23 13:38:39 2020) [sssd] [mt_svc_exit_handler] (0x0040): Child [MY.AD.ORG] exited with code [1]
Mar 23 13:38:48 <my-host> kernel: sssd[12892]: segfault at 1000 ip 00007f1fb7a68cc0 sp 00007ffc40e20a20 error 4 in libc-2.23.so[7f1fb7a1a000+1c0000]

Looks like SSSD process serving [MY.AD.ORG] exited before third ping and this caused SSSD monitor to crash during attempt to ping. Looks like bug, perhaps this is the same as https://bugzilla.redhat.com/show_bug.cgi?id=1766470 (didn't verify)
But this bug will not be addressed because 1.13.x is not maintained, and those code paths absent in 1.16.x and 2.x
Please consider upgrade if possible.

You are right that the crash is a consequence (is triggered by) of an issue in SSSD process serving [MY.AD.ORG]. To figure out what is this issue we need to look into /var/log/sssd/sssd_MY.AD.ORG.log

The logs in that file were very specific to my org, I believe my org's LDAP server was down or unavailable at that time.

Do you recommend Ubuntu users not use the LTM package provided by the official repos? You're right in that it does seem quite outdated, but I thought it was a stable release.

To get around this issue, I plan to setup auth cache and possibly increase the value of "krb5_auth_timeout". I believe my org is just hammered by increase network traffic from the COVID situation.

Thank you for your help!

The logs in that file were very specific to my org, I believe my org's LDAP server was down or unavailable at that time.

Look in this log for a reason of "Child [MY.AD.ORG] exited with code [1]"

Do you recommend Ubuntu users not use the LTM package provided by the official repos? You're right in that it does seem quite outdated, but I thought it was a stable release.

I must admit Release page needs update. In general 1.13.x doesn't receive updates/bugfixes so it is not LTM anymore.

To get around this issue, I plan to setup auth cache and possibly increase the value of "krb5_auth_timeout".

It seems to get around this issue you need to configure backend to avoid its exit... But to do so you first need to understand why it exits.

Hm... maybe I could contact the Ubuntu package maintainer to update SSSD on their end. I bet a lot of users have followed the guides that recommend "sudo apt-get install sssd" like me.

Edit: I clearly didn't understand how Ubuntu's packages work, I'm used to rolling release... newer SSSD versions are clearly available in the repos.

Closing issue. Thank you atikhonov.

Metadata Update from @wbollock:
- Issue close_status updated to: worksforme
- Issue status updated to: Closed (was: Open)

4 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/5127

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata
Attachments 1
Attached 4 years ago View Comment