#3372 crash in sssd-kcm due to a race-condition between two concurrent requests
Closed: Fixed 3 years ago Opened 3 years ago by lslebodn.

It seems to be use after free.

sh$ kinit lslebodn@EXAMPLE.COM
kinit: Credentials cache I/O operation failed while getting default ccache

valgrind error:

==15846== 1 errors in context 1 of 1:
==15846== Invalid read of size 4
==15846==    at 0x5BFB6B9: talloc_chunk_from_ptr (talloc.c:452)
==15846==    by 0x5BFB6B9: __talloc_get_name (talloc.c:1486)
==15846==    by 0x5BFB6B9: talloc_check_name (talloc.c:1509)
==15846==    by 0x120F58: kcm_op_queue_add (kcmsrv_op_queue.c:136)
==15846==    by 0x120F58: kcm_op_queue_send (kcmsrv_op_queue.c:223)
==15846==    by 0x12072F: kcm_cmd_send (kcmsrv_ops.c:162)
==15846==    by 0x1115B1: kcm_cmd_dispatch (kcmsrv_cmd.c:364)
==15846==    by 0x1115B1: kcm_recv (kcmsrv_cmd.c:512)
==15846==    by 0x1115B1: kcm_fd_handler (kcmsrv_cmd.c:600)
==15846==    by 0x59F1A4F: epoll_event_loop (tevent_epoll.c:728)
==15846==    by 0x59F1A4F: epoll_event_loop_once (tevent_epoll.c:930)
==15846==    by 0x59EFEC6: std_event_loop_once (tevent_standard.c:114)
==15846==    by 0x59EBCAC: _tevent_loop_once (tevent.c:721)
==15846==    by 0x59EBECA: tevent_common_loop_wait (tevent.c:844)
==15846==    by 0x59EFE66: std_event_loop_wait (tevent_standard.c:145)
==15846==    by 0x7C2D0F2: server_loop (server.c:718)
==15846==    by 0x110856: main (kcm.c:313)
==15846==  Address 0xdccbea0 is 544 bytes inside a block of size 773 free'd
==15846==    at 0x4C2FCC8: free (vg_replace_malloc.c:530)
==15846==    by 0x5C024B3: _tc_free_poolmem (talloc.c:1000)
==15846==    by 0x5C024B3: _tc_free_internal (talloc.c:1141)
==15846==    by 0x5BFAAA7: _tc_free_children_internal (talloc.c:1593)
==15846==    by 0x5BFAAA7: _tc_free_internal (talloc.c:1104)
==15846==    by 0x5BFAAA7: _talloc_free_internal (talloc.c:1174)
==15846==    by 0x5BFAAA7: _talloc_free (talloc.c:1716)
==15846==    by 0x59ECFC0: tevent_req_received (tevent_req.c:255)
==15846==    by 0x59ECFD8: tevent_req_destructor (tevent_req.c:107)
==15846==    by 0x5BFAF30: _tc_free_internal (talloc.c:1078)
==15846==    by 0x5BFAF30: _talloc_free_internal (talloc.c:1174)
==15846==    by 0x5BFAF30: _talloc_free (talloc.c:1716)
==15846==    by 0x111711: kcm_cmd_request_done (kcmsrv_cmd.c:391)
==15846==    by 0x11E846: kcm_op_get_cache_uuid_list_done (kcmsrv_ops.c:1303)
==15846==    by 0x11B1F7: ccdb_sec_list_done (kcmsrv_ccache_secrets.c:1147)
==15846==    by 0x119DB0: sec_list_done (kcmsrv_ccache_secrets.c:224)
==15846==    by 0x12495A: tcurl_request_done (tev_curl.c:746)
==15846==    by 0x12495A: handle_curlmsg_done (tev_curl.c:234)
==15846==    by 0x12495A: process_curl_activity.isra.0 (tev_curl.c:245)
==15846==    by 0x124E4B: tcurlsock_input_available (tev_curl.c:288)
==15846==  Block was alloc'd at
==15846==    at 0x4C2EB1B: malloc (vg_replace_malloc.c:299)
==15846==    by 0x5BFE03B: __talloc_with_prefix (talloc.c:698)
==15846==    by 0x5BFE03B: _talloc_pool (talloc.c:752)
==15846==    by 0x5BFE03B: _talloc_pooled_object (talloc.c:820)
==15846==    by 0x59ECCAA: _tevent_req_create (tevent_req.c:73)
==15846==    by 0x120C43: kcm_op_queue_send (kcmsrv_op_queue.c:206)
==15846==    by 0x12072F: kcm_cmd_send (kcmsrv_ops.c:162)
==15846==    by 0x1115B1: kcm_cmd_dispatch (kcmsrv_cmd.c:364)
==15846==    by 0x1115B1: kcm_recv (kcmsrv_cmd.c:512)
==15846==    by 0x1115B1: kcm_fd_handler (kcmsrv_cmd.c:600)
==15846==    by 0x59F1A4F: epoll_event_loop (tevent_epoll.c:728)
==15846==    by 0x59F1A4F: epoll_event_loop_once (tevent_epoll.c:930)
==15846==    by 0x59EFEC6: std_event_loop_once (tevent_standard.c:114)
==15846==    by 0x59EBCAC: _tevent_loop_once (tevent.c:721)
==15846==    by 0x59EBECA: tevent_common_loop_wait (tevent.c:844)
==15846==    by 0x59EFE66: std_event_loop_wait (tevent_standard.c:145)
==15846==    by 0x7C2D0F2: server_loop (server.c:718)

Part of sssd-kcm.log

(Tue Apr 18 08:30:48:243213 2017) [sssd[kcm]] [handle_socket] (0x2000): Activity on curl socket 14 socket data 0xdcd93f0
(Tue Apr 18 08:30:48:246461 2017) [sssd[kcm]] [handle_curlmsg_done] (0x0400): Handled http://localhost/kcm/persistent/1000/ccache/
(Tue Apr 18 08:30:48:247437 2017) [sssd[kcm]] [tcurl_request_done] (0x0400): TCURL request finished [0]: Success
(Tue Apr 18 08:30:48:249128 2017) [sssd[kcm]] [schedule_fd_processing] (0x2000): timeout_ms: -1
(Tue Apr 18 08:30:48:249732 2017) [sssd[kcm]] [check_curl_timeouts] (0x4000): Still tracking 0 outstanding requests
(Tue Apr 18 08:30:48:253592 2017) [sssd[kcm]] [sec_list_done] (0x2000): Found 0 items
(Tue Apr 18 08:30:48:262271 2017) [sssd[kcm]] [sec_list_done] (0x2000): list done
(Tue Apr 18 08:30:48:263827 2017) [sssd[kcm]] [ccdb_sec_list_done] (0x2000): Found 2 ccaches
(Tue Apr 18 08:30:48:266425 2017) [sssd[kcm]] [ccdb_sec_list_done] (0x2000): Listing all caches done
(Tue Apr 18 08:30:48:269666 2017) [sssd[kcm]] [kcm_debug_uuid] (0x4000): UUID: 23cb4886-fa25-41e1-b065-327cc6f565c4
(Tue Apr 18 08:30:48:270029 2017) [sssd[kcm]] [kcm_debug_uuid] (0x4000): UUID: 901c7634-98ee-4e56-bd6c-84d6548d0f9f
(Tue Apr 18 08:30:48:270963 2017) [sssd[kcm]] [kcm_cmd_done] (0x0400): KCM operation GET_CACHE_UUID_LIST returned [0]: Success
(Tue Apr 18 08:30:48:272724 2017) [sssd[kcm]] [kcm_op_get_default_ccache_send] (0x1000): Getting client's default ccache
(Tue Apr 18 08:30:48:273393 2017) [sssd[kcm]] [ccdb_sec_get_default_send] (0x2000): Getting the default ccache
(Tue Apr 18 08:30:48:273723 2017) [sssd[kcm]] [tcurl_request_send] (0x0400): Sending TCURL request for http://localhost/kcm/persistent/1000/default, at socket /var/run/secrets.socket
(Tue Apr 18 08:30:48:273828 2017) [sssd[kcm]] [schedule_fd_processing] (0x2000): timeout_ms: 0
(Tue Apr 18 08:30:48:274587 2017) [sssd[kcm]] [kcm_send_reply] (0x2000): Sending a reply
(Tue Apr 18 08:30:48:274802 2017) [sssd[kcm]] [kcm_output_construct] (0x1000): Sending a reply with 36 bytes of payload
(Tue Apr 18 08:30:48:280941 2017) [sssd[kcm]] [handle_socket] (0x2000): Activity on curl socket 14 socket data (nil)
(Tue Apr 18 08:30:48:281043 2017) [sssd[kcm]] [schedule_fd_processing] (0x2000): timeout_ms: 0
(Tue Apr 18 08:30:48:281121 2017) [sssd[kcm]] [check_curl_timeouts] (0x4000): Still tracking 1 outstanding requests
(Tue Apr 18 08:30:48:281227 2017) [sssd[kcm]] [schedule_fd_processing] (0x2000): timeout_ms: 4995
(Tue Apr 18 08:30:48:281319 2017) [sssd[kcm]] [check_curl_timeouts] (0x4000): Still tracking 1 outstanding requests
(Tue Apr 18 08:30:48:282299 2017) [sssd[kcm]] [kcm_send] (0x2000): All data sent!
(Tue Apr 18 08:30:48:282717 2017) [sssd[kcm]] [kcm_input_parse] (0x1000): Received message with length 4
(Tue Apr 18 08:30:48:282789 2017) [sssd[kcm]] [kcm_get_opt] (0x2000): The client requested operation 20
(Tue Apr 18 08:30:48:282870 2017) [sssd[kcm]] [kcm_cmd_send] (0x0400): KCM operation GET_DEFAULT_CACHE
(Tue Apr 18 08:30:48:282935 2017) [sssd[kcm]] [kcm_cmd_send] (0x1000): 0 bytes on KCM input
(Tue Apr 18 08:30:48:283032 2017) [sssd[kcm]] [kcm_op_queue_send] (0x0200): Adding request by 1000 to the wait queue
(Tue Apr 18 08:30:48:284040 2017) [sssd[kcm]] [talloc_log_fn] (0x0010): Bad talloc magic value - wrong talloc version used/mixed

I have a suspicion that it is partially related to expired keys.

sh]$ klist -l
Principal name                 Cache name
--------------                 ----------
lslebodn@EXAMPLE.COM           KCM:1000 (Expired)
lslebodn@ANOTHERREALM.ORG      KCM:1000:85269 (Expired)

sh$ kinit lslebodn@EXAMPLE.COM
kinit: Credentials cache I/O operation failed while getting default ccache

sh$ kinit lslebodn@EXAMPLE.COM
Password for lslebodn@EXAMPLE.COM:

I have a different theory. I think it's a race condition between the request that just finished, calls tevent_req_done which frees the queue entry and between the code that first fetches the queue entry from the hash table and then checks it. I think the entry is first returned from the hash table, then the request is freed and then the queue code checks the entry.

I doubt there was any other request. It is still possible.

But I haven't had a time to find reasonable reproducer.

You were probably right. Reasonable reproducer is to kinit from two different terminals with different realm at the same time.

1st terminal:

sh$ kinit lslebodn@EXAMPLE.COM
kinit: Credentials cache I/O operation failed while getting default ccache
sh$ kinit lslebodn@EXAMPLE.COM
kinit: Credentials cache I/O operation failed while getting default ccache
sh$ kinit lslebodn@EXAMPLE.COM
Password for lslebodn@EXAMPLE.COM: 
kinit: Pre-authentication failed: Password read interrupted while getting initial credentials
sh$ kinit lslebodn@EXAMPLE.COM
kinit: Credentials cache I/O operation failed while searching for ccache for lslebodn@EXAMPLE.COM
sh$ kinit lslebodn@EXAMPLE.COM
Password for lslebodn@EXAMPLE.COM:

2nd terminal

sh$ kinit lslebodn@ANOTHER.REALM
sh$ kinit lslebodn@ANOTHER.REALM
Password for lslebodn@ANOTHER.REALM: 
kinit: Pre-authentication failed: Password read interrupted while getting initial credentials
sh$ kinit lslebodn@ANOTHER.REALM
Password for lslebodn@ANOTHER.REALM: 
kinit: Pre-authentication failed: Password read interrupted while getting initial credentials
sh$ kinit lslebodn@ANOTHER.REALM
kinit: Credentials cache I/O operation failed while getting default ccache
sh$ kinit lslebodn@ANOTHER.REALM
kinit: Credentials cache I/O operation failed while getting default ccache

I might reproduce it more often because I run sssd-secrets and sssd-kcm with valgrind and therefore they are slower. But I am still not sure why it can occur even with single kinit operation; as it is described in ticket.

And it needn't be even a different realm. You can just renew the same ticket twice in parallel as a reasonable reproducer. Just ensure that ticker is not Expired :-)

e.g. for i in {1..2} ; do kinit -R & done

Metadata Update from @jhrozek:
- Issue priority set to: critical
- Issue set to the milestone: SSSD 1.15.3

3 years ago

Metadata Update from @jhrozek:
- Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1446302

3 years ago

Metadata Update from @jhrozek:
- Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1446302

3 years ago

Metadata Update from @jhrozek:
- Issue assigned to jhrozek

3 years ago

Metadata Update from @jhrozek:
- Custom field patch adjusted to on

3 years ago

master:

Metadata Update from @jhrozek:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Metadata Update from @lslebodn:
- Custom field version adjusted to 1.1.5.3

3 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/4402

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata