#3693 test_kcm_mem_collection_init_list_destroy failures on Rawhide
Closed: Invalid 5 years ago Opened 6 years ago by fidencio.

I've seen the following issue on our internal CI for the second time Today, thus opening an issue for further investigation.

=================================== FAILURES ===================================
__________________ test_kcm_mem_collection_init_list_destroy ___________________
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/ci/label/fedora_rawhide/src/tests/intg/test_kcm.py", line 309, in test_kcm_mem_collection_init_list_destroy
    collection_init_list_destroy(testenv)
  File "/var/lib/jenkins/workspace/ci/label/fedora_rawhide/src/tests/intg/test_kcm.py", line 260, in collection_init_list_destroy
    assert out == 0
AssertionError: assert 1 == 0
----------------------------- Captured stdout call -----------------------------
Authenticating as principal jenkins/admin@KCMTEST with password.
Principal "bob@KCMTEST" created.
Authenticating as principal jenkins/admin@KCMTEST with password.
Principal "carol@KCMTEST" created.
Authenticating as principal jenkins/admin@KCMTEST with password.
Principal "host/somehostname@KCMTEST" created.
----------------------------- Captured stderr call -----------------------------
kadmin.local: Connection reset by peer while opening default credentials cache
WARNING: no policy specified for bob@KCMTEST; defaulting to no policy
WARNING: no policy specified for carol@KCMTEST; defaulting to no policy
WARNING: no policy specified for host/somehostname@KCMTEST; defaulting to no policy
--------------------------- Captured stderr teardown ---------------------------
k5_mutex_lock: Received error 22 (Invalid argument)
sssd_kcm: ../../include/k5-thread.h:376: k5_mutex_lock: Assertion `r == 0' failed.
==================== 1 failed, 255 passed in 791.18 seconds ====================
make[2]: *** [Makefile:1052: intgcheck-installed] Error 1
make[2]: Leaving directory '/var/lib/jenkins/workspace/ci/label/fedora_rawhide/ci-build-debug/intg/bld/src/tests/intg'
make[1]: *** [Makefile:38201: intgcheck-run] Error 2
make[1]: Leaving directory '/var/lib/jenkins/workspace/ci/label/fedora_rawhide/ci-build-debug'
make: *** [Makefile:38214: intgcheck] Error 2
End: Thu Mar 29 11:14:13 CEST 2018

Metadata Update from @fidencio:
- Issue assigned to fidencio

6 years ago

Metadata Update from @fidencio:
- Issue tagged with: KCM

6 years ago

Is it possible there's clockskew involved here? If so, it may go away when that's corrected.

So, just for the record, this is the coredump for this issue:

(gdb) bt
#0  0x00007f4425ddaf6b in raise () from /lib64/libc.so.6
#1  0x00007f4425dc55c1 in abort () from /lib64/libc.so.6
#2  0x00007f4425dc5491 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x00007f4425dd36e2 in __assert_fail () from /lib64/libc.so.6
#4  0x00007f442b566ca5 in k5_mutex_lock (m=0x7f442b7c6588 <krb5int_profile_shared_data+8>)
    at ../../include/k5-thread.h:376
#5  0x00007f442b567790 in k5_mutex_lock (m=0x7f442b7c6588 <krb5int_profile_shared_data+8>)
    at ../../include/k5-thread.h:376
#6  profile_dereference_data (data=0x9bbe50) at prof_file.c:534
#7  0x00007f442b5678f5 in profile_free_file (prf=0x9bbe30) at prof_file.c:558
#8  0x00007f442b567983 in profile_close_file (prf=0x9bbe30) at prof_file.c:599
#9  0x00007f442b56b249 in profile_release (profile=0x99a000) at prof_init.c:514
#10 0x00007f442b55c6c7 in k5_os_free_context (ctx=ctx@entry=0x9bbbf0) at init_os_ctx.c:506
#11 0x00007f442b531a9a in krb5_free_context (ctx=0x9bbbf0) at init_ctx.c:303
#12 0x0000000000407a5b in kcm_data_destructor (ptr=<optimized out>)
    at /root/sssd/x86_64/../src/responder/kcm/kcm.c:163
#13 0x00007f4426371e96 in _tc_free_internal () from /lib64/libtalloc.so.2
#14 0x00007f44263717b0 in _tc_free_internal () from /lib64/libtalloc.so.2
#15 0x00007f44263717b0 in _tc_free_internal () from /lib64/libtalloc.so.2
#16 0x00007f44263717b0 in _tc_free_internal () from /lib64/libtalloc.so.2
#17 0x00007f44263717b0 in _tc_free_internal () from /lib64/libtalloc.so.2
#18 0x00007f442636c648 in _talloc_free () from /lib64/libtalloc.so.2
#19 0x00007f442636a480 in talloc_lib_fini () from /lib64/libtalloc.so.2
#20 0x00007f442bffca96 in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#21 0x00007f4425ddd6bc in __run_exit_handlers () from /lib64/libc.so.6
#22 0x00007f4425ddd7ec in exit () from /lib64/libc.so.6
#23 0x00007f442a7c4f5d in orderly_shutdown (status=0)
    at /root/sssd/x86_64/../src/util/server.c:258
#24 0x00007f4426583a46 in tevent_common_check_signal () from /lib64/libtevent.so.0
#25 0x00007f4426585975 in epoll_event_loop_once () from /lib64/libtevent.so.0
#26 0x00007f4426583dab in std_event_loop_once () from /lib64/libtevent.so.0
#27 0x00007f4426580098 in _tevent_loop_once () from /lib64/libtevent.so.0
#28 0x00007f44265802eb in tevent_common_loop_wait () from /lib64/libtevent.so.0
#29 0x00007f4426583d3b in std_event_loop_wait () from /lib64/libtevent.so.0
#30 0x00007f442a7c5e13 in server_loop (main_ctx=0x983210)
    at /root/sssd/x86_64/../src/util/server.c:721
#31 0x00000000004076c3 in main (argc=<optimized out>, argv=<optimized out>)
    at /root/sssd/x86_64/../src/responder/kcm/kcm.c:318

And we have hit a similar one as well ...

(gdb) bt
#0  0x00007fbb60047bdb in gethostname () from /usr/lib64/libnss_wrapper.so
#1  0x00007fbb5e0230e9 in ldap_pvt_get_fqdn () from /lib64/libldap-2.4.so.2
#2  0x00007fbb5e0212de in ldap_int_initialize () from /lib64/libldap-2.4.so.2
#3  0x00007fbb5e021a8a in ldap_get_option () from /lib64/libldap-2.4.so.2
#4  0x00007fbb556acd83 in remove_connection_callback (mem_ctx=<optimized out>)
    at /root/sssd/x86_64/../src/providers/ldap/sdap_fd_events.c:75
#5  0x00007fbb5b37ed2b in _talloc_free () from /lib64/libtalloc.so.2
#6  0x00007fbb556ace9c in remove_ldap_connection_callbacks (sh=sh@entry=0x1ba3a60)
    at /root/sssd/x86_64/../src/providers/ldap/sdap_fd_events.c:57
#7  0x00007fbb556860f2 in sdap_handle_release (sh=0x1ba3a60)
    at /root/sssd/x86_64/../src/providers/ldap/sdap_async.c:99
#8  sdap_handle_destructor (mem=<optimized out>)
    at /root/sssd/x86_64/../src/providers/ldap/sdap_async.c:79
#9  0x00007fbb5b383e96 in _tc_free_internal () from /lib64/libtalloc.so.2
#10 0x00007fbb5b3837b0 in _tc_free_internal () from /lib64/libtalloc.so.2
#11 0x00007fbb5b3837b0 in _tc_free_internal () from /lib64/libtalloc.so.2
#12 0x00007fbb5b3837b0 in _tc_free_internal () from /lib64/libtalloc.so.2
#13 0x00007fbb5b3837b0 in _tc_free_internal () from /lib64/libtalloc.so.2
#14 0x00007fbb5b3837b0 in _tc_free_internal () from /lib64/libtalloc.so.2
#15 0x00007fbb5b3837b0 in _tc_free_internal () from /lib64/libtalloc.so.2
#16 0x00007fbb5b3837b0 in _tc_free_internal () from /lib64/libtalloc.so.2
#17 0x00007fbb5b3837b0 in _tc_free_internal () from /lib64/libtalloc.so.2
#18 0x00007fbb5b3837b0 in _tc_free_internal () from /lib64/libtalloc.so.2
#19 0x00007fbb5b3837b0 in _tc_free_internal () from /lib64/libtalloc.so.2
#20 0x00007fbb5b3837b0 in _tc_free_internal () from /lib64/libtalloc.so.2
#21 0x00007fbb5b37e648 in _talloc_free () from /lib64/libtalloc.so.2
#22 0x00007fbb5b37c480 in talloc_lib_fini () from /lib64/libtalloc.so.2
#23 0x00007fbb6066ea96 in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#24 0x00007fbb5adef6bc in __run_exit_handlers () from /lib64/libc.so.6
#25 0x00007fbb5adef7ec in exit () from /lib64/libc.so.6
#26 0x00007fbb5f7d6f5d in orderly_shutdown (status=0)
    at /root/sssd/x86_64/../src/util/server.c:258
#27 0x00007fbb5b595a46 in tevent_common_check_signal () from /lib64/libtevent.so.0
#28 0x00007fbb5b597975 in epoll_event_loop_once () from /lib64/libtevent.so.0
#29 0x00007fbb5b595dab in std_event_loop_once () from /lib64/libtevent.so.0
#30 0x00007fbb5b592098 in _tevent_loop_once () from /lib64/libtevent.so.0
#31 0x00007fbb5b5922eb in tevent_common_loop_wait () from /lib64/libtevent.so.0
#32 0x00007fbb5b595d3b in std_event_loop_wait () from /lib64/libtevent.so.0
#33 0x00007fbb5f7d7e13 in server_loop (main_ctx=0x1b99140)
    at /root/sssd/x86_64/../src/util/server.c:721
#34 0x0000000000407ff4 in main (argc=8, argv=<optimized out>)
    at /root/sssd/x86_64/../src/providers/data_provider_be.c:639

Both issues may possibly be a glibc issue. More investigation is needed on our side before we actually can open them a bug and give then a simple reproducer.

I think both are SSSD bugs revealed by a recent talloc update. For the first one, I don't know yet, for the second one, we need to get rid of the autofree context from server.c.

This was discussed on sambva-technical last week btw.

About the first one, I think we should start with running the testcase with valgrind, maybe that will reveal the issue. The second BT should imo be tracked with a different ticket.

@lslebodn has already started a discussion about this on samba mailing list: https://lists.samba.org/archive/samba-technical/2018-March/126552.html

@lslebodn, please, let us know whether you need something from us that could help.

As @lslebodn did, I also have tested the libtalloc patch and it did solve both issues.
We're still waiting for a libtalloc release on rawhide for talloc-2.1.13 (AFAICT the maintainer is @lslebodn, so it will be done soon).

I'm closing this ticket now but feel free to re-open it till libtalloc gets updated on Fedora (and then close it again).

Metadata Update from @fidencio:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

5 years ago

Metadata Update from @lslebodn:
- Issue untagged with: KCM
- Issue close_status updated to: Invalid (was: Fixed)

5 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/4710

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata