#8877 Nightly test failure in test_nfs.py: runner VM runs out of disk space due to huge sssd log file
Opened 2 years ago by sorlov. Modified 2 years ago

Issue noticed in pull requests [testing_master_latest 960] and [testing_master_latest_selinux]

In runner.log (during artifacts collection):

2021-06-08 05:10:38,191    DEBUG  [ipatests.pytest_ipa.integration.host.Host.client1.IPAOpenSSHTransport] RUN ['rm', '-f', '/tmp/tmp.NcLumALhRJ']
2021-06-08 05:10:38,192    DEBUG  [ipatests.pytest_ipa.integration.host.Host.client1.cmd96] RUN ['rm', '-f', '/tmp/tmp.NcLumALhRJ']
2021-06-08 05:10:38,237    DEBUG  [ipatests.pytest_ipa.integration.host.Host.client1.cmd96] Exit code: 0
2021-06-08 07:00:46,126 CRITICAL  Terminating execution, runtime exceeded 9000s
2021-06-08 07:00:46,140 CRITICAL  No free disk space

One of sssd log files (sssd_kcm.log) has multi-gb size on test runner machine (log is not available in PR artifacts):

ls -lh jobs/2c404c56-c6ce-11eb-b200-fa163ef8383a/test_integration-test_nfs.py-TestNFS-install/client1.ipa.test/var/log/sssd
# ls -lh jobs/2c404c56-c6ce-11eb-b200-fa163ef8383a/test_integration-test_nfs.py-TestNFS-install/client1.ipa.test/var/log/sssd
total 17G
-rw-r--r--. 1 root root 5.6K Jun  6 16:22 index.html
-rw-------. 1 root root 1.7K Jun  6 14:02 ldap_child.log
-rw-------. 1 root root  259 Jun  6 14:02 sssd_implicit_files.log.gz
-rw-------. 1 root root  36K Jun  6 14:17 sssd_ipa.test.log.gz
-rw-------. 1 root root  17G Jun  6 14:11 sssd_kcm.log
-rw-------. 1 root root  90M Jun  6 16:22 sssd_kcm.log.gz
-rw-------. 1 root root 8.3K Jun  6 14:21 sssd.log.gz
-rw-------. 1 root root  26K Jun  6 14:21 sssd_nss.log.gz
-rw-------. 1 root root  11K Jun  6 14:02 sssd_pac.log
-rw-------. 1 root root 1.5K Jun  6 14:02 sssd_pam.log.gz
-rw-------. 1 root root 1.4K Jun  6 14:02 sssd_ssh.log.gz
-rw-------. 1 root root 1.4K Jun  6 14:02 sssd_sudo.log.gz

# tail -30 sssd_kcm.log

(2021-06-06 14:11:37): [kcm] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client
(2021-06-06 14:11:37): [kcm] [kcm_reply_error] (0x0040): KCM operation returns failure [74]: Bad message
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2021-06-06 14:11:37): [kcm] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client
   *  (2021-06-06 14:11:37): [kcm] [kcm_reply_error] (0x0040): KCM operation returns failure [74]: Bad message
********************** BACKTRACE DUMP ENDS HERE *********************************

(2021-06-06 14:11:37): [kcm] [kcm_input_parse] (0x0020): Illegal zero-length message
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2021-06-06 14:11:37): [kcm] [kcm_failbuf_construct] (0x1000): Sent reply with error -1765328188
   *  (2021-06-06 14:11:37): [kcm] [kcm_input_parse] (0x1000): Received message with length 0
   *  (2021-06-06 14:11:37): [kcm] [kcm_input_parse] (0x0020): Illegal zero-length message
********************** BACKTRACE DUMP ENDS HERE *********************************

(2021-06-06 14:11:37): [kcm] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client
(2021-06-06 14:11:37): [kcm] [kcm_reply_error] (0x0040): KCM operation returns failure [74]: Bad message
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2021-06-06 14:11:37): [kcm] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client
   *  (2021-06-06 14:11:37): [kcm] [kcm_reply_error] (0x0040): KCM operation returns failure [74]: Bad message
********************** BACKTRACE DUMP ENDS HERE *********************************

(2021-06-06 14:11:37): [kcm] [kcm_input_parse] (0x0020): Illegal zero-length message
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2021-06-06 14:11:37): [kcm] [kcm_failbuf_construct] (0x1000): Sent reply with error -1765328188
   *  (2021-06-06 14:11:37): [kcm] [kcm_input_parse] (0x1000): Received message with length 0
   *  (2021-06-06 14:11:37): [kcm] [kcm_input_parse] (0x0020): Illegal zero-length message
********************** BACKTRACE DUMP ENDS HERE *********************************

(2021-06-06 14:11:37): [kcm] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client

The log files starts to grow after the test mounts nfs share using command

mount -t nfs4 -o sec=krb5p,vers=4.0 master.testrelm.test:/exports/krbnfs /mnt/krb -v

Here is the minimal automated test demonstrating this behavior:
https://github.com/wladich/freeipa/blob/reproduce-sssd-huge-log/ipatests/test_integration/test_sssd_huge_log_reproducer.py
Test report for this test

Version information:

Fedora 34
sssd-common-2.5.0-2.fc34.x86_64
krb5-libs-1.19.1-8.fc34.x86_64

Other notes:

  • Status of simlified test on Fedora 34 with different package versions:
    • sssd 2.5.0-2, krb5-libs 1.19.1-3 - Fail
    • sssd 2.5.0-2, krb5-libs 1.19.1-8 - Fail
    • sssd 2.5.1-1, krb5-libs 1.19.1-8 - Pass
  • This failure is not observed on:
    • Fedora 34 with sssd 2.4.2-3 and krb5-libs 1.19.1-3
    • Fedora 33 with sssd-client-2.4.2-2.fc33.x86_64 and krb5-libs-1.18.2-29.fc33.x86_64
  • This failure is also observed on

Important detail is that libkrb5 is also upgraded to krb5-libs-1.19.1-8.fc34.
This brings in few KCM/SSSD related patches in libkrb5: https://src.fedoraproject.org/rpms/krb5/commits/f34

CC: @rharwood

Not sure what you're looking for from me. These patches were requested by SSSD starting in https://bugzilla.redhat.com/show_bug.cgi?id=1956388 and also usptream.

Possibly related is that sssd still doesn't implement KCM_OP_RETRIEVE. But that should scan as an unknown operation, not a parse failure with backtrace.

master:

  • bcb8d4d ipatests: temporary disable execution of test_nfs.py::TestNFS in nightly runs

Metadata Update from @frenaud:
- Issue tagged with: test-failure, tests

2 years ago

ipa-4-9:

  • 6ee14f5 ipatests: temporary disable execution of test_nfs.py::TestNFS in nightly runs

Issue updated:

  • the log starts to grow after mount -t nfs ... command
  • added link to minimal automated test reproducing the issue. Test report also provided

I tried to reproduce the issue with the same package and steps mentioned in the test but I didn't see this error, nor big file. Can you provide reproducer machine where I can join and debug the issue?

Issue updated: added test results with 1) downgraded krb5 and 2) with sssd upgraded to 2.5.1-1

The issue is not observed with sssd-common-2.5.1-1.fc34

@pbrezina @sbose Please comment if investigation of failure in current version is still desired.

ipa-4-9:

  • c9f5acc ipatests: disable test_nfs.py::TestNFS in nightly runs on Fedora 33

master:

  • accc27e ipatests: disable test_nfs.py::TestNFS in nightly runs on Fedora 33

@atikhonov @sbose @pbrezina
I have executed tests with 2.5.1-1 several times and found that the issue is still here. The point is that the failure is intermittent, this can explain why @pbrezina could not reproduce it manually (I also tried and did not succeed as well).

Note: the huge log appears on the NFS server side after client calls mount

It appears to be some sort of race condition, I don't know which component triggers it. The run that does not go into loop clearly shows that KCM sends all data and the we get two client's disconnects where the loop-case gets only one disconnect and after that KCM sends data. This makes KCM go into loop where the file descriptor is always readable, however with no data.

It feels like there is some error in rpc-gssd or libkrb5 that does not disconnect correctly under some circumstances and KCM can't cope with it.

Loop
(2021-06-14 14:10:15): [kcm] [kcm_cmd_done] (0x0400): KCM operation GET_CRED_LIST returned [0]: Success
(2021-06-14 14:10:15): [kcm] [kcm_send_reply] (0x2000): Sending a reply
(2021-06-14 14:10:15): [kcm] [kcm_output_construct] (0x1000): Sending a reply with 4736 bytes of payload
(2021-06-14 14:10:15): [kcm] [queue_removal_cb] (0x0200): Removed queue for 0
(2021-06-14 14:10:15): [kcm] [kcm_recv] (0x4000): Client closed connection.
(2021-06-14 14:10:15): [kcm] [client_close_fn] (0x2000): Terminated client [0x55cb3758eab0][14]
(2021-06-14 14:10:15): [kcm] [kcm_send] (0x2000): All data sent!
(2021-06-14 14:10:15): [kcm] [kcm_input_parse] (0x1000): Received message with length 0
(2021-06-14 14:10:15): [kcm] [kcm_input_parse] (0x0020): Illegal zero-length message

No loop:
(2021-06-14 14:11:47): [kcm] [kcm_cmd_done] (0x0400): KCM operation GET_CRED_LIST returned [0]: Success
(2021-06-14 14:11:47): [kcm] [kcm_send_reply] (0x2000): Sending a reply
(2021-06-14 14:11:47): [kcm] [kcm_output_construct] (0x1000): Sending a reply with 4736 bytes of payload
(2021-06-14 14:11:47): [kcm] [queue_removal_cb] (0x0200): Removed queue for 0
(2021-06-14 14:11:47): [kcm] [kcm_send] (0x2000): All data sent!
(2021-06-14 14:11:47): [kcm] [kcm_send] (0x2000): All data sent!
(2021-06-14 14:11:47): [kcm] [kcm_recv] (0x4000): Client closed connection.
(2021-06-14 14:11:47): [kcm] [client_close_fn] (0x2000): Terminated client [0x55ad2d6b61a0][14]
(2021-06-14 14:11:47): [kcm] [kcm_recv] (0x4000): Client closed connection.
(2021-06-14 14:11:47): [kcm] [client_close_fn] (0x2000): Terminated client [0x55ad2d6b6ab0][15]

I also observed the issue with "Fedora 34 with sssd 2.4.2-3 and krb5-libs 1.19.1-3". Since it is a race condition it does not always appear.

@sorlov Can you please tell me what is the latest nfs-utils/krb5 version that worked? We did not change this part of code for a long time in KCM, if there is a bug it is there for a long time so I think it must be a combination KCM and either nfs-utils or krb5.

I can reproduce it locally, but event with quite prehistoric versions:

# rpm -q sssd-kcm krb5-libs nfs-utils
sssd-kcm-2.4.2-3.fc34.x86_64
krb5-libs-1.18.2-20.fc34.x86_64
nfs-utils-2.5.2-0.fc34.x86_64

Neither sssd-kcm (Wed Mar 31 2021) nor krb5-libs (Thu Aug 13 2020) contains recent changes, both were release quite some time ago. nfs-utils is also an older release Mon Oct 26 2020.

So the problem is there for a long time and the reason why it started to appear in your tests must be that Alexey implemented debug-backtrace that is printed to the log on errors.

I have no idea why this happens, it does not seem to be an issue in KCM itself - but it should be able to kill the connection instead of keep looping over it.

Also obtained ticket looks weird but it is the same in both ok and no-ok cases:

Ticket cache: KCM:0:34305
Default principal: nfs/master.ldap.vm@LDAP.VM

Valid starting       Expires              Service principal
01/01/1970 00:00:00  01/01/1970 00:00:00  Encrypted/Credentials/v1@X-GSSPROXY:

@pbrezina this is a perfectly correct ticket in a ccache -- for GSSProxy. ;)

When GSSProxy is active, it stores encrypted ticket in a ccache in a specially crafted entry with that specific principal name. Only GSSAPI clients that use GSSProxy can see the actual ticket, the rest only sees the encrypted entry.

I took it in strace, it is clear that the connection from which we read zero-length message sends 5 valid messages and reads 5 valid answers. After this, the connection is closed - twice. After the second close, SSSD reads four bytes which indicates the message length and they are all zeros and we get into the loop because there is some more data available on the fd but we never get to read them (trying to read 0 length message).

There is no sendmsg from rpd.gssd that would send this message. The only strange thing visible in strace (attached together with sssd-kcm log if anyone is interested) is the double close (one from libkrb5 and the other one from gssproxy):

  • strace -e close,connect,sendmsg,read --timestamps=format:time,precision:us -k -fp $(pidof rpc.gssd)
[pid 1120436] 12:49:04.999577 close(12) = 0
 > /usr/lib64/libpthread-2.33.so(__close+0x3b) [0x129bb]
 > /usr/lib64/libkrb5.so.3.3(kcm_close.lto_priv.0+0x84) [0x37f14]
 > /usr/lib64/gssproxy/proxymech.so(gppint_retrieve_remote_creds+0x23d) [0x1318d]
 > /usr/lib64/gssproxy/proxymech.so(gssi_acquire_cred_from+0x1c8) [0x17378]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_add_cred_from+0x521) [0x18331]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_acquire_cred_from+0x20f) [0x1b00f]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_acquire_cred+0x29) [0x1b2a9]
 > /usr/sbin/rpc.gssd(gssd_acquire_krb5_cred+0x4f) [0x75ef]
 > /usr/sbin/rpc.gssd(create_auth_rpc_client.constprop.0+0x30c) [0xc77c]
 > /usr/sbin/rpc.gssd(process_krb5_upcall+0x21a) [0xd34a]
 > /usr/sbin/rpc.gssd(handle_gssd_upcall+0x4a6) [0xe7c6]
 > /usr/lib64/libpthread-2.33.so(start_thread+0xe8) [0x9298]
 > /usr/lib64/libc-2.33.so(__clone+0x42) [0x1006a2]
[pid 1120436] 12:49:05.000248 read(3, "\200\0\2\260", 4) = 4
 > /usr/lib64/libpthread-2.33.so(read+0x4c) [0x1292c]
 > /usr/lib64/gssproxy/proxymech.so(gpm_make_call+0x52f) [0xefcf]
 > /usr/lib64/gssproxy/proxymech.so(gpm_acquire_cred+0xec) [0x11f2c]
 > /usr/lib64/gssproxy/proxymech.so(gssi_acquire_cred_from+0x230) [0x173e0]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_add_cred_from+0x521) [0x18331]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_acquire_cred_from+0x20f) [0x1b00f]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_acquire_cred+0x29) [0x1b2a9]
 > /usr/sbin/rpc.gssd(gssd_acquire_krb5_cred+0x4f) [0x75ef]
 > /usr/sbin/rpc.gssd(create_auth_rpc_client.constprop.0+0x30c) [0xc77c]
 > /usr/sbin/rpc.gssd(process_krb5_upcall+0x21a) [0xd34a]
 > /usr/sbin/rpc.gssd(handle_gssd_upcall+0x4a6) [0xe7c6]
 > /usr/lib64/libpthread-2.33.so(start_thread+0xe8) [0x9298]
 > /usr/lib64/libc-2.33.so(__clone+0x42) [0x1006a2]
[pid 1120436] 12:49:05.000388 read(3, "\0\0\0\226\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 688) = 688
 > /usr/lib64/libpthread-2.33.so(read+0x4c) [0x1292c]
 > /usr/lib64/gssproxy/proxymech.so(gpm_make_call+0x5f7) [0xf097]
 > /usr/lib64/gssproxy/proxymech.so(gpm_acquire_cred+0xec) [0x11f2c]
 > /usr/lib64/gssproxy/proxymech.so(gssi_acquire_cred_from+0x230) [0x173e0]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_add_cred_from+0x521) [0x18331]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_acquire_cred_from+0x20f) [0x1b00f]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_acquire_cred+0x29) [0x1b2a9]
 > /usr/sbin/rpc.gssd(gssd_acquire_krb5_cred+0x4f) [0x75ef]
 > /usr/sbin/rpc.gssd(create_auth_rpc_client.constprop.0+0x30c) [0xc77c]
 > /usr/sbin/rpc.gssd(process_krb5_upcall+0x21a) [0xd34a]
 > /usr/sbin/rpc.gssd(handle_gssd_upcall+0x4a6) [0xe7c6]
 > /usr/lib64/libpthread-2.33.so(start_thread+0xe8) [0x9298]
 > /usr/lib64/libc-2.33.so(__clone+0x42) [0x1006a2]
[pid 1120436] 12:49:05.000463 close(12) = 0
 > /usr/lib64/libpthread-2.33.so(__close+0x3b) [0x129bb]
 > /usr/lib64/gssproxy/proxymech.so(gpm_make_call+0x62b) [0xf0cb]
 > /usr/lib64/gssproxy/proxymech.so(gpm_acquire_cred+0xec) [0x11f2c]
 > /usr/lib64/gssproxy/proxymech.so(gssi_acquire_cred_from+0x230) [0x173e0]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_add_cred_from+0x521) [0x18331]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_acquire_cred_from+0x20f) [0x1b00f]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_acquire_cred+0x29) [0x1b2a9]
 > /usr/sbin/rpc.gssd(gssd_acquire_krb5_cred+0x4f) [0x75ef]
 > /usr/sbin/rpc.gssd(create_auth_rpc_client.constprop.0+0x30c) [0xc77c]
 > /usr/sbin/rpc.gssd(process_krb5_upcall+0x21a) [0xd34a]
 > /usr/sbin/rpc.gssd(handle_gssd_upcall+0x4a6) [0xe7c6]
 > /usr/lib64/libpthread-2.33.so(start_thread+0xe8) [0x9298]
 > /usr/lib64/libc-2.33.so(__clone+0x42) [0x1006a2]
[pid 1120436] 12:49:05.000588 close(10) = 0

There is no connect in between so it should be the same fd. The file descriptor is readable in sssd-kcm after the second close.

I opened https://github.com/SSSD/sssd/pull/5688 to avoid getting into the loop. But I still don't know whats causing the incoming invalid message.

@pbrezina could you let us know here when a new SSSD package is available for rawhide and Fedora 34's updates-testing please?

It is sssd-2.5.1-2, should be already available in f34 and rawhide.

Login to comment on this ticket.

Metadata