Issue #8877: Nightly test failure in test_nfs.py: runner VM runs out of disk space due to huge sssd log file - freeipa

freeipa

#8877 Nightly test failure in test_nfs.py: runner VM runs out of disk space due to huge sssd log file

Opened 2 years ago by sorlov. Modified 2 years ago

Issue noticed in pull requests [testing_master_latest 960] and [testing_master_latest_selinux]

In runner.log (during artifacts collection):

2021-06-08 05:10:38,191    DEBUG  [ipatests.pytest_ipa.integration.host.Host.client1.IPAOpenSSHTransport] RUN ['rm', '-f', '/tmp/tmp.NcLumALhRJ']
2021-06-08 05:10:38,192    DEBUG  [ipatests.pytest_ipa.integration.host.Host.client1.cmd96] RUN ['rm', '-f', '/tmp/tmp.NcLumALhRJ']
2021-06-08 05:10:38,237    DEBUG  [ipatests.pytest_ipa.integration.host.Host.client1.cmd96] Exit code: 0
2021-06-08 07:00:46,126 CRITICAL  Terminating execution, runtime exceeded 9000s
2021-06-08 07:00:46,140 CRITICAL  No free disk space

One of sssd log files (sssd_kcm.log) has multi-gb size on test runner machine (log is not available in PR artifacts):

ls -lh jobs/2c404c56-c6ce-11eb-b200-fa163ef8383a/test_integration-test_nfs.py-TestNFS-install/client1.ipa.test/var/log/sssd
# ls -lh jobs/2c404c56-c6ce-11eb-b200-fa163ef8383a/test_integration-test_nfs.py-TestNFS-install/client1.ipa.test/var/log/sssd
total 17G
-rw-r--r--. 1 root root 5.6K Jun  6 16:22 index.html
-rw-------. 1 root root 1.7K Jun  6 14:02 ldap_child.log
-rw-------. 1 root root  259 Jun  6 14:02 sssd_implicit_files.log.gz
-rw-------. 1 root root  36K Jun  6 14:17 sssd_ipa.test.log.gz
-rw-------. 1 root root  17G Jun  6 14:11 sssd_kcm.log
-rw-------. 1 root root  90M Jun  6 16:22 sssd_kcm.log.gz
-rw-------. 1 root root 8.3K Jun  6 14:21 sssd.log.gz
-rw-------. 1 root root  26K Jun  6 14:21 sssd_nss.log.gz
-rw-------. 1 root root  11K Jun  6 14:02 sssd_pac.log
-rw-------. 1 root root 1.5K Jun  6 14:02 sssd_pam.log.gz
-rw-------. 1 root root 1.4K Jun  6 14:02 sssd_ssh.log.gz
-rw-------. 1 root root 1.4K Jun  6 14:02 sssd_sudo.log.gz

# tail -30 sssd_kcm.log

(2021-06-06 14:11:37): [kcm] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client
(2021-06-06 14:11:37): [kcm] [kcm_reply_error] (0x0040): KCM operation returns failure [74]: Bad message
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2021-06-06 14:11:37): [kcm] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client
   *  (2021-06-06 14:11:37): [kcm] [kcm_reply_error] (0x0040): KCM operation returns failure [74]: Bad message
********************** BACKTRACE DUMP ENDS HERE *********************************

(2021-06-06 14:11:37): [kcm] [kcm_input_parse] (0x0020): Illegal zero-length message
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2021-06-06 14:11:37): [kcm] [kcm_failbuf_construct] (0x1000): Sent reply with error -1765328188
   *  (2021-06-06 14:11:37): [kcm] [kcm_input_parse] (0x1000): Received message with length 0
   *  (2021-06-06 14:11:37): [kcm] [kcm_input_parse] (0x0020): Illegal zero-length message
********************** BACKTRACE DUMP ENDS HERE *********************************

(2021-06-06 14:11:37): [kcm] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client
(2021-06-06 14:11:37): [kcm] [kcm_reply_error] (0x0040): KCM operation returns failure [74]: Bad message
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2021-06-06 14:11:37): [kcm] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client
   *  (2021-06-06 14:11:37): [kcm] [kcm_reply_error] (0x0040): KCM operation returns failure [74]: Bad message
********************** BACKTRACE DUMP ENDS HERE *********************************

(2021-06-06 14:11:37): [kcm] [kcm_input_parse] (0x0020): Illegal zero-length message
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2021-06-06 14:11:37): [kcm] [kcm_failbuf_construct] (0x1000): Sent reply with error -1765328188
   *  (2021-06-06 14:11:37): [kcm] [kcm_input_parse] (0x1000): Received message with length 0
   *  (2021-06-06 14:11:37): [kcm] [kcm_input_parse] (0x0020): Illegal zero-length message
********************** BACKTRACE DUMP ENDS HERE *********************************

(2021-06-06 14:11:37): [kcm] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client

The log files starts to grow after the test mounts nfs share using command

mount -t nfs4 -o sec=krb5p,vers=4.0 master.testrelm.test:/exports/krbnfs /mnt/krb -v

Here is the minimal automated test demonstrating this behavior:
https://github.com/wladich/freeipa/blob/reproduce-sssd-huge-log/ipatests/test_integration/test_sssd_huge_log_reproducer.py
Test report for this test

Version information:

Fedora 34
sssd-common-2.5.0-2.fc34.x86_64
krb5-libs-1.19.1-8.fc34.x86_64

Other notes:

Status of simlified test on Fedora 34 with different package versions:
- sssd 2.5.0-2, krb5-libs 1.19.1-3 - Fail
- sssd 2.5.0-2, krb5-libs 1.19.1-8 - Fail
- sssd 2.5.1-1, krb5-libs 1.19.1-8 - Pass
This failure is not observed on:
- Fedora 34 with sssd 2.4.2-3 and krb5-libs 1.19.1-3
- Fedora 33 with sssd-client-2.4.2-2.fc33.x86_64 and krb5-libs-1.18.2-29.fc33.x86_64
This failure is also observed on
- Fedora Rawhide: (4 Gb log available), sssd-client-2.5.0-2.fc35.x86_64, krb5-libs-1.19.1-8.fc35.x86_64, [testing_master_rawhide] 953

sorlov commented 2 years ago

@atikhonov @pbrezina @jstephen Could you have a look at the issue?

atikhonov commented 2 years ago

Important detail is that libkrb5 is also upgraded to krb5-libs-1.19.1-8.fc34.
This brings in few KCM/SSSD related patches in libkrb5: https://src.fedoraproject.org/rpms/krb5/commits/f34

CC: @rharwood

rharwood commented 2 years ago

Not sure what you're looking for from me. These patches were requested by SSSD starting in https://bugzilla.redhat.com/show_bug.cgi?id=1956388 and also usptream.

rharwood commented 2 years ago

Possibly related is that sssd still doesn't implement KCM_OP_RETRIEVE. But that should scan as an unknown operation, not a parse failure with backtrace.

rcritten commented 2 years ago

master:

bcb8d4d ipatests: temporary disable execution of test_nfs.py::TestNFS in nightly runs

Metadata Update from @frenaud:
- Issue tagged with: test-failure, tests

2 years ago

sorlov commented 2 years ago

ipa-4-9:

6ee14f5 ipatests: temporary disable execution of test_nfs.py::TestNFS in nightly runs

sorlov commented 2 years ago

Issue updated:

the log starts to grow after mount -t nfs ... command
added link to minimal automated test reproducing the issue. Test report also provided

Edited 2 years ago by sorlov

pbrezina commented 2 years ago

I tried to reproduce the issue with the same package and steps mentioned in the test but I didn't see this error, nor big file. Can you provide reproducer machine where I can join and debug the issue?

sorlov commented 2 years ago

Issue updated: added test results with 1) downgraded krb5 and 2) with sssd upgraded to 2.5.1-1

The issue is not observed with sssd-common-2.5.1-1.fc34

Edited 2 years ago by sorlov

sorlov commented 2 years ago

@pbrezina @sbose Please comment if investigation of failure in current version is still desired.

frenaud commented 2 years ago

ipa-4-9:

c9f5acc ipatests: disable test_nfs.py::TestNFS in nightly runs on Fedora 33

frenaud commented 2 years ago

master:

accc27e ipatests: disable test_nfs.py::TestNFS in nightly runs on Fedora 33

sorlov commented 2 years ago

@atikhonov @sbose @pbrezina
I have executed tests with 2.5.1-1 several times and found that the issue is still here. The point is that the failure is intermittent, this can explain why @pbrezina could not reproduce it manually (I also tried and did not succeed as well).

The original test_nfs.py failed in 4 of 4 test runs (report)
The simplified test failed in 3 of 5 test runs (failure report, success report)

pbrezina commented 2 years ago

Note: the huge log appears on the NFS server side after client calls mount

It appears to be some sort of race condition, I don't know which component triggers it. The run that does not go into loop clearly shows that KCM sends all data and the we get two client's disconnects where the loop-case gets only one disconnect and after that KCM sends data. This makes KCM go into loop where the file descriptor is always readable, however with no data.

It feels like there is some error in rpc-gssd or libkrb5 that does not disconnect correctly under some circumstances and KCM can't cope with it.

Loop
(2021-06-14 14:10:15): [kcm] [kcm_cmd_done] (0x0400): KCM operation GET_CRED_LIST returned [0]: Success
(2021-06-14 14:10:15): [kcm] [kcm_send_reply] (0x2000): Sending a reply
(2021-06-14 14:10:15): [kcm] [kcm_output_construct] (0x1000): Sending a reply with 4736 bytes of payload
(2021-06-14 14:10:15): [kcm] [queue_removal_cb] (0x0200): Removed queue for 0
(2021-06-14 14:10:15): [kcm] [kcm_recv] (0x4000): Client closed connection.
(2021-06-14 14:10:15): [kcm] [client_close_fn] (0x2000): Terminated client [0x55cb3758eab0][14]
(2021-06-14 14:10:15): [kcm] [kcm_send] (0x2000): All data sent!
(2021-06-14 14:10:15): [kcm] [kcm_input_parse] (0x1000): Received message with length 0
(2021-06-14 14:10:15): [kcm] [kcm_input_parse] (0x0020): Illegal zero-length message

No loop:
(2021-06-14 14:11:47): [kcm] [kcm_cmd_done] (0x0400): KCM operation GET_CRED_LIST returned [0]: Success
(2021-06-14 14:11:47): [kcm] [kcm_send_reply] (0x2000): Sending a reply
(2021-06-14 14:11:47): [kcm] [kcm_output_construct] (0x1000): Sending a reply with 4736 bytes of payload
(2021-06-14 14:11:47): [kcm] [queue_removal_cb] (0x0200): Removed queue for 0
(2021-06-14 14:11:47): [kcm] [kcm_send] (0x2000): All data sent!
(2021-06-14 14:11:47): [kcm] [kcm_send] (0x2000): All data sent!
(2021-06-14 14:11:47): [kcm] [kcm_recv] (0x4000): Client closed connection.
(2021-06-14 14:11:47): [kcm] [client_close_fn] (0x2000): Terminated client [0x55ad2d6b61a0][14]
(2021-06-14 14:11:47): [kcm] [kcm_recv] (0x4000): Client closed connection.
(2021-06-14 14:11:47): [kcm] [client_close_fn] (0x2000): Terminated client [0x55ad2d6b6ab0][15]

I also observed the issue with "Fedora 34 with sssd 2.4.2-3 and krb5-libs 1.19.1-3". Since it is a race condition it does not always appear.

@sorlov Can you please tell me what is the latest nfs-utils/krb5 version that worked? We did not change this part of code for a long time in KCM, if there is a bug it is there for a long time so I think it must be a combination KCM and either nfs-utils or krb5.

pbrezina commented 2 years ago

I can reproduce it locally, but event with quite prehistoric versions:

# rpm -q sssd-kcm krb5-libs nfs-utils
sssd-kcm-2.4.2-3.fc34.x86_64
krb5-libs-1.18.2-20.fc34.x86_64
nfs-utils-2.5.2-0.fc34.x86_64

Neither sssd-kcm (Wed Mar 31 2021) nor krb5-libs (Thu Aug 13 2020) contains recent changes, both were release quite some time ago. nfs-utils is also an older release Mon Oct 26 2020.

So the problem is there for a long time and the reason why it started to appear in your tests must be that Alexey implemented debug-backtrace that is printed to the log on errors.

I have no idea why this happens, it does not seem to be an issue in KCM itself - but it should be able to kill the connection instead of keep looping over it.

Also obtained ticket looks weird but it is the same in both ok and no-ok cases:

Ticket cache: KCM:0:34305
Default principal: nfs/master.ldap.vm@LDAP.VM

Valid starting       Expires              Service principal
01/01/1970 00:00:00  01/01/1970 00:00:00  Encrypted/Credentials/v1@X-GSSPROXY:

abbra commented 2 years ago

@pbrezina this is a perfectly correct ticket in a ccache -- for GSSProxy. ;)

When GSSProxy is active, it stores encrypted ticket in a ccache in a specially crafted entry with that specific principal name. Only GSSAPI clients that use GSSProxy can see the actual ticket, the rest only sees the encrypted entry.

pbrezina commented 2 years ago

I took it in strace, it is clear that the connection from which we read zero-length message sends 5 valid messages and reads 5 valid answers. After this, the connection is closed - twice. After the second close, SSSD reads four bytes which indicates the message length and they are all zeros and we get into the loop because there is some more data available on the fd but we never get to read them (trying to read 0 length message).

There is no sendmsg from rpd.gssd that would send this message. The only strange thing visible in strace (attached together with sssd-kcm log if anyone is interested) is the double close (one from libkrb5 and the other one from gssproxy):

strace -e close,connect,sendmsg,read --timestamps=format:time,precision:us -k -fp $(pidof rpc.gssd)

[pid 1120436] 12:49:04.999577 close(12) = 0
 > /usr/lib64/libpthread-2.33.so(__close+0x3b) [0x129bb]
 > /usr/lib64/libkrb5.so.3.3(kcm_close.lto_priv.0+0x84) [0x37f14]
 > /usr/lib64/gssproxy/proxymech.so(gppint_retrieve_remote_creds+0x23d) [0x1318d]
 > /usr/lib64/gssproxy/proxymech.so(gssi_acquire_cred_from+0x1c8) [0x17378]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_add_cred_from+0x521) [0x18331]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_acquire_cred_from+0x20f) [0x1b00f]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_acquire_cred+0x29) [0x1b2a9]
 > /usr/sbin/rpc.gssd(gssd_acquire_krb5_cred+0x4f) [0x75ef]
 > /usr/sbin/rpc.gssd(create_auth_rpc_client.constprop.0+0x30c) [0xc77c]
 > /usr/sbin/rpc.gssd(process_krb5_upcall+0x21a) [0xd34a]
 > /usr/sbin/rpc.gssd(handle_gssd_upcall+0x4a6) [0xe7c6]
 > /usr/lib64/libpthread-2.33.so(start_thread+0xe8) [0x9298]
 > /usr/lib64/libc-2.33.so(__clone+0x42) [0x1006a2]
[pid 1120436] 12:49:05.000248 read(3, "\200\0\2\260", 4) = 4
 > /usr/lib64/libpthread-2.33.so(read+0x4c) [0x1292c]
 > /usr/lib64/gssproxy/proxymech.so(gpm_make_call+0x52f) [0xefcf]
 > /usr/lib64/gssproxy/proxymech.so(gpm_acquire_cred+0xec) [0x11f2c]
 > /usr/lib64/gssproxy/proxymech.so(gssi_acquire_cred_from+0x230) [0x173e0]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_add_cred_from+0x521) [0x18331]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_acquire_cred_from+0x20f) [0x1b00f]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_acquire_cred+0x29) [0x1b2a9]
 > /usr/sbin/rpc.gssd(gssd_acquire_krb5_cred+0x4f) [0x75ef]
 > /usr/sbin/rpc.gssd(create_auth_rpc_client.constprop.0+0x30c) [0xc77c]
 > /usr/sbin/rpc.gssd(process_krb5_upcall+0x21a) [0xd34a]
 > /usr/sbin/rpc.gssd(handle_gssd_upcall+0x4a6) [0xe7c6]
 > /usr/lib64/libpthread-2.33.so(start_thread+0xe8) [0x9298]
 > /usr/lib64/libc-2.33.so(__clone+0x42) [0x1006a2]
[pid 1120436] 12:49:05.000388 read(3, "\0\0\0\226\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 688) = 688
 > /usr/lib64/libpthread-2.33.so(read+0x4c) [0x1292c]
 > /usr/lib64/gssproxy/proxymech.so(gpm_make_call+0x5f7) [0xf097]
 > /usr/lib64/gssproxy/proxymech.so(gpm_acquire_cred+0xec) [0x11f2c]
 > /usr/lib64/gssproxy/proxymech.so(gssi_acquire_cred_from+0x230) [0x173e0]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_add_cred_from+0x521) [0x18331]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_acquire_cred_from+0x20f) [0x1b00f]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_acquire_cred+0x29) [0x1b2a9]
 > /usr/sbin/rpc.gssd(gssd_acquire_krb5_cred+0x4f) [0x75ef]
 > /usr/sbin/rpc.gssd(create_auth_rpc_client.constprop.0+0x30c) [0xc77c]
 > /usr/sbin/rpc.gssd(process_krb5_upcall+0x21a) [0xd34a]
 > /usr/sbin/rpc.gssd(handle_gssd_upcall+0x4a6) [0xe7c6]
 > /usr/lib64/libpthread-2.33.so(start_thread+0xe8) [0x9298]
 > /usr/lib64/libc-2.33.so(__clone+0x42) [0x1006a2]
[pid 1120436] 12:49:05.000463 close(12) = 0
 > /usr/lib64/libpthread-2.33.so(__close+0x3b) [0x129bb]
 > /usr/lib64/gssproxy/proxymech.so(gpm_make_call+0x62b) [0xf0cb]
 > /usr/lib64/gssproxy/proxymech.so(gpm_acquire_cred+0xec) [0x11f2c]
 > /usr/lib64/gssproxy/proxymech.so(gssi_acquire_cred_from+0x230) [0x173e0]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_add_cred_from+0x521) [0x18331]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_acquire_cred_from+0x20f) [0x1b00f]
 > /usr/lib64/libgssapi_krb5.so.2.2(gss_acquire_cred+0x29) [0x1b2a9]
 > /usr/sbin/rpc.gssd(gssd_acquire_krb5_cred+0x4f) [0x75ef]
 > /usr/sbin/rpc.gssd(create_auth_rpc_client.constprop.0+0x30c) [0xc77c]
 > /usr/sbin/rpc.gssd(process_krb5_upcall+0x21a) [0xd34a]
 > /usr/sbin/rpc.gssd(handle_gssd_upcall+0x4a6) [0xe7c6]
 > /usr/lib64/libpthread-2.33.so(start_thread+0xe8) [0x9298]
 > /usr/lib64/libc-2.33.so(__clone+0x42) [0x1006a2]
[pid 1120436] 12:49:05.000588 close(10) = 0

There is no connect in between so it should be the same fd. The file descriptor is readable in sssd-kcm after the second close.

(with some additional debug lines)

I opened https://github.com/SSSD/sssd/pull/5688 to avoid getting into the loop. But I still don't know whats causing the incoming invalid message.

fcami commented 2 years ago

@pbrezina could you let us know here when a new SSSD package is available for rawhide and Fedora 34's updates-testing please?

pbrezina commented 2 years ago

It is sssd-2.5.1-2, should be already available in f34 and rawhide.

Metadata

Assignee

None

Tags

Blocking

None

Depending on

None

Priority

None

Milestone

None

affects_doc

None

source

None

knownissue

None

type

None

blockedby

None

test_case

None

component

None

blocking

None

on_review

None

keywords

None

test_coverage

None

reviewer

None

external_tracker

None

rhbz

None

tester

None

changelog

None

design

None

freeipa

Source Code

#8877 Nightly test failure in test_nfs.py: runner VM runs out of disk space due to huge sssd log file Opened 2 years ago by sorlov. Modified 2 years ago

Close issue as:

Version information:

Other notes:

Metadata

tests test-failure

#8877 Nightly test failure in test_nfs.py: runner VM runs out of disk space due to huge sssd log file

Opened 2 years ago by sorlov. Modified 2 years ago