#339 Rawhide update tier0.functional test fails in provisioning
Opened 2 years ago by adamwill. Modified 2 years ago

The tier0.functional CI test for a recent Rawhide update to Cockpit seem to be failing in provisioning:

/plans/upstream
    discover
        how: fmf
        url: https://github.com/cockpit-project/cockpit
        ref: 268.1
        hash: 4c6291857
        summary: 1 test selected
            /test/verify
    provision
        how: connect
        guest: 3.22.175.231
        user: root
        key: /etc/citool.d/id_rsa_artemis
        distro: Fedora Linux 37 (Cloud Edition Prerelease)
        kernel: 5.18.0-0.rc4.20220427git46cf2c613f4b10e.35.fc37.x86_64
        summary: 1 guest provisioned
    prepare
        how: install
        summary: Install required packages
        name: requires
        package: 19 packages requested
            NetworkManager-team
            createrepo_c
            cryptsetup
            dnf-automatic
            firewalld
            git
            glibc-all-langpacks
            libvirt-daemon-config-network
            libvirt-python3
            make
            nfs-utils
            nodejs
            python3
            python3-tracer
            rpm-build
            sssd-dbus
            targetcli
            tlog
            tuned

        summary: 1 preparation applied
    execute
        how: tmt
            00:43:07 /test/verify [1/1]
    finish
        summary: 0 tasks completed

stderr (1/1 lines)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ssh: connect to host 3.22.175.231 port 22: Connection refused

I'm not 100% sure, but that looks like a problem in the infra, not an error on the Cockpit team's side. Anyone know what's going on and how to fix it? This prevents the update going into Rawhide.


@adamw looking at more verbose logs, it seems to me the test breaks ssh:

https://artifacts.dev.testing-farm.io/68b9e592-fcc1-445c-8ba2-f334c778d128/work-upstreamWnFhHb/log.txt

@martinpitt @cockpit can somebody take a look pls? Namely this is weird, tests normally run, and then it breaks, maybe some reboot is involved? ...

.....
15:34:40                 out: 
15:34:40                 out: # Result testNoAdminGroup (__main__.TestLogin) failed
15:34:40                 out: # 1 TEST FAILED [81s on ip-172-31-22-241]
15:34:40                 out: not ok 41 test/verify/check-static-login TestLogin.testNoAdminGroup [ND@-1] # RETRY 2 (be robust against unstable tests)
15:36:03                 out: # ----------------------------------------------------------------------
15:36:03                 out: # testNoAdminGroup (__main__.TestLogin)
15:36:03                 out: Traceback (most recent call last):
15:36:03                 out:   File "/var/ARTIFACTS/work-upstreamWnFhHb/plans/upstream/discover/default/tests/test/common/testlib.py", line 1263, in setUp
15:36:03                 out:     machine.wait_boot()
15:36:03                 out:   File "/var/ARTIFACTS/work-upstreamWnFhHb/plans/upstream/discover/default/tests/bots/machine/machine_core/ssh_connection.py", line 126, in wait_boot
15:36:03                 out:     boot_id = self.wait_user_login()
15:36:03                 out:   File "/var/ARTIFACTS/work-upstreamWnFhHb/plans/upstream/discover/default/tests/bots/machine/machine_core/ssh_connection.py", line 118, in wait_user_login
15:36:03                 out:     raise exceptions.Failure("Timed out waiting for /run/nologin to disappear")
15:36:03                 out: machine_core.exceptions.Failure: Timed out waiting for /run/nologin to disappear
15:36:03                 out: 
15:36:03                 out: # Result testNoAdminGroup (__main__.TestLogin) failed
15:36:03                 out: # 1 TEST FAILED [82s on ip-172-31-22-241]
15:36:03                 out: not ok 41 test/verify/check-static-login TestLogin.testNoAdminGroup [ND@-1]
15:37:22             Command returned '255'.
15:37:22             Append to file '/var/ARTIFACTS/work-upstreamWnFhHb/plans/upstream/execute/data/test/verify/output.txt'.
15:37:22             00:43:11 /test/verify [1/1]
15:37:22             Copy '/var/ARTIFACTS/work-upstreamWnFhHb/plans/upstream/execute/data/test/verify' from the guest to '/'.
15:37:22             Run command 'rsync -Rrz --links --safe-links --protect-args -e ssh -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oIdentitiesOnly=yes -i/etc/citool.d/id_rsa_artemis -S/tmp/tmp1aec2ups root@13.59.149.153:/var/ARTIFACTS/work-upstreamWnFhHb/plans/upstream/execute/data/test/verify /'.
15:37:22             environment: None
15:37:22             err: Warning: Permanently added '13.59.149.153' (ECDSA) to the list of known hosts.
15:37:23             err: System is going down. Unprivileged users are not permitted to log in anymore. For technical details, see pam_nologin(8).
15:37:23             err: 
15:37:23             Read file '/var/ARTIFACTS/work-upstreamWnFhHb/plans/upstream/execute/data/test/verify/output.txt'.
15:37:23             Check result of '/test/verify'.
15:37:23             Remove our reboot script implementations.
15:37:23             Execute command 'rm "/usr/local/bin/rstrnt-reboot"' on guest '13.59.149.153'.
15:37:23             Run command 'ssh -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oIdentitiesOnly=yes -i/etc/citool.d/id_rsa_artemis -S/tmp/tmp1aec2ups root@13.59.149.153 export TMT_PLAN_DATA=/var/ARTIFACTS/work-upstreamWnFhHb/plans/upstream/data; rm "/usr/local/bin/rstrnt-reboot"'.
15:37:23             environment: None
15:37:23             err: ssh: connect to host 13.59.149.153 port 22: Connection refused
15:37:23             Command returned '255'.
15:37:23             Error reported is: ('Failed to run command \'ssh -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oIdentitiesOnly=yes -i/etc/citool.d/id_rsa_artemis -S/tmp/tmp1aec2ups root@13.59.149.153 export TMT_PLAN_DATA=/var/ARTIFACTS/work-upstreamWnFhHb/plans/upstream/data; rm "/usr/local/bin/rstrnt-reboot"\'. Reason: Command returned \'255\'.', 'ssh -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oIdentitiesOnly=yes -i/etc/citool.d/id_rsa_artemis -S/tmp/tmp1aec2ups root@13.59.149.153 export TMT_PLAN_DATA=/var/ARTIFACTS/work-upstreamWnFhHb/plans/upstream/data; rm "/usr/local/bin/rstrnt-reboot"', 255, '', 'ssh: connect to host 13.59.149.153 port 22: Connection refused\r\n')
15:37:23             Execute command 'rm "/usr/local/bin/rhts-reboot"' on guest '13.59.149.153'.
15:37:23             Run command 'ssh -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oIdentitiesOnly=yes -i/etc/citool.d/id_rsa_artemis -S/tmp/tmp1aec2ups root@13.59.149.153 export TMT_PLAN_DATA=/var/ARTIFACTS/work-upstreamWnFhHb/plans/upstream/data; rm "/usr/local/bin/rhts-reboot"'.
15:37:23             environment: None
15:37:24             err: ssh: connect to host 13.59.149.153 port 22: Connection refused
15:37:24             Command returned '255'.
.....

But the error could be a lot better ...

We run the exact same tests on every upstream PR, on the same infrastructure, just not on Rawhide. Unfortunately bodhi updates were not auto-created for the 268.1 release, I just did that manually for F36 and F35. I am fairly sure they will succeed, so this smells like some rawhide regression.

testNoAdminGroup() is fine, this fails very early in its setup when it waits for the nologin flag to go away (but it exists because of scheduled reboot). So it's the test that ran before that which triggered an unintended reboot. Unfortunately your referenced test log does not exist any more.

I retried the test over the weekend, with this result page. How do I map that to an artifacts.dev.... URL?

@martinpitt yes, because rawhide is not gated .... things happen :(

TF artifacts link ins the jenkins console:

https://osci-jenkins-1.ci.fedoraproject.org/job/fedora-ci/job/dist-git-pipeline/job/master/134762/console

Thanks. This log is useful. That looks like shutdown -c regressed and does not actually cancel a scheduled reboot any more. This needs to be validated on a Rawhide VM and reported. I can't do this today (PTO/in a train), but can have a look tomorrow.

@martinpitt I'll take a look at the rawhide VM and report the issue.

Tested on the following rawhide image: Fedora-Cloud-Base-Rawhide-20220501.n.0.x86_64.qcow2

shutdown -c and busctl call org.freedesktop.login1 /org/freedesktop/login1 org.freedesktop.login1.Manager CancelScheduledShutdown both seem to cancel the reboot successfully for me. But this as root on a console.

In the logs I spot the following warning which is probably the issue:

07:30:13 out: warning: Failed to cancel shutdown Interactive authentication required.

Thanks, let me know if that was the case. We need to improve the tooling to fix the visibility of the problem, as we cannot make Fedora Rawhide gating to mitigate regressions like this :(

@mvadkert That's the issue, cockpit uses dbus to cancel the shutdown but for some reason polkit interferes? which makes the test not cancel the poweroff and therefore the following tests fall.

‣ Type=method_return  Endian=l  Flags=1  Version=1 Cookie=-1  ReplyCookie=2009  Timestamp="Mon 2022-05-02 13:37:13.555384 UTC"
  Sender=org.freedesktop.DBus  Destination=:1.46
  MESSAGE "u" {
          UINT32 1782;
  };

‣ Type=method_return  Endian=l  Flags=1  Version=1 Cookie=2010  ReplyCookie=360  Timestamp="Mon 2022-05-02 13:37:13.556330 UTC"
  Sender=:1.46  Destination=:1.4
  UniqueName=:1.46
  MESSAGE "(bba{ss})" {
          STRUCT "bba{ss}" {
                  BOOLEAN false;
                  BOOLEAN true;
                  ARRAY "{ss}" {
                          DICT_ENTRY "ss" {
                                  STRING "polkit.retains_authorization_after_challenge";
                                  STRING "1";
                          };
                  };
          };
  };

‣ Type=error  Endian=l  Flags=1  Version=1 Cookie=361  ReplyCookie=1339  Timestamp="Mon 2022-05-02 13:37:13.557643 UTC"
  Sender=:1.4  Destination=:1.41
  ErrorName=org.freedesktop.DBus.Error.InteractiveAuthorizationRequired  ErrorMessage="Interactive authentication required."
  UniqueName=:1.4
  MESSAGE "s" {
          STRING "Interactive authentication required.";
  };

By downgrading figured out this is a systemd regression, so this issue can be closed I think.

https://bugzilla.redhat.com/show_bug.cgi?id=2081039

Thanks for figuring this out, folks.

Note that I think what @mvadkert means about Rawhide not being "gating" is that we do not have a distro-wide gating policy for Rawhide currently - but this update is gated on this failure, because cockpit has a gating.yml that requires this test to pass on Rawhide. The build won't go stable unless the regression is fixed or the failure is waived.

yeah, ideally it would be great to gate systemd on the great cockpit testsuite :(

Login to comment on this ticket.

Metadata