#373 podman: run full system tests
Opened 2 months ago by luap99. Modified 2 months ago

Currently openQA runs a subset of podman's upstream system tests that are shipped as part of the podman-tests rpm.
openQA only runs tests tagged as "distro-integration" which where added for this purpose in https://github.com/containers/podman/pull/19302

However using this tag has problems. Upstream is responsible for what tests are tagged however so far we basically never added new test and most likely besides me and Ed (who is no longer on our team) knows what this tag is doing and that we should add them to new tests that are at risk of regressions in external dependencies. And while I could try to pay more attention to that overall on most tests it is impossible to know in advance if this is something that might break or not.

Looking at our git history I see basically no additions of new tests with that tag but we added many important tests that should get covered since.
Consider this commit[1] which was added after this got broken by a kernel update, we could have found this before but didn't because the test was not run at the time.

As such I think it would be best to run all system tests by default in openQA so we don't need to deal with this extra complexity upstream and we don't end up in situations where we regret that X was not caught just because openQA didn't run it as it didn't have test test tag.

Now there are obvious downsides:

  • running all tests will make it take much longer, current upstream source show 65 tests tagged with distro-integration, we have 719 tests in total right now, so like 11x increase
    My assumption is that running tests in serial might take 30 mins or even more now that totally deepens on the hardware used of course. There is a trick to speed things up by running things in parallel as we do in upstream CI [2], with that they take around 15 mins on a 4 core system.
  • a lot more tests increases the risk of flakes, from my current observation the currently running tests are already flaky. I see a lot of setup failures, dnf install timeouts and quay.io pull errors. Some of the podman tests are a flaky but we try to fix them or we skip them if it is to bad so I don't think it would make the situation much worse.

[1] https://github.com/containers/podman/commit/64516e1b8fea5612e9371b956415d559b36a615f
[2] https://github.com/containers/podman/blob/671b240236ad8924ca537b414c8a1ed424955a16/Makefile#L709-L710


Oh, and another thing I think currently openQA runs the tests only as root? For full coverage the system tests need to be run as root and rootless. There are a lot of root/rootless specific code paths in podman.

So you want to run the entire test suite twice?

If the goal is to go for proper coverage sure. In upstream CI and fedora gating we actually run 4 runs, podman (root), podman (rootless), podman-remote (root), podman-remote (rootless). That said the "remote" tests are more of podman function and would not cover any meaningful new code path in terms of dependency testing so they are definly not needed in openQA.

I am not sure how openQA test runs work but to make things faster we generally have the root and rootless defined as individual tasks that run in parallel so if done like that it should not really effect the overall time.

But I am also cool to go with one thing at a time right now. To me the biggest issue is the distro-integration test tag is missing over 90% of podman tests, sure not every single one is relevant but I don't think the current thinking of we know what is important seems to be working and just adds workload on upstream.

I am not sure if it matters but AFAIK the Suse folks seems to be running everything
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/containers/bats/podman.pm

so I tested doing a full test suite run on staging, here. It took 46 minutes and 18 tests seem to have failed.

Ok,so

  1. slirp4netns is missing. I guess it should be a hard dep on podman-tests as we try to test it.

  2. podman-testing location is wrong, we must set a env var to the right location, i.e. PODMAN_TESTING: /usr/bin/podman-testing
    https://src.fedoraproject.org/rpms/podman/blob/rawhide/f/test/tmt/system.fmf#_11

  3. checkpoint restore is broken on rawhide and f40, that is one of the motivation why I started this discussion, i.e. https://bugzilla.redhat.com/show_bug.cgi?id=2357890 and https://bugzilla.redhat.com/show_bug.cgi?id=2355314.
    And from history criu (or kernel) updates break the checkpoint functionality a lot which is why I wanted to add the distro-integration tag to them. However then I thought why should we have to put up with this upstream we should be just testing everything in fedora.

As for speed how may cores do these workers use? If they have more than one then we can utilize the ci:parallel tag we use upstream for tests that work in parallel, i.e.
https://github.com/containers/podman/blob/671b240236ad8924ca537b414c8a1ed424955a16/Makefile#L709-L710

for clarity, I'm assuming this ticket is now on you to deal with at least 1) and 2), and...something to happen to resolve 3). once we can get the tests to actually pass we can worry about speed.

1) https://github.com/containers/podman/commit/6c7179c652a346eda5448646ae2a2eb22890f947 that should go into the fedora package with the next release (v5.5.0-rc1, which happens later today actually)

2) Sure I will make a PR here to set the env in the test definition.

3) With updates-testing repo we should have working checkpoint tests right now, at least they pass upstream with packages I installed yesterday.

Log in to comment on this ticket.

Metadata
Related Pull Requests
  • #389 Last updated 20 hours ago