| |
@@ -115,15 +115,200 @@
|
| |
path updates appear, and results should be reported to ResultsDB and
|
| |
Wikitcms (when appropriate). Dynamically generated assets should be
|
| |
regenerated regularly, including across release boundaries (see the
|
| |
- section on createhdds below): no manual intervention should be required
|
| |
- when a new Fedora release appears. If any of this does not happen,
|
| |
- something is wrong, and manual inspection is needed.
|
| |
+ section on createhdds below). If any of this does not happen, something
|
| |
+ is wrong, and manual inspection is needed. However, at branching (when
|
| |
+ a new Fedora release branches from Rawhide), some manual intervention
|
| |
+ is usually required to ensure the smoothest possible transition. See
|
| |
+ <<Branching procedure>> below.
|
| |
|
| |
Our usual practice is to upgrade the openQA systems to new Fedora
|
| |
releases promptly as they appear, using `dnf system-upgrade`. This is
|
| |
done manually. We usually upgrade the staging instance first and watch
|
| |
for problems for a week or two before upgrading production.
|
| |
|
| |
+ == Updating 'needles'
|
| |
+
|
| |
+ Needles are the 'magic screenshots' openQA uses for testing. A needle
|
| |
+ consists of two files - `somefile.png` (the screenshot itself), and
|
| |
+ `somefile.json` (the metadata). The names must match. Needles are
|
| |
+ usually created using the openQA web UI. This will create the two
|
| |
+ files in the `/var/lib/openqa/share/tests/fedora/needles` directory
|
| |
+ on the server. In Fedora openQA we do not leave it like this. We keep
|
| |
+ the needles in a https://pagure.io/fedora-qa/os-autoinst-distri-fedora[git repository]. After creating a
|
| |
+ needle, you should copy the files out to a local checkout of that
|
| |
+ repository, place it in the appropriate subdirectory - we organize our
|
| |
+ needles into subdirectories - commit it, push the commit, then update
|
| |
+ the checkout back on the server, and remove the "working copy" of the
|
| |
+ needle in the top-level `needles` directory. Also remember to update
|
| |
+ the checkout on the other instance. If the lab instance is on a
|
| |
+ different branch and you need it to have the new needle, rebase that
|
| |
+ branch on the updated `main` branch and force-push it back (but of
|
| |
+ course, make sure your local checkout of the feature branch is fully
|
| |
+ up to date before rebasing and force pushing).
|
| |
+
|
| |
+ == Branching procedure
|
| |
+
|
| |
+ Branching is a disruptive time for openQA operation. Since openQA
|
| |
+ associates a release number with Rawhide, two things change from its
|
| |
+ perspective during branching: the release number associated with
|
| |
+ Rawhide changes, and the release number formerly associated with
|
| |
+ Rawhide is now 'taken over' by the new branched release. As the
|
| |
+ branching process takes some time, and there is no "perfect" point
|
| |
+ at which the transition can be done smoothly, it is normal that
|
| |
+ update tests for both Rawhide and the new branched release will fail
|
| |
+ for some hours around the branching process. The best we can do is to
|
| |
+ mitigate this as far as possible.
|
| |
+
|
| |
+ openQA's behavior around branching will depend on when fedfind's
|
| |
+ https://fedorapeople.org/groups/qa/metadata/release.json[release metadata] is updated. Until that is updated, openQA will
|
| |
+ continue to believe that Rawhide "owns" the "old" release number: the
|
| |
+ RAWREL variable will be set to that number, and tests of updates for
|
| |
+ that release number will behave as if it is Rawhide. If updates with
|
| |
+ the new release number are created before this metadata is updated,
|
| |
+ the openQA scheduler will be confused by them and ignore them.
|
| |
+
|
| |
+ Once the metadata is updated, openQA will act as if branching has
|
| |
+ happened - tests will be scheduled for updates with the "new" number,
|
| |
+ tests for updates for the "old" number will act consistently with it
|
| |
+ being Branched, not Rawhide.
|
| |
+
|
| |
+ The key tasks to make Branching go as smoothly as possible in openQA
|
| |
+ are:
|
| |
+
|
| |
+ * Get the fedfind metadata updated as close as possible to 'the right'
|
| |
+ time, which should be just before the first update for the new number
|
| |
+ is created in Bodhi. This task in the xref:release_guide:sop_mass_branching.adoc[Mass Branching SOP],
|
| |
+ but releng may contact us to do the edit if they don't have permissions
|
| |
+ * Build base disk images for the new release number as soon as the
|
| |
+ metadata is updated and a post-branching Rawhide compose exists
|
| |
+ * Rebuild base disk images for the old release number as soon as the
|
| |
+ first post-branching Branched compose exists
|
| |
+ * Trigger tests for any Rawhide updates for which they were missed
|
| |
+ * Create version identification needles for the new release number
|
| |
+ as soon as possible
|
| |
+ * Disable desktop_background test for the Branched release if a new
|
| |
+ background image does not yet exist
|
| |
+ * Aggressively monitor and retry failures
|
| |
+
|
| |
+ === Rebuilding base disk images
|
| |
+
|
| |
+ Base disk images can only be rebuilt successfully once a post-branch
|
| |
+ compose for the release has completed and synced to https://dl.fedoraproject.org/pub/fedora/linux/development/[dl.fedoraproject.org].
|
| |
+ Stay in touch with the release engineering team and monitor the chat
|
| |
+ channel to keep up with this process. Once you have verified that a
|
| |
+ post-branch Rawhide compose has synced to https://dl.fedoraproject.org/pub/fedora/linux/development/rawhide/[the rawhide repository],
|
| |
+ build new base images for Rawhide. Once you have verified that the
|
| |
+ first Branched compose has synced to the numbered directory under
|
| |
+ the development directory, rebuild base images for that release (most
|
| |
+ will already exist, but will be pre-branch Rawhide images, which will
|
| |
+ likely cause issues in the tests).
|
| |
+
|
| |
+ To (re)build images, log into the openQA worker hosts tasked with disk
|
| |
+ image builds for each arch on each openQA instance. These are listed
|
| |
+ in the `openqa_hdds_workers` group in the ansible inventory. Become
|
| |
+ root, then go to the correct directory, and run the command to rebuild
|
| |
+ all images for the release, where `NN` is the *new* release number:
|
| |
+ ```
|
| |
+ cd /var/lib/openqa/share/factory/hdd/fixed
|
| |
+ /root/createhdds/createhdds.py all -r NN -f
|
| |
+ ```
|
| |
+ So when branching Fedora 43 from Rawhide, we would pass `-r 44` to
|
| |
+ build the new Rawhide base disk images, and `-r 43` to rebuild the
|
| |
+ 43 base disk images with the new Branched compose.
|
| |
+
|
| |
+ === Triggering missed Rawhide tests
|
| |
+
|
| |
+ If any critical path Rawhide updates are created under the new release
|
| |
+ number before the fedfind metadata is updated, openQA will fail to
|
| |
+ schedule tests for them. Once the metadata is updated and new Rawhide
|
| |
+ base images are built, check in the Bodhi web UI for any Rawhide
|
| |
+ updates that have failed gating. Look on the automated tests page for
|
| |
+ each update. If they show tests as missing (rather than failed),
|
| |
+ check the openQA web UI and see if you can find any tests for the
|
| |
+ update. If not, you will need to trigger the tests for that update by
|
| |
+ running:
|
| |
+ ```
|
| |
+ fedora-openqa update -f (UPDATE ID)
|
| |
+ ```
|
| |
+ from the openQA server.
|
| |
+
|
| |
+ === Creating new version identification needles
|
| |
+
|
| |
+ The installer tests have a check that the installer shows the correct
|
| |
+ release number. Needles for the new Rawhide release number will not
|
| |
+ yet exist at the time of branching. The first time install tests run
|
| |
+ for the new Rawhide release number and reach the point where this
|
| |
+ check happens, they will fail looking for a needle with the tag
|
| |
+ `version_NN_ident`, where `NN` is the new Rawhide release number.
|
| |
+ On one of the openQA instances, use the web UI needle editor to create
|
| |
+ a new needle with the correct match area and tags (reference the
|
| |
+ existing needles for the previous release https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/main/f/needles/anaconda/identification[here]). You will
|
| |
+ need to create two needles, one for the GTK UI and one for the web UI,
|
| |
+ so long as we test images with both installer UIs. Creating a needle
|
| |
+ adds a .json and a .png file in the
|
| |
+ `/var/lib/openqa/share/tests/fedora/needles` directory. These files
|
| |
+ need to be copied out and checked into the os-autoinst-distri-fedora
|
| |
+ git repo, in the `anaconda/identification` folder, then pushed back
|
| |
+ to both openQA instances, after which the 'working copy' in the top-
|
| |
+ level `needles` directory can be removed.
|
| |
+
|
| |
+ === Disabling the desktop_background test
|
| |
+
|
| |
+ The desktop_background test will initially fail for all updates for
|
| |
+ the newly-branched release. If the release already contains a new,
|
| |
+ unique background (different from that of the current stable release)
|
| |
+ you can create a new needle for it, following much the same process
|
| |
+ as for the version identification needles. Otherwise, the test must be
|
| |
+ disabled until the new background is ready.
|
| |
+
|
| |
+ Here is https://pagure.io/fedora-qa/os-autoinst-distri-fedora/c/f8810b67b4fc461d1e060c0c9449991c6b18b68d?branch=main[a sample commit] that disabled the test for
|
| |
+ Fedora 42. You can just follow that example with a new commit, push it
|
| |
+ out, and pull it to both openQA instances. Then re-run all failed
|
| |
+ instances of the test.
|
| |
+
|
| |
+ === Restarting failures
|
| |
+
|
| |
+ The update gating configuration is updated during the main releng
|
| |
+ branch SOP, so update gating will be active for the newly-branched
|
| |
+ release and for Rawhide under its new release number almost
|
| |
+ immediately. It is therefore critical that we ensure all failed update
|
| |
+ tests are re-run until they pass or the failure is deemed 'genuine'
|
| |
+ (i.e. not due to the branching process, but a real bug in the update).
|
| |
+
|
| |
+ Throughout the branching process, constantly keep an eye on the webUI
|
| |
+ summary page for the Fedora Updates and Fedora AArch64 Updates groups.
|
| |
+ Also keep the web UI detail pages for one new-Rawhide and one
|
| |
+ new-Branched update open, and retry failures as you think they may be
|
| |
+ addressed.
|
| |
+
|
| |
+ Once you are sure tests are generally working for both new-Rawhide and
|
| |
+ new-Branched, systematically go through and restart all failed tests.
|
| |
+ Some of the restarts may fail (due to normal flakiness, or the heavy
|
| |
+ load of running so many tests at once) - keep an eye on these, and
|
| |
+ keep up the restarts until they pass or the failure appears 'genuine'.
|
| |
+
|
| |
+ If tests are failing and the cause looks like something to do with the
|
| |
+ branching process - classic symptoms are RPM signature errors or 404s
|
| |
+ from the mirror system - contact the release engineering team to get
|
| |
+ these rectified, and retry once they tell you the issue is addressed.
|
| |
+
|
| |
+ Beware of tests with the wrong `RAWREL` value. This variable records
|
| |
+ the Rawhide release number. When running tests after branching it
|
| |
+ should always be the new Rawhide release number; running tests after
|
| |
+ branching with it set to the old number can cause various failures.
|
| |
+ It's common for some tests scheduled around branching to fail and have
|
| |
+ the old `RAWREL` value; you will find you cannot get these tests to
|
| |
+ pass with regular restarts. Always check the `RAWREL` value of failed
|
| |
+ tests, and if it's wrong, instead of just restarting the test through
|
| |
+ the web UI, retrigger the tests for that update from the server:
|
| |
+ ```
|
| |
+ fedora-openqa update -f (UPDATE ID)
|
| |
+ ```
|
| |
+
|
| |
+ Your end goal, as always, is for all outstanding failures to be
|
| |
+ definitely identified as genuine bugs in the update, with a comment
|
| |
+ linking to a Bodhi comment or bug report that identifies the issue.
|
| |
+
|
| |
== Rebooting / restarting
|
| |
|
| |
The optimal approach to rebooting an entire openQA deployment is as
|
| |
@@ -468,12 +653,3 @@
|
| |
to do it. As with all other message consumers, if making manual changes
|
| |
or updates to the components, remember to restart the consumer service
|
| |
afterwards.
|
| |
-
|
| |
- == Autocloud ResultsDB forwarder (autocloudreporter)
|
| |
-
|
| |
- An ansible role called `autocloudreporter` also runs on the openQA
|
| |
- production server. This has nothing to do with openQA at all, but is run
|
| |
- there for convenience. This role deploys a fedmsg consumer that listens
|
| |
- for fedmsgs indicating that Autocloud (a separate automated test system
|
| |
- which tests cloud images) has completed a test run, then forwards those
|
| |
- results to ResultsDB.
|
| |
This is a big update that's mainly about documenting all the tasks
related to openQA at branch time. We also add a couple of bits to
the releng SOP: one noting that co-ordination with quality and CI
folks is needed during branching, and one to get the fedfind
release metadata updated, which is a critical step that should
happen during branching. Currently I Just Know that this has to
happen, but we should write it down. This metadata file is not
templated in Ansible because that would cause it to be updated too
soon at release time (we need fedfind's stable release list to
change only when the new release is available in the mirror
system, for...reasons).
This also documents how we handle openQA needles, since I noticed
while writing the branching steps that it's not written down, and
drops the section about autocloudreporter as that's obsolete and
no longer used.
Signed-off-by: Adam Williamson awilliam@redhat.com