| |
@@ -0,0 +1,370 @@
|
| |
+ = OpenQA Infrastructure SOP
|
| |
+
|
| |
+ OpenQA is an automated test system used to run validation tests on
|
| |
+ nightly and candidate Fedora composes, and also to run a subset of these
|
| |
+ tests on critical path updates.
|
| |
+
|
| |
+ OpenQA production instance: https://openqa.fedoraproject.org
|
| |
+
|
| |
+ OpenQA staging instance: https://openqa.stg.fedoraproject.org
|
| |
+
|
| |
+ Wiki page on Fedora openQA deployment: https://fedoraproject.org/wiki/OpenQA
|
| |
+
|
| |
+ Upstream project page: http://open.qa/
|
| |
+
|
| |
+ Upstream repositories: https://github.com/os-autoinst
|
| |
+
|
| |
+ == Contact Information
|
| |
+
|
| |
+ Owner::
|
| |
+ Fedora QA devel
|
| |
+ Contact::
|
| |
+ #fedora-qa, #fedora-admin, qa-devel mailing list
|
| |
+ People::
|
| |
+ Adam Williamson (adamwill / adamw), Petr Schindler (pschindl)
|
| |
+ Machines::
|
| |
+ See ansible inventory groups with 'openqa' in name
|
| |
+ Purpose::
|
| |
+ Run automated tests on VMs via screen recognition and VNC input
|
| |
+
|
| |
+ == Architecture
|
| |
+
|
| |
+ Each openQA instance consists of a server (these are virtual machines)
|
| |
+ and one or more worker hosts (these are bare metal systems). The server
|
| |
+ schedules tests ("jobs", in openQA parlance) and stores results and
|
| |
+ associated data. The worker hosts run "jobs" and send the results back
|
| |
+ to the server. The server also runs some fedmsg consumers to handle
|
| |
+ automatic scheduling of jobs and reporting of results to external
|
| |
+ systems (ResultsDB and Wikitcms).
|
| |
+
|
| |
+ == Server
|
| |
+
|
| |
+ The server runs a web UI for viewing scheduled, running and completed
|
| |
+ tests and their data, with an admin interface where many aspects of the
|
| |
+ system can be configured (though we do not use the web UI for several
|
| |
+ aspects of configuration). There are several separate services that run
|
| |
+ on each server, and communicate with each other mainly via dbus. Each
|
| |
+ server requires its own PostgreSQL database. The web UI and websockets
|
| |
+ server are made externally available via reverse proxying through an
|
| |
+ Apache server.
|
| |
+
|
| |
+ It hosts an NFS share that contains the tests, the 'needles'
|
| |
+ (screenshots with metadata as JSON files that are used for screen
|
| |
+ matching), and test 'assets' like ISO files and disk images. The path is
|
| |
+ `/var/lib/openqa/share/factory`.
|
| |
+
|
| |
+ In our deployment, the PostgreSQL database for each instance is hosted
|
| |
+ by the QA database server. Also, some paths on the server are themselves
|
| |
+ mounted as NFS shares from the infra storage server. This is so that
|
| |
+ these are not lost if the server is re-deployed, and can easily be
|
| |
+ backed up. These locations contain the data from each executed job. As
|
| |
+ both the database and these key data files are not actually stored on
|
| |
+ the server, the server can be redeployed from scratch without loss of
|
| |
+ any data (at least, this is the intent).
|
| |
+
|
| |
+ Also in our deployment, an openQA plugin (which we wrote, but which is
|
| |
+ part of the upstream codebase) is enabled which emits fedmsgs on various
|
| |
+ events. This works by calling fedmsg-logger, so the appropriate fedmsg
|
| |
+ configuration must be in place for this to emit events correctly.
|
| |
+
|
| |
+ The server systems run a fedmsg consumer for the purpose of
|
| |
+ automatically scheduling jobs in response to the appearance of new
|
| |
+ composes and critical path updates, and one for the purpose of reporting
|
| |
+ the results of completed jobs to ResultsDB and Wikitcms. These use the
|
| |
+ `fedmsg-hub` system.
|
| |
+
|
| |
+ == Worker hosts
|
| |
+
|
| |
+ The worker hosts run several individual worker 'instances' (via
|
| |
+ systemd's 'instantiated service' mechanism), each of which registers
|
| |
+ with the server and accepts jobs from it, uploading the results of the
|
| |
+ job and some associated data to the server on completion. The worker
|
| |
+ instances and server communicate both via a conventional web API
|
| |
+ provided by the server and via websockets. When a worker runs a job, it
|
| |
+ starts a qemu virtual machine (directly - libvirt is not used) and
|
| |
+ interacts with it via VNC and the serial console, following a set of
|
| |
+ steps dictating what it should do and what response it should expect in
|
| |
+ terms of screen contents or serial console output. The server 'pushes'
|
| |
+ jobs to the worker instances over a websocket connection.
|
| |
+
|
| |
+ Each worker host must mount the `/var/lib/openqa/share/factory` NFS
|
| |
+ share provided by the server. If this share is not mounted, any jobs run
|
| |
+ will fail immediately due to expected asset and test files not being
|
| |
+ found.
|
| |
+
|
| |
+ Some worker hosts for each instance are denominated 'tap workers',
|
| |
+ meaning they run some advanced jobs which use software-defined
|
| |
+ networking (openvswitch) to interact with each other. All the
|
| |
+ configuration for this should be handled by the ansible scripts, but
|
| |
+ it's useful to be aware that there is complex software-defined
|
| |
+ networking stuff going on on these hosts which could potentially be the
|
| |
+ source of problems.
|
| |
+
|
| |
+ == Deployment and regular operation
|
| |
+
|
| |
+ Deployment and normal update of the openQA systems should run entirely
|
| |
+ through Ansible. Just running the appropriate ansible plays for the
|
| |
+ systems should complete the entire deployment / update process, though
|
| |
+ it is best to check after running them that there are no failed services
|
| |
+ on any of the systems (restart any that failed), and that the web UI is
|
| |
+ properly accessible.
|
| |
+
|
| |
+ Regular operation of the openQA deployments is entirely automated. Jobs
|
| |
+ should be scheduled and run automatically when new composes and critical
|
| |
+ path updates appear, and results should be reported to ResultsDB and
|
| |
+ Wikitcms (when appropriate). Dynamically generated assets should be
|
| |
+ regenerated regularly, including across release boundaries (see the
|
| |
+ section on createhdds below): no manual intervention should be required
|
| |
+ when a new Fedora release appears. If any of this does not happen,
|
| |
+ something is wrong, and manual inspection is needed.
|
| |
+
|
| |
+ Our usual practice is to upgrade the openQA systems to new Fedora
|
| |
+ releases promptly as they appear, using `dnf system-upgrade`. This is
|
| |
+ done manually. We usually upgrade the staging instance first and watch
|
| |
+ for problems for a week or two before upgrading production.
|
| |
+
|
| |
+ == Rebooting / restarting
|
| |
+
|
| |
+ The optimal approach to rebooting an entire openQA deployment is as
|
| |
+ follows:
|
| |
+
|
| |
+ [arabic]
|
| |
+ . Wait until no jobs are running
|
| |
+ . Stop all `openqa-*` services on the server, so no more will be queued
|
| |
+ . Stop all `openqa-worker@` services on the worker hosts
|
| |
+ . Reboot the server
|
| |
+ . Check for failed services (`systemctl --failed`) and restart any that
|
| |
+ failed
|
| |
+ . Once the server is fully functional, reboot the worker hosts
|
| |
+ . Check for failed services and restart any that failed, particularly
|
| |
+ the NFS mount service
|
| |
+
|
| |
+ Rebooting the workers *after* the server is important due to the NFS
|
| |
+ share.
|
| |
+
|
| |
+ If only the server needs restarting, the entire procedure above should
|
| |
+ ideally be followed in any case, to ensure there are no issues with the
|
| |
+ NFS mount breaking due to the server reboot, or the server and worker
|
| |
+ getting confused about running jobs due to the websockets connections
|
| |
+ being restarted.
|
| |
+
|
| |
+ If only a worker host needs restarting, there is no need to restart the
|
| |
+ server too, but it is best to wait until no jobs are running on that
|
| |
+ host, and stop all `open-worker@` services on the host before rebooting
|
| |
+ it.
|
| |
+
|
| |
+ There are two ways to check if jobs are running and if so where. You can
|
| |
+ go to the web UI for the server and click 'All Tests'. If any jobs are
|
| |
+ running, you can open each one individually (click the link in the
|
| |
+ 'Test' column) and look at the 'Assigned worker', which will tell you
|
| |
+ which host the job is running on. Or, if you have admin access, you can
|
| |
+ go to the admin menu (top right of the web UI, once you are logged in)
|
| |
+ and click on 'Workers', which will show the status of all known workers
|
| |
+ for that server, and select 'Working' in the state filter box. This will
|
| |
+ show all workers currently working on a job.
|
| |
+
|
| |
+ Note that if something which would usually be tested (new compose, new
|
| |
+ critpath update...) appears during the reboot window, it likely will
|
| |
+ _not_ be scheduled for testing, as this is done by a fedmsg consumer
|
| |
+ running on the server. You will need to schedule it for testing manually
|
| |
+ in this case (see below).
|
| |
+
|
| |
+ == Scheduling jobs manually
|
| |
+
|
| |
+ While it is not normally necessary, you may sometimes need to run or
|
| |
+ re-run jobs manually.
|
| |
+
|
| |
+ The simplest cases can be handled by an admin from the web UI: for a
|
| |
+ logged-in admin, all scheduled and running tests can be cancelled (from
|
| |
+ various views), and all completed tests can be restarted. 'Restarting' a
|
| |
+ job actually effectively clones it and schedules the clone to be run: it
|
| |
+ creates a new job with a new job ID, and the previous job still exists.
|
| |
+ openQA attempts to handle complex cases of inter-dependent jobs
|
| |
+ correctly when restarting, but doesn't always manage to do it right;
|
| |
+ when it goes wrong, the best thing to do is usually to re-run all jobs
|
| |
+ for that medium.
|
| |
+
|
| |
+ To run or re-run the full set of tests for a compose or update, you can
|
| |
+ use the `fedora-openqa` CLI. To run or re-run tests for a compose, use:
|
| |
+
|
| |
+ ....
|
| |
+ fedora-openqa compose -f (COMPOSE LOCATION)
|
| |
+ ....
|
| |
+
|
| |
+ where `(COMPOSE LOCATION)` is the full URL of the `/compose`
|
| |
+ subdirectory of the compose. This will only work for Pungi-produced
|
| |
+ composes with the expected productmd-format metadata, and a couple of
|
| |
+ other quite special cases.
|
| |
+
|
| |
+ The `-f` argument means 'force', and is necessary to re-run tests:
|
| |
+ usually, the scheduler will refuse to re-schedule tests that have
|
| |
+ already run, and `-f` overrides this.
|
| |
+
|
| |
+ To run or re-run tests for an update, use:
|
| |
+
|
| |
+ ....
|
| |
+ fedora-openqa update -f (UPDATEID) (RELEASE)
|
| |
+ ....
|
| |
+
|
| |
+ where `(UPDATEID)` is the update's ID - something like
|
| |
+ `FEDORA-2018-blahblah` - and `(RELEASE)` is the release for which the
|
| |
+ update is intended (27, 28, etc).
|
| |
+
|
| |
+ To run or re-run only the tests for a specific medium (usually a single
|
| |
+ image file), you must use the lower-level web API client, with a more
|
| |
+ complex syntax. The command looks something like this:
|
| |
+
|
| |
+ ....
|
| |
+ /usr/share/openqa/script/client isos post \
|
| |
+ ISO=Fedora-Server-dvd-x86_64-Rawhide-20180108.n.0.iso DISTRI=fedora VERSION=Rawhide \
|
| |
+ FLAVOR=Server-dvd-iso ARCH=x86_64 BUILD=Fedora-Rawhide-20180108.n.0 CURRREL=27 PREVREL=26 \
|
| |
+ RAWREL=28 IMAGETYPE=dvd SUBVARIANT=Server \
|
| |
+ LOCATION=http://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20180108.n.0/compose
|
| |
+ ....
|
| |
+
|
| |
+ The `ISO` value is the filename of the image to test (it may not
|
| |
+ actually be an ISO), the `DISTRI` value is always 'fedora', the
|
| |
+ `VERSION` value should be the release number or 'Rawhide', the `FLAVOR`
|
| |
+ value depends on the image being tested (you can check the value from an
|
| |
+ existing test for the same or a similar ISO), the `ARCH` value is the
|
| |
+ arch of the image being tested, the `BUILD` value is the compose ID,
|
| |
+ `CURREL` should be the release number of the current Fedora release at
|
| |
+ the time the test is run, `PREVREL` should be one lower than `CURREL`,
|
| |
+ `RAWREL` should be the release number associated with Rawhide at the
|
| |
+ time the test is run, `IMAGETYPE` depends on the image being tested
|
| |
+ (again, check a similar test for the correct value), `LOCATION` is the
|
| |
+ URL to the /compose subdirectory of the compose location, and
|
| |
+ `SUBVARIANT` again depends on the image being tested. Please ask for
|
| |
+ help if this seems too daunting. To re-run the 'universal' tests on a
|
| |
+ given image, set the `FLAVOR` value to 'universal', then set all other
|
| |
+ values as appropriate to the chosen image. The 'universal' tests are
|
| |
+ only likely to work at all correctly with DVD or netinst images.
|
| |
+
|
| |
+ openQA provides a special script for cloning an existing job but
|
| |
+ optionally changing one or more variable values, which can be useful in
|
| |
+ some situations. Using it looks like this:
|
| |
+
|
| |
+ ....
|
| |
+ /usr/share/openqa/script/clone_job.pl --skip-download --from localhost 123 RAWREL=28
|
| |
+ ....
|
| |
+
|
| |
+ to clone job 123 with the `RAWREL` variable set to '28', for instance.
|
| |
+ For interdependent jobs, you may or may not want to use the
|
| |
+ `--skip-deps` argument to avoid re-running the cloned job's parent
|
| |
+ job(s), depending on circumstances.
|
| |
+
|
| |
+ == Manual updates
|
| |
+
|
| |
+ In general updates to any of the components of the deployments should be
|
| |
+ handled via ansible: push the changes out in the appropriate way (git
|
| |
+ repo update, package update, etc.) and then run the ansible plays.
|
| |
+ However, sometimes we do want to update or test a change to something
|
| |
+ manually for some reason. Here are some notes on those cases.
|
| |
+
|
| |
+ For updating openQA and/or os-autoinst packages: ideally, ensure no jobs
|
| |
+ are running. Then, update all installed subpackages on the server. The
|
| |
+ server services should be automatically restarted as part of the package
|
| |
+ update. Then, update all installed subpackages on the worker hosts, and
|
| |
+ restart all worker services. A 'for' loop can help with that, for
|
| |
+ instance:
|
| |
+
|
| |
+ ....
|
| |
+ for i in {1..10}; do systemctl restart openqa-worker@$i.service; done
|
| |
+ ....
|
| |
+
|
| |
+ on a host with ten worker instances.
|
| |
+
|
| |
+ For updating the openQA tests:
|
| |
+
|
| |
+ ....
|
| |
+ cd /var/lib/openqa/share/tests/fedora
|
| |
+ git pull (or git checkout (branch) or whatever)
|
| |
+ ./templates --clean
|
| |
+ ./templates-updates --update
|
| |
+ ....
|
| |
+
|
| |
+ The templates steps are only necessary if there are any changes to the
|
| |
+ templates files.
|
| |
+
|
| |
+ For updating the scheduler code:
|
| |
+
|
| |
+ ....
|
| |
+ cd /root/fedora_openqa
|
| |
+ git pull (or whatever changes)
|
| |
+ python setup.py install
|
| |
+ systemctl restart fedmsg-hub
|
| |
+ ....
|
| |
+
|
| |
+ Updating other components of the scheduling process follow the same
|
| |
+ pattern: update the code or package, then remember to restart
|
| |
+ fedmsg-hub, or the fedmsg consumers won't use the new code. It's
|
| |
+ relatively common for the openQA instances to need fedfind updates in
|
| |
+ advance of them being pushed to stable, for example when a new compose
|
| |
+ type is invented and fedfind doesn't understand it, openQA can end up
|
| |
+ trying to schedule tests for it, or the scheduler consumer can crash;
|
| |
+ when this happens we have to fix and update fedfind on the openQA
|
| |
+ instances ASAP.
|
| |
+
|
| |
+ == Logging
|
| |
+
|
| |
+ Just about all useful logging information for all aspects of openQA and
|
| |
+ the scheduling and report tools is logged to the journal, except that
|
| |
+ the Apache server logs may be of interest in debugging issues related to
|
| |
+ accessing the web UI or websockets server. To get more detailed logging
|
| |
+ from openQA components, change the logging level in
|
| |
+ `/etc/openqa/openqa.ini` from 'info' to 'debug' and restart the relevant
|
| |
+ services. Any run of the Ansible plays will reset this back to 'info'.
|
| |
+
|
| |
+ Occasionally the test execution logs may be useful in figuring out why
|
| |
+ all tests are failing very early, or some specific tests are failing due
|
| |
+ to an asset going missing, etc. Each job's execution logs can be
|
| |
+ accessed through the web UI, on the _Logs & Assets_ tab of the job page;
|
| |
+ the files are `autoinst-log.txt` and `worker-log.txt`.
|
| |
+
|
| |
+ == Dynamic asset generation (createhdds)
|
| |
+
|
| |
+ Some of the hard disk image file 'assets' used by the openQA tests are
|
| |
+ created by a tool called `createhdds`, which is checked out of a git
|
| |
+ repo to `/root/createhdds` on the servers and also on some guests. This
|
| |
+ tool uses `virt-install` and the Python bindings for `libguestfs` to
|
| |
+ create various hard disk images the tests need to run. It is usually run
|
| |
+ in two different ways. The ansible plays run it in a mode where it will
|
| |
+ only create expected images that are entirely missing: this is mainly
|
| |
+ meant to facilitate initial deployment. The plays also install a file to
|
| |
+ `/etc/cron.daily` causing it to be run daily in a mode where it will
|
| |
+ also recreate images that are 'too old' (the age-out conditions for
|
| |
+ images are part of the tool itself).
|
| |
+
|
| |
+ This process isn't 100% reliable; `virt-install` can sometimes fail,
|
| |
+ either just quasi-randomly or every time, in which case the cause of the
|
| |
+ failure needs to be figured out and fixed so the affected image can be
|
| |
+ (re-)built.
|
| |
+
|
| |
+ The i686 and x86_64 images for each instance are built on the server, as
|
| |
+ its native arch is x86_64. The images for other arches are built on one
|
| |
+ worker host for each arch (nominated by inclusion in an ansible
|
| |
+ inventory group that exists for this purpose); those hosts have write
|
| |
+ access to the NFS share for this purpose.
|
| |
+
|
| |
+ == Compose check reports (check-compose)
|
| |
+
|
| |
+ An additional ansible role runs on each openQA server, called
|
| |
+ `check-compose`. This role installs a tool (also called `check-compose`)
|
| |
+ and an associated fedmsg consumer. The consumer kicks in when all openQA
|
| |
+ tests for any compose finish, and uses the `check-compose` tool to send
|
| |
+ out an email report summarizing the results of the tests (well, the
|
| |
+ production server sends out emails, the staging server just logs the
|
| |
+ contents of the report). This role isn't really a part of openQA proper,
|
| |
+ but is run on the openQA servers as it seems like as good a place as any
|
| |
+ to do it. As with all other fedmsg consumers, if making manual changes
|
| |
+ or updates to the components, remember to restart `fedmsg-hub` service
|
| |
+ afterwards.
|
| |
+
|
| |
+ == Autocloud ResultsDB forwarder (autocloudreporter)
|
| |
+
|
| |
+ An ansible role called `autocloudreporter` also runs on the openQA
|
| |
+ production server. This has nothing to do with openQA at all, but is run
|
| |
+ there for convenience. This role deploys a fedmsg consumer that listens
|
| |
+ for fedmsgs indicating that Autocloud (a separate automated test system
|
| |
+ which tests cloud images) has completed a test run, then forwards those
|
| |
+ results to ResultsDB.
|
| |
There are many additions as this repo was mostly empty.
The bulk import was done from https://pagure.io/infra-docs/blob/main/f/docs/sysadmin-guide/sops with autoconversion from rst to adoc with pandoc.
The repo is currently rendered in https://gifted-engelbart-b02e8b.netlify.app/infra/
Any help in reviewing is appreciated :-)