#23 Make openQA publicly visible
Closed: Fixed None Opened 8 years ago by adamwill.

As we all know, the 'official' openQA deployment is behind the RH firewall, only RHers can access it and see results and so on.

We've been wanting to fix this all along, but I figured it could do with a ticket for discussions.

I have done a test deployment of containerized openQA on the infrastructure cloud, and it all seems to work except for one problem: nested virt isn't available, so we cannot use qemu-kvm and all tests run far, far too slow. I filed an infra ticket to enable nested virt: https://fedorahosted.org/fedora-infrastructure/ticket/4894

@tflink also says there is some possibility of us getting some bare metal systems in PHX for openQA use. Apparently getting a public IP address could be difficult for that course.

I think either approach might be able to get us more resources than we currently have, which would be good. We could do with a few more worker machines.


This ticket had assigned some Differential requests:
D610

<puiterwijk> adamw: okay, just tested, and nested virt works in openstack. Enabling it unfortunately requires restart of all nodes, but that will come on monday with the cloud restart, so I think that after that we can enable it.

Running openQA publicly in Fedora infrastructure would be great. I think that there are several problems we need to deal with before we can deploy it like so. Some notes:

  • Dockerized openQA uses self-signed cert. I assume that we will be given proper SSL certificate. Best way to handle this would be to use cert from /data directory and load it from mounted host system (or, better, put self-signed cert into /data, symlink it to correct dir and hope that it gets overwritten when user puts its own cert into data dir in host system).
  • We need to update openQA - for example because of #620. I hope that install process wasn't changed much since when we built our last version. I'm planning to do this (and D606 is part of this) now. And from what I understand, we should use openQA from unstable repo rather than stable.
  • I'm not sure how GRU tasks will work in Dockerized environment. I hope that there won't be a problem.
  • Scheduler isn't Dockerized - it runs on host system, outside of Docker. That isn't really a problem - I've been using it like this for weeks. You just have to put correct values into /etc/openqa/client.conf or ~/.config/openqa/client.conf.
  • README.md should contain all necessary instruction how to deploy it. @adamwill, have you used it? Was there a problem? I think that "Update firewall rules" section shouldn't be necessary anymore.

openQA install process hasn't changed much, no. Off the top of my head I can think of:

  • There's now a separate user account for workers - _openqa-worker , I think - and it owns /etc/openqa/client.conf
  • In a default install, factory and tests directories moved into /var/lib/openqa/share and are symlinked to /var/lib/openqa . I believe this is to allow for it to be a remote share for the case where you want to have multiple worker hosts
  • openqa-webui.service has split out an openqa-scheduler.service and an openqa-websockets.service. All three have to run on the same host, though, so it's not a significant difference I don't think.

The way Gru works is basically this: the scheduler adds rows to a DB table, each row is one task. The Gru process is a very simple daemon which runs a constant loop of waking up, hitting the DB, and running the task defined in the first row, if there is one. If the task completes it deletes the row. As far as I understand docker (which is not very far!) it should work fine if we simply run openqa-gru.service in the same container as the webui and scheduler.

README.md was great, yep, I literally had it up and running in 20 minutes. Indeed the 'update firewall rules' section is no longer needed - Fedora has a much newer Docker now - and the SELinux fix could be improved. chcon is transient; if the affected files are relabelled for any reason they will go back to the old labels. You should instead use semanage: semanage somebodytext -a -t svirt_sandbox_file_t "/root/data(/.*)?"(I think it needs an absolute not relative path). Then you run restorecon -vr /root/data to apply the change.

One random-ish note - my 'download ISO' PR is very close to being merged (I think) so it'd be great to have that in when we deploy. Gives us flexibility to schedule jobs from elsewhere.

Good news, everyone: nested virt is now turned on in the infra cloud and works - https://209.132.184.108/tests/10 is a successful test run in my test cloud deployment.

Now we need to solve the issues @jsedlak is dealing with in updating the docker containers to the latest openQA. Then if we're gonna take this path we'd get a much bigger instance out of infra (the current one is a tiny 'medium' instance, can only handle one job at a time) and see how it compares to the BOS box in performance.

Dockerfiles and all other Docker-related files should be up-to-date, README.md is updated, Docker Hub contains image of latest openQA so we could probably try to deploy it into Ze Cloud. Only thing I think that remains is SSL certificate - is there any way how to obtain it? Or are we going to use self-signed one?

Well, the other issue is that the infra folks don't think it makes much sense to give us one giant instance to run the scheduler and a bunch of workers on. If we're going to deploy in the cloud they'd rather we use one instance for the webui/scheduler and then one instance per worker - which means we'd also need to figure out the best way of sharing /var/lib/openqa/share across cloud instances.

I can fiddle with that a bit today I guess.

OK, so I pretty much got distributed workers going as a PoC. I now have two test instances in the infra cloud, one running the webUI and one worker, and another running another worker instance.

There are various ways you can come at it, but I figured on a fairly simple one. Basically to set up a new worker instance you mount the /share directory from the 'primary' instance somehow - I used sshfs - then you create the data container in the usual way. Then you create a worker container, only you give it a different hostname and instead of --connect openqa_webui you use --add-host="openqa_webui:(IP_of_primary)", which adds a line to /etc/hosts. Then the worker will try and connect via HTTP to openqa_webui, which will be the webUI container because of the /etc/hosts line.

There are a few caveats with this approach. You obviously need a way to configure the shared mount; sshfs is probably a good choice because it's lightweight and lets us restrict access to the shared data to the openQA instances (so other stuff on the cloud can't fiddle with it and mess up openQA). So what I did was generate a key with no passphrase on the new worker instance and added that key to authorized_keys on the webUI host instance, so that instance could simply mount the /share directory right off of the webUI host instance, and then linked those into the worker instance's data container just as normal. One snag is SELinux contexts: fuse has support for setting a single context for an entire sshfs mount, but it isn't in fuse 2.9.4, I'm going to backport it and send out an update.

One change I had to make to the Workfiles is to stop making /var/lib/openqa/pool be a volume provided by the webUI container. This doesn't make much sense, really: there's no reason for all the worker instances to share a single pool directory. Rather each worker instance should have its own. The point of the pool directories is to isolate the files for each worker from each other - well, we're already isolating them in containers.

Oh, note, it seemed like UEFI tests would fail at boot, but I don't think this is down to being docker-ized or remote workers - I think it might just be an issue in SUSE's latest qemu-ovmf or something, the version you get in a freshly-deployed container is newer than the one we have on BOS and happyassassin. I'll poke a bit more tomorrow I guess.

! In #623#8190, @adamwill wrote:
One change I had to make to the Workfiles is to stop making /var/lib/openqa/pool be a volume provided by the webUI container. This doesn't make much sense, really: there's no reason for all the worker instances to share a single pool directory. Rather each worker instance should have its own. The point of the pool directories is to isolate the files for each worker from each other - well, we're already isolating them in containers.

The only reason we made it shared is that we thought that WebUI is using these directories somehow (for showing stream from installation or whatever), but we didn't really verified that this is the case, so if WebUI doesn't need it, there is no reason to make it really shared.

I'm about 99.99% sure it isn't, because otherwise remote workers just flat out wouldn't work, and upstream is explicitly supporting (and I think using) remote workers.

I think in Ye Olde Days it did, but they rejigged it to allow for remote workers - all the code that touches the /pool dir should be in the worker command codepath (or os-autoinst).

I'll try and remember to verify with upstream tomorrow, though.

! In #623#8190, @adamwill wrote:
Oh, note, it seemed like UEFI tests would fail at boot, but I don't think this is down to being docker-ized or remote workers - I think it might just be an issue in SUSE's latest qemu-ovmf or something, the version you get in a freshly-deployed container is newer than the one we have on BOS and happyassassin. I'll poke a bit more tomorrow I guess.

I've talked to sysrich and he said that it's regression in openQA.

So, I've been working on making deployment via Ansible possible. The Python client/worker config script I wrote is part of that, also this PR:

https://github.com/os-autoinst/openQA/pull/444

to provide a way to create an initial admin user and API key that doesn't require using the web UI.

So with https://phab.qadevel.cloud.fedoraproject.org/D619 , I can now do a completely ansible-ized deployment of openQA. Woot? ansible files:

https://www.happyassassin.net/temp/common.yml
https://www.happyassassin.net/temp/server.yml
https://www.happyassassin.net/temp/worker.yml
https://www.happyassassin.net/temp/openqa.yml

openqa.yml is the playbook you should run, it defines all required variables and has a comment explaining usage. This example just uses local storage for the data container so could not be used for remote workers; you'd change up the way /var/local/lib/openqa is set up in common.yml if you wanted to do it with shared storage. I could get all fancy and gin up little pluggable plays for different storage strategies, but let's not get too crazy yet.

You will run into one of two Docker/Ansible bugs, if you try this. If your host system winds up with the rather old Docker that's in F23 stable - 1.7.something - it'll fail when building the container images, consistently, with some error about ApplyLayer. This, near as I can tell, is simply a Docker bug: it goes away if you update to the Docker 1.8.something in updates-testing. However, when you do THAT, you'll run into https://github.com/ansible/ansible-modules-core/issues/2043 the first time you run the play. Just run it again and it'll be fine.

oh, note the plays currently build the webui and data containers in order to ensure the necessary scripts wind up in them. once the images on dockerhub are sufficiently new the step that builds the images can be dropped so they'd just be pulled in.

Note that we currently DON'T have working Docker images. 4.1-3.12 from dockerhub doesn't contain required functionality and 4.1443951062.9953cb8-703.1 have UEFI bug. I would build latest :stable version, but I am planning to first do all the required changes from https://github.com/os-autoinst/openQA/pull/442 and also I have to wait until this gets fixed/solved out. As soon as we will be able to log in with built image, I will build latest stable version (should be 4.2.1), push it to Dockerhub and we will be able to continue :-).

It was working for me over the weekend, building the images fresh (so using whatever openQA package was in the upstream repo over the weekend). I could log in fine.

FWIW, I'm starting to feel like doing the public deployment with Docker would be just a big pain. I'm going to see if I can put together a working openQA-on-Fedora over the weekend; I'll set up a side repo (either just a happyassassin repo or a COPR) and put any deps which haven't passed package review there. Infra says it would be OK to run something that uses non-official packages in the cloud.

Great, we've thought Docker-based openQA to be mainly for development for us and not for production, I think that running openQA directly on Fedora is better idea.

! In #623#8253, @adamwill wrote:
It was working for me over the weekend, building the images fresh (so using whatever openQA package was in the upstream repo over the weekend). I could log in fine.

We've traced the problem to be in Red Hat infrastructure. If I run openQA here from within our Brno office, none of us can log in. If I run it while connected to public Wi-Fi, it works without a problem.

Alright, cool we're all on the same page there. I actually think I should be pretty close to having something that works already(!) though I haven't actually, you know, successfully built or run it yet. We'll see :) I'll aim to have something up and running this weekend.

Funnily enough Fedora actually has more of the deps in its official repos than SUSE - SUSE is actually pulling quite a lot from that 'openQA-perl-repo' OBS side repo. Fedora's only missing like one package at present and that's in review.

So with https://copr.fedoraproject.org/coprs/adamwill/openQA I have openQA running on Fedora 23 - at least the web UI and a worker. Can't run any tests as the VM I deployed it into hasn't got enough storage, and I got stuff to do this evening; I'll work on it more tomorrow or Monday. openqa.happyassassin.net is down ATM because I had to cannibalize its PSU for another box; I'll re-deploy it using these packages once the replacement PSU arrives.

You need setsebool -P httpd_can_network_connect 1 for it to work at present, I'll see about shipping a more targeted rule with the package that lets Apache connect to the openQA process.

There's about 4-5 packages required that aren't in the official repo yet, a few currently under review, a few not - you can see them all in the COPR. I'll try to get them all reviewed soon.

Sadly, I have Fedora 22, so I'm unable to try it :-).

So, openqa.happyassassin.net is now running on Fedora (and using postgresql, for bonus points). I'll talk to infra about next steps tomorrow, I guess.

! In #623#8426, @adamwill wrote:
So, openqa.happyassassin.net is now running on Fedora (and using postgresql, for bonus points). I'll talk to infra about next steps tomorrow, I guess.

I assume you mean openqa?

I can get started on this today but I'm unclear on what is needed where - f23 everywhere?

! In #623#8430, @tflink wrote:
I can get started on this today but I'm unclear on what is needed where - f23 everywhere?

Yup, f23 should be everywhere. @adamwill has COPR repos for openQA for f23.

yeah, f23 ideally. we could do f22 but I'd have to build more stuff in the COPR. I guess EL7 might be nice for stability, but a lot of the deps will be missing there. The main thing that's needed, though, is the shared storage.

Ticket for deployment:

https://fedorahosted.org/fedora-infrastructure/ticket/4958

I'm not entirely sure whether we or infra 'own' this - the stuff needs to be in infra ansible git, apparently, but the servers are owned by QA. I dunno if it depends whether we deploy to the .qa.fedoraproject.org subdomain or what.

Progress report!

We have the hosts assigned and deployed, many thanks to tflink. We have the ansible changes in place. We have the staging openQA deployed and, as of right now, more or less working. Issues:

  1. The firewall rules aren't in place, meaning you can't access the web UI from outside of the infra network - you should be able to see it at https://openqa.stg.fedoraproject.org , but you can't. Tests do run and you can examine the results by looking at the log files or hitting the API with http or curl, if you have access to the boxes.
  2. The worker host CPU is old enough that it suffers from https://bugzilla.redhat.com/show_bug.cgi?id=1278688 . I stuck a Rawhide nodebug kernel on it manually for now, but ideally we need the fix for that backporting to f23 kernels.
  3. We don't have the UEFI firmware packages installed right now, so the UEFI tests fail (also I think openQA may be hardcoded to look for a specific filename for the UEFI firmware and in the kraxel.org packages it's called something else, I'll have to check that).

Those are pretty small issues, though, really! Once they're resolved I guess we'll run the stg deployment for a few days to prove it works, then ask if we can do the prod deployment.

Great work!

Looking at the code, when you specify UEFI, it tries to read UEFI_BIOS variable, so we could probably use that to provide correct firmware name. Problem is that it then concatenates it with hardcoded path (/usr/share/qemu/). But we could set UEFI_BIOS to something like ../edk2.git/ovmf-x64/OVMF_CODE-pure-efi.fd I guess.

I've tried it (setting UEFI_BIOS to correct path) and it seems to work.

awesome, thanks for looking into that! we still need to arrange an appropriate way to make the package available in infra, I'll work on that.

Soooo:

https://openqa.fedoraproject.org/
https://openqa.stg.fedoraproject.org/

:)

There are still a few loose ends to tie up - not all the disk images are generated (I'm working on that), we need to do something about UEFI_BIOS, a few other bits and pieces - but it's basically up now.

{meme, src=fireworks}

Amazing :-).

Why is there a problem with UEFI_BIOS? I thought that we are going to use http://www.kraxel.org/repos/firmware.repo with UEFI_BIOS pointing to correct directory.

There's no problem with UEFI_BIOS exactly, we just need to do it (and I guess figure out how we want to handle the transition from BOS to fp.o; do we just forget about BOS now and set everything up for fp.o, or have two files for a bit, or what?)

I don't mind either way. I think that we can forget about BOS and set everything to be on fp.o. There are still problems in BOS - GRU is still failing (so I had to delete some isos today) etc.

So I came up with a different fix for the UEFI issue:

https://github.com/os-autoinst/os-autoinst/pull/357

because I was too anal to just go ahead and change templates so it only works on Fedora, and I started writing a little tool to make various edits to templates and then I thought, hell, it's not actually that hard to make os-autoinst better, so let's just do that.

Might still need the silly little tool to deal with another problem, though; machines inside infra can't actually reach dl.fedoraproject.org. We stuck an entry in /etc/hosts to workaround the problem, but I think the way os-autoinst does networking, openQA VMs won't respect that setting; we're probably going to need a little thing that changes the REPOSITORY_GRAPHICAL values in templates for infra deployments.

So I just stuck a hack in the ansible playbook to deal with that issue:

https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=9d271749fb66a56eed20aec5a7f5cb50b157e7ba

I'm still having fun generating the missing disk images, but most tests should now be working in the fp.o deployments...except that freetype 2.6.2 landed in Rawhide today and completely changed font rendering again! So I'm currently doing another mass needle re-take, hopefully tomorrow we'll get some useful results.

So almost all tests now run and work correctly on the prod deployment! Which is a good thing, because I think I just killed BOS by rebooting it, or something.

I'm now just struggling to create the F23 'desktop' hard disk images; it seems like every time I try it fails because of insufficient disk space, no matter how big I make the image (I tried bumping it to 25GB and 30GB but it seems to make no difference). Filed https://bugzilla.redhat.com/show_bug.cgi?id=1288689 for that.

rwmj pointed out the desktop image problem was happening because of a missing package, so I went ahead and fixed that up. I built the missing images manually on qa05 and transferred them over to the server boxes; the reason being that the server boxes are VMs and nested virt is not enabled, so they take forever to build any of the images that require actually launching a VM. I've got it on my todo list to improve the disk generation stuff in general, but for now, we have images in place and we can do the same 'generate on a worker box and copy across' dance if we need to build new ones or refresh them.

I think we can go ahead and close this ticket, now: the deployments are done, openqa.fp.o is sending out the compose check reports, and BOS still seems to be dead. We still need to work out some workflow stuff, but the main deployment work is done.

! In #623#8700, @adamwill wrote:
So I came up with a different fix for the UEFI issue:

Great, we have finally nice and clean solution.

! In #623#8729, @adamwill wrote:
... because I think I just killed BOS by rebooting it, or something.

Have no mercy, BOS was horribly deployed and I'm glad that we finally don't have to take care of system none of us actually know how to use.

! In #623#8731, @adamwill wrote:
rwmj pointed out the desktop image problem was happening because of a missing package, so I went ahead and fixed that up.

I've encountered it before - it's missing libguestfs-xfs package.

Login to comment on this ticket.

Metadata