#11377 Install AWX for testing proof of concept
Opened 10 months ago by kevin. Modified 2 months ago

We would like to install a AWX instance on our production OpenShift cluster.

Why do we want to do this?
Allow users/groups to use a web interface to deploy things in Fedora Infrastructure instead of having to login to batcave01 and run sudo
Reduce / potentially eliminate the use of rbac-playbook
Allow better permissions / Users seeing what permissions they have easily
Allow much better access to logs
Allow automation (scheduling playbook runs, checking things, etc)
decouple ansible from batcave01, allowing us to run it in a more controlled/different execution env
* Help AWX upstream (and AAP downstream) with more use cases/bug reports/etc

Why not staging?
* Since we don't have a seperate staging ansible setup currently, it would introduce confusion and not much gain to deploy to stg first.

Config options/possible issues/things we already discussed
We may not be able to use our existing private repo, will need to investigate that more, so we may need to move private variables to vault or some other solution.
Because of that we are going to make a new repo for at least this proof of concept under https://pagure.io/fedora-infra/
to start with this will use a different ansible ssh key/have different access. So, we can add some few test machines to it to see how things work.
We want to install the AWX operator and configure it via ansible just like we do all our other things.
* We want batcave01 to still be able to run playbooks/ansible and manage things, as AWX requires our openshift cluster is up and running.

Things we need to determine:
Can we re-use ansible-private?
Do we want to use existing ansible repo, or just migrate to a new one?
* What execution env are we going to deploy/use?

@darknao has been running AWX at home for years and is hopefully able to help us make this happen. ;)

Metadata Update from @phsmoura:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

10 months ago

Can we re-use ansible-private?

The Ansible Way (r) is usually to store secret files/vars in the same repo as the playbooks, but encrypted with ansible-vault. Then you just need to set the vault key in AWX and/or ansible-core to decrypt them automatically on use.

Alternatively, we can expose host paths from the execution node (host) to the execution environment (container).
If we use batcave01 as the execution node, the setup should be quite straightforward as everything is already available there.
If we set up a dedicated execution node (batcomputer01), then we'll need to replicate the /srv/private space.
Playbooks will still need to be modified to remove most of the hard-coded absolute paths that make them only usable on batcave01 at this moment.
Note: this only work with execution nodes, we can't do that if we run playbooks in the OCP cluster.

Do we want to use existing ansible repo, or just migrate to a new one?

AWX can work with our current repo, but most playbooks will not work in their current state anywhere else than batcave01.
I think using another repository could make sense as we would need to migrate all our playbooks/roles for AWX anyway (remove absolute paths, use upstream collections - see next section,...).
We can also take that opportunity to update our playbooks with the latest syntax recommendation (we have a lot of them using old and deprecated syntax) and maybe clean unused stuff.

What execution env are we going to deploy/use?

The standard AWX Execution Environment will most likely lack a few dependencies that we need here.
I've set up a customized EE as proof-of-concept with everything I think we need, based on what we have on batcave01:
It's automatically built with GitLab-CI and available in the GitLab registry and can be easily customized if needed.
Few notable differences from batcave01:
- This EE is based on fedora instead of RHEL
- Since it's not RHEL, the redhat.linux-system-roles collection is replaced with fedora.linux_system_roles
- I used the freeipa.ansible_freeipa collection instead of the ansible-freeipa rpm as I think it makes more sense to use ansible-galaxy for everything instead of mixing galaxy and rpms.

We could set up a RHEL based EE if we want to get closer to batcave01, but that requires a Red Hat sub, and make the build a little bit more complicated.

Regarding authentication:

AWX can work with both OIDC & SAML2 but only SAML2 allows group mapping.
There is an issue with Ipsilon (https://pagure.io/ipsilon/issue/393) that uses RSA-SHA1 deprecated algorithm to sign SAML messages which are then rejected by AWX.

This was fixed on the underlying library lasso (https://src.fedoraproject.org/rpms/lasso/pull-request/9) but the updated package is not available yet. So we'll have to wait for that in order to get working authentication with FAS accounts.

Can we re-use ansible-private?

I add a third option: make ansible-private a git submodule of the main ansible repository.
The content of the repo stays private, but AWX is able to automatically clone and update the submodule when needed.
All files are then accessible for consumption on the execution node. We still need to update playbooks with the new path, but that seems more portable to me and a bit less hacky than the other solution.

So I think there might be several states here. A initial 'play around with it/poc' and a 'we are definitely moving to this'.

How about for a initial proof of concept we just use batcave01 as execution node? Then, everything should work with our existing setup and we can play around with it. Once we are happy with it/decide to move forward, we can then make a new repo entirely and slowly migrate things over to it (and clean them up/fix them as we go) and move to a new clean execution node?

At that point it probibly makes sense to just use vault and the 'ansible way' of doing things, although I would like to make it so we could use either ansible directly or awx as the end state. Just in case there's some problem with awx we aren't blocked.

One thing to note here is that the way ansible works now on batcave01 is that there's a ssh agent that has the ssh key unlocked in it that only root can access. Would this need a new ssh key entirely setup or can the EE use the existing agent and have access that way? I suppose a new key might be nicer so it's completely seperate and we only add it to targets we want to test with?

I'd really like auth to be working... we can ping simo to ask about plans for lasso updates.

Thanks again for working on this.

We can start with batcave01, sure. A few notes though:
Execution Nodes require a few things to communicate with AWX control plane (AWX generates a playbook for this):
- Install AWX receptor from https://copr.fedorainfracloud.org/coprs/ansible-awx/receptor/.
- Install podman.
- Configure the receptor with the certificate/key pair generated by AWX.
- Install ansible-runner with pip.

The receptor & Execution Environment is executed by the unprivileged awx user (that can be changed, but I can't really recommend using root here).
The receptor listens on port 27199/tcp by default (that can be changed too).

Since the Execution Environment is an unprivileged container running as awx, the root ssh-agent will not be available. I recommend creating a new key, with a small subset of hosts to begin with, especially for the POC phase.

For the same reason, /srv/private/ansible would need some additional acl to grant read access to the awx user.

So, I keep meaning to do the upgrade/replace batcave01 with rhel9. I think I will push for doing that next week (thursday?)
I'd like to finish that up before we do this.

Can we rebuild the ansible-runner package for rhel9?

new key sounds good.

we already have some facls on that, we could add more.

There is no ansible-runner rpm in epel9 (or epel8), but it's available in the ansible-automation-platform repository.

There was an effort to get it built in epel8 a long time ago, but was dropped by lack of dependencies (pexpect). I think that is still the case.

There is no ansible-runner rpm in epel9 (or epel8), but it's available in the ansible-automation-platform repository.

There was an effort to get it built in epel8 a long time ago, but was dropped by lack of dependencies (pexpect). I think that is still the case.

Yeah, my thought was that rhel9 is newer enough to make it more possible... but thats a sidetrack I guess. ;)

Little status update:

  • AWX is deployed 🎉
  • FAS login set up with SAML2
  • Infra git connected and host inventory imported
  • Execution Node (batcave01) preconfigured, but not installed yet
  • Execution Environment set up with all our collections / roles / python modules

Next steps:

We still need to create and deploy a new dedicated ssh key so we can connect to hosts.
Then we should be able to start creating playbooks and/or job templates for simple things that don't require /srv/private/ansible. It could be system updates tasks, or proxies resync for instance.

In the meantime, we are working on getting ansible-runner, ansible-receiver, and the awx.awx collection packaged for rhel9 so we can deploy the Execution Node on batcave01.

@darknao awx.awx collection is packaged here https://src.fedoraproject.org/rpms/ansible-collection-awx-awx will take a week to hit stable unless it get 3 approvals before hand, I can look at ansible-runner and ansible-receiver packaging if needed as well?

That's great, thanks for working on that!

Yes, we still need both ansible-runner for rhel9 and ansible-receptor.
The receptor is the most important one.

Yep, I have created a request for receptor here https://bugzilla.redhat.com/show_bug.cgi?id=2232203 but im sure this package will take some time to review and get approved as their are more moving parts.

receptor should be in now and available.

Whats the next steps here? :)

now at this point we should build a test system to make sure that AWX can install the receptor package and set the config files to communicate, once we have that we should be able to run some test jobs before we start to scale and move to production.

Login to comment on this ticket.

Boards 1
ops Status: Backlog
Related Pull Requests
  • #1490 Merged 9 months ago