#284 Add an SOP for cloud-image-uploader
Opened 6 months ago by jcline. Modified 4 months ago
jcline/infra-docs-fpo cloud-image-uploader  into  master

@@ -0,0 +1,164 @@ 

+ = cloud-image-uploader SOP

+ 

+ Upload Cloud images to public clouds after they are built in Koji.

+ 

+ Source code: https://pagure.io/cloud-image-uploader

+ 

+ == Contact Information

+ 

+ Owner::

+   Cloud SIG, Jeremy Cline (jcline)

+ Contact::

+   #cloud:fedoraproject.org (Matrix)

+ Servers::

+   - https://console-openshift-console.apps.ocp.stg.fedoraproject.org/project-details/ns/cloud-image-uploader[Stage]

+   - https://console-openshift-console.apps.ocp.fedoraproject.org/project-details/ns/cloud-image-uploader[Production]

+ 

+ Purpose::

+   Upload Cloud images to public clouds.

+ 

+ == Description

+ 

+ cloud-image-uploader is an AMQP message consumer (run via `fedora-messaging

+ consume`) that processes Pungi compose messages published on the

+ `org.fedoraproject.*.pungi.compose.status.change` AMQP topic. When a compose

+ enters the `FINISHED` or `FINISHED_INCOMPLETE` states, the service downloads

+ any images in the compose and uploads it to the relevant cloud provider by

+ running an Ansible playbook. Consult the `playbooks` directory in the source

+ repository or Python package to see the playbooks.

+ 

+ The service does not accept any incoming connections and only depends on the

+ RabbitMQ message broker and the relevant cloud provider's APIs.

+ 

+ It requires a few gigabytes of temporary space to download the images before

+ uploading them to the cloud provider. It is heavily I/O bound and the most

+ computationally expensive thing it does is decompress the images.

+ 

+ == General Configuration

+ 

+ The Fedora Ansible repository contains the

+ https://pagure.io/fedora-infra/ansible/blob/main/f/roles/openshift-apps/cloud-image-uploader[OpenShift

+ application definition]. The playbook to create the OpenShift application is

+ located at `playbooks/openshift-apps/cloud-image-uploader.yml`.

+ 

+ Within the container image, configuration is provided via

+ `/etc/fedora-messaging/config.toml`. Additionally, secrets may be provided via

+ environment variables and are noted in the relevant cloud sections.

+ 

+ == Deploying

+ 

+ The service contains a single image and one pod in its deployment configuration.

+ 

+ === Staging

+ 

+ The staging BuildConfig builds a container from

+ https://pagure.io/cloud-image-uploader/tree/main[the main branch]. You need to

+ trigger a build manually, either from the web UI or the CLI.

+ 

+ === Production

+ 

+ The staging BuildConfig builds a container from

+ https://pagure.io/cloud-image-uploader/tree/prod[the prod branch]. Just like

+ staging, you need to trigger a build manually. After deploying to staging, the

+ main branch can be merged into the production branch to "promote" it:

+ 

+ ....

+ $ git checkout prod && git merge --ff-only main

+ ....

+ 

+ === Azure

+ 

+ Images are uploaded whenever a compose that contains `vhd-compressed` images.

+ Images are first uploaded to a container in the storage account and then

+ imported into an Image Gallery.

+ 

+ Credentials for Azure are provided using environment variables. The credentials

+ are used by the

+ https://docs.ansible.com/ansible/latest/collections/azure/azcollection/index.html[Azure

+ Ansible collection].

+ 

+ ==== Image Cleanup

+ 

+ Image clean-up is automated.

+ 

+ The storage account is configured to delete any blob in the container older

+ than 1 week and should require no manual attention. Nothing in the container is

+ required after the VHD is imported to the Image Gallery.

+ 

+ Images in the Gallery are cleaned up by the image uploader after a new image

+ has been uploaded. For complete details on the image cleanup policy refer to

+ the consumer code, but at the time of this writing the policy is as follows:

+ 

+ - Any image that has an end-of-life field that is in the past is removed.

+ 

+ - Only the latest 7 images that are marked as "excluded from latest = True"

+   within an image definition are retained. When an image is marked as "exclude

+   from latest = False", new virtual machines that don't reference an explicit

+   image version will boot using the newest image (following semver). All images

+   are uploaded with "excluded from latest = True" and are only marked as

+   "excluded from latest = False" after testing.

+ 

+ - Only the latest 7 images in the Rawhide image definitions are retained,

+   regardless of whether they are marked "excluded from latest = False".

+ 

+ At the moment, testing and promotion to "excluded from latest = False" is a

+ manual process, but in the future will be automated to happen regularly

+ (weekly, perhaps).

+ 

+ ==== Authentication

+ 

+ The following environment variables are used:

+ 

+ ....

+ AZURE_SUBSCRIPTION_ID - Identifies the subscription within an Azure tenant (our tenant only has 1)

+ AZURE_CLIENT_ID - The application ID used during authentication.

+ AZURE_SECRET - The application secret used during authentication.

+ AZURE_TENANT - Identifies the Azure tenant.

+ ....

+ 

+ If you have access to the Fedora Project tenant, these values are available in

+ the https://portal.azure.com[web portal] under the Microsoft Entra ID service

+ in the "App registrations" tab. To manage things via the CLI you can do `dnf

+ install azure-cli`. All commands below assume you've logged in with `az login`.

+ 

+ There are two app registrations, `fedora-cloud-image-uploader` and

+ `fedora-cloud-image-uploader-staging`. These were created by running:

+ ....

+ $ az ad app create --display-name fedora-cloud-image-uploader

+ ....

+ 

+ ==== Authorization

+ 

+ Images are placed in two resource groups (containers for arbitrary resources).

+ `fedora-cloud-staging` is used for the staging deployment, and `fedora-cloud`

+ is used for the production deployment.

+ 

+ The app registrations are granted access to their respective resource group by

+ assigning them a role on the resource group. The role definition can be seen with:

+ 

+ ....

+ $ az role definition list --name "Image Uploader"

+ ....

+ 

+ This role is then assigned to the app registration with

+ 

+ ....

+ $ az role assignment create --assignee "fedora-cloud-image-uploader" \

+     --role "Image Uploader" \

+     --scope "/subscriptions/{subscription_id}/resourceGroups/fedora-cloud"

+ ....

+ 

+ In the event that additional permissions are required, the role can be updated

+ with additional permission.

+ 

+ 

+ ==== Credential rotation

+ 

+ At the moment, credentials are set to expire and will need to be periodically rotated. To do so via the CLI:

+ ....

+ $ az ad app list -o table  # Find the application to issue new secrets for and set CLIENT_ID to its "Id" field

+ $ touch azure_secret

+ $ chmod 600 azure_secret

+ $ SECRET_NAME="Some useful name for the secret"

+ $ az ad app credential reset --id $CLIENT_ID --append --display-name $SECRET_NAME --years 1 --query password --output tsv > azure_secret

+ ....

@@ -82,6 +82,7 @@ 

  * xref:bodhi.adoc[Bodhi Infrastructure - Releng]

  * xref:bodhi-deploy.adoc[Bodhi Infrastructure - Deployment]

  * xref:bugzilla2fedmsg.adoc[bugzilla2fedmsg]

+ * xref:cloud-image-uploader.adoc[Cloud Image Uploader]

  * xref:collectd.adoc[Collectd]

  * xref:compose-tracker.adoc[Compose Tracker]

  * xref:registry.adoc[Container registry]

This is just the basics for now. In addition to adding AWS and GCP
sections (once the image uploader supports those clouds), I plan on
adding details on common tasks (deploy a new version, deal with
failures, etc).

Looks like a good start to me.

A few things I wonder about:

  • Do we need to prune images ever? would this be something to add to this app? Or something seperate?

  • We probably should sometime have a larger conversation on if we want to add a QE step in here. ie, do we want to test and only upload passing images? or do we want to upload everything, but somehow 'tag' images that pass? But I am not sure at all how people search for images on Azure...

But I'm fine merging this as is and expanding on it/discussing those things somewhere else.

Looks like a good start to me.

A few things I wonder about:

  • Do we need to prune images ever? would this be something to add to this app? Or something seperate?

Yes, that's next on my to-do list and I've got a few ideas, but nothing I've quite committed to just yet so I didn't document anything. I'm currently leaning towards having a Function run in Azure that just runs once a day/week/whatever and implements the pruning rules. We can write it as a Python function and maybe even ensure it exists via Ansible, or just set it up one-time manually if that's not easy/possible.

  • We probably should sometime have a larger conversation on if we want to add a QE step in here. ie, do we want to test and only upload passing images? or do we want to upload everything, but somehow 'tag' images that pass? But I am not sure at all how people search for images on Azure...

Indeed. The way it currently works is all the images are marked with a flag to "exclude them from latest" so the only way to use them is to explicitly boot the image ID, which we don't link to. You can then flip that flag to expose them as the latest (assuming they sort higher, semver-wise, I think). For the super short term I'm going to handle promoting them manually, but I'm definitely motivated to automated that process.

rebased onto fe02817

4 months ago

The service now cleans up Azure images, so I've documented that in the SOP.

rebased onto fe02817

4 months ago