#12142 Deploy updated fedora-cloud-image-uploader to replace the container sync scripts
Opened 11 months ago by adamwill. Modified 5 months ago

We should now try and deploy this to replace sync-latest-container-base-image.sh and sync-ostree-base-containers.sh. There are a few things to decide, though.

Currently cloud-image-uploader runs in a pod. The existing sync scripts run on the compose host, which has the credentials to publish to the registries. We would need to either add those credentials into the pod, or run a second instance of cloud-image-uploader on the compose host, configured only to publish container images (not cloud images). Or I guess we could drop the pod and do all the publishing for both cloud and container from an instance of the consumer running on the compose host? In that case, we could make it capable of detecting when the compose is present on the local filesystem and using the images from there, instead of re-downloading them, as it currently does.

We would also need to synchronize disabling the current sync scripts, so they don't collide with each other.

  • When do you need this? (YYYY/MM/DD)
    No specific date.

  • When is this no longer needed or useful? (YYYY/MM/DD)
    When we have a whole new compose pipeline, I guess.

  • If we cannot complete your request, what is the impact?
    We'll continue using the existing fragile bash scripts to sync, which has various drawbacks - Koji task-based discovery can cause odd effects, the atomic one generates containers from ostrees on the fly instead of using the native ociarchive images we build in the composes, the scripts have no tests and so are a bit fragile to modify.

@jcline @siosm @walters @pbrobinson


IMHO it should be fine to do this in a pod.

The existing ansible roles/login-registry/tasks/main.yml does this on hosts, so we just need to make the login/pass there into openshift secrets and use them...

But perhaps someone else sees a problem doing that?

It is kinda nice that if we do this on the compose host they don't have to re-download the images. But I guess if it's all within infra the 'downloads' are really just local network transfers and happen fast anyway...

Metadata Update from @phsmoura:
- Issue tagged with: medium-gain, medium-trouble, ops

10 months ago

We've now merged my PR to remove various cloud-specific references to the project. This includes renaming the Python library to just fedora_image_uploader, so when deploying the next version (with this change in it) to infra, we'll need to update the consumer config's callback line for that change.

I don't think we need to change anything else. The name 'cloud-image-uploader' is all over the infra ansible stuff, but I think none of it refers to the stuff changed in my PR except the callback name. If we rename the Pagure project we'd need to change more bits in the ansible stuff. We could change them all to just 'image-uploader' but it's a bigger thing to do, maybe we can do that later.

Hey folks. Thanks for all the work already done on this front. Do we know what's remaining to complete this? Thanks

I need someone to help me with whats wrong with the IAM permissions that prevents the cloud-image-uploader from working with AWS. ;(

I need someone to help me with whats wrong with the IAM permissions that prevents the cloud-image-uploader from working with AWS. ;(

We do need this, but this ticket is about it uploading container images to quay.io and stuff, I believe.

The code is deployed so as far as I am aware the only thing left to do is get the credentials added and update the app configuration with the registries to push to. Happy to help with that, I can prep a PR for the ansible repo shortly.

yeah, sorry, too many ticket updates at once.

@kevin I am happy to help this week with the permissions for AWS.

we'd also need to co-ordinate disabling the old shell scripts so they don't fight each other, I guess.

oh, we should also co-ordinate with @pbrobinson / @pwhalen for the bootc/IoT angle. since we merged https://pagure.io/cloud-image-uploader/pull-request/18 , if we deploy this as-is, it will start publishing the bootc images from the IoT composes; we should make sure that's what they want (and arrange to decommission their bash script too).

Where do we stand on this one?

Where do we stand on this one?

I started on the config change here: https://pagure.io/fedora-infra/ansible/pull-request/2200

I still need to inject the registry credentials, as which point we could deploy it to stage and make sure it works before rolling it out to prod and switching off the old scripts. It's on my to-do list for some time this week.

Anything I can help with to get this over the line? The beta is out and the images are not yet updated which leaves atomic desktops with DNF5 if people are using the registry images.

I think thats https://pagure.io/releng/issue/12314
I looked at it the other day, but our existing scripts are actually not erroring and seeming to upload, but somehow something is not working right to show them.

Help appreciated.

seems I am late to the party.

So...we're more or less ready for this, I think. (We have actually already activated it for ELN).

We can't really easily cut over to this gradually (it would be possible but finicky). I'm gonna suggest we do a big bang where we enable this for everything in the rawhide/branched nightlies and the nightly Container composes at once, on 39/40/41/Rawhide, and disable all the calls to the old bash scripts in every branch of pungi-fedora. Then we monitor it closely for a couple of days and make sure it's working as intended (if not, we revert the changes). For IoT we can wait a bit until https://pagure.io/fedora-iot/pungi-iot/pull-request/102 is sorted out.

@jcline @kevin @siosm wdyt? I can send pull requests to implement the above if you agree. to be clear, for atomic desktops, this would result in the 'native' ostree container images being published instead of the ones converted on-the-fly by https://pagure.io/releng/blob/main/f/scripts/sync-ostree-base-containers.sh .

I'm not sure how I feel about enabling this in final freeze, but I guess if everyone else is ok with it...

I would personally prefer to wait until the final freeze is over as well.

Since the old bash scripts are working (from my understanding) it doesn't feel like an emergency. I know this has been going on for a long time and I'd love to see it over the finish line, but maybe we'll hit the first release date and it'll just be a couple weeks?

I'm tempted to say that we should do it now, at least for the Atomic Desktops, as that's part of what we want to have for F41.

If we break the existing for the Atomic Desktops it's not the end of the world as no one should be relying on those right now (notably Universal Blue does not).

If we don't do it now and switch a few weeks after the release, it's going to be super confusing for everybody.

But I also understand that we are now in freeze so this is putting the release at risk (the Atomic Desktops are not a blocker).

Can we deploy this only for the Atomic Desktops for F41, and then do it for bootc & application container images after the release?

If we don't do it now and switch a few weeks after the release, it's going to be super confusing for everybody.

But I also understand that we are now in freeze so this is putting the release at risk (the Atomic Desktops are not a blocker).

Can we deploy this only for the Atomic Desktops for F41, and then do it for bootc & application container images after the release?

We are able to deploy it for a single image type (turning off the bash script for a single image type, I don't know how risky that is). I admit I've not followed the details for the atomic stuff so my preference is based solely on "it's a freeze and a non-visible-to-user change isn't urgent".

However, if there is a visible-to-users change here that will cause headaches for users if we wait, I lean towards being in favor of doing it for those containers and rolling out the rest after the freeze lifts... Unless that's deemed as riskier than just doing them all at once.

For what it's worth, the ELN images seem to be doing just fine so that's some comfort. Ultimately, though, I'm okay with whatever the infra folks are comfortable doing.

we can do it just for atomic desktops, yeah, by only onboarding those pairings in the config dict, and only disabling the atomic desktop sync script (leaving the 'regular' container sync script enabled).

Is there plans of doing this now that F41 is released?
People are starting to expect the images built with dnf and bootc in them and as I understand it it only gets delivered for the desktops via this path. https://discussion.fedoraproject.org/t/f41-change-proposal-dnf-and-bootc-in-image-mode-fedora-variants-system-wide/117664/14

Yeah, I think we do want to do this now that f41 is out. Just need to coordinate.

Okay, to get the ball rolling I've filed https://pagure.io/fedora-infra/ansible/pull-request/2337. It turns it on for everything except IoT. I'm not really familiar with where the bash scripts are set up/run in Ansible so that's not currently in the PR.

I seem to recall Adam mentioning he'd be on vacation so we either need to wait for him to come back or if someone else knows how to remove the bash scripts from the equation we can give it a go. Not sure if we want to do it before the weekend, but I'll be around tomorrow and the first half of next week to help out.

I can disable the scripts, but... lets do this early next week. I have a bunch of other things today. ;)

yeah, I was thinking the same, just hadn't got through my backlog to it yet.

We need to remove the scripts from all stable branches as well as main, though. And we'd also need to remove the calls to them from the relevant nightly.sh scripts. And there's similar stuff in https://pagure.io/fedora-iot/pungi-iot that we'd have to remove if onboarding IoT.

OK, ansible and pungi-fedora PRs are all merged now. We can monitor the next compose or two and see what happens. Not sure if we need to do anything to make the container pick up the config change. edit: ran the playbook for the app, it's re-deployed.

Looks like we're live, but the tags are not pushed under their compose version so I'll open a new issue for the cloud-image-uploader.

Ah, nevermind, the images don't have bootc so something went wrong?

Ah, nevermind, the images don't have bootc so something went wrong?

My bad, looks like we are not building the right thing in bodhi. Working on a PR.

Looks like we're live, but the tags are not pushed under their compose version so I'll open a new issue for the cloud-image-uploader.

Filed: https://pagure.io/cloud-image-uploader/issue/37

As it's not strictly speaking related to this issue, I'll copy/paste all that in the parent one: https://pagure.io/releng/issue/10399

It looks like the containers (https://quay.io/repository/fedora/fedora-kinoite?tab=tags) are including bootc and are thus the correct one now. Thus i think we can close this one. We can track the remaining steps in https://pagure.io/releng/issue/10399, notably https://pagure.io/cloud-image-uploader/issue/37.

Thanks!

well, IoT bootc images are still outstanding while we wait on https://pagure.io/fedora-iot/pungi-iot/pull-request/102 - I probably need to send a pungi patch there. I did mean to look at whether we can just publish them with the weird properties they currently have, while we work that out. Didn't get the roundtuits yet.

Log in to comment on this ticket.

Metadata
Boards 1
Ops Status: Backlog