#12237 Impossible to push/share new AMIs (CentOS Stream) due to reached quota
Closed: Fixed with Explanation 17 days ago by kevin. Opened 4 months ago by arrfab.

Describe what you would like us to do:

Ensure some clean-up scripts would delete old/outdated AMI from various regions
At least changing the permissions to unshare these, as we're now reaching a quote, not allowing the CentOS Stream team to push CentOS Stream 10 images out (see https://issues.redhat.com/browse/CS-2561)

[+] 20241008-02:25 [scripts/images-push-ec2.sh] -> Image seems now available to modifying permissions on it ... 
An error occurred (ResourceLimitExceeded) when calling the ModifyImageAttribute operation: You have reached your quota of 1000 for the number of public images allowed in this Region. Deregister unused public images or make them private, or request an increase in your public AMIs quota.

When do you need this to be done by? (YYYY/MM/DD)

ASAP, but I'll also send mail on fedora-infra list to start the thread.
It seems we can ask AWS to increase our existing (default) quota of 1000 AMIs that can be shared, but ideally we'd just clean-up old and unwanted (and insecure !) AMI from EOL Fedora releases.

At first glance, in some AWS regions (like af-south-1) I see the following pattern :
(actual Fedora release : 39 and 40)

  • fedora-coreos-{31,32,33,34,35,36,37,38}-<date> (multiple variants, nothing seems deleted/unshared)
  • Fedora-Cloud-Base-AmazonEC2.{arch}-{39,40}-<date> (multiple variants, nothing seems deleted/unshared)
  • Fedora-Cloud-Base-AmazonEC2.{arch}-{ELN,Rawhide}-<date> (multiple variants, nothing seems deleted/unshared)
  • Fedora-Cloud-Base-{33,36}

At least starting by unsharing (deleting ?) unmaintained Fedora CoreOS versions/AMIs would solve our actual issue


Metadata Update from @zlopez:
- Issue priority set to: None (was: Needs Review)
- Issue tagged with: aws, high-gain, medium-trouble, ops

4 months ago

CC: @jcline @davdunc @adamwill @dustymabe

Short term I think we should just request a increase. I can do that later this morning.

Longer term we should indeed be better about lifecycle here. Since we are all sharing one account here we should coordinate some.

I'm happy to implement a cleanup policy in the image uploader.

I've got a policy for Azure and (still incomplete) GCE images. It works a little different on each cloud since there's some differences in features/design.

Azure

Images are organized into "definitions" which map to GCE's "family", and are for a release (fedora-cloud-40-arm64, for example). When a new image is added, a flag is set so it's excluded from being considered as the latest for the release. Every ~2 weeks, an image is promoted to be the latest by unsetting the flag that excludes it. Each image has an EOL field which we set (except for rawhide and ELN).

Images that are excluded from being the latest, along with Rawhide and ELN images regardless of setting, are removed after a week (might change this to 2 weeks). Images that aren't excluded from latest are removed when they reach their EOL date.

GCE

Similar to Azure, except the image doesn't have a dedicated EOL field (that I've found, anyway). I add a tag with the EOL date to each image, along with a tag indicating its being managed by the fedora-image-uploader app so I don't touch other folks images.

GCE also doesn't have an "exclude from latest" flag, instead you can mark an image as deprecated to exclude it. New images (except for Rawhide and ELN) are uploaded and marked deprecated, and then every two weeks one is promoted by marking it active. Deprecated images are removed after 14 days, active images after the EOL date passes.

AWS

I am not aware of a dedicated EOL field for AWS images, nor does it have a concept of an image family that I'm aware of, so we could use tags for everything. I've been applying the "end-of-life", "fedora-release", "fedora-version", "fedora-subvariant" and "fedora-compose-id" tags. I could also add a "image-owner": "fedora-image-uploader" tag so we only manage the lifecycle of those images.

I'm open to any other schemes people want to implement, of course.

@jcline : for AWS and in Fedora account there is an easy way to have resources being automatically deleted after specific time : just tag resources with key=Delete and value=<number_of_days>

It was announced by @mobrien in 2022 : https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org/thread/HYWGJZBEFSO6OH37WQTUJYHMZDMKPSB7/

That's how CentOS Stream images are being tagged with proper FedoraGroup when they are created and with --tags Key="Delete",Value="180" so that they'll be automatically deleted after ~6months

Neat, I wasn't aware of that. I'll just tag the images with their EOLs then.

Is the FedoraGroup tag scheme documented anywhere?

Okay, I've adjusted the Cloud images to include the Delete key. I started with a very conservative policy (Rawhide, ELN images for 14 days, nightly stable images until EOL), but I think we'll want to adjust it since we upload nightly and between all the stable releases that adds up.

@davdunc what exact retention policy would you prefer? We can leave it as described above if you'd prefer to raise the limits, or we can do something else like delete most images after 14 days and every other week leave one around until EOL. I could put the time-to-live in the AMQP messages and the website could pick up the long-lived ones to update the AMIs or something like that.

Note that there are still lots of AMIs without a FedoraGroup tag. I am writing about it on fedora-infra mailing list for more than a year. Regarding AMIs see my emails with subject "Deleting old AMIs in AWS".
In July I deleted (deregistered) 1014 AMIs. In June 797 AMIs and before that 1240 AMIs...
I promised to continue in September with more recent ones. I just triggered my script and I am deleting AMIs older than 2023-06-01.
I am using this script:
https://github.com/xsuchy/fedora-infra-scripts/blob/main/delete-old-amis.py

Note that e.g., CoreOS has 742 tagged AMIs in each region. So beside cleaning untagged AMI it is important that each group make their own cleanup.

One additional note - deleting AMI is one step, but you should make sure that the associated snapshot is deleted, too.

BTW, to update this ticket:

125523088429 limits increased in all regions. Max number of images is now ~11600 in every region.

So, related to this, we have a 'clean-ami's cron job:

roles/releng/templates/clean-amis.j2

that runs a releng script:

releng/scripts/clean-amis.py

I am not sure if it's actually doing anything, but we should stop it/clean it up if we are handling this in cloud-image uploader.

One additional note - deleting AMI is one step, but you should make sure that the associated snapshot is deleted, too.

Oh, good point. I don't see how to tag these in the AWS APIs. I tried doing so during the import_snapshot call (I'm using boto3) and the API wouldn't let me apply tags to the 'snapshot' resource. There also doesn't seem to be an API to modify the snapshot with tags after the fact.

Is it safe to delete the snapshot after we create an AMI from it? If so, I could try to do that. Obviously we'll still leak snapshots when there's an outage between the snapshot import and when we're done creating the AMI.

ad how to tag snapshot - see https://github.com/xsuchy/fedora-infra-scripts/blob/main/label-ami-id.py#L34

Snapshot cannot be deleted if there is registered AMI based on that snapshot.

So if you have label AMI then no one purge that AMI and because that AMI will exist no one can delete the Snapshot. Because when I try to do that AWS will give me an error.
I still recommend to label it though.

So, related to this, we have a 'clean-ami's cron job:

I checked the script. It is deleting AMIs that has tag LaunchPermissionRevoked set. I checked two regions and none AMI there has this tag. Not sure if it is the work of this script or no one use this tag nowadays.

Yeah, this may be something that fedimg was doing. I think we can just retire/remove this script after freeze and move to new methods.

I have removed that script.

Was there anything further to do on tagging snapshots?

I think any further cleanup work here can be done in cloud-image-uploader.

Since the actual issue was solved long ago, closing now.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

17 days ago

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog