#11195 local delete_old_oci_images.py broken by new docker-distribution upgrade
Closed: Fixed a year ago by ryanlerch. Opened a year ago by kevin.

Upgrading docker-distribution on oci-candidate-registry01.iad2.fedoraproject.org (from 2.6.2 to 2.8.1) fixes a garbage collection issue, but then it breaks this module we have in ansible:

library/delete_old_oci_images.py

called from playbooks//manual/oci-registry-prune.yml

gives

<compose-x86-01.iad2.fedoraproject.org> (1, b'', b'Traceback (most recent call last):\n  File
 "<stdin>", line 107, in <module>\n  File "<stdin>", line 99, in _ansiballz_main\n  File "<st
din>", line 47, in invoke_module\n  File "<frozen runpy>", line 226, in run_module\n  File "<
frozen runpy>", line 98, in _run_module_code\n  File "<frozen runpy>", line 88, in _run_code\
n  File "/tmp/ansible_delete_old_oci_images_payload_4c6teiuz/ansible_delete_old_oci_images_pa
yload.zip/ansible/modules/delete_old_oci_images.py", line 166, in <module>\n  File "/tmp/ansi
ble_delete_old_oci_images_payload_4c6teiuz/ansible_delete_old_oci_images_payload.zip/ansible/
modules/delete_old_oci_images.py", line 144, in main\nTypeError: \'NoneType\' object is not $
ubscriptable\n')
<compose-x86-01.iad2.fedoraproject.org> Failed to connect to the host via ssh: Traceback (mos
t recent call last):
  File "<stdin>", line 107, in <module>
  File "<stdin>", line 99, in _ansiballz_main
  File "<stdin>", line 47, in invoke_module
  File "<frozen runpy>", line 226, in run_module
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/tmp/ansible_delete_old_oci_images_payload_4c6teiuz/ansible_delete_old_oci_images_pay
load.zip/ansible/modules/delete_old_oci_images.py", line 166, in <module>
  File "/tmp/ansible_delete_old_oci_images_payload_4c6teiuz/ansible_delete_old_oci_images_pay
load.zip/ansible/modules/delete_old_oci_images.py", line 144, in main
TypeError: 'NoneType' object is not subscriptable

So, we need to fix it up.

Can be duplicated with:

ansible-playbook -vvv /srv/web/infra/ansible/playbooks/manual/oci-registry-prune.yml -l oci-candidate-registry01.iad*

Perhaps @zlopez could take a look? or anyone else...


Metadata Update from @kevin:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

a year ago

SO dug into this, the blob json that the script gets from "resp = s.get("{}/v2/{}/blobs/{}".format(registry, repo, digest))" is expecting a created key and value at the top level.

for some reason, on a a handful of builds (fedora-kintote and fedora-silverblue), the created key is not in the JSON.

i'll paste below an example of the JSON we get back, but in this case, i think i am going to update the script to take the date from the first item in the history list, and use that as the date to check age by

{
    "architecture": "amd64",
    "config": {
        "Cmd": ["/usr/bin/bash"],
        "Labels": {
            "org.opencontainers.image.version": "Rawhide.20230426.n.0",
            "ostree.bootable": "true",
            "ostree.commit": "7b34259d07d276a0fbb4dbe8cf3578a2ae2ec7d6abf3c668c7bec1732b50a7bb",
            "ostree.final-diffid": "sha256:6e107dfe6a72a3825b8a7d784f007e8f361a7ba417b6e7b26c45c70491db1b74",
            "ostree.linux": "6.4.0-0.rc0.20230425git173ea743bf7a.3.fc39.x86_64",
            "rpmostree.inputhash": "c53e72f6a1df69e56ad5c2228ccd5083a3a153a9c93e2342ab9ecf4a2ce7f462",
            "version": "Rawhide.20230426.n.0",
        },
    },
    "history": [
        {
            "created": "2023-04-26T12:57:04Z",
            "created_by": "ostree export of commit 7b34259d07d276a0fbb4dbe8cf3578a2ae2ec7d6abf3c668c7bec1732b50a7bb",
        },
        {
            "created": "2023-04-26T12:57:04Z",
            "created_by": "firefox-112.0.1-1.fc39.x86_64",
        },
        {
            "created": "2023-04-26T12:57:04Z",
            "created_by": "glibc-all-langpacks-2.37.9000-7.fc39.x86_64",
        },

<<< SNIP >>>

        {"created": "2023-04-26T12:57:04Z", "created_by": "qt-1:4.8.7-71.fc38.x86_64"},
        {
            "created": "2023-04-26T12:57:04Z",
            "created_by": "mariadb-3:10.5.18-2.fc39.x86_64",
        },
        {"created": "2023-04-26T12:57:04Z", "created_by": "1435 components"},
    ],
    "os": "linux",
    "rootfs": {
        "diff_ids": [
            "sha256:9a0e1f0bdfdc9789c8af5eec8cfb2ddd3e0d7d783f18e273c66530369340c9f3",

<<< SNIP >>>

            "sha256:6e107dfe6a72a3825b8a7d784f007e8f361a7ba417b6e7b26c45c70491db1b74",
        ],
        "type": "layers",
    },
}

Ok, i have updated the script, and it now just skips the offending images (currently 4), with the following output in the debug that gets printed when running the playbook.

"Could not get date for fedora-kinoite:sha256:105d887f6c01dc163b8bea6a5168e2dee1b851409eb172b461cafe3654a116e7 -- skipping",                                                                                                         
"Could not get date for fedora-kinoite:sha256:105d887f6c01dc163b8bea6a5168e2dee1b851409eb172b461cafe3654a116e7 -- skipping",                                                                                                         
"Could not get date for fedora-silverblue:sha256:78e559cc27d6ea6deab916ab497cf779a080fc230352c085b535ca656403c733 -- skipping",                                                                                                      
"Could not get date for fedora-silverblue:sha256:78e559cc27d6ea6deab916ab497cf779a080fc230352c085b535ca656403c733 -- skipping", 

I wasnt confident on using the alternative date to delete these images, since i wasnt really sure why the json for those images didnt have the date key, so decided to just skip them for now.

Also, i haven'y actually run the playbook proper (wanst sure what the procedure is for running it), i just tested it out with --check (i.e.)

ansible-playbook -vvv /srv/web/infra/ansible/playbooks/manual/oci-registry-prune.yml --check -l oci-candidate-registry01.iad*

Metadata Update from @ryanlerch:
- Issue assigned to ryanlerch

a year ago

We should probably check why those images don't have a date.

Looking at this more closely, I agree that using the first item in the history makes sense.

These images are very likely from this: https://fedoraproject.org/wiki/Changes/OstreeNativeContainer

So, I am not sure if they should have a date here or if thats expected. ;)

@walters ?

Also, i haven'y actually run the playbook proper (wanst sure what the procedure is for running it), i just tested it out with --check (i.e.)

ansible-playbook -vvv /srv/web/infra/ansible/playbooks/manual/oci-registry-prune.yml --check -l oci-candidate-registry01.iad*

Yes, thats it. I think if you want to run it thats fine, or if you want I can... ;)

At the current time (pun intentional) we don't set a creation date in the interest of reproducible builds for our ostree-bootable container images.

Conversely today, AFAIK our scripts to build the non-ostree container images e.g. quay.io/fedora/fedora will always generate a new tarball (because of timestamps written by RPM into the tarball) and even if that was fixed it'd be apparently a new image because of the manifest creation timestamp.

The date field is optional, so the script should indeed handle this.

(An aside, this type of stuff is why I think languages without null/None i.e. Rust are better, podman had the same bug https://github.com/containers/podman/pull/12936 )

Now, back up to a higher level, I think for a garbage collection policy...for OpenShift with imagestream objects we track the time that a tag was created, distinct from the image creation time. It might be that docker/distribution tracks the time an image was pushed? Needs investigation.

Alternatively, we could probably just GC anything that isn't tagged once a month.

Ok, so ran the playbook, and it cleaned up all but the 4 previously mentioned fedora-kinoite and fedora-silverblue images.

@walters i'm a little green on how all this works, but are you saying that we shouldn't compare on the first item in the history object (in the sample json i provided above) to see if the image is older than 30 days? (i.e. if the image doesnt have a creation date)

Wait, reading https://pagure.io/fedora-infra/ansible/blob/main/f/library/delete_old_oci_images.py I'm confused...the code is talking about deleting blobs but I don't think it should be doing that, the GC handling for those must be docker-distribution's job. xref https://github.com/distribution/distribution/blob/main/docs/garbage-collection.md

okay! going to close this one as the original traceback is now fixed.

if we want to re-write this script to do it in a more correct way, we will create a new ticket.

Metadata Update from @ryanlerch:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a year ago

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog