#11099 IoT composes consistently embedding older ostrees in the installer image
Opened a year ago by adamwill. Modified a year ago

We have a test in openQA which checks the ostree in an ostree installer image has the same date as the compose - i.e. it's checking to make sure the 'correct' ostree was embedded, not an old one.

This seems to be failing consistently on Fedora-IoT composes since 20221002.0. They always seem to get an 'old' ostree:

1002.0 installer image had 0929.0 ostree
1004.0 installer image had 1002.0 ostree
1005.0 installer image had 1004.0 ostree
1007.0 installer image had 1006.0 ostree
1011.0 installer image had 1007.0 ostree
1012.0 installer image had 1011.0 ostree
1014.0 installer image had 1013.0 ostree
1015.0 installer image had 1014.0 ostree
1018.0 installer image had 1015.0 ostree
1019.0 installer image had 1018.0 ostree

That's almost perfectly "successful compose N installer image has successful compose N-1 ostree", except a couple of cases where the embedded ostree was newer than the previous successful compose, which are probably explained by the intermediate compose's ostree build succeeding but some other failure making the compose fail.

I'm wondering if there's some kind of race here. What actually happens when we build an ostree installer image, AFAICT, is that this dracut template gets run, which shakes out something like this (from the 1019.0 log):

executing /mnt/koji/compose/iot/Fedora-IoT-37-20221019.0/work/x86_64/IoT/lorax_templates/ostree-based-installer/lorax-embed-repo.tmpl with variables={'ostree_install_repo': 'https://kojipkgs.fedoraproject.org/compose/iot/repo/', 'ostree_update_repo': 'https://ostree.fedoraproject.org/iot/', 'ostree_osname': 'fedora-iot', 'ostree_oskey': 'fedora-37-primary', 'ostree_contenturl': 'mirrorlist=https://ostree.fedoraproject.org/iot/mirrorlist', 'ostree_install_ref': 'fedora/devel/x86_64/iot', 'ostree_update_ref': 'fedora/devel/x86_64/iot', 'arch': {'buildarch': 'x86_64', 'basearch': 'x86_64', 'libdir': 'lib64', 'bcj': 'x86'}, 'product': {'name': 'fedora-iot', 'version': '37', 'release': '20221019.0', 'variant': 'IoT', 'bugurl': 'your distribution provided bug reporting tool', 'isfinal': True}, 'dbo': <dnf.base.Base object at 0x7f92224bfd90>, 'root': '/var/tmp/lorax/lorax.jduke_i7/installtree', 'basearch': 'x86_64', 'libdir': 'lib64', 'branding': {'release': 'fedora-release-iot', 'logos': 'fedora-logos'}, 'exists': <function LoraxTemplateRunner.__init__.<locals>.<lambda> at 0x7f92224b71a0>, 'glob': <function LoraxTemplateRunner.__init__.<locals>.<lambda> at 0x7f92224b6fc0>}
running /mnt/koji/compose/iot/Fedora-IoT-37-20221019.0/work/x86_64/IoT/lorax_templates/ostree-based-installer/lorax-embed-repo.tmpl
template line 1: runcmd mkdir /var/tmp/lorax/lorax.jduke_i7/installtree/ostree
running command: ('mkdir', '/var/tmp/lorax/lorax.jduke_i7/installtree/ostree')
command finished successfully
template line 2: runcmd ostree --repo=/var/tmp/lorax/lorax.jduke_i7/installtree/ostree/repo init --mode=bare
running command: ('ostree', '--repo=/var/tmp/lorax/lorax.jduke_i7/installtree/ostree/repo', 'init', '--mode=bare')
command finished successfully
template line 3: runcmd ostree --repo=/var/tmp/lorax/lorax.jduke_i7/installtree/ostree/repo remote add fedora-iot --set=gpg-verify=false https://kojipkgs.fedoraproject.org/compose/iot/repo/
running command: ('ostree', '--repo=/var/tmp/lorax/lorax.jduke_i7/installtree/ostree/repo', 'remote', 'add', 'fedora-iot', '--set=gpg-verify=false', 'https://kojipkgs.fedoraproject.org/compose/iot/repo/')
command finished successfully
template line 4: runcmd ostree --repo=/var/tmp/lorax/lorax.jduke_i7/installtree/ostree/repo pull --mirror fedora-iot fedora/devel/x86_64/iot
running command: ('ostree', '--repo=/var/tmp/lorax/lorax.jduke_i7/installtree/ostree/repo', 'pull', '--mirror', 'fedora-iot', 'fedora/devel/x86_64/iot')
command output:
2941 metadata, 19732 content objects fetched; 614395 KiB transferred in 24 seconds; 1.0?GB content written

so, more or less, we're embedding whichever ostree is currently tagged as fedora/devel/x86_64/iot, I think.

I don't know how an ostree gets marked as the "current" devel one, exactly. I'm wondering if there's a race there? The ostree build task runs and succeeds, then the ostree installer task runs, but it's running before whatever needs to happen for the just-built ostree to get tagged as the "current" devel one, so it embeds the previous one?

Oddly, this doesn't seem to be happening for Silverblue and Kinoite ostree installer images in the regular Fedora compose. Still, I guess that could possibly support the 'race' theory. For instance, for Silverblue in the 20221016.n.0 Rawhide compose, the ostree task completed at Sun, 16 Oct 2022 06:01:41 UTC and the ostree-installer task started at Sun, 16 Oct 2022 06:49:29 UTC , so there was a 48 minute gap. For the 20221019.0 IoT compose, the ostree task completed at Wed, 19 Oct 2022 07:46:05 UTC and the ostree installer task started at Wed, 19 Oct 2022 07:46:50 UTC , so there was only a 45 second gap...

sadly I can't easily check if the gap between the tasks was longer back when this was working OK, because I'm getting the task IDs out of the compose logs and we've garbage collected F37 IoT composes up to 1004.0 :| Finding the tasks directly in Koji is a nightmare because they're just runroot tasks, there are zillions of those, and the web UI doesn't provide enough query options to narrow it down.

Metadata Update from @phsmoura:
- Issue tagged with: high-gain, high-trouble, ops

a year ago

This is still an issue.

could someone just stick an artificial 15 minute wait in the compose process or something to at least confirm that's the issue?

It's not a race condition in that the tasks would be started in wrong order. Pungi starts and waits for all ostree tasks to finish, and only then starts submitting tasks for the installer.

In the IoT case, the installer task gets the ostree content from kojipkgs.fedoraproject.org via http://. Could it be some caching in httpd or something like that?

A dirty hack to inject sleep after the ostree repo is updated would be to put time.sleep() at the end of file /usr/lib/python3.*/site-packages/pungi/scripts/wait_for_signed_ostree_handler.py. That way it would only wait once the ostree repo is updated and signed.

the hack time.sleep(15*60) is in place on compose-iot01

thanks, we'll see how it goes in the next couple of composes.

Hmm, it doesn't seem like that solved it :/ I don't recall exactly how i found the tasks before, but I think https://koji.fedoraproject.org/koji/taskinfo?taskID=98664398 is the x86_64 ostree build task from the 20230313.1 compose, and https://koji.fedoraproject.org/koji/taskinfo?taskID=98666941 is the ostree-installer task...there's 30 minutes between the ostree task ending and the ostree-installer task starting, but the bug still happened, the installer has the 20230313.0 ostree embedded :(

Login to comment on this ticket.

Boards 1
Ops Status: Backlog