#11099 IoT composes consistently embedding older ostrees in the installer image
Opened 3 months ago by adamwill. Modified 3 months ago

We have a test in openQA which checks the ostree in an ostree installer image has the same date as the compose - i.e. it's checking to make sure the 'correct' ostree was embedded, not an old one.

This seems to be failing consistently on Fedora-IoT composes since 20221002.0. They always seem to get an 'old' ostree:

1002.0 installer image had 0929.0 ostree
1004.0 installer image had 1002.0 ostree
1005.0 installer image had 1004.0 ostree
1007.0 installer image had 1006.0 ostree
1011.0 installer image had 1007.0 ostree
1012.0 installer image had 1011.0 ostree
1014.0 installer image had 1013.0 ostree
1015.0 installer image had 1014.0 ostree
1018.0 installer image had 1015.0 ostree
1019.0 installer image had 1018.0 ostree

That's almost perfectly "successful compose N installer image has successful compose N-1 ostree", except a couple of cases where the embedded ostree was newer than the previous successful compose, which are probably explained by the intermediate compose's ostree build succeeding but some other failure making the compose fail.

I'm wondering if there's some kind of race here. What actually happens when we build an ostree installer image, AFAICT, is that this dracut template gets run, which shakes out something like this (from the 1019.0 log):

executing /mnt/koji/compose/iot/Fedora-IoT-37-20221019.0/work/x86_64/IoT/lorax_templates/ostree-based-installer/lorax-embed-repo.tmpl with variables={'ostree_install_repo': 'https://kojipkgs.fedoraproject.org/compose/iot/repo/', 'ostree_update_repo': 'https://ostree.fedoraproject.org/iot/', 'ostree_osname': 'fedora-iot', 'ostree_oskey': 'fedora-37-primary', 'ostree_contenturl': 'mirrorlist=https://ostree.fedoraproject.org/iot/mirrorlist', 'ostree_install_ref': 'fedora/devel/x86_64/iot', 'ostree_update_ref': 'fedora/devel/x86_64/iot', 'arch': {'buildarch': 'x86_64', 'basearch': 'x86_64', 'libdir': 'lib64', 'bcj': 'x86'}, 'product': {'name': 'fedora-iot', 'version': '37', 'release': '20221019.0', 'variant': 'IoT', 'bugurl': 'your distribution provided bug reporting tool', 'isfinal': True}, 'dbo': <dnf.base.Base object at 0x7f92224bfd90>, 'root': '/var/tmp/lorax/lorax.jduke_i7/installtree', 'basearch': 'x86_64', 'libdir': 'lib64', 'branding': {'release': 'fedora-release-iot', 'logos': 'fedora-logos'}, 'exists': <function LoraxTemplateRunner.__init__.<locals>.<lambda> at 0x7f92224b71a0>, 'glob': <function LoraxTemplateRunner.__init__.<locals>.<lambda> at 0x7f92224b6fc0>}
running /mnt/koji/compose/iot/Fedora-IoT-37-20221019.0/work/x86_64/IoT/lorax_templates/ostree-based-installer/lorax-embed-repo.tmpl
template line 1: runcmd mkdir /var/tmp/lorax/lorax.jduke_i7/installtree/ostree
running command: ('mkdir', '/var/tmp/lorax/lorax.jduke_i7/installtree/ostree')
command finished successfully
template line 2: runcmd ostree --repo=/var/tmp/lorax/lorax.jduke_i7/installtree/ostree/repo init --mode=bare
running command: ('ostree', '--repo=/var/tmp/lorax/lorax.jduke_i7/installtree/ostree/repo', 'init', '--mode=bare')
command finished successfully
template line 3: runcmd ostree --repo=/var/tmp/lorax/lorax.jduke_i7/installtree/ostree/repo remote add fedora-iot --set=gpg-verify=false https://kojipkgs.fedoraproject.org/compose/iot/repo/
running command: ('ostree', '--repo=/var/tmp/lorax/lorax.jduke_i7/installtree/ostree/repo', 'remote', 'add', 'fedora-iot', '--set=gpg-verify=false', 'https://kojipkgs.fedoraproject.org/compose/iot/repo/')
command finished successfully
template line 4: runcmd ostree --repo=/var/tmp/lorax/lorax.jduke_i7/installtree/ostree/repo pull --mirror fedora-iot fedora/devel/x86_64/iot
running command: ('ostree', '--repo=/var/tmp/lorax/lorax.jduke_i7/installtree/ostree/repo', 'pull', '--mirror', 'fedora-iot', 'fedora/devel/x86_64/iot')
command output:
2941 metadata, 19732 content objects fetched; 614395 KiB transferred in 24 seconds; 1.0?GB content written

so, more or less, we're embedding whichever ostree is currently tagged as fedora/devel/x86_64/iot, I think.

I don't know how an ostree gets marked as the "current" devel one, exactly. I'm wondering if there's a race there? The ostree build task runs and succeeds, then the ostree installer task runs, but it's running before whatever needs to happen for the just-built ostree to get tagged as the "current" devel one, so it embeds the previous one?

Oddly, this doesn't seem to be happening for Silverblue and Kinoite ostree installer images in the regular Fedora compose. Still, I guess that could possibly support the 'race' theory. For instance, for Silverblue in the 20221016.n.0 Rawhide compose, the ostree task completed at Sun, 16 Oct 2022 06:01:41 UTC and the ostree-installer task started at Sun, 16 Oct 2022 06:49:29 UTC , so there was a 48 minute gap. For the 20221019.0 IoT compose, the ostree task completed at Wed, 19 Oct 2022 07:46:05 UTC and the ostree installer task started at Wed, 19 Oct 2022 07:46:50 UTC , so there was only a 45 second gap...


sadly I can't easily check if the gap between the tasks was longer back when this was working OK, because I'm getting the task IDs out of the compose logs and we've garbage collected F37 IoT composes up to 1004.0 :| Finding the tasks directly in Koji is a nightmare because they're just runroot tasks, there are zillions of those, and the web UI doesn't provide enough query options to narrow it down.

Metadata Update from @phsmoura:
- Issue tagged with: high-gain, high-trouble, ops

3 months ago

Login to comment on this ticket.

Metadata
Boards 1
Ops Status: Backlog