#5636 Use a staging area for test composes and safe practices when naming them
Closed: Fixed None Opened 10 years ago by kparal.

Summary: There was a hiccup for F19 TC3 compose. First it was created with an old anaconda, then it was deleted and created with a new anaconda. The announcement was sent only afterwards. Unfortunately the mirrors do not heed announcements, they periodically sync everything. For Brno office our mirror synced the old TC3 -> we came to work -> saw TC3 announcement -> saw TC3 on our local mirror -> tested the (outdated) TC3 version for the whole day -> and by pure chance learned that we have been using an outdated compose (and wasted a day's work) at the end of the day.

I don't mean to blame anyone. It was unfortunate, and I'd like to improve our workflow to minimize a chance of that happening again.

I have heard a solution "If you download images before announcement, make sure they match the upstream checksums after the announcement is out". First, a lot of people (e.g. us) don't download images manually, our mirror does. So this essentially means checking timestamps or checksums for every compose that we sync. That 1) requires some time spent for dozens of individuals 2) is error-prone, because nobody will do that every time, like a machine. Can we do better?

I see two very simple solutions and they can be combined:
1. Use a "staging area" for new composes. It's really simple - create a directory that is not world-readable (thus not mirrored) and create the compose there. Once you are sure everything looks OK, move the compose to the correct directory. The other benefit is that this is nearly atomic, thus making sure we don't have half-synced mirrors.
2. Once something is public, don't change contents. It might happen that we publish something in error. But this is the same situation as in source tarballs for software projects. You can never change it, because you never know if somebody hasn't downloaded it in the (albeit short) meantime. You can delete, sure, but the fixed file/tarball/compose should always have a bumped version. There's no problem in that, really. If TC3 is published in error, delete it if you want, and create TC3.1 or TC4. It's just a safety practice and we don't mind, really, quite the opposite. This is a risk-prevention.

If we combine these approaches, I believe there is no actual increase in your workload (these are just work approaches, but no extra load) and we increase the reliability of our tools and processes.

Thanks.


If we go with 1), it would be nice if I could have shell access to the staging area on secondary01 so I can start making disos ASAP. If the contents change, I'll recreate them (I do that already).

A downside is that people with low bandwidth but no caps/metering might want to get a head start on the download and would not be able to do that anymore. Could the staging area be world-readable, but in a different location, so the final location only can be mirrored, and people downloading manually can go to the staging area (with warnings to compare checksum files before testing)? That would also address my concern in the last comment.

If we go with a public staging area, I'd like to create the disos there as well, and make them visible while building, so people can download earlier if they want (but know that the content isn't guaranteed correct until the announcement, when I'd move it to the usual place).

I think option 1 makes sense and should work fine.

I don't think there's much point in worrying about making disos until the content is in place.
If we tried to make them from the staging area we could just run into the same problem we are trying to solve no?

content is being staged, you start making disos, some mistake is found, recompose is started, but people start downloading bad/invalid/not released disos?

So, I think we should follow option 1, use a staging area and disos can be done once that content is in place.
I would think waiting a few minutes really wouldn't be that much burden.

Replying to [comment:4 kevin]:

content is being staged, you start making disos, some mistake is found, recompose is started, but people start downloading bad/invalid/not released disos?

People would know that anything in the staging area isn't guaranteed correct until moved to the final location. It could be visible on a trial basis. If too many people ignore the warnings about checking their downloads and waste time filing invalid bugs, we could then hide the staging area.

So, I think we should follow option 1, use a staging area and disos can be done once that content is in place.
I would think waiting a few minutes really wouldn't be that much burden.

It usually takes closer to an hour to both create and verify the disos, not a few minutes. For Alpha TC1 it's more like 2 hours.

We have this in place now. Once synced from the staging area, it should be live and ready to announce/etc.

Thanks!

By the way, once the sync is complete, could it also be announced using fedmsg? Or is it already?

Sometimes content is available before the ticket is closed, sometimes empty dirs but nothing else is visible. Things were more predictable before. Should this be reopened, at least until stable procedures are in place to fix it? (My suggestion would be to sync the content to a different hidden dir - for example make a subdir called .tmp/ in stage/, and make sure it's not world-readable, so it's not mirrored - then when done, look the content over manually, and then both mv it to the public location and close the ticket at the same time. There's virtually nothing that can go wrong with that procedure.)

For example, if creating 20-Beta-TC4, create 20-Beta-TC4/ and its contents in stage/.tmp/ instead of stage/, look it over, then "mv 20-Beta-TC4 .." and close the ticket.

this is as fixed as its going to be.

Dennis, status shuffling is not helpful. Could you please clarify what the current (new) practice is and what is preventing you from improving it? Thanks.

I have to confirm I saw some empty dirs created in advance for some of the previous composes, so the problem is really not fully fixed yet. Some of our QA processes require manual intervention and it's very problematic and inefficient for us if we can't rely on compose consistency. Andre Robatino creates deltaisos, zsync files and wiki test matrices for us, and I'm very grateful for that. We would love to automate that, but for that we need some automation (or at least safe practices) on your side. We need staging areas, we need stable naming policy and we need fedmsg notifications. Or at least some of that.

I would be willing to devote some of my time to help you improve your releng scripts in these regards. As a consequence we would save a lot of time in our QA tasks. Where can I find the scripts, please?

As a side note, I have a couple of wishes. If a compose (say TC3) fails to compose:
a) could we not make the compose public? There's no reason for it, it just wastes the bandwidth for infra and for all people syncing it. Just delete it and start again.
b) could the next compose be again named TC3? (that implies a) of course)

Thanks for info.

The main issue for me is that bad composes are visible. They should never be, not even temporarily. What I described in https://fedorahosted.org/rel-eng/ticket/5636#comment:8 is one easy and reliable way to do this (I've been using it in making disos and have never made anything broken visible by mistake). So it's certainly possible. A lesser issue is not having consecutive numbers for composes. I can live with that, the only issue for me is that I now have to wait until the TC/RC ticket is closed before making the wiki test pages, but those only take a few minutes.

Also, if only good composes were visible, and the TC/RC ticket were closed at the same time, then a simple script for checking stage/ for changes would be a crude but workable substitute for fedmsg notifications. I already have a short script that does this using urlwatch. So fedmsg should be a low priority.

i changed the compose script to noty open up permission after things have been synced. the reason why things would show up empty was that id noticed things went bad and ctrl-c killed the running process but not the script and it would go onto the next step. There is a lot of work to improve things but really what we have now is as good as it will get.

We have a completely private staging area, Ḯ've been trying this cycle tyo automate many bits. its not entirely been perfect. but it has been constantly improving. all the scripts used are in the releng git repo and always have been. prior to F20 many of the steps were 100% manual, I would have to wait till things were done then kick off the next step.

I believe that all composes should be public but Beta RC1 for instance did remain hidden from view.

This ticket is the wrong place to offer help, please use the mailing list or preferably #fedora-releng where we could have a real time discussion.

Metadata Update from @kparal:
- Issue set to the milestone: Fedora 19 Final

7 years ago

Login to comment on this ticket.

Metadata