#11358 Sync RCs to alt/stg dl.fp.o
Opened a year ago by sumantrom. Modified a year ago

  • Describe the issue

As we keep increasing QA activities, we are noticing that, downloading RC images from [https://kojipkgs.fedoraproject.org/compose/38/latest-Fedora-38/compose/Workstation/x86_64/iso/] is very slow. In the last few releases, we have RCs that drop one or two days before the Go/No-Go. In most of the cases, we have a parallel test day/week running as well. Being an RHT employee it becomes hard enough to wait for RC to download one image, let alone multiple (more if I have to download test day artifacts). I am only wondering, just how much a community member would get affected. I corresponded with @kparal , just to check if this is my internet and he seems to have the same problem in Brno.

Actionable, would be to rsync RCs right after they are created in alt/stage of the dl.fp.o (somewhat automated will help)

RE : https://pagure.io/fedora-infrastructure/issue/11206#comment-848245

  • When do you need this? (YYYY/MM/DD)

ASAP

  • When is this no longer needed or useful? (YYYY/MM/DD)

As long as Fedora Linux will need testing and QA

  • If we cannot complete your request, what is the impact?

NA


To make this request clearer, we'd really like to see if https://dl.fedoraproject.org/pub/alt/stage/ was populated automatically after an RC is built. Currently it's a manual task that has a considerable delay. Koji download speeds are very slow sometimes (and especially in Asia, where Sumantro is) and we can't easy automate syncing. Whereas the stage/ dir can easily easily rsynced.

This means both manual and automated testing has a long delay in some parts of the world, making it very hard to perform QA tasks shortly before the release, when timing is essential.

Please populate the stage/ dir automatically, thank you.

But the composes are synced right after they are finished.
We do that by our SOP after every RC
https://docs.fedoraproject.org/en-US/infra/release_guide/beta_RC_compose/#_syncing_the_compose

Let me explain the process here,

We are running RC manually because they require us to include specific builds to fix blockers. It can take multiple runs to get an RC that is considered good enough this release cycle for the beta was RC_1.3 meaning we did 2 composes that didn't make it.

At least, one of them was completed successfully but had wrong configs and/or artifacts in it.

If the process was fully automated the first successful compose have been synced.

Let me explain the process here,

We are running RC manually because they require us to include specific builds to fix blockers. It can take multiple runs to get an RC that is considered good enough this release cycle for the beta was RC_1.3 meaning we did 2 composes that didn't make it.

At least, one of them was completed successfully but had wrong configs and/or artifacts in it.

If the process was fully automated the first successful compose have been synced.

I see, can we at least have the composes synced to dl.fp.o synced at the same time they are manually created. I found that dl.fp.o didn't have the RC 1.3 hrs after it was available on kojipkgs.fp.o/compose.
@kparal did you find the image instantly available to you on dl.fp.o?

I see, can we at least have the composes synced to dl.fp.o synced at the same time they are manually created. I found that dl.fp.o didn't have the RC 1.3 hrs after it was available on kojipkgs.fp.o/compose.
@kparal did you find the image instantly available to you on dl.fp.o?

So kojipkgs is basically showing you the content of the compose dir live as it is happening. Just because the dir and one image are there it does not mean the whole compose is finished.

One partial solution might be to implement the RC sync in ansible and allow people within specific group to run the sync scripts. Because sure sometimes the compose is finished in a time that no one from releng is around to do the sync.

In my experience, it's sometimes the case that a RC is finished (fully), but it takes hours or even half a day before the RC is copied to /pub/alt/stage. Because it's a manual task, and if the releng person is asleep, then obviously he has to wake up first. That's the delay that we would like to remove by populating /pub/alt/stage automatically, not manually.

It can take multiple runs to get an RC that is considered good enough

Is that a problem? If this is about disk space, the intermediate broken composes can be deleted from /pub/alt/stage after a fully working one is produced, right? Is there some other disadvantage?

Our automation already runs even on those "broken" composes, it pulls them directly from Koji. It's in Fedora infra, so it's fast. But QA testers get very low download speeds. And because we announce the RCs immediately (here's the broken Beta 1.2 announcement), we announce the koji download links. If syncing to /pub/alt/stage was part of the compose job, we would be able to announce those dl.fp.o links right away, and it wouldn't be just secret knowledge for folks who know about it and can wait a random time interval before it gets there.

One partial solution might be to implement the RC sync in ansible and allow people within specific group to run the sync scripts.

That's still a manual task :-/ We'd like to have the sync fully automated, if possible.

If you do not mind having possibly broken things in place, let's make it part of the RC compose.

Moving things to dl.fp.o is not going to make downloads that much faster. The 5 dl.fedoraproject.org boxes are just virtual machines which are mounting a backing store netapp. They are mainly meant to be rsync servers for the mirror network which mainly mirrors just /pub/fedora and not alt. The mirrors may not get updates for hours because they are volunteer run and will sync when they can versus when we want to.

The httpd servers on the dl.fedoraproject.org are meant to be 'last-ditch' servers for systems which could not find a usable mirror elsewhere. As such the network traffic on these boxes is pretty high at all times.

kojipkgs on the other hand is a dedicated http server which also mounts disks from the same netapp. It is also resource bound for various reasons so it will also be 'slow'.

I think historically we had the gap there so releng could look and scrap the compose if it was bad, but I agree it's probibly just less confusing to always sync it as part of the compose scripting so there's no delay. If it's bad, then we say that and delete it or close perms on the directory, etc.

I think historically we had the gap there so releng could look and scrap the compose if it was bad, but I agree it's probibly just less confusing to always sync it as part of the compose scripting so there's no delay. If it's bad, then we say that and delete it or close perms on the directory, etc.

I am +1 to this approach and I would like to see that happen!
@kparal thoughts?

The 5 dl.fedoraproject.org boxes are just virtual machines which are mounting a backing store netapp.

This is good to know. In my personal experience downloading from dl.fp.o is usually faster than downloading from koji directly. However, I'm not hit as hard as Sumantro from India, where he often gets speeds in the 10-100kB/s range. Perhaps there's some unrelated issue with routing to Asia that could be examined.

I do believe that putting RCs to /pub/alt/stage immediately is still beneficial, though. There are not that many mirrors which mirror that directory, but there are some (even close to Sumantro). And so if the RC is placed there immediately, it might already be mirrored a few hours later, when Sumantro and other testers in that area want to access it. It should be a considerable improvement for them.

I originally also wanted to replace koji direct links with dl.fp.o links (or even better, download.fp.o links) in our test matrices at least for RCs (Beta RC3 example). The problem with download.fp.o seems to be that it can easily go 404, when the mirror doesn't mirror that directory at all. For example in my case, https://download.fedoraproject.org/pub/alt/stage/ redirects to https://mirror.karneval.cz/pub/linux/fedora-alt/stage/ , which is 404. So it seems we can't rely on that. And manually figuring out some available mirror which has it is a tedious process of inspecting mirrors one by one. We could still link to dl.fp.o, but it depends whether infra actually wants that or not.

CC @adamwill just in case he wants to add something, e.g. related to automation.

So my guess is that dl.fedoraproject.org is faster than kojipkgs.fedoraproject.org as it is round robin of 5 hosts versus 2 proxies at the same location. That said, if you are using dl.fedoraproject.org then you aren't using any mirrors at all. But as you found using download.fedoraproject.org (which does give you mirrors) depends on mirrors actually having the content and being a part of the system mirror-manager can check.

Mirror manager is good about checking the regular repositories under /pub/fedora, /pub/fedora-secondary, /pub/epel and /pub/archive. The items under /pub/alt rarely have a 'rhyme' and reason so mirror-manager does a 'best case' of seeing that some files are there but not every file under /pub/alt.

The download speeds Sumantro is seeing are typical for trying to download a US link that far. There are too many overloaded network conjunctions between the US and Asia which cause a 'straw' pipe problem. [It is also a reason why it is hard for mirrors in Asia to be in sync with Fedora.. they are also trying to keep up with a couple of hundred GB of daily changes on a 10-100kB/s range. ] I don't have a straight forward answer to this.

@kparal what are the final thoughts we can have on this? Do we still want what we initially requested for?

Yes, I believe putting the RCs to /pub/alt/stage automatically has lots of benefits. Not just to Asia testers (if they can find a local mirror which contains it), but also to our automation outside of Fedora infra.

@smooge @kevin , what would be the best possible way to work toward the request?

Hi @sumantrom,

It will be done before the final freeze.

Metadata Update from @humaton:
- Issue assigned to patrikp

a year ago

Login to comment on this ticket.

Metadata