#12344 mirrorlist is returning too few mirrors
Closed: Fixed a month ago by schanzle. Opened a month ago by schanzle.

I wrote a script to ensure my private mirror is first in the returned list of mirrors and the list of mirrors returned is reasonably long. Over the last few days, it's been only returning my private mirror plus (per Adrian Reber) mirrors that are marked as being always up to date.

http://mirrors.fedoraproject.org/mirrorlist?repo=updates-released-f41&arch=x86_64

2025-01-02 00:02:02 URL:http://mirrors.fedoraproject.org/mirrorlist?repo=updates-released-f41&arch=x86_64 [327/327] -> "/dev/shm/mirrorlist.wHbIbkacqUbv" [1]
# repo = updates-released-f41 arch = x86_64 Using preferred netblock country = US country = global
http://darkstar.cam.nist.gov/fedora/updates/41/Everything/x86_64/
https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/41/Everything/x86_64/
https://dl.fedoraproject.org/pub/fedora/linux/updates/41/Everything/x86_64/

Y2k+25 issue? :smile:


I checked the metalink and the result seems correct, could this be something in mirrorlist-server @adrian ?

I think this shows the state what we currently have in the database. A better way to view it might be:

$ curl "http://mirrors.fedoraproject.org/mirrorlist?path=pub/fedora/linux/updates/41"
# path = pub/fedora/linux/updates/41 country = global 
https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/41/
https://dl.fedoraproject.org/pub/fedora/linux/updates/41/

Compare that to the output of:

$  curl "http://mirrors.fedoraproject.org/mirrorlist?path=pub/fedora/linux/updates/40"

Could it be that 41 was the first release to be scanned by the new crawler and that the new code marks up to date directories in the database differently?

Looking at my mirror I see for 40 that directories with only directories as content are listed as up to date in the database. For 41 these directories (with only directories as content) are not listed as up to date.

Looking at:

$ curl "http://mirrors.fedoraproject.org/mirrorlist?path=pub/fedora/linux/updates/41/Everything/x86_64/repodata"

Gives me a long list of mirrors again. The main reason why we should fix this is also download.fedoraproject.org relies on this information and currently users would be only redirected to the primary mirror and the cloudfront cache. This might be good thing but not sure if we want this to happen by accident and not by design.

At the same time we always had the problem that once a directory with only directories as content is marked as up to date it will never be removed as up to date from the database. That is something we have seen problems with in the past. Again related to download.fedoraproject.org.

Metadata Update from @zlopez:
- Issue tagged with: mirrorlists

a month ago

Metadata Update from @phsmoura:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: Needs investigation, low-gain, low-trouble

a month ago

Interesting!

Could it be that 41 was the first release to be scanned by the new crawler and that the new code marks up to date directories in the database differently?

That's very possible.

At the same time we always had the problem that once a directory with only directories as content is marked as up to date it will never be removed as up to date from the database. That is something we have seen problems with in the past. Again related to download.fedoraproject.org.

Ah, I see, so what do you think should be the proper behavior? Mark a directory as up-to-date only if all its sub-directories are up-to-date?

Also, I'm not sure I understand how you reached that conclusion, could you show me the SQL query you used?
I'm trying this:

select hcd.*, h.name as host_name from host_category_dir hcd join directory d on hcd.directory_id = d.id join host_category hc on hcd.host_category_id = hc.id join host h on hc.host_id = h.id where d.name = 'pub/fedora/linux/updates/41';

but I only see 12 answers, all of which are flagged as up2date.

Not 100% sure what your question is. I just looked at my mirror under up to date directories: https://mirrormanager.fedoraproject.org/host/218 and saw that for 40 and 41 different directories are listed as being up to date.

Your SQL command only returns private mirrors because they are updated in the database using report_mirror and not by the crawler.

Does this answer kind of make sense to your previous comment?

Not 100% sure what your question is. I just looked at my mirror under up to date directories: https://mirrormanager.fedoraproject.org/host/218 and saw that for 40 and 41 different directories are listed as being up to date.

Ah, yeah. I may not be looking at the right thing in the database.
On your mirror, there is no host_category_dir entry for /pub/fedora/linux/updates/41, not even a not-up2date one. But there is one for /pub/fedora-secondary/updates/41, and it also only has subdirectories, so it may not be why the main one is not showing up in mirrorlist.

Your SQL command only returns private mirrors because they are updated in the database using report_mirror and not by the crawler.

How did you notice that?

Does this answer kind of make sense to your previous comment?

Yeah, I've forgotten quite a few mechanisms and model structures in mirrormanager since I worked on it last year, I'm trying to rebuild it in my mind but I'm a bit confused and I poke at random things :-)

Your SQL command only returns private mirrors because they are updated in the database using report_mirror and not by the crawler.

How did you notice that?

The host_name field content sounded like private mirrors to me. I then changed your query to contain the private field:

select hcd.*, h.private, h.name as host_name from host_category_dir hcd join directory d on hcd.directory_id = d.id join host_category hc on hcd.host_category_id = hc.id join host h on hc.host_id = h.id where d.name = 'pub/fedora/linux/updates/41';

That showed that all mirrors are indeed private mirrors.

OK I think I know why the updates/41 directory is not in the database, but I don't understand why updates/40 was. In the old crawler there used to be a function add_parents() that would add to the gathered dirs the parents of the current dir with status unknown. I have not re-implemented this function because in my understanding, it did not lead to creating the directory in the database, only to increment the counter of unknown status dirs as shown in this block.
HostCategoryDir entries added by add_parents should never pass this point and be added to the database.
Or did I miss something?

I don't remember much of that code, not sure how it worked. I cannot really comment on it, sorry.

No worries. Here's where I'm at with the investigation: the directories for 40 and 41 don't have the same file listing in the database:

# select id, name, files from directory where name = 'pub/fedora/linux/updates/40' or name = 'pub/fedora/linux/updates/41';
   id   |            name             | files 
--------+-----------------------------+-------
 646266 | pub/fedora/linux/updates/40 | []
 714193 | pub/fedora/linux/updates/41 | 
(2 rows)

40 has an empty list and 41 has NULL. In the crawler code, the directories with NULL in files are explicitely skipped from being scanned, which is why no host_category_dir entry is created.
So, 2 ways to go from there:
- why are we excluding empty directories from crawls? Is there a semantic reason or is it just optimization? If it's just optimization then we should add them back as host_category_dir entries.
- should it be empty list or null in the DB? NULL feels more correct.

Could you please look at the cases when scan-primary-mirror sets the file list as NULL and when it sets an empty list? Thanks.

OK, so here's what I think we should do (and I'd love your opinion on this):
- don't scan empty directories, it makes sense because there's nothing to actually check there
- add them as host_category_dir entries after the crawl with a up2date status equivalent to a AND of the up2date status of their sub-directories
Do you think this is correct?
I'm also interested in whether scan-primary-mirror will set the files to [] or NULL on empty directories, I tried to read the Rust code but couldn't tell. It doesn't look like the main file has changed in a while.

OK, so here's what I think we should do (and I'd love your opinion on this):
- don't scan empty directories, it makes sense because there's nothing to actually check there
- add them as host_category_dir entries after the crawl with a up2date status equivalent to a AND of the up2date status of their sub-directories
Do you think this is correct?

It sounded correct at first. But then I started to think about partial mirrors. Partial mirrors is something we supported from the beginning. Your suggestion might break this.

If I look at the example that started this ticket:

$ curl "http://mirrors.fedoraproject.org/mirrorlist?repo=updates-released-f41&arch=x86_64"
# repo = updates-released-f41 arch = x86_64 country = global 
https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/41/Everything/x86_64/
https://dl.fedoraproject.org/pub/fedora/linux/updates/41/Everything/x86_64/

if somebody would exclude debug/, this would mean that that mirror would not be part of the result above. Which is correct, but it seems to be different than what we used to do. I think a directory without files is probably always up to date, independent of the subdirectories. Also not sure.

I'm also interested in whether scan-primary-mirror will set the files to [] or NULL on empty directories, I tried to read the Rust code but couldn't tell. It doesn't look like the main file has changed in a while.

That code has not change a lot, right. I think I remember some of the background again. MirrorManager used to have a pickle stored in the Files column in the Directory table. The Files column is a bytea. I reused it to store a JSON string in it instead of the pickle.

Looking at the code I would say that is files are found then it adds a JSON encoded content if nothing is found it adds indeed a []. I am not 100% sure, however, what happens during the conversion to bytea. If scan-primary-mirror adds a [] it is strange that we see the NULL value in the database. Not sure why it would be NULL. Ah, the main Fedora category is not being scanned by the Rust code, but by the Python code. That might be the reason why it works for ppc64le. The reason I didn't switch the Fedora main category to the scan-primary-mirror code was the wrong location of the fullfiletimelist. For the Fedora main category it is not in the same location as for all other categories.

Maybe short_filelist() in https://github.com/fedora-infra/mirrormanager2/blob/master/mirrormanager2/utility/update_master_directory_list.py needs to return [] and not None if the the filelist is empty.

It sounded correct at first. But then I started to think about partial mirrors. Partial mirrors is something we supported from the beginning. Your suggestion might break this.
if somebody would exclude debug/, this would mean that that mirror would not be part of the result above. Which is correct, but it seems to be different than what we used to do. I think a directory without files is probably always up to date, independent of the subdirectories. Also not sure.

Oh yeah good point, I didn't think about partial mirrors. Alright, then I'll just mark empty directories as up-to-date.

Ah, the main Fedora category is not being scanned by the Rust code, but by the Python code. That might be the reason why it works for ppc64le. The reason I didn't switch the Fedora main category to the scan-primary-mirror code was the wrong location of the fullfiletimelist. For the Fedora main category it is not in the same location as for all other categories.
Maybe short_filelist() in https://github.com/fedora-infra/mirrormanager2/blob/master/mirrormanager2/utility/update_master_directory_list.py needs to return [] and not None if the the filelist is empty.

Alright, thanks for investigating this. I probably need to have a look at umdl one of these days and refactor it.
In the meantime, if we just mark empty directories as up2date, the crawler can also interpret NULL values for files as an empty list.

I'll craft a PR.

Alright, thanks for investigating this. I probably need to have a look at umdl one of these days and refactor it.

I was always hoping to be able to get rid of umdl and do everything with scan-primary-mirror. It should not be much work for me to get it finally working for everything.

Yeah, that said there isn't a lot of performance constraints on umdl/scan-primary-mirror, is it?
I'm not sure which script requires the less work, but if I understand you correctly it's probably scan-primary-mirror, as umdl would also need to move to the configuration file the part that scan-primary-mirror finds in the toml file.

Alright, the fix is now deployed to prod, let's see if the next crawl creates the directories correctly in the database.

Confirmed! Thanks to all for the deep-dive and effort.

curl -s  'http://mirrors.fedoraproject.org/mirrorlist?repo=updates-released-f41&arch=x86_64' | wc -l
91

curl 'http://mirrors.fedoraproject.org/mirrorlist?repo=updates-released-f41&arch=x86_64'
# repo = updates-released-f41 arch = x86_64 Using preferred netblock country = US country = DO country = CA 
http://darkstar.cam.nist.gov/fedora/updates/41/Everything/x86_64/
http://mirror.sfo12.us.leaseweb.net/fedora/linux/updates/41/Everything/x86_64/
https://coresite-atl.mm.fcix.net/fedora/linux/updates/41/Everything/x86_64/
http://mirror.math.princeton.edu/pub/fedora/linux/updates/41/Everything/x86_64/
http://mirror.cs.princeton.edu/pub/mirrors/fedora/linux/updates/41/Everything/x86_64/
https://ftp-osl.osuosl.org/pub/fedora/linux/updates/41/Everything/x86_64/
http://mirror.web-ster.com/fedora/updates/41/Everything/x86_64/
http://paducahix.mm.fcix.net/fedora/linux/updates/41/Everything/x86_64/
https://mirror.umd.edu/fedora/linux/updates/41/Everything/x86_64/
...

Metadata Update from @schanzle:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a month ago

Log in to comment on this ticket.

Metadata