#4837 site 105/host 455/(fr2.rpmfind.net) not crawled by mm after a 2 days long outage
Closed: Fixed None Opened 8 years ago by bellet.

Hi!

The site I maintain (site 105/host 455, fr2.rpmfind.net) seems to be ignored by the mirror manager crawler, after the site has been down for 3 days between 07/22-07/24. The site is back up and up-to-date since 07/24, and is currently ignored by the mm crawler. It was previously accessed via rsync by tx-102-181-132-209.redhat.com twice a day. I still run report_mirror successfully, and the site is still listed in the metalink.

The last crawl before the outage failed with a huge last_crawl_duration value. I noticed in the source code that the crawler sorts the mirrors to be checked according to their last crawl duration value. Maybe my site is now in such a position in this list of hosts to be crawled, that it's just somehow too far to be checked ?


Thanks for the problem report. The problem you were seeing had nothing to do with your mirror. This was a problem with all mirrors. Or better: with 50% of the mirrors. We are using two crawlers two check if the mirrors are up to date and on one crawler (the one which is supposed to crawl your mirror (and mine by the way)) was a required python package missing and therefore the crawler was immediately crashing. So for a few days only 50% of the mirrors were actually crawled. So, thanks again for pointing this problem out.

Unrelated to the missing package but related to the long crawl time. The crawler has a timeout of three hours for all category a mirror carries. The problem is that the content grows and it gets difficult to scan all categories in one crawl. To properly fix this some larger changes are required. We are working on it, but it takes time.

My mirror, for example, caries all categories and it is not possible to crawl a mirror with all categories within three hours. I created a second host in MirrorManager which points to the same mirror and have added some categories URLs to the second host to allow the crawler to finish in time. As already mentioned this is not optimal and we are working on it, but it takes some time.

A side note about the crawl duration, for what I can observe, the long duration seems to be related to some ''processing'' occuring after the filelist of each category is requested to the rsync server. For example, for the crawl of fr2.rpmfind.net of last night (Last Crawled: 2015-07-28 03:16:52.922058), that took 11797 seconds to complete, the time spent in the rsync commands for the three consecutive contents (epel, fedora-secondary, and fedora) is rather small:

{{{
2015/07/28 00:00:19 [17252] rsync on linux/epel/ from UNKNOWN (209.132.181.102)
2015/07/28 00:00:19 [17252] building file list
2015/07/28 00:01:11 [17252] sent 8789498 bytes received 2567 bytes total size 153507048096

2015/07/28 00:01:53 [18586] rsync on linux/fedora-secondary/ from UNKNOWN (209.132.181.102)
2015/07/28 00:01:53 [18586] building file list
2015/07/28 00:15:30 [18586] sent 197044786 bytes received 29782 bytes total size 5610476424962

2015/07/28 02:45:11 [15162] rsync on linux/fedora/linux/ from UNKNOWN (209.132.181.102)
2015/07/28 02:45:12 [15162] building file list
2015/07/28 02:53:00 [15162] sent 82724201 bytes received 17172 bytes total size 1514545173230
}}}

What causes the majority of the crawl duration is what happens ''between'' these rsync commands (and the duration reported on the mirrormanager page includes this extra delay), and it seems to be related to the rsync module size.

Login to comment on this ticket.

Metadata