since ticket #4837 has been resolved, I can see the crawler visiting my site (site 105/host 455) twice a day, a first one starting around midnight UTC, and a second one starting around noon UTC (from my rsync logs), but only the nightly crawl seems to be reported in the web page in the "Last crawled" field. This happened today, and also yesterday. The daily crawl looks fine according to my logs, and not different from the nightly one (except maybe for the daily rawhide content, that is not yet synchronized on my mirror when the daily crawl begins).
Thanks for having such a close look at the crawler. Theoretically the crawler always updates the last crawled entry. This time, unfortunately, the crawler was ended by the OOM killer and thus the database could not be properly updated. It is pretty difficult to predict how much memory the crawler requires. For the last few months it was running without problems but it seems that we have now a situation where the crawling requires again more memory. I will decrease the number of parallel crawled mirrors to accommodate the new increased memory requirement. Thanks again.
Maybe it also explains the extra crawling duration too. As I added at the end of ticket #4837, for what I can observe mirror-side, most of the "crawl duration" is not time spent in the rsync command itself, but in what happens crawler-side after the filelist has been received.
About the crawl duration. We are aware that, if crawled by rsync, the time required to process the data is rather long but we have to limit the duration at some point and currently it is three hours. One goal is to crawl each category separately so that mirrors, which are mirroring a lot of content, are not penalized.
Crawling by HTTP or FTP would still take longer it is just not visible which part is spent for network connections or talking to the local database.
to comment on this ticket.