#23 Better handling of vanished source files
Closed: Fixed 5 years ago Opened 7 years ago by tibbs.

rsync returns 24 when source files have gone away. When this happens, we can assume that the best thing to do is completely retry the run. The second best thing to do is just stop.

Currently we just retry the rsync call, which will essentially always find the same set of files missing.


The only thing I can think to do is to grep the error log for the specific error message.

One problem is that at high debug levels, we don't have an error log handy because it isn't redirected.

There are crazy examples of using tee to direct stdout and stderr to different files while still sending them to the console, but we also want the exit status. It;s a lot of work to go through for just the case where someone has run with debugging on, even though that's usually me. One unpleasant hack would be to tail -f the output files. I'd end up doing the same anyway if they did't go to the terminal.

Copying info from #39....

We already handle the case where a file vanished during the transfer, because rsync returns exit code 24. But it's not uncommon for a file to vanish before we started rsync, or for the file list to simply be outdated, so that we asked rsync to transfer something which simply wasn't there at the beginning of the transfer. Sadly rsync returns the generic "23" exit code when this happens.

When getting 23 back from rsync, we should grep through the error log and if all of the errors are of the form:

rsync: link_stat "/fedora/linux/updates/24/SRPMS/a/awscli-1.11.63-1.fc24.src.rpm" (in fedora-buffet0) failed: No such file or directory  (2)

then we should pretend we got error 24 instead and not sleep/retry. Alternately we should re-fetch the file lists and if they've changed then start the whole process anew. (The latter might be a good thing to do in a number of cases.)

Note that we must ignore any line in the error output not beginning with "rsync: ". This will catch anything we write to the error log, as well as the generic "rsync error: some files/attrs were not transferred" line.

Metadata Update from @tibbs:
- Issue assigned to tibbs

5 years ago

This is annoying me, so I'll work on it next.

Metadata Update from @tibbs:
- Issue untagged with: far future
- Issue tagged with: immediate

5 years ago

Commit 04dbdeb does this in a basic way; we grep the log looking for the exact problematic line.

This is fragile; rsync could change slightly and we would no longer notice the error. But the worst that happens is we have some retries which are guaranteed to fail.

Because not everyone is going to poll every ten minutes, it would really be better to have some way for an error like this to start the whole process from scratch, including a fresh file list download. That will take more thought and effort.

Worked some more on this and was surprised to find that even with --delay-updates, rsync will put the new content in place even if it's going to fail and exit 23. So even if you have a failed copy due to an outdated file list, your mirror is still updated with the content you wanted that the master did have.

I don't think this is a huge problem, really, but it makes the logic odd. If we detect an outdated file list, we're updated. So we can re-run the whole main loop immediately, but this will probably just get the same file list that were just transferred and so we'll end with a successful transfer.

So there are two situations:
If you're polling quickly, you probably just want to exit now. Let your next poll find the new file list.
If you're polling slowly, you might want to sleep for a minute or so and then retry, and loop over that a few times.

Honestly it seems smarter to just recommend that people poll every ten minutes. But if we were polling based on a fedmsg trigger or something then we might not get another one for some time. So it looks like this will need a couple of settings that will need to be tuned for the particular situation.

I decided against doing anything complicated and instead the run will just fail immediately when rsync throws an error that we know we can't retry. Otherwise it will retry, just as it always has.

I will document that poll times can and generally should be short.

Metadata Update from @tibbs:
- Issue untagged with: immediate
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

5 years ago

Login to comment on this ticket.

Metadata