#12023 release-monitoring.org crawlers seem to not be running since ~June 26
Closed: Fixed 5 months ago by zlopez. Opened 6 months ago by decathorpe.

It appears that the crawlers for different sources in release-monitoring.org have not been checking for new versions since about June 26. Checking projects where I know there have been new upstream releases shows that release-monitoring.org doesn't know about them, so it's not a problem with filing bugs in bugzilla. I don't know if this affects all crawlers, but at least the crates.io crawler has been dead for about a week.

Forcing a refresh manually for those projects makes release-monitoring.org see the new releases and the-new-hotness to file bugs, so from what I can tell, everything is working, except the crawlers.


Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: Needs investigation, medium-gain, ops

6 months ago

Metadata Update from @zlopez:
- Issue assigned to zlopez

6 months ago

I checked the release-monitoring.org deployment and there were no new log entries since June 26th. The OpenShift namespace probably needs more resources. I redeployed the job again, let's see if that helps.

I now have gotten notifications for almost 100 new bugs being filed, for projects ranging [a-z], so I assume that means it's working again :)

The restart of the job did the trick, so I'm closing this one as fixed :-)

Metadata Update from @zlopez:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

6 months ago

Metadata Update from @zlopez:
- Issue untagged with: Needs investigation
- Issue tagged with: low-trouble

6 months ago

It looks like this is happening again. release-monitoring.org has no knowledge of any releases that happened in the last ~24 hours or so.

Metadata Update from @decathorpe:
- Issue status updated to: Open (was: Closed)

6 months ago

I'm trying to understand what is happening here, as the pod is still running in OpenShift, but the last log message is from 2024-07-11 02:25:27 and there isn't any error that is looking like it could have caused this.

Let me restart the pod and enable debug output, maybe we will have more info next time it happens. This could be just stuck on some project that is incorrectly setup.

Thank you, restarting seems to have done the trick for now.

The check-service got stuck again at 2024-07-21 01:36:42. With debug log enabled I will try to investigate what caused it and hopefully fix the issue.

Found out I had the logging options set incorrectly and I was receiving DEBUG messages for urllib3 instead of anitya itself. I changed the setting to correct one this time and restarted the pod again. Will watch for anything strange in the future.

The pod got stuck again, I didn't found any obvious reason for it to be stuck. It's processing projects and then it gets stuck. But the pod itself it's strange as it doesn't react. You can't even connect to it.

I will try to investigate further and will see if I will be able to dig more info.

It seems to be stuck with 100% of CPU usage, probably looping over something. I will add more debug messages to check process to see what is happening there and restart it again.

So the check was stuck again, but I still can't find anything even with DEBUG enabled.
I thought originally it was some badly configured project, but it seems more a problem somewhere in parallelization.

I need to think how to go forwards with this. I already have some kind of timeout in place for checks, but obviously it doesn't work as intended.

I decided to rewrite the parallelization to use processes instead of threads. That should help when the processing get's stuck again, because it's easier to kill stuck process than thread.

I didn't saw any more issue like that for 14 days and I didn't get to the rewrite. So I think that was just some hiccup of the service.

I'm closing this for now and if it will happen again I will try to rewrite the parallelization part.

Metadata Update from @zlopez:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

5 months ago

This seems to have been happening again for the past 2-3 days.

Metadata Update from @decathorpe:
- Issue status updated to: Open (was: Closed)

5 months ago

You are right, it is stuck for few days. I will need to rewrite the parallelization part then.

It seems that I finally find the culprit. There was a deadlock on transactions in database, I killed all the blocking transactions after getting this advice from @nils. So this issue should be solved.

I still did the rewrite of the parallelization code and will deploy that as it does improve things.

Closing this as solved, if it happens again we can reopen it.

Metadata Update from @zlopez:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

5 months ago

Great news. Thank you for looking into it!

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog