#159 Starting a Job take more then 10 min and 8/10 fails on timeout
Closed: Fixed 2 years ago by dkirwan. Opened 3 years ago by liranmauda.


Metadata Update from @dkirwan:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: centos-ci-infra, medium-gain, medium-trouble

3 years ago

Issue status updated to: Closed (was: Open)
Issue close_status updated to: Fixed

3 years ago

Metadata Update from @pingou:
- Issue status updated to: Open (was: Closed)

3 years ago

Issue status updated to: Closed (was: Open)
Issue close_status updated to: Fixed

3 years ago

Issue status updated to: Open (was: Closed)

3 years ago

Metadata Update from @dkirwan:
- Issue assigned to dkirwan

3 years ago

Hi @liranmauda,

My apologies for not getting back to you sooner.

Just catching up, I was reading the #centos-ci chat logs, where others have discussed rate limiting, but the Github docs state that they don't do rate limiting for connections like cloning/pulling data: https://docs.github.com/en/free-pro-team@latest/github/getting-started-with-github/troubleshooting-connectivity-problems

They do however use rate limiting when interacting with applications via the API: https://docs.github.com/en/free-pro-team@latest/developers/apps/rate-limits-for-github-apps

But I don't believe this is the workflow you are using on CentOS CI.

I did a quick google for the error messages: https://issues.jenkins.io/browse/JENKINS-36269

ERROR: Timeout after 10 minutes usually means one of the following:

    incorrect credentials were provided with the Jenkins job definition
    repository is so large that it takes more than 10 minutes to clone 
    bandwidth to the git server is slow enough that it takes more than 10 minutes to clone

Can you double check the credentials are correct, in the morning I'll investigate the state of the network, check how long its taking to clone this repo.

As far as we can see, the credentials are ok.

Adding @devos

Metadata Update from @liranmauda:
- Issue private status set to: False (was: True)

3 years ago

Hmm ok I've put a quick verification script together to test this out.

https://gist.github.com/davidkirwan/a05d6202ad3584e7b368b16dd6b686a8

So far I've run it on the openshift node itself:

real    0m4.140s
user    0m5.106s
sys     0m0.864s

Tomorrow I'll try launch a container with a PV storage, and try on there.

We're hitting this too in CoreOS CI:

ERROR: Timeout after 10 minutes
ERROR: Checkout failed
hudson.plugins.git.GitException: Command "git checkout -f e0145f81ef84d144e4bbda9f3874bad82c0df843" returned status code 143:
stdout:
stderr:
    at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2430)
    at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$1100(CliGitAPIImpl.java:81)
    at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$9.execute(CliGitAPIImpl.java:2743)
Caused: hudson.plugins.git.GitException: Could not checkout e0145f81ef84d144e4bbda9f3874bad82c0df843
    at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$9.execute(CliGitAPIImpl.java:2767)
    at jenkins.plugins.git.MergeWithGitSCMExtension.checkout(MergeWithGitSCMExtension.java:144)
    at jenkins.plugins.git.MergeWithGitSCMExtension.decorateRevisionToBuild(MergeWithGitSCMExtension.java:110)
    at hudson.plugins.git.GitSCM.determineRevisionToBuild(GitSCM.java:1063)
    at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1168)
    at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:124)
    at org.jenkinsci.plugins.workflow.cps.CpsScmFlowDefinition.create(CpsScmFlowDefinition.java:155)
    at org.jenkinsci.plugins.workflow.multibranch.SCMBinder.create(SCMBinder.java:120)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.run(WorkflowRun.java:293)
    at hudson.model.ResourceController.execute(ResourceController.java:97)
    at hudson.model.Executor.run(Executor.java:428)

It coincides with us moving back to using an NFS PVC for Jenkins. So this is likely related to slow NFS again?

I think its likely yes @jlebon from what I've been able to figure out that error is related to the operation timing out after 10 minutes.

I didn't get to try out yet with within a pod writing to the NFS storage I'll look at this now.

Question, do we need pods writing to nfs? I remember when I used pod-based workers, I used ephemeral with no persistent storage (emptyDirWorkspaceVolume if I remember the setting).

@liranmauda we were checking the noobaa-core_unit pipeline build history and it seems to be much better nowadays, are you still experiencing this issue?

Question, do we need pods writing to nfs? I remember when I used pod-based workers, I used ephemeral with no persistent storage (emptyDirWorkspaceVolume if I remember the setting).

For us, all our tests do indeed run with emptyDirs. But it'd be nice if Jenkins itself could use NFS. Otherwise you basically lose all state everytime the pod is rescheduled. (Apart from the annoyance of losing logs, a bigger issue is the thundering herd of all the PRs across all the repos being retested -- there's some ways to work around it, but it's hacky.)

@dkirwan I looked into it and indeed it stopped failing on timeout.

but it still takes time to start and I think it is fetching for a very long time.

The whole tests and stages are ~30 min but in this example: In this job we can see that the job took 50 min.

From testing today, I'm getting an average 10 minutes or so to complete the test script from within a Jenkins job writing to our NFS storage. It is completing without timing out, but its very slow.

The same job running on ephemeral storage is 5 seconds~ or so.

I'll try bug @arrfab or @bstinson tomorrow to get some help looking at the storage.

Any updates on this? Would like to be able to switch back to NFS.

@jlebon its working currently but slow. No change on that front, still investigating, I'm currently trying to test the speeds for this storage mounted outside Openshift.

Any updates on this? We're still hoping to move back to NFS eventually once this is fixed.

Metadata Update from @dkirwan:
- Issue untagged with: medium-gain, medium-trouble
- Issue priority set to: None (was: Waiting on Assignee)
- Issue tagged with: high-gain, high-trouble

3 years ago

Hi @jlebon no update since, the storage is working, but slow.

We've been looking at everything that could be causing the slowness and have a number of possible causes identified:

I'm hoping to get some time with arrfab and sidharthvipul to sit down and look into it further, but there are other priorities at the moment.

We've recently reshaped our storage array to RAID10, from initial testing it is much faster than it was previously, I think I'll close this issue as fixed. If we continue to see slowness issues with storage we'll likely have to begin investigation from scratch given some of our recent changes.

Metadata Update from @dkirwan:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata
Boards 1
CentOS CI Infra Status: In Progress