Issue #159: Starting a Job take more then 10 min and 8/10 fails on timeout - centos-infra

centos-infra

#159 Starting a Job take more then 10 min and 8/10 fails on timeout

Closed: Fixed 2 years ago by dkirwan. Opened 3 years ago by liranmauda.

In this example: https://jenkins-noobaa.apps.ocp.ci.centos.org/blue/organizations/jenkins/noobaa-core_unit/detail/noobaa-core_unit/536/pipeline We are failing on timeout

Metadata Update from @dkirwan:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: centos-ci-infra, medium-gain, medium-trouble

3 years ago

Issue status updated to: Closed (was: Open)
Issue close_status updated to: Fixed

3 years ago

Metadata Update from @pingou:
- Issue status updated to: Open (was: Closed)

3 years ago

Issue status updated to: Closed (was: Open)
Issue close_status updated to: Fixed

3 years ago

Issue status updated to: Open (was: Closed)

3 years ago

Metadata Update from @dkirwan:
- Issue assigned to dkirwan

3 years ago

dkirwan commented 3 years ago

Hi @liranmauda,

My apologies for not getting back to you sooner.

Just catching up, I was reading the #centos-ci chat logs, where others have discussed rate limiting, but the Github docs state that they don't do rate limiting for connections like cloning/pulling data: https://docs.github.com/en/free-pro-team@latest/github/getting-started-with-github/troubleshooting-connectivity-problems

They do however use rate limiting when interacting with applications via the API: https://docs.github.com/en/free-pro-team@latest/developers/apps/rate-limits-for-github-apps

But I don't believe this is the workflow you are using on CentOS CI.

I did a quick google for the error messages: https://issues.jenkins.io/browse/JENKINS-36269

ERROR: Timeout after 10 minutes usually means one of the following:

    incorrect credentials were provided with the Jenkins job definition
    repository is so large that it takes more than 10 minutes to clone 
    bandwidth to the git server is slow enough that it takes more than 10 minutes to clone

Can you double check the credentials are correct, in the morning I'll investigate the state of the network, check how long its taking to clone this repo.

liranmauda commented 3 years ago

As far as we can see, the credentials are ok.

Adding @devos

Metadata Update from @liranmauda:
- Issue private status set to: False (was: True)

3 years ago

dkirwan commented 3 years ago

Hmm ok I've put a quick verification script together to test this out.

https://gist.github.com/davidkirwan/a05d6202ad3584e7b368b16dd6b686a8

So far I've run it on the openshift node itself:

real    0m4.140s
user    0m5.106s
sys     0m0.864s

Tomorrow I'll try launch a container with a PV storage, and try on there.

jlebon commented 3 years ago

We're hitting this too in CoreOS CI:

ERROR: Timeout after 10 minutes
ERROR: Checkout failed
hudson.plugins.git.GitException: Command "git checkout -f e0145f81ef84d144e4bbda9f3874bad82c0df843" returned status code 143:
stdout:
stderr:
    at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2430)
    at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$1100(CliGitAPIImpl.java:81)
    at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$9.execute(CliGitAPIImpl.java:2743)
Caused: hudson.plugins.git.GitException: Could not checkout e0145f81ef84d144e4bbda9f3874bad82c0df843
    at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$9.execute(CliGitAPIImpl.java:2767)
    at jenkins.plugins.git.MergeWithGitSCMExtension.checkout(MergeWithGitSCMExtension.java:144)
    at jenkins.plugins.git.MergeWithGitSCMExtension.decorateRevisionToBuild(MergeWithGitSCMExtension.java:110)
    at hudson.plugins.git.GitSCM.determineRevisionToBuild(GitSCM.java:1063)
    at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1168)
    at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:124)
    at org.jenkinsci.plugins.workflow.cps.CpsScmFlowDefinition.create(CpsScmFlowDefinition.java:155)
    at org.jenkinsci.plugins.workflow.multibranch.SCMBinder.create(SCMBinder.java:120)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.run(WorkflowRun.java:293)
    at hudson.model.ResourceController.execute(ResourceController.java:97)
    at hudson.model.Executor.run(Executor.java:428)

It coincides with us moving back to using an NFS PVC for Jenkins. So this is likely related to slow NFS again?

dkirwan commented 3 years ago

I think its likely yes @jlebon from what I've been able to figure out that error is related to the operation timing out after 10 minutes.

I didn't get to try out yet with within a pod writing to the NFS storage I'll look at this now.

asaleh commented 3 years ago

Question, do we need pods writing to nfs? I remember when I used pod-based workers, I used ephemeral with no persistent storage (emptyDirWorkspaceVolume if I remember the setting).

dkirwan commented 3 years ago

@liranmauda we were checking the noobaa-core_unit pipeline build history and it seems to be much better nowadays, are you still experiencing this issue?

jlebon commented 3 years ago

Question, do we need pods writing to nfs? I remember when I used pod-based workers, I used ephemeral with no persistent storage (emptyDirWorkspaceVolume if I remember the setting).

For us, all our tests do indeed run with emptyDirs. But it'd be nice if Jenkins itself could use NFS. Otherwise you basically lose all state everytime the pod is rescheduled. (Apart from the annoyance of losing logs, a bigger issue is the thundering herd of all the PRs across all the repos being retested -- there's some ways to work around it, but it's hacky.)

liranmauda commented 3 years ago

@dkirwan I looked into it and indeed it stopped failing on timeout.

but it still takes time to start and I think it is fetching for a very long time.

The whole tests and stages are ~30 min but in this example: In this job we can see that the job took 50 min.

dkirwan commented 3 years ago

From testing today, I'm getting an average 10 minutes or so to complete the test script from within a Jenkins job writing to our NFS storage. It is completing without timing out, but its very slow.

The same job running on ephemeral storage is 5 seconds~ or so.

I'll try bug @arrfab or @bstinson tomorrow to get some help looking at the storage.

jlebon commented 3 years ago

Any updates on this? Would like to be able to switch back to NFS.

dkirwan commented 3 years ago

@jlebon its working currently but slow. No change on that front, still investigating, I'm currently trying to test the speeds for this storage mounted outside Openshift.

jlebon commented 3 years ago

Any updates on this? We're still hoping to move back to NFS eventually once this is fixed.

Metadata Update from @dkirwan:
- Issue untagged with: medium-gain, medium-trouble
- Issue priority set to: None (was: Waiting on Assignee)
- Issue tagged with: high-gain, high-trouble

3 years ago

dkirwan commented 3 years ago

Hi @jlebon no update since, the storage is working, but slow.

We've been looking at everything that could be causing the slowness and have a number of possible causes identified:

Something arrfab has encountered before within CentOS infra see https://arrfab.net/posts/2018/Jan/19/diagnosing-nf_conntracknf_conntrack_count-issues-on-centos-mirrorlist-nodes/
The links between each OCP 4 node is 1 gigabit.

I'm hoping to get some time with arrfab and sidharthvipul to sit down and look into it further, but there are other priorities at the moment.

dkirwan commented 2 years ago

We've recently reshaped our storage array to RAID10, from initial testing it is much faster than it was previously, I think I'll close this issue as fixed. If we continue to see slowness issues with storage we'll likely have to begin investigation from scratch given some of our recent changes.

Metadata Update from @dkirwan:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Metadata

Assignee

dkirwan

Tags

Blocking

None

Depending on

None

Priority

None

Boards 1

CentOS CI Infra Status: In Progress

centos-infra

Source Code

#159 Starting a Job take more then 10 min and 8/10 fails on timeout Closed: Fixed 2 years ago by dkirwan. Opened 3 years ago by liranmauda.

Metadata

centos-ci-infra high-trouble high-gain

Boards 1

#159 Starting a Job take more then 10 min and 8/10 fails on timeout

Closed: Fixed 2 years ago by dkirwan. Opened 3 years ago by liranmauda.