#44 Improve Overall CI Stability (was: CI errors happen far to often)
Opened a year ago by churchyard. Modified 2 months ago

Having a CI (any CI) available is a magnificent improvement over having no CI at all. Seeing the green badge of success brings me endorphin boosts. However I far to often see the red error badge without actually knowing what's wrong.

In my eyes, the current CI is unreliable and flaky. Let me know what data do you need to back this claim (if any).

I think this needs to be solved before we consider enabling rawhide gating.

There is no one CI for all which is going to be enabled in gating. Gating is a mechanism which allows different CI pipelines to vote on a change, and then gating may use or may ignore these votes in making decisions.

When you a referring to "current CI" I assume you mean the CI pipeline which run tests configured via STI interface https://docs.fedoraproject.org/en-US/ci/standard-test-roles/ ?

I am working on making a script/tool to report on test failures vs test errors of this CI pipeline. I'll post here once I have the data.

Metadata Update from @bookwar:
- Issue assigned to bookwar

a year ago

When you a referring to "current CI" I assume you mean the CI pipeline...

Correct. Sorry for not naming things right.

This is one of the most annoying issues with the CI. So far this was only a PR thing, so we just showered the PR with [citest] until it passed, however given that there is no way to rerun a gating test, the unreliability of the test system will be much more painful. Is there any chance this issue can be prioritized? How can I help? Do you need more data?

@churchyard so this ticket predates my joining the team, and things seem to be a bit better than described here.

What are your thoughts on CI today? The same as above or improved?

Still quite unreliable, unfortunately. Jobs don't start, jobs error. Maybe not that often, but still not good, sorry.

@jimbair Afaik, we have prometheus metrics of the Jenkins master enabled now.

The monitoring infra is unfortunately still internal. But how about we start by adding the report on last two weeks to our Fedora CI SIG meetings?

Things like amount of builds executed, amount of failures? Average and maximal time jobs are waiting in a queue?

FTR with Zuul the situation is better now: We have 2 CIs for redundancy.

@churchyard would you be open to renaming this issue to "Improve Overall CI Stability"? We can keep this issue open to track the progress on moving our infra to a new cluster, and improvements to make it more stable.

This also feels like an extension of what was discussed in issue #43

More examples: https://jenkins-continuous-infra.apps.ci.centos.org/blue/organizations/jenkins/fedora-rawhide-pr-pipeline/detail/fedora-rawhide-pr-pipeline/3705/pipeline

In this case it seems we had some connection issue with pagure and failed to fetch the PR, we probably should retry it in the pipeline.

10:33:36  + git fetch -fu origin refs/pull/25/head:pr
10:33:38  error: RPC failed; result=22, HTTP code = 404
10:33:38  fatal: The remote end hung up unexpectedly

and https://jenkins-continuous-infra.apps.ci.centos.org/blue/organizations/jenkins/fedora-rawhide-pr-pipeline/detail/fedora-rawhide-pr-pipeline/3702/pipeline/

it appears to be also another connection issue to pagure:

failed: [/workDir/workspace/fedora-rawhide-pr-pipeline/images/test_subject.qcow2 -> test-runner] (item={'repo': 'https://src.fedoraproject.org/rpms/pyproject-rpm-macros.git', 'dest': 'pyproject-rpm-macros'}) => changed=false

@astepano do you think this is a case where STR should retry as well?

Login to comment on this ticket.