#43 Make CI errors understandable when infrastructure failures occur
Opened a year ago by churchyard. Modified 2 months ago

Once the CI errors like here:

https://jenkins-continuous-infra.apps.ci.centos.org/blue/organizations/jenkins/fedora-rawhide-pr-pipeline/detail/fedora-rawhide-pr-pipeline/1109/pipeline/

We have no idea what went wrong. The logs are unreadable and the problem is hidden. Only possible thing we are able to do is to kick it with [citest] until it runs.


Metadata Update from @bookwar:
- Issue assigned to bookwar

a year ago

Error examples - we try to make it run, but we have no idea what's wrong:
https://src.fedoraproject.org/rpms/python3/pull-request/95
https://src.fedoraproject.org/rpms/python38/pull-request/8

It looks like yum is removed from Rawhide. I've opened on issue to fix the pipeline code: https://github.com/CentOS-PaaS-SIG/upstream-fedora-pipeline/issues/140

https://jenkins-continuous-infra.apps.ci.centos.org/blue/organizations/jenkins/fedora-rawhide-pr-pipeline/detail/fedora-rawhide-pr-pipeline/1192/pipeline/

Could you please help me understand what's wrong here? I see:

java.net.SocketTimeoutException: sent ping but didn't receive pong within 1000ms (after 192 successful ping/pongs)

Is it a red herring?

https://jenkins-continuous-infra.apps.ci.centos.org/blue/organizations/jenkins/fedora-rawhide-pr-pipeline/detail/fedora-rawhide-pr-pipeline/1192/pipeline/
Could you please help me understand what's wrong here? I see:

java.net.SocketTimeoutException: sent ping but didn't receive pong within 1000ms (after 192 successful ping/pongs)

This error has been showing in the pipeline for a while. @jbieren do you know more details what is causing it?

Is it a red herring?

This is not what caused your test to fail though, it seems the test caused the VM to run out of memory and it got stuck. The test playbook aborted after 4hrs due to timeout.

The VM log is at https://jenkins-continuous-infra.apps.ci.centos.org/job/fedora-rawhide-pr-pipeline/1192/artifact/package-tests/logs/test_subject.qcow2.guest.log

Out of memory: Killed process 13838 (python3.7) total-vm:64616kB, anon-rss:46568kB, file-rss:640kB, shmem-rss:0kB

Do I read that correctly as 64 MiB total RAM?

"640K ought to be enough for anybody"

Out of memory: Killed process 13838 (python3.7) total-vm:64616kB, anon-rss:46568kB, file-rss:640kB, shmem-rss:0kB

Do I read that correctly as 64 MiB total RAM?
"640K ought to be enough for anybody"

By default the VM is provisioned with 1G, if I read it correctly there is about 900MiB free to use.

[ 0.259070] Memory: 910136K/1048052K available

To increase the default value it is necessary to use fmf to create the provision.fmf [1]

[1] https://pagure.io/standard-test-roles (Flexible Metadata Format for default provisioner(s))

@bgoncalv I don't know why it is happening. It started showing itself after a kubernetes plugin upgrade. I do know that it shows all the time, but it never is the reason why anything fails. Other people are having the issue too. Some suggest changing the Read or Connection timeout under Cloud 'openshift' here https://jenkins-continuous-infra.apps.ci.centos.org/configure (bottom of the page) from 0 to some higher number (one person said 300 made the errors go away). You could try that to see if it helps, but I don't have much time to commit to debugging it right now

@jbieren thanks, I've set the Connection timeout to 300, so far it looks good.

That is good to know, thanks @bgoncalv !

By default the VM is provisioned with 1G, if I read it correctly there is about 900MiB free to use.
[ 0.259070] Memory: 910136K/1048052K available
To increase the default value it is necessary to use fmf to create the provision.fmf

I've tried https://src.fedoraproject.org/rpms/python3/pull-request/107 - the CI errors. I don't know why :(

Note that there are plenty comments here but nothing about the actual problem: the errors are undecipherable. I always end up asking an expert to decipher it for me.

Another one where I have no idea: https://jenkins-continuous-infra.apps.ci.centos.org/blue/organizations/jenkins/fedora-rawhide-pr-pipeline/detail/fedora-rawhide-pr-pipeline/1607/pipeline/

The reason of the failure was "virt-customize: error: no operating systems were found in the guest image", but why virt-customize failed in this case is a mystery for me as I see you re-triggered the PR and it passed, the qcow2 image used was the same. We even test the base qcow2 to make sure it is usable by the pipeline before start using it to run the tests....

Another one where I have no idea: https://jenkins-continuous-infra.apps.ci.centos.org/blue/organizations/jenkins/fedora-rawhide-pr-pipeline/detail/fedora-rawhide-pr-pipeline/1607/pipeline/

The reason of the failure was "virt-customize: error: no operating systems were found in the guest image", but why virt-customize failed in this case is a mystery for me as I see you re-triggered the PR and it passed, the qcow2 image used was the same. We even test the base qcow2 to make sure it is usable by the pipeline before start using it to run the tests....

it looks like there is some network issue when fetching the base qcow2, it is some thing the pipeline needs to handle better...

14:28:21  + curl --connect-timeout 5 --retry 5 --retry-delay 0 --retry-max-time 60 -L -k -O https://jenkins-continuous-infra.apps.ci.centos.org/job/fedora-rawhide-image-test/lastSuccessfulBuild/artifact/Fedora-Rawhide.qcow2
14:28:21    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
14:28:21                                   Dload  Upload   Total   Spent    Left  Speed
14:28:21  
0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   394  100   394    0     0   2886      0 --:--:-- --:--:-- --:--:--  2897

@churchyard reading over this ticket, this looks to mostly be intermittent infra issues. Ideally we would ping this over in fedora-infra for this type of error, though I know pin pointing the source can be fun. But the first problem was a VM dying, and the second was a networking issue, which both eventually resolved themselves.

That said, I feel like this issue was a flash in the pan and we continue to keep trying to make the infra better. I know we are moving to new infra in the coming months for our jenkins instance and other jobs, so hopefully that improves.

I'll mark this closed, but if there's a targeted RFE or I missed something, feel free to call me out. I'm trying to clean up our queue so we know what we need to attack.

P.S. I enjoy your quotes while reading over the tickets. Rebellions built on hope is thus far my favorite.

Metadata Update from @jimbair:
- Issue status updated to: Closed (was: Open)

3 months ago

I respectfully disagree with closing this one. When an error happens, it should be self explanatory. Current errors, when they happen, are still cryptic.

P.S. I enjoy your quotes while reading over the tickets. Rebellions built on hope is thus far my favorite.

:blush:

Metadata Update from @churchyard:
- Issue status updated to: Open (was: Closed)

3 months ago

Hmm - maybe we can add it into #44? I want to avoid having too many open tickets around general infra errors/stability. Or, if we do want to keep it open, maybe rename this one to "Improve CI Error Readability"?

I'm okay with either. =)

"CI errors are undecipherable" is a problem, "Improve CI Error Readability" is the solution to that problem.

I don't care whether this ticket is named after the problems or the solution.

But this and #44 are different problems.

But this and #44 are different problems.

Agreed; one says "make it better" and the other says "make it easy to understand when it's not better". =) I've updated the title to reflect that as well.

While we're on the subject, do you have a recent error that we could try to focus on improving? Or is the original error worth diving into?

The above failure is from the following script:
https://github.com/CentOS-PaaS-SIG/upstream-fedora-pipeline/blob/master/config/Dockerfiles/fedoraci-runner/virt-customize/virt-customize.sh#L5-L7

# A shell script that pulls the latest Fedora cloud build
# and uses virt-customize to inject rpms into it. It
# outputs a new qcow2 image for you to use.

While I don't know the exact cause of the failure, what would "good" look like if this script were to fail in the future vs what you're seeing today?

While I don't know the exact cause of the failure

That's my point. Thank You.

Example failure that I'd like here:

Cannot install required packages:

Problem:
 Error: 
  Problem: cannot install the best candidate for the job
   - nothing provides python3.8dist(guizero) >= 1.1 needed by mu-1.0.3-1.fc32.noarch

Preferably in big red letters somewhere where I don't need to dig for it.

@bgoncalv is there any way to add this kind of readability into the virt-customize script? Maybe we can sit down Thursday or Friday together to look at it?

@jimbair Sure, just ping me later on and we can see what we can do.

Login to comment on this ticket.

Metadata