#43 CI errors are undecipherable
Opened 10 months ago by churchyard. Modified 2 months ago

Once the CI errors like here:

https://jenkins-continuous-infra.apps.ci.centos.org/blue/organizations/jenkins/fedora-rawhide-pr-pipeline/detail/fedora-rawhide-pr-pipeline/1109/pipeline/

We have no idea what went wrong. The logs are unreadable and the problem is hidden. Only possible thing we are able to do is to kick it with [citest] until it runs.


Metadata Update from @bookwar:
- Issue assigned to bookwar

10 months ago

Error examples - we try to make it run, but we have no idea what's wrong:
https://src.fedoraproject.org/rpms/python3/pull-request/95
https://src.fedoraproject.org/rpms/python38/pull-request/8

It looks like yum is removed from Rawhide. I've opened on issue to fix the pipeline code: https://github.com/CentOS-PaaS-SIG/upstream-fedora-pipeline/issues/140

https://jenkins-continuous-infra.apps.ci.centos.org/blue/organizations/jenkins/fedora-rawhide-pr-pipeline/detail/fedora-rawhide-pr-pipeline/1192/pipeline/

Could you please help me understand what's wrong here? I see:

java.net.SocketTimeoutException: sent ping but didn't receive pong within 1000ms (after 192 successful ping/pongs)

Is it a red herring?

https://jenkins-continuous-infra.apps.ci.centos.org/blue/organizations/jenkins/fedora-rawhide-pr-pipeline/detail/fedora-rawhide-pr-pipeline/1192/pipeline/
Could you please help me understand what's wrong here? I see:

java.net.SocketTimeoutException: sent ping but didn't receive pong within 1000ms (after 192 successful ping/pongs)

This error has been showing in the pipeline for a while. @jbieren do you know more details what is causing it?

Is it a red herring?

This is not what caused your test to fail though, it seems the test caused the VM to run out of memory and it got stuck. The test playbook aborted after 4hrs due to timeout.

The VM log is at https://jenkins-continuous-infra.apps.ci.centos.org/job/fedora-rawhide-pr-pipeline/1192/artifact/package-tests/logs/test_subject.qcow2.guest.log

Out of memory: Killed process 13838 (python3.7) total-vm:64616kB, anon-rss:46568kB, file-rss:640kB, shmem-rss:0kB

Do I read that correctly as 64 MiB total RAM?

"640K ought to be enough for anybody"

Out of memory: Killed process 13838 (python3.7) total-vm:64616kB, anon-rss:46568kB, file-rss:640kB, shmem-rss:0kB

Do I read that correctly as 64 MiB total RAM?
"640K ought to be enough for anybody"

By default the VM is provisioned with 1G, if I read it correctly there is about 900MiB free to use.

[ 0.259070] Memory: 910136K/1048052K available

To increase the default value it is necessary to use fmf to create the provision.fmf [1]

[1] https://pagure.io/standard-test-roles (Flexible Metadata Format for default provisioner(s))

@bgoncalv I don't know why it is happening. It started showing itself after a kubernetes plugin upgrade. I do know that it shows all the time, but it never is the reason why anything fails. Other people are having the issue too. Some suggest changing the Read or Connection timeout under Cloud 'openshift' here https://jenkins-continuous-infra.apps.ci.centos.org/configure (bottom of the page) from 0 to some higher number (one person said 300 made the errors go away). You could try that to see if it helps, but I don't have much time to commit to debugging it right now

@jbieren thanks, I've set the Connection timeout to 300, so far it looks good.

That is good to know, thanks @bgoncalv !

By default the VM is provisioned with 1G, if I read it correctly there is about 900MiB free to use.
[ 0.259070] Memory: 910136K/1048052K available
To increase the default value it is necessary to use fmf to create the provision.fmf

I've tried https://src.fedoraproject.org/rpms/python3/pull-request/107 - the CI errors. I don't know why :(

Note that there are plenty comments here but nothing about the actual problem: the errors are undecipherable. I always end up asking an expert to decipher it for me.

Another one where I have no idea: https://jenkins-continuous-infra.apps.ci.centos.org/blue/organizations/jenkins/fedora-rawhide-pr-pipeline/detail/fedora-rawhide-pr-pipeline/1607/pipeline/

The reason of the failure was "virt-customize: error: no operating systems were found in the guest image", but why virt-customize failed in this case is a mystery for me as I see you re-triggered the PR and it passed, the qcow2 image used was the same. We even test the base qcow2 to make sure it is usable by the pipeline before start using it to run the tests....

Another one where I have no idea: https://jenkins-continuous-infra.apps.ci.centos.org/blue/organizations/jenkins/fedora-rawhide-pr-pipeline/detail/fedora-rawhide-pr-pipeline/1607/pipeline/

The reason of the failure was "virt-customize: error: no operating systems were found in the guest image", but why virt-customize failed in this case is a mystery for me as I see you re-triggered the PR and it passed, the qcow2 image used was the same. We even test the base qcow2 to make sure it is usable by the pipeline before start using it to run the tests....

it looks like there is some network issue when fetching the base qcow2, it is some thing the pipeline needs to handle better...

14:28:21  + curl --connect-timeout 5 --retry 5 --retry-delay 0 --retry-max-time 60 -L -k -O https://jenkins-continuous-infra.apps.ci.centos.org/job/fedora-rawhide-image-test/lastSuccessfulBuild/artifact/Fedora-Rawhide.qcow2
14:28:21    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
14:28:21                                   Dload  Upload   Total   Spent    Left  Speed
14:28:21  
0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   394  100   394    0     0   2886      0 --:--:-- --:--:-- --:--:--  2897

Login to comment on this ticket.

Metadata