#829 Inactive nodes in jenkins-fedora-infra
Closed: Fixed 2 years ago by arrfab. Opened 2 years ago by onosek.

Hi,

I am running unit tests on Jenkins https://jenkins-fedora-infra.apps.ocp.ci.centos.org
About 2 days ago, they started to fail (or they just nor run).
Did I miss some announcement about migration or is it an unexpected issue? If yes, will you please help me with starting the nodes again?
https://jenkins-fedora-infra.apps.ocp.ci.centos.org/computer/
So far I run my jobs on the F33 node. Of course, the newer node for running docker containers would be better.

Example jobs with issues:
https://jenkins-fedora-infra.apps.ocp.ci.centos.org/job/pyrpkg/
https://jenkins-fedora-infra.apps.ocp.ci.centos.org/job/odcs/


Metadata Update from @arrfab:
- Issue assigned to arrfab

2 years ago

Metadata Update from @arrfab:
- Issue tagged with: centos-ci-infra, need-more-info

2 years ago

We had a quick look and while you seem to have been migrated some time ago to the ocp4 cluster , you were still relying on "legacy" nodes that were initially deployed for ci.centos.org (all stopped now)
We powered back on these VMs that are attached to your jenkins in ocp (so your jobs resumed) but you'll have to migrate to the cico-workspace pod template in your jenkins for your jobs (and request ephemeral nodes from duffy to run your jobs/tests on).

Can you also "opt-in" for the upcoming migration ? Camila sent a mail on dedicated ci-users list to explain the migration plan (see https://lists.centos.org/pipermail/ci-users/2022-June/004547.html) . Based on that, you'll be onboarded for the next steps for both duffy and ocp migration to ec2

Metadata Update from @arrfab:
- Assignee reset

2 years ago

@onosek : can you confirm that fedora-infra opted-in for the new setup ?
If not, it will all stop working in August, when we'll migrate to new Duffy.
Waiting for feedback

Metadata Update from @arrfab:
- Issue priority set to: Waiting on Reporter (was: Needs Review)

2 years ago

@arrfab yes, please, opt-in. My jobs will need a new home.

@onosek : thanks for confirming. Can you start looking at migrating your jobs from the legacy setup to the new one ? don't understand why you were migrated to openshift 4 but with agents from the legacy setup connected to it.

/cc: @cgranell @phsmoura

@onosek : reminder that you should migrate asap to new workflow (it was supposed to be done ~2y ago) so your existing jenkins agents will disappear for real soon

Hi @arrfab can you give some docs about migrating to new workflow?

you are already on new openshift so normally the goal was to just use the cico-workspace pod template.
See https://sigs.centos.org/guide/ci/
So you can then request through duffy an ephemeral node to then run your tests (can be also just deploying podman on that host and pull which container you'd like to test for example)

I tried to start a cico-workspace pod https://jenkins-fedora-infra.apps.ocp.ci.centos.org/job/hlintest/2/console

But following duffy command in the pod returned Unauthorized error

$ oc rsh cico-workspace-lbgl7
sh-4.4$ duffy client --url https://duffy.ci.centos.org/api/v1 --auth-name fedora-infra --auth-key $CICO_API_KEY list-sessions
{
  "error": {
    "detail": "Unauthorized"
  }
}

Metadata Update from @arrfab:
- Issue assigned to arrfab

2 years ago

I just looked and it seems you were never really migrated so now fedora-infra is a valid tenant and updated api key is available as a new openshift secret.
So can you restart a pod and then try again ?

I just looked and it seems you were never really migrated so now fedora-infra is a valid tenant and updated api key is available as a new openshift secret.
So can you restart a pod and then try again ?

It works now.

I have a question that do we must request & release machine using duffy in each job run? can we add the machine requested using duffy as dedicated jenkins node?

no, duffy nodes are ephemeral and automatically recycled after 6h.
So you have a dedicated jenkins instance and if you want to run your test on it directly (assuming you don't need anything like a specific distro release) you can.
But if you need (and it's how it's designed) a node your job can ssh into, yes each job should request a node, run test, and then return node (which is automatically deleted every 6hours anyway)

just gathering some feedback/update : is that now working for you and have you modified your workflow so that we can shutdown the legacy VMs that were used as jenkins agents ?

just gathering some feedback/update : is that now working for you and have you modified your workflow so that we can shutdown the legacy VMs that were used as jenkins agents ?

We need more time to update all related jobs. Please keep legacy VMs longer if possible.

What's the deadline of shutdown the legacy VMs?

@hlin .. we'll shutdown down that infra in october but as it's running from out of warranty infra, if there is hardware issue, nothing we'll do to fix it (as it's going away anyway)
Let me close this request as initial request was worked on, and you're now aware of the migration plan/schedule so adapt asap to new workflow

Metadata Update from @arrfab:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata
Boards 1
CentOS CI Infra Status: Backlog