#5466 Plan to reduce CI build and run time
Opened a year ago by wombelix. Modified 11 months ago

To give the discussion in the Pagure Matrix channel today a bit more structure and details:

I want to bring the improved local testing (https://pagure.io/pagure/pull-request/5384) concept into the Jenkins CI pipeline.

Currently the CI builds container images with packages and pagure code on every start.
Those are then dropped when the Job is finished.
Pipeline stages running serial and not parallel.
Running the unit tests on F39 in four tox environments (py39, py310, py311, py312) takes between 1h 5min and 1h 30min.
The Goal is to complete all tests in ~15min on average.

Steps to achieve this:


1/ Separate Base and Code Container
Instead of one monolith that need to be re-build on every run, there will be a base and code container.
Base contains all required packages and pre-populated tox environments.
Code contains scripts to clone the pagure code and trigger the tests.
If something changes in the testing scripts, only the code container has to be re-build.

2/ Publish pre-build base and code container
Base and Code container will be published and updated on a regular basis, e.g. every week, in the pagure namespace on quay.io.
Those images can then be used for local testing and in the CI.

3/ CI use pre-build images if no base or code container files are changed
A script identifies if files that are used by the base or code container differ from pagure.io/pagure:master.
If that's not the case, then the pre-build container images from quay.io/pagure are used.
If files are changed, then the code or base and code container are build before starting the tests.
Changes, e.g. in the .spec file, are then taken into account and properly covered.
But in the majority of cases, pre-build images are used and reduce overall CI run time.
This logic can also be added to local testing to align both as much as possible.

4/ Run tests in parallel
Each test stage runs on it's own duffy session in parallel.
We can't just run tox in parallel, because the tests are compute intensive.
By running in the same VM, we don't win time, the tests just block each other.
That means, in the current setup we need four sessions / virtual machines from the pool.
If they run CentOS Stream 8 or 9 doesn't matter for us, because all tests run in containers.
The pools virt-ec2-t2-centos-8s-x86_64 and virt-ec2-t2-centos-9s-x86_64 would work for our use-case and offer 18 sessions in total:

    {
      "name": "virt-ec2-t2-centos-8s-x86_64",
      "fill-level": 10
    },
    {
      "name": "virt-ec2-t2-centos-9s-x86_64",
      "fill-level": 8
    },

We can send a GET request to the duffy endpoint /api/v1/pools/{name} to receive the number of available slots.
And then request sessions based on pools with the most available or balance between them.
That way we avoid taking away all resources of a single pool.
If that approach is fine need to be clarified with the Infrastructure Team.
Also if we already have or can get the necessary resource quota for Pagure.


My intention is to start working on this as soon as possible.

I'm looking forward to feedback and suggestions.


We should use the 9s ones because 8s can be troublesome for newer container environments.

We should use the 9s ones because 8s can be troublesome for newer container environments.

Interesting, how so? I was under the (naive?) impression that this wouldn't matter for our use-case?

We should use the 9s ones because 8s can be troublesome for newer container environments.

Interesting, how so? I was under the (naive?) impression that this wouldn't matter for our use-case?

Nevermind, I totally forgot about the upcoming EOL of 8s ...

https://pagure.io/centos-infra/issue/1377
https://lists.centos.org/hyperkitty/list/ci-users@lists.centos.org/thread/B7JU5S7H2EBKFID6IH4GUB3LU5XOAJ7U/

So, yes, virt-ec2-t2-centos-9s-x86_64 is the pool we going to use then.
To be clarified if the fill-level will then increased, right now it's the smallest of the CentOS Stream pools.

Thinking loud: There is a pool called metal-ec2-c5n-centos-9s-x86_64 that has three nodes in it. Each has 72vcpu and 192gb ram. Again to be clarified with infrastructure, but if they are rarely used we could also think about a fall-back logic to leveraging one of them and running our container in parallel. We can give back such an instance probably in less than 15 minutes.

I have the necessary access to Jenkins now and plan to work the CI in the next days.
So this would be the perfect time to provide feedback about the planned approach.
Otherwise I will assume that there are no objections to re-design the CI.

Update:

After getting access to the Jenkins installation, I realized it's a bit outdated. Core and Plugins were affected by vulnerabilities. So I decided to work first on the upgrade (https://pagure.io/centos-infra/issue/1409).
To reduce the CI run time we need pre-build container images. They should be close to the code base in the master branch. So I will install the plugin JMS Messaging (https://plugins.jenkins.io/jms-messaging/) to react on commit events of pagure in Fedora Messaging.
A new pipeline then triggers a container image build on quay.io.

When those pre-requirements are completed, I will work on the actual refactoring of the ci pipeline logic.

Log in to comment on this ticket.

Metadata