packager-workflow

Created 3 years ago
Maintained by pingou
Packager workflow issue tracker
Members 3
Adam Saleh committed 3 years ago
Notes on triaging monitor-gating failures
=========================================

You might have been tasked with triaging issues that pop up on: \*
https://pagure.io/packager-workflow/issues as a result of a run of \*
https://pagure.io/fedora-ci/monitor-gating on \*
https://os.fedoraproject.org/console/project/monitor-gating/overview

The script goes through a basic gated rpm update workflow with our
canary packages: \* for single-package \*
https://src.fedoraproject.org/rpms/dummy-test-package-gloster \*
multi-package for two related packages in a side-tag \*
https://src.fedoraproject.org/rpms/dummy-test-package-rubino \*
https://src.fedoraproject.org/rpms/dummy-test-package-crested

The issue is logged automatically when the monitor-gating script detects
a failure.

Triage
------

The triage of an issue consists of \* going through the logs in the
issue \* figuring out if the failure was caused by \* a problem in our
test-script \* service degradation \* we expect the build to complete
within at most ~60 mins \* if the workflow takes longer, i.e. waiting in
queue for CI, we mark it as failed \* there might be no functional
issues in the flow \* functional issue in the service

Sample logs
~~~~~~~~~~~

Here we provide a sample log for a succesful build. \* single build:
https://gist.github.com/AdamSaleh/258a7ac22c47baef3940fdfe80d3bb36 \*
multi build:
https://gist.github.com/AdamSaleh/eb310b3f50be5d95d5f886c2782c1508

When looking for issues,simplest is to search for string ``[FAILED]``,
as every log-line should end in ``[DONE]``

**BEWARE**: a successful line could look like

::

   23:07:52 - resultsdb results in datagrepper returned FAILED - ran for: 0s   [DONE]

because the workflow we are testing is that the package with an
unsuccessful test will be gated and can be waived. If the line mentions
errors and failures, but ends in ``[DONE]``, it is an error we are
expecting and waiting for.

Common issues
~~~~~~~~~~~~~

Most common issues we encounter are \* package was waiting too long in
CI queue *leading to timeout* \* Resultsdb was too slow to process the
message *leading to timeout* \* Resultsdb experiencing other problems

These are best triaged by querying api with `httpie`_ and `jq`_

Least common issues \* pagure \* koji \* with multi-build there was an
issue of the two packages requiring synchronous version bumps, but there
is a mechanism in the script to ammend this on the fly

Triaging Resultsdb
~~~~~~~~~~~~~~~~~~

If you see line along the lines:

::

   23:40:11 - CI results did not show in resultsdb(phx) for dummy-test-package-gloster-0-1384.fc34 within 300 minutes since 2020-09-13 23:34:59.146252 - ran for: 312s [FAILED]

you can query the datagrepper api with

::

   for i in {0..10}; do; curl "https://taskotron.fedoraproject.org/resultsdb_api/api/v2.0/results?testcases=fedora-ci.koji-build.tier0.functional&page=$i" | jq '.data | map(select(.data.item[0] | startswith("dummy-test-package-gloster-0-1384") ))'; done

to find if the result eventually appeared in the database. It is easier
to search in a for-loop right away, especially if you are triaging issue
that is several days old.

There are several options we usually encounter: \* the run is there, and
this seems to be a reccuring issue in several past runs \* the run is
there, but it seems to be an intermittent issue \* the run isn’t there

If **the run is there**, and this seems to be **a recurring issue**,
investigate further, at least try to get the time between there was the
``org.centos.prod.ci.koji-build.test.error`` message about the
completion of the CI run and the result appearing in the resultsdb
database. You can see an example datagrepper query in
`Triaging-CI-through-datagrepper`_

Create an issue about this in
https://pagure.io/fedora-infrastructure/issues

If **the run is there**, and this seems to be just **an intermittent
issue**, you can close this with a note pointing to the resultsdb
result.

If you can’t fund the run, look for other runs of the same package:

::

   for i in {0..10}; do; curl "https://taskotron.fedoraproject.org/resultsdb_api/api/v2.0/results?testcases=fedora-ci.koji-build.tier0.functional&page=$i" | jq '.data | map(select(.data.item[0] | startswith("dummy-test-package-gloster-0") ))'; done

If you only find a much older run, do check the older runs for recurring
issues. This can point to resultsdb having a problem on the receiving
end. File an issue in https://pagure.io/fedora-infrastructure/issues

If there are other runs and only this one is conspicuously missing, CI
might have a problem on the sending end. Ask people on #fedora-ci on
freenode IRC and file an issue on pagure.io/fedora-ci/general

Triaging CI through datagrepper
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To triage problems with CI, the easiest option is to look into messages
stored in datagrepper. You want to search for the messages that signify
CI running and the completion of the CI run by searching through topics
``org.centos.prod.ci.koji-build.test.running`` and
``org.centos.prod.ci.koji-build.test.error`` for the specific nvr.

Beware, there will be several ``running`` messages, i.e. for
``rpminspect`` as well as for the funcitonal test.

The queries can look like this:

| \``\` for i in {1..10}; do; http get
  https://apps.fedoraproject.org/datagrepper/raw   delta==1728000
  page==$i
| topic==org.centos.prod.ci.koji-build.test.running \| jq ’.raw_messages
  \| map(select(.msg.artifac

.. _Triaging-CI-through-datagrepper: #Triaging-CI-through-datagrepper

To triage problems with CI, the easiest option is to look into messages
stored in datagrepper. You want to search for the messages that signify
CI running and the completion of the CI run by searching through topics
``org.centos.prod.ci.koji-build.test.running`` and
``org.centos.prod.ci.koji-build.test.error`` for the specific nvr.

Beware, there will be several ``running`` messages, i.e. for
``rpminspect`` as well as for the funcitonal test.

The queries can look like this:

::

   for i in {1..10}; do; http get https://apps.fedoraproject.org/datagrepper/raw \ 
      delta==1728000 page==$i \
       topic==org.centos.prod.ci.koji-build.test.running | jq '.raw_messages | map(select(.msg.artifact.nvr | startswith("dummy-test-package-gloster-0-1469")))'; done

and

::

   for i in {1..10}; do; http get https://apps.fedoraproject.org/datagrepper/raw \ 
      delta==1728000 page==$i \
       topic==org.centos.prod.ci.koji-build.test.error | jq '.raw_messages | map(select(.msg.artifact.nvr | startswith("dummy-test-package-gloster-0-1469")))'; done

You should be able to learn the duration of the build and the url of the
jenkins-job that was responsible.

If the issue seems intermitent, and you managed to find the messages in
question, you may choose to close the issue with the links to relevant
messages.

If there are messages missing, or the issue seems repeating, investigate
further and file in either of \*
https://pagure.io/fedora-infrastructure/issues \* if it seems related to
the message-bus or datagrepper \* pagure.io/fedora-ci/general \* if it
seems to be the problem with Fedora CI itself \* if you suspect Fedora
CI, it is a good idea to ask on #fedora-ci on freenode IRC channel

Fixing monitor-gating
---------------------

During your triage you might start suspecting the problem to be in the
monitor-gating script, as opposed in the various systems it touches.

So far we discovered and fixed: \* waiting for a message that already
appeared due to misconfigured time-offset \* looking for wrong test-case
in resultsdb \* various timeout-related issues

Other times you might just want to add a new quality-of-life function
for yourself, i.e. so far we added: \* automatic opening of issues \*
more log-lines \* log-lines containing relevant URLs \* script will
check if packages didn’t diverge in versions and attempts a correction

To run monitor-gating and update script, you need to: \* have access to
https://pagure.io/fedora-ci/monitor-gating \* be a packager \* have
access to \*
https://src.fedoraproject.org/rpms/dummy-test-package-gloster \*
https://src.fedoraproject.org/rpms/dummy-test-package-rubino \*
https://src.fedoraproject.org/rpms/dummy-test-package-crested \* be
member of @sysadmin-monitor-gating

### Debugging monitor-gating

Afer you check-out monitor-gating you should be able to run it with a debugger attached.

[Sample configuration for Visual Studio Code](https://gist.github.com/AdamSaleh/8fa116a6c8e44fdaffd2a4763031faae) assumes a `monitor_gating_local.cfg ` copy of the [monitor_gating.cfg](https://pagure.io/fedora-ci/monitor-gating/blob/master/f/monitor_gating.cfg) sample configuration file.

The runner-config is not necessary for local debugging.

### Updating new version

After there are code changes, make sure they will find their way to the `production` branch on monitor-gating.

Before you attempt to run the ansible playbook to run the update, make sure you can see the pods, their events and logs in
* https://os.fedoraproject.org/console/project/monitor-gating/

To update, it should be sufficient to log-in to batcave and run
```
sudo /bin/rbac-playbook openshift-apps/monitor_gating.yml
```