#11350 Planned Outage - koji database - 2023-06-01 14:30 UTC
Closed: Fixed 10 months ago by kevin. Opened 11 months ago by kevin.

Planned Outage - koji database - 2023-06-01 14:30 UTC

There will be an outage starting at 2023-06-01 14:30UTC,
which will last approximately 8 hours.

To convert UTC to your local time, take a look at
http://fedoraproject.org/wiki/Infrastructure/UTCHowto
or run:

date -d '2023-06-01 14:30UTC'

Reason for outage:

We will be moving the koji buildsystem database (and the virthost it runs on) to RHEL9 and postgresql 15 (from RHEL8 and postgresql 12). This outage will happen while the outage of s390x builders is occuring to consolidate outages. During the outage window koji will be unavailable and builds will not be possible. After this outage is over, the s390x builder outage may still be ongoing, so archfull builds may still not complete until that outage is over.

Affected Services:

koji.fedoraproject.org

Ticket Link:

https://pagure.io/fedora-infrastructure/issue/11350

Please join #fedora-admin or #fedora-noc on irc.libera.chat
or add comments to the ticket for this outage above.


My plan for this outage:

  • on 2023-05-31, rsync /var/lib/pgsql from db-koji01 to a safe space somewhere else.
    Then at outage time:
  • Change hub config to say it's down and refer to status
  • stop hubs httpd
  • stop postgresql and make sure there are 0 connections.
  • do a final rsync of /var/lib/pgsql for changed files/blocks
  • save db-koji01 virt xml
  • reinstall bvmhost-x86-01 with rhel9
  • install new rhel9 db-koji01
  • make sure postgresql 15 module is enabled.
  • rsync data back to /var/lib/pgsql (this step is likely to take a while... it's 1.2TB of data)
  • run postgresql-setup --upgrade
  • If everything works, bring up db and hubs and modify hub config to not be outage

Contingencies:
If something fails in upgrade undefine rhel9 db-koji01 and redefine rhel8 one and bring it back up.
If rhel8 vm storage is wiped somehow, reinstall rhel8 instance and rsync data back to it.

Metadata Update from @phsmoura:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, outage

11 months ago

The longest part of this was copying off the old vm storage in case of doom. (~6 hours).

The new vmhost/db-koji01 are installed and db is upgraded, I am just doing a reindex now... once thats done I can re-open things and hopefully all will be well.

Outage is finally over.

Unfortunately the s390x builders are still down...

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

10 months ago

Login to comment on this ticket.

Metadata