#837 [doc] document how to upgrade persistent instances
Merged 4 years ago by praiskup. Opened 4 years ago by frostyx.
copr/ frostyx/copr how-to-upgrade-instances  into  master

@@ -0,0 +1,224 @@ 

+ .. _how_to_upgrade_persistent_instances:

+ 

+ How to upgrade persistent instances

+ ===================================

+ 

+ This article describes how to upgrade persistent instances (e.g. copr-fe-dev) to new Fedora version.

+ 

+ 

+ Requirements

+ ------------

+ 

+ * an account on `Fedora Infra OpenStack`_

+ * access to persistent tenant

+ * ssh access to batcave01

+ 

+ 

+ Find source image

+ -----------------

+ 

+ For OpenStack, there is an image registry on `OpenStack images dashboard`_.  By

+ default you see only the project images; to see all of them, click on the

+ ``Public`` button.

+ 

+ Search for the ``Fedora-Cloud-Base-*`` images of the particular Fedora. Please note

+ that if there is a timestamp in the image name suffix than it is a beta version.

+ It is better to use images with numbered minor version.

+ 

+ The goal in this step is just to find an image name.

+ 

+ 

+ Update the image in playbooks

+ -----------------------------

+ 

+ Once the new image name is known, make sure it is set in `vars/global.yml`, e.g.::

+ 

+     fedora30_x86_64: Fedora-Cloud-Base-30-1.2.x86_64

+ 

+ Then edit the host vars for the instance::

+ 

+     vim inventory/host_vars/<instance>.fedorainfracloud.org

+     # e.g.

+     vim inventory/host_vars/copr-dist-git-dev.fedorainfracloud.org

+ 

+ And configure it to use the new image::

+ 

+     image: "{{ fedora30_x86_64 }}"

+ 

+ That is all, that needs to be changed in the ansible repository. Commit and push it.

+ 

+ 

+ Backup the old instance

+ -----------------------

+ 

+ This part is done via ``openstack`` client on your computer. First, download an RC

+ file for the ``persistent`` tenant. Open `Fedora Infra OpenStack`_ dashboard, switch

+ to the ``Access & Security`` section, then ``API Access`` and click on

+ ``Download OpenStack RC File``.

+ 

+ Load the openstack settings::

+ 

+     source ~/Downloads/persistent-openrc.sh

+ 

+ Detach volume from the old instance::

+ 

+     openstack server remove volume "<instance_id>" "<volume_id>"

+     # e.g.

+     openstack server remove volume "52d97d72-5915-45c0-b223-xxxxxxxxxxxx" "9e2b4c55-9ec3-4508-af46-xxxxxxxxxxxx"

+ 

+ Backup the old instance by renaming it::

+ 

+     openstack server set --name <old_name>_backup "<id>"

+     # e.g.

+     openstack server set --name copr-dist-git-dev_backup "85260b5b-7f61-4398-8d05-xxxxxxxxxxxx"

+ 

+ 

+ .. note:: You might need to backup also letsencrypt certificates.

+           See `Letsencrypt renewal limits`_.

+ 

+ .. note:: You should terminate existing resalloc resources.

+           See `Terminate resalloc resources`_.

+ 

+ 

+ Finally, shut down the instance to avoid storage inconsistency and other possible problems::

+ 

+     $ ssh root@<old_name>.fedorainfracloud.org

+     [root@copr-dist-git-dev ~][STG]# shutdown -h now

+ 

+ 

+ Provision new instance from scratch

+ -----------------------------------

+ 

+ On batcave01 run playbook to provision the instance. For dev, see

+ 

+ https://docs.pagure.org/copr.copr/how_to_release_copr.html#upgrade-dev-machines

+ 

+ and for production, see

+ 

+ https://docs.pagure.org/copr.copr/how_to_release_copr.html#upgrade-production-machines

+ 

+ .. note:: Please note that the playbook may be stuck longer than expected while waiting for a new

+           instance to boot. See `Initial boot hangs waiting for entropy`_.

+ 

+ 

+ Get it working

+ --------------

+ 

+ The playbook from the previous section will most likely **not** succeed. At this point,

+ you need to debug and fix the issues from running it. If required, adjust the playbook

+ and re-run it again and again. Most likely you will also need to attach a volume to it

+ in the `OpenStack instances dashboard`_.

+ 

+ .. note:: Copr backend requires an outdated version of python3-novaclient.

+           See `Downgrade python novaclient`_.

+ 

+ 

+ Terminate the old instance

+ --------------------------

+ 

+ Once the new instance is successfully provisioned and working as expected, terminate the

+ old backup instance.

+ 

+ Open the `OpenStack instances dashboard`_ and switch the current project to ``persistent``

+ and find the instance, that you want to terminate. Make sure, it is the right one! Don't

+ mistake e.g. production instance with dev. Then look at the ``Actions`` column and click

+ ``More`` button. In the dropdown menu, there is a button ``Terminate instance``, use it.

+ 

+ 

+ Troubleshooting

+ ---------------

+ 

+ Initial boot hangs waiting for entropy

+ ......................................

+ 

+ Because of a known infrastructure issue `Fedora infrastructure issue #7966`_ initial boot

+ of an instance in OpenStack hangs and waits for entropy. It seems that it can't be fixed

+ properly, so we need to work around by going to `OpenStack instances dashboard`_, opening

+ the instance details, switching to the ``Console`` tab and typing random characters in it.

+ It resumes the booting process.

+ 

+ 

+ Letsencrypt renewal limits

+ ..........................

+ 

+ Currently, we renew our Let's Encrypt certificates on a daily basis through ``certbot-renew.timer``

+ service. However, Let's Encrypt website provides at maximum five certificates a week (think of

+ a week as a 7-day floating window, instead of a calendar week) per a domain. As a consequence,

+ it may happen, that our new instance won't be able to obtain a certificate for two days,

+ with no way to bypass it. Don't let this happen on production instances!

+ 

+ There are two possible options for dealing with this situation at the moment. Either disable

+ ``certbot-renew.timer`` at least two days ahead of upgrading an instance or backup its

+ current certificates and copy them to the upgraded instance::

+ 

+     [root@copr-be-dev ~][STG]# tar zcvf /tmp/copr-be-dev-letsencrypt.tar.gz /etc/letsencrypt

+     $ scp root@copr-be-dev.cloud.fedoraproject.org:/tmp/copr-be-dev-letsencrypt.tar.gz /tmp/

+ 

+ Once a new instance is provisioned and unable to obtain certificates from the letsencrypt

+ site, copy them from backup::

+ 

+     $ scp /tmp/copr-be-dev-letsencrypt.tar.gz root@copr-be-dev.cloud.fedoraproject.org:/tmp

+     [root@copr-be-dev ~][STG]# tar zxvf /tmp/copr-be-dev-letsencrypt.tar.gz -C /

+ 

+ Remove the backup from your computer, it contains secret files::

+ 

+     $ rm /tmp/copr-be-dev-letsencrypt.tar.gz

+ 

+ 

+ Private IP addresses

+ ....................

+ 

+ Most of the communication within Copr stack happens on public interfaces via hostnames

+ with one exception. Communication between ``backend`` and ``keygen`` is done on a private

+ network behind a firewall through IP addresses that change when spawning a fresh instance.

+ 

+ After updating a ``copr-keygen`` (or dev) instance, change its IP address in

+ ``inventory/group_vars/copr_dev``::

+ 

+     keygen_host: "172.XX.XX.XX"

+ 

+ Whereas after updating a ``copr-backend`` (or dev) instance change the configuration in

+ ``inventory/group_vars/copr_keygen`` (or dev) and update the iptables rules::

+ 

+     custom_rules: [ ... ]

+ 

+ Please note two addresses needs to be updated, both are backend's.

+ 

+ 

+ Terminate resalloc resources

+ ............................

+ 

+ It is easier to close all resalloc tickets otherwise there will be dangling VMs

+ preventing the backend from starting new ones.

+ 

+ Edit the ``/etc/resallocserver/pools.yaml`` file and in all section, set::

+ 

+     max: 0

+ 

+ Then delete all current resources::

+ 

+     su - resalloc

+     resalloc-maint resource-delete $(resalloc-maint resource-list | cut -d' ' -f1)

+ 

+ 

+ Downgrade python novaclient

+ ...........................

+ 

+ Backend is dependent on ``python3-novaclient`` in prehistoric version ``3.3.1``. This

+ version is no longer supported and the spec file needed to be customized to build and

+ install only python3 package. Also, the epoch has been bumped so it doesn't get replaced

+ with a newer version. Please install this package from Copr project (even on production

+ instance)::

+ 

+     dnf copr enable @copr/novaclient

+     dnf install python3-novaclient-2:3.3.1

+ 

+ .. note:: Please do not automatize this step in the playbook, so it forces us to deal

+           with the situation properly.

+ 

+ 

+ 

+ .. _`Fedora Infra OpenStack`: https://fedorainfracloud.org

+ .. _`OpenStack images dashboard`: https://fedorainfracloud.org/dashboard/project/images/

+ .. _`OpenStack instances dashboard`: https://fedorainfracloud.org/dashboard/project/instances/

+ .. _`Fedora infrastructure issue #7966`: https://pagure.io/fedora-infrastructure/issue/7966

@@ -11,6 +11,7 @@ 

  

     How to release copr RPM packages <how_to_release_copr>

     how_to_upgrade_builders

+    how_to_upgrade_persistent_instances

     How to manage active chroots <how_to_manage_chroots>

     Sending notifications and removing data from outdated chroots <how_to_delete_outdated_chroots>

     sanity_tests

This is the first version of the "How to upgrade persistent instances" document. It doesn't cover all topics, but I figured, that it would be better to submit a PR describing the basics and then iteratively improve it.

I have several topics that should be either documented here or changed in playbooks, but I am saving them for a meeting.

What about $ openstack server set --name <old_name>_backup <old_id>?

Then one could just shutdown the VM; ... and ansible-playbooks should (could :-)) just take the storage/floating IPs from the previous instance, and there would be some chance to go back...

Interesting. I will try it when upgrading some next instance. Thank you.

It makes the process much less scary :-)

3 new commits added

  • [doc] add todo to detach volume
  • [doc] backup the old section instead of terminating it
  • [doc] add troubleshooting section
4 years ago

Metadata Update from @msuchy:
- Pull-request tagged with: needs-work

4 years ago

I successfully did this today:
$ openstack server remove volume 52d97d72-5915-45c0-b223-d8a50ca73135 9e2b4c55-9ec3-4508-af46-a40f3a5bd982

While we are in command-line window, I tried this one today - and it worked: openstack server stop copr-keygen-dev_backup

On backend, we should do systemctl stop copr-backend ; su - resalloc ; resalloc-maint ticket-list ; resalloc ticket-close <ID> ; ... first; because otherwise there will be dangling VMs on aarch64 hosts preventing the new backend from starting new VMs.

11 new commits added

  • [doc] add section about downgrading python novaclient
  • [doc] fix typos
  • [doc] add get it working section
  • [doc] describe how to close resalloc tickets
  • [doc] describe how to detach volumes
  • [doc] use shutdown command
  • [doc] add more troubleshooting sections
  • [doc] add todo to detach volume
  • [doc] backup the old section instead of terminating it
  • [doc] add troubleshooting section
  • [doc] document how to upgrade persistent instances
4 years ago

On backend, we should do systemctl stop copr-backend ; su - resalloc ; resalloc-maint ticket-list ; resalloc ticket-close <ID> ; ... first; because otherwise there will be dangling VMs on aarch64 hosts preventing the new backend from starting new VMs.

Thank you, I've added it

I have dumped everything I know about the topic, so I am removing the needs work tag and let you guys review it.

Metadata Update from @frostyx:
- Pull-request untagged with: needs-work

4 years ago

please rename to "terminate resalloc resources"

I was wrong here, it is not needed to terminate copr-backend, neither resalloc server... You need to set max: config options in /etc/resallocserver/pools.yaml to zeroes, and then delete all resources:
resalloc-maint resource-delete $(resalloc-maint resource-list | cut -d' ' -f1)

11 new commits added

  • [doc] add section about downgrading python novaclient
  • [doc] fix typos
  • [doc] add get it working section
  • [doc] describe how to close resalloc tickets
  • [doc] describe how to detach volumes
  • [doc] use shutdown command
  • [doc] add more troubleshooting sections
  • [doc] add todo to detach volume
  • [doc] backup the old section instead of terminating it
  • [doc] add troubleshooting section
  • [doc] document how to upgrade persistent instances
4 years ago

rebased onto 75981cd

4 years ago

Pull-Request has been merged by praiskup

4 years ago