From 3908d237cd75dca48101bd987102ed43b6c002d8 Mon Sep 17 00:00:00 2001 From: David Kirwan Date: Sep 29 2021 01:45:35 +0000 Subject: metrics-for-apps: Added new sops - Cluster upgrades - Creating etcd backups Signed-off-by: David Kirwan --- diff --git a/modules/ocp4/pages/sop_etcd_backup.adoc b/modules/ocp4/pages/sop_etcd_backup.adoc new file mode 100644 index 0000000..fbfc07c --- /dev/null +++ b/modules/ocp4/pages/sop_etcd_backup.adoc @@ -0,0 +1,50 @@ +== Create etcd backup +This SOP should be followed in the following scenarios: + +- When the need exists to create an etcd backup. +- When shutting a cluster down gracefully. + +=== Resources + +- [1] https://docs.openshift.com/container-platform/4.8/backup_and_restore/backing-up-etcd.html[Creating an etcd backup] + +=== Take etcd backup + +1. Connect to the `os-control01` node associated with the ENV. + +2. Use the `oc` tool to make a debug connection to a controlplane node + +---- +oc debug node/ +---- + +3. Chroot to the /host directory on the containers filesystem + +---- +sh-4.2# chroot /host +---- + +4. Run the cluster-backup.sh script and pass in the location to save the backup to + +---- +sh-4.4# /usr/local/bin/cluster-backup.sh /home/core/assets/backup +---- + +5. Chown the backup files to be owned by user `core` and group `core` + +---- +chown -R core:core /home/core/assets/backup +---- + +6. From the admin machine, see inventory group: `ocp-ci-management`, become the Openshift service account, see the inventory hostvars for the host identified in the previous step and note the `ocp_service_account` variable. + +---- +ssh +sudo su - +---- + +7. Copy the files down to the `os-control01` machine. + +---- +scp -i core@:/home/core/assets/backup/* ocp_backups/ +---- diff --git a/modules/ocp4/pages/sop_installation.adoc b/modules/ocp4/pages/sop_installation.adoc index d2a6072..6e74301 100644 --- a/modules/ocp4/pages/sop_installation.adoc +++ b/modules/ocp4/pages/sop_installation.adoc @@ -207,3 +207,9 @@ Several other SOPs should be followed to perform the post installation configura - xref:sop_create_machineconfigs.adoc[SOP Create MachineConfigs to Configure RHCOS] - xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT] - xref:sop_configure_image_registry_operator.adoc[SOP Configure the Image Registry Operator] +- xref:sop_disable_provisioners_role.adoc[SOP Disable the Provisioners Role] +- xref:sop_configure_oauth_ipa.adoc[SOP Configure oauth Authentication via IPA/Noggin] +- xref:sop_configure_local_storage_operator.adoc[SOP Configure the Local Storage Operator] +- xref:sop_configure_openshift_container_storage.adoc[SOP Configure the Openshift Container Storage Operator] +- xref:sop_configure_userworkload_monitoring_stack.adoc[SOP Configure the Userworkload Monitoring Stack] + diff --git a/modules/ocp4/pages/sop_upgrade.adoc b/modules/ocp4/pages/sop_upgrade.adoc new file mode 100644 index 0000000..8cbef4f --- /dev/null +++ b/modules/ocp4/pages/sop_upgrade.adoc @@ -0,0 +1,37 @@ +== Upgrade OCP4 Cluster +Please see the official documentation for more information [1][3], this SOP can be used as a rough guide. + +=== Resources + +- [1] https://docs.openshift.com/container-platform/4.8/updating/updating-cluster-between-minor.html[Upgrading OCP4 Cluster Between Minor Versions] +- [2] xref:sop_etcd_backup.adoc[SOP Create etcd backup] +- [3] https://docs.openshift.com/container-platform/4.8/operators/admin/olm-upgrading-operators.html +- [4] https://docs.openshift.com/container-platform/4.8/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-restoring-cluster-state[Restore etcd backup] +- [5] https://docs.openshift.com/container-platform/4.8/operators/admin/olm-upgrading-operators.html#olm-upgrading-operators[Upgrading Operators Prior to Cluster Update] + +=== Prerequisites + +- Incase an upgrade fails, it is wise to first take an `etcd` backup. To do so follow the SOP [2]. +- Ensuare that all installed Operators are at the latest versions for their channel [5]. + +=== Upgrade OCP +At the time of writing the version installed on the cluster is `4.8.11` and the `upgrade channel` is set to `stable-4.8`. It is easiest to update the cluster via the web console. Go to: + +- Administration +- Cluster Settings +- In order to upgrade between `z` or `patch` version (x.y.z), when one is available, click the update button. +- When moving between `y` or `minor` versions, you must first switch the `upgrade channel` to `fast-4.9` as an example. You should also be on the very latest `z`/`patch` version before upgrading. +- When the upgrade has finished, switch back to the `upgrade channel` for stable. + + +=== Upgrade failures +In the worst case scenario we may have to restore etcd from the backups taken at the start [4]. Or reinstall a node entirely. + +==== Troubleshooting +There are many possible ways an upgrade can fail mid way through. + +- Check the monitoring alerts currently firing, this can often hint towards the problem +- Often individual nodes are failing to take the new MachineConfig changes and will show up when examining the `MachineConfigPool` status. +- Might require a manual reboot of that particular node +- Might require killing pods on that particular node + diff --git a/modules/ocp4/pages/sops.adoc b/modules/ocp4/pages/sops.adoc index 7620939..cb2bd29 100644 --- a/modules/ocp4/pages/sops.adoc +++ b/modules/ocp4/pages/sops.adoc @@ -13,3 +13,5 @@ - xref:sop_graceful_startup_ocp_cluster.adoc[SOP Graceful Cluster Startup] - xref:sop_installation.adoc[SOP Openshift 4 Installation on Fedora Infra] - xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT] +- xref:sop_upgrade.adoc[SOP Upgrade OCP4 Cluster] +- xref:sop_etcd_backup.adoc[SOP Create etcd backup]