From a77d6bb52b8928a944d5602c7c6295614b771d94 Mon Sep 17 00:00:00 2001 From: David Kirwan Date: Sep 29 2021 01:45:35 +0000 Subject: [PATCH 1/15] SOP for installation of OCP4 Signed-off-by: David Kirwan Signed-off-by: Akashdeep Dhar Signed-off-by: David Kirwan --- diff --git a/modules/ocp4/pages/sop_installation.adoc b/modules/ocp4/pages/sop_installation.adoc new file mode 100644 index 0000000..45f03d2 --- /dev/null +++ b/modules/ocp4/pages/sop_installation.adoc @@ -0,0 +1,213 @@ +== SOP Installation/Configuration of OCP4 on Fedora Infra + +=== Resources + +- [1]: https://docs.openshift.com/container-platform/4.8/installing/installing_bare_metal/[Official OCP4 Installation Documentation] + +=== Install +To install OCP4 on Fedora Infra, one must be apart of the following groups: + +- `sysadmin-openshift` +- `sysadmin-noc` + + +==== Prerequisites +Visit the https://console.redhat.com/openshift/install/metal/user-provisioned[OpenShift Console] and download the following OpenShift tools: + +* A Red Hat Access account is required +* OC client tools https://access.redhat.com/downloads/content/290/ver=4.8/rhel---8/4.8.10/x86_64/product-software[Here] +* OC installation tool https://access.redhat.com/downloads/content/290/ver=4.8/rhel---8/4.8.10/x86_64/product-software[Here] +* Ensure the downloaded tools are available on the `PATH` +* A valid OCP4 subscription is required to complete the installation configuration, by default you have a 60 day trial. +* Take a copy of your pull secret file you will need to put this in the `install-config.yaml` file in the next step. + + +==== Generate install-config.yaml file +We must create a `install-config.yaml` file, use the following example for inspiration, alternatively refer to the documentation[1] for more detailed information/explainations. + +---- +apiVersion: v1 +baseDomain: stg.fedoraproject.org +compute: +- hyperthreading: Enabled + name: worker + replicas: 0 +controlPlane: + hyperthreading: Enabled + name: master + replicas: 3 +metadata: + name: 'ocp' +networking: + clusterNetwork: + - cidr: 10.128.0.0/14 + hostPrefix: 23 + networkType: OpenShiftSDN + serviceNetwork: + - 172.30.0.0/16 +platform: + none: {} +fips: false +pullSecret: 'PUT PULL SECRET HERE' +sshKey: 'PUT SSH PUBLIC KEY HERE kubeadmin@core' +---- + +* Login to the `os-control01` corresponding with the environment +* Make a directory to hold the installation files: `mkdir ocp4-` +* Enter this newly created directory: `cd ocp4-` +* Generate a fresh SSH keypair: `ssh-keygen -f ./ocp4--ssh` +* Create a `ssh` directory and place this keypair into it. +* Put the contents of the public key in the `sshKey` value in the `install-config.yaml` file +* Put the contents of your Pull Secret in the `pullSecret` value in the `install-config.yaml` +* Take a backup of the `install-config.yaml` to `install-config.yaml.bak`, as running the next steps consumes this file, having a backup allows you to recover from mistakes quickly. + + +==== Create the Installation Files +Using the `openshift-install` tool we can generate the installation files. Make sure that the `install-config.yaml` file is in the `/path/to/ocp4-` location before attempting the next steps. + +===== Create the Manifest Files +The manifest files are human readable, at this stage you can put any customisations required before the installation begins. + +* Create the manifests: `openshift-install create manifests --dir=/path/to/ocp4-` +* All configuration for RHCOS must be done via MachineConfigs configuration. If there is known configuration which must be performed, such as NTP, you can copy the MachineConfigs into the `/path/to/ocp4-/openshift` directory now. +* The following step should be performed at this point, edit the `/path/to/ocp4-/manifests/cluster-scheduler-02-config.yml` change the `mastersSchedulable` value to `false`. + + +===== Create the Ignition Files +The ignition files have been generated from the manifests and MachineConfig files to generate the final installation files for the three roles: `bootstrap`, `master`, `worker`. In Fedora we prefer not to use the term `master` here, we have renamed this role to `controlplane`. + +* Create the ignition files: `openshift-install create ignition-configs --dir=/path/to/ocp4-` +* At this point you should have the following three files: `bootstrap.ign`, `master.ign` and `worker.ign`. +* Rename the `master.ign` to `controlplane.ign`. +* A directory has been created, `auth`. This contains two files: `kubeadmin-password` and `kubeconfig`. These allow `cluster-admin` access to the cluster. + + +==== Copy the Ignition files to the `batcave01` server +On the `batcave01` at the following location: `/srv/web/infra/bigfiles/openshiftboot/`: + +* Create a directory to match the environment: `mkdir /srv/web/infra/bigfiles/openshiftboot/ocp4-` +* Copy the ignition files, the ssh files and the auth files generated in previous steps, to this newly created directory. Users with `sysadmin-openshift` should have the necessary permissions to write to this location. +* when this is complete it should look like the following: +---- + ├── + │ ├── auth + │ │ ├── kubeadmin-password + │ │ └── kubeconfig + │ ├── bootstrap.ign + │ ├── controlplane.ign + │ ├── ssh + │ │ ├── id_rsa + │ │ └── id_rsa.pub + │ └── worker.ign +---- + + +==== Update the ansible inventory +The ansible inventory/hostvars/group vars should be updated with the new hosts information. + +For inspiration see the following https://pagure.io/fedora-infra/ansible/pull-request/765[PR] where we added the ocp4 production changes. + + +==== Update the DNS/DHCP configuration +The DNS and DHCP configuration must also be updated. This https://pagure.io/fedora-infra/ansible/pull-request/765[PR] contains the necessiary changes DHCP for prod and can be done in ansible. + +However the DNS changes may only be performed by `sysadmin-main`. For this reason any DNS changes must go via a patch snippet which is emailed to the `infrastructure@lists.fedoraproject.org` mailing list for review and approval. This process may take several days. + + +==== Generate the TLS Certs for the new environment +This is beyond the scope of this SOP, the best option is to create a ticket for Fedora Infra to request that these certs are created and available for use. The following certs should be available: + +- `*.apps..fedoraproject.org` +- `api..fedoraproject.org` +- `api-int..fedoraproject.org` + + +==== Run the Playbooks +There are a number of playbooks required to be run. Once all the previous steps have been reached, we can run these playbooks from the `batcave01` instance. + +- `sudo rbac-playbook groups/noc.yml -t 'tftp_server,dhcp_server'` +- `sudo rbac-playbook groups/proxies.yml -t 'haproxy,httpd,iptables'` + + +===== Baremetal / VMs +Depending on if some of the nodes are VMs or baremetal, different tags should be supplied to the following playbook. If the entire cluster is baremetal you can skip the `kvm_deploy` tag entirely. + +If there are VMs used for some of the roles, make sure to leave it in. + +- `sudo rbac-playbook manual/ocp4-place-ignitionfiles.yml -t "ignition,repo,kvm_deploy"` + + +===== Baremetal +At this point we can switch on the baremetal nodes and begin the PXE/UEFI boot process. The baremetal nodes should via DHCP/DNS have the configuration necessary to reach out to the `noc01.iad2.fedoraproject.org` server and retrieve the UEFI boot configuration via PXE. + +Once booted up, you should visit the management console for this node, and manually choose the UEFI configuration appropriate for its role. + +The node will begin booting, and during the boot process it will reach out to the `os-control01` instance specific to the `` to retrieve the ignition file appropriate to its role. + +The system will then become autonomous, it will install and potentially reboot multiple times as updates are retrieved/applied etc. + +Eventually you will be presented with a SSH login prompt, where it should have the correct hostname eg: `ocp01` to match what is in the DNS configuration. + + +==== Bootstrapping completed +When the control plane is up, we should see all controlplane instances available in the appropriate haproxy dashboard. eg: https://admin.fedoraproject.org/haproxy/proxy01=ocp-masters-backend-kapi[haproxy]. + +At this time we should take the `bootstrap` instance out of the haproxy load balancer. + +- Make the necessiary changes to ansible at: `ansible/roles/haproxy/templates/haproxy.cfg` +- Once merged, run the following playbook once more: `sudo rbac-playbook groups/proxies.yml -t 'haproxy'` + + +==== Begin instllation of the worker nodes +Follow the same processes listed in the Baremetal section above to switch on the worker nodes and begin installation. + + +==== Configure the `os-control01` to authenticate with the new OCP4 cluster +Copy the `kubeconfig` to `~root/.kube/config` on the `os-control01` instance. +This will allow the `root` user to automatically be authenticated to the new OCP4 cluster with `cluster-admin` privileges. + + +==== Accept Node CSR Certs +To accept the worker/compute nodes into the cluster we need to accept their CSR certs. + +List the CSR certs. The ones we're interested in will show as pending: + +---- +oc get csr +---- + +To accept all the OCP4 node CSRs in a one liner do the following: + +---- +oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve +---- + +This should look something like this once completed: + +---- +[root@os-control01 ocp4][STG]= oc get nodes +NAME STATUS ROLES AGE VERSION +ocp01.ocp.stg.iad2.fedoraproject.org Ready master 34d v1.21.1+9807387 +ocp02.ocp.stg.iad2.fedoraproject.org Ready master 34d v1.21.1+9807387 +ocp03.ocp.stg.iad2.fedoraproject.org Ready master 34d v1.21.1+9807387 +worker01.ocp.stg.iad2.fedoraproject.org Ready worker 21d v1.21.1+9807387 +worker02.ocp.stg.iad2.fedoraproject.org Ready worker 20d v1.21.1+9807387 +worker03.ocp.stg.iad2.fedoraproject.org Ready worker 20d v1.21.1+9807387 +worker04.ocp.stg.iad2.fedoraproject.org Ready worker 34d v1.21.1+9807387 +worker05.ocp.stg.iad2.fedoraproject.org Ready worker 34d v1.21.1+9807387 +---- + +At this point the cluster is basically up and running. + + +=== Follow on SOPs +Several other SOPs should be followed to perform the post installation configuration on the cluster. + +- http://linkmeh[Retrieve the OCP4 Cluster's CA Cert to configure haproxy] +- http://linkmeh[Configure the Image Registry Operator to use NFS Storage] +- http://linkmeh[Configure OIDC for Noggin/IPA in OCP4] +- http://linkmeh[Disable self provisioners role] +- http://linkmeh[Installation/Configuration of the Local Storage Operator] +- http://linkmeh[Installation/Configuration of the Openshift Container Storage Operator] +- http://linkmeh[Configure the OCP4 User Workload Monitoring Stack] + diff --git a/modules/ocp4/pages/sops.adoc b/modules/ocp4/pages/sops.adoc new file mode 100644 index 0000000..0da08bb --- /dev/null +++ b/modules/ocp4/pages/sops.adoc @@ -0,0 +1,30 @@ +== SOPs + +- xref:sop_installation.adoc[SOP Openshift 4 Installation on Fedora Infra] + +=== Configure the baremetal nodes to pxeboot with UEFI into RHCOS + +=== Create MachineConfigs to configure RHCOS + +=== Retrieve the OCP4 Cluster's CA Cert to configure haproxy + +=== Configure the Image Registry Operator to use NFS Storage + +=== Configure OIDC for Noggin/IPA in OCP4 + +=== Disable self provisioners role + +=== Installation/Configuration of the Local Storage Operator + +=== Installation/Configuration of the Openshift Container Storage Operator + +=== Configure the OCP4 User Workload Monitoring Stack + + + + + + + + + diff --git a/modules/sysadmin_guide/nav.adoc b/modules/sysadmin_guide/nav.adoc index 8d50a04..269824d 100644 --- a/modules/sysadmin_guide/nav.adoc +++ b/modules/sysadmin_guide/nav.adoc @@ -1,4 +1,5 @@ * xref:orientation.adoc[Orientation for Sysadmin Guide] +** xref:ocp4:sops.adoc[Openshift 4 SOPs] * xref:index.adoc[Sysadmin Guide] ** xref:2-factor.adoc[Two factor auth] ** xref:accountdeletion.adoc[Account Deletion SOP] From 746afc6e30984c34295c65f4b9803ffb0abe9e99 Mon Sep 17 00:00:00 2001 From: David Kirwan Date: Sep 29 2021 01:45:35 +0000 Subject: [PATCH 2/15] SOP OCP4 configure UEFI boot Signed-off-by: David Kirwan --- diff --git a/modules/ocp4/pages/sop_configure_baremetal_pxe_uefi_boot.adoc b/modules/ocp4/pages/sop_configure_baremetal_pxe_uefi_boot.adoc new file mode 100644 index 0000000..0356f9f --- /dev/null +++ b/modules/ocp4/pages/sop_configure_baremetal_pxe_uefi_boot.adoc @@ -0,0 +1,51 @@ +== Configure Baremetal PXE-UEFI Boot +A high level overview of how a baremetal node in the Fedora Infra gets booted via UEFI is as follows. + +- Server powered on +- Gets ip via dhcp +- DHCP server uses `next-server` command to point the Server to next contact the tftpboot server and retrieve `grub.cfg` +- tftpboot serves `grub.cfg` +- Sysadmin manually chooses the correct UEFI menu to boot +- tftpboot serves kernal and initramfs to the server +- Server boots with kernal and initramfs, and retrieves ingition file from `os-control01` + +=== Resources + +- [1] https://pagure.io/fedora-infra/ansible/blob/main/f/roles/dhcp_server[Ansible Role DHCP Server] +- [2] https://pagure.io/fedora-infra/ansible/blob/main/f/roles/tftp_server[Ansible Role tftpboot server] + +=== UEFI Configuration +The configuration for UEFI booting is contained in the `grub.cfg` config which is not currently under source control. It is located on the `batcave01` at: `/srv/web/infra/bigfiles/tftpboot2/uefi/grub.cfg`. + +The following is a sample configuration to install a baremetal OCP4 worker in the Staging cluster. + +---- +menuentry 'RHCOS 4.8 worker staging' { + linuxefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-kernel-x86_64 ip=dhcp nameserver=10.3.163.33 coreos.inst.install_dev=/dev/sda coreos.live.rootfs_url=http://10.3.166.50/rhcos/rhcos-4.8.2-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://10.3.166.50/rhcos/worker.ign + initrdefi images/RHCOS/4.8/x86_64/rhcos-4.8.2-x86_64-live-initramfs.x86_64.img +} +---- + +Any new changes must be made here. Writing to this file requires one to be a member of the `sysadmin-main` group, so best to instead create a ticket in the Fedora Infra issue tracker with patch request. See the following https://pagure.io/fedora-infrastructure/issue/10213[PR] for inspiration. + +=== Pushing new changes out to the tftpboot server +To push out changes made to the `grub.cfg` the following playbook should be run, which requires `sysadmin-noc` group permissions: + +---- +sudo rbac-playbook groups/noc.yml -t 'tftp_server,dhcp_server' +---- + +On the `noc01` instance the `grub.cfg` file is located at `/var/lib/tftpboot/uefi/grub.cfg` + +If particular changes to OS images for example, are required, they should be made on the `noc01` instance directly at `/var/lib/tftpboot/images/`. This will require users to be in the `sysadmin-noc` group. + + + + + + + + + + + diff --git a/modules/ocp4/pages/sops.adoc b/modules/ocp4/pages/sops.adoc index 0da08bb..38dabc8 100644 --- a/modules/ocp4/pages/sops.adoc +++ b/modules/ocp4/pages/sops.adoc @@ -1,30 +1,4 @@ == SOPs - xref:sop_installation.adoc[SOP Openshift 4 Installation on Fedora Infra] - -=== Configure the baremetal nodes to pxeboot with UEFI into RHCOS - -=== Create MachineConfigs to configure RHCOS - -=== Retrieve the OCP4 Cluster's CA Cert to configure haproxy - -=== Configure the Image Registry Operator to use NFS Storage - -=== Configure OIDC for Noggin/IPA in OCP4 - -=== Disable self provisioners role - -=== Installation/Configuration of the Local Storage Operator - -=== Installation/Configuration of the Openshift Container Storage Operator - -=== Configure the OCP4 User Workload Monitoring Stack - - - - - - - - - +- xref:sop_configure_baremetal_pxe_uefi_boot.adoc[SOP Configure Baremetal PXE-UEFI Boot] From c894cc230443db336ce3f1359e6615881c419d78 Mon Sep 17 00:00:00 2001 From: David Kirwan Date: Sep 29 2021 01:45:35 +0000 Subject: [PATCH 3/15] SOP Create RHCOS ignition files Signed-off-by: David Kirwan --- diff --git a/modules/ocp4/pages/sop_create_machineconfigs.adoc b/modules/ocp4/pages/sop_create_machineconfigs.adoc new file mode 100644 index 0000000..016b68e --- /dev/null +++ b/modules/ocp4/pages/sop_create_machineconfigs.adoc @@ -0,0 +1,38 @@ +== Create MachineConfigs to Configure RHCOS + +=== Resources + +- [1] https://coreos.github.io/butane/getting-started/[Butane Getting Started] +- [2] https://docs.openshift.com/container-platform/4.8/post_installation_configuration/machine-configuration-tasks.html#installation-special-config-chrony_post-install-machine-configuration-tasks[OCP4 Post Installation Configuration] + +=== Butane +"Butane (formerly the Fedora CoreOS Config Transpiler) is a tool that consumes a Butane Config and produces an Ignition Config, which is a JSON document that can be given to a Fedora CoreOS machine when it first boots." [1] + +Butane is available in a container image, we can pull the latest version locally like so: + +---- +# Pull the latest release +podman pull quay.io/coreos/butane:release + +# Run butane using standard in and standard out +podman run -i --rm quay.io/coreos/butane:release --pretty --strict < your_config.bu > transpiled_config.ign + +# Run butane using files. +podman run --rm -v /path/to/your_config.bu:/config.bu:z quay.io/coreos/butane:release --pretty --strict /config.bu > transpiled_config.ign +---- + +We can create a CLI alias to make running the Butane container much easier like so: + +---- +alias butane='podman run --rm --tty --interactive \ + --security-opt label=disable \ + --volume ${PWD}:/pwd --workdir /pwd \ + quay.io/coreos/butane:release' +---- + +For more detailed information on how to structure your Butane file see [1]. Once created you can convert the butane config to an igntion file like so: + +---- +butane master_chrony_machineconfig.bu -o master_chrony_machineconfig.yaml +butane worker_chrony_machineconfig.bu -o worker_chrony_machineconfig.yaml +---- diff --git a/modules/ocp4/pages/sops.adoc b/modules/ocp4/pages/sops.adoc index 38dabc8..cd64e68 100644 --- a/modules/ocp4/pages/sops.adoc +++ b/modules/ocp4/pages/sops.adoc @@ -2,3 +2,4 @@ - xref:sop_installation.adoc[SOP Openshift 4 Installation on Fedora Infra] - xref:sop_configure_baremetal_pxe_uefi_boot.adoc[SOP Configure Baremetal PXE-UEFI Boot] +- xref:sop_create_machineconfigs.adoc[SOP Create MachineConfigs to Configure RHCOS] From 6a195f85edff5abcfbc240ed8553ee128fa322aa Mon Sep 17 00:00:00 2001 From: David Kirwan Date: Sep 29 2021 01:45:35 +0000 Subject: [PATCH 4/15] SOP Retrieve ocp4 cacert Signed-off-by: David Kirwan --- diff --git a/modules/ocp4/pages/sop_configure_baremetal_pxe_uefi_boot.adoc b/modules/ocp4/pages/sop_configure_baremetal_pxe_uefi_boot.adoc index 0356f9f..17b584b 100644 --- a/modules/ocp4/pages/sop_configure_baremetal_pxe_uefi_boot.adoc +++ b/modules/ocp4/pages/sop_configure_baremetal_pxe_uefi_boot.adoc @@ -38,14 +38,3 @@ sudo rbac-playbook groups/noc.yml -t 'tftp_server,dhcp_server' On the `noc01` instance the `grub.cfg` file is located at `/var/lib/tftpboot/uefi/grub.cfg` If particular changes to OS images for example, are required, they should be made on the `noc01` instance directly at `/var/lib/tftpboot/images/`. This will require users to be in the `sysadmin-noc` group. - - - - - - - - - - - diff --git a/modules/ocp4/pages/sop_retrieve_ocp4_cacert.adoc b/modules/ocp4/pages/sop_retrieve_ocp4_cacert.adoc new file mode 100644 index 0000000..01b9136 --- /dev/null +++ b/modules/ocp4/pages/sop_retrieve_ocp4_cacert.adoc @@ -0,0 +1,22 @@ +== SOP Retrieve OCP4 Cluster CACERT + +=== Resources + +- [1] https://pagure.io/fedora-infra/ansible/blob/main/f/roles/dhcp_server[Ansible Role DHCP Server] + +=== Retrieve CACERT +In Fedora Infra, we have Apache terminating TLS for the cluster. Connections to the api and the machineconfig server are handled by haproxy. To prevent TLS errors we must configure haproxy with the OCP4 Cluster CA Cert. + +This can be retrieved once the cluster control plane has been installed, from the `os-control01` node like so: + +---- +oc get configmap kube-root-ca.crt -o yaml -n openshift-ingress +---- + +Extract this CACERT in full, and commit it to ansible at: `https://pagure.io/fedora-infra/ansible/blob/main/f/roles/haproxy/files/ocp.-iad2.pem` + +To deploy this cert, one must be apart of the `sysadmin-noc` group. Run the following playbook: + +---- +sudo rbac-playbook groups/proxies.yml -t 'haproxy' +---- diff --git a/modules/ocp4/pages/sops.adoc b/modules/ocp4/pages/sops.adoc index cd64e68..c496ba4 100644 --- a/modules/ocp4/pages/sops.adoc +++ b/modules/ocp4/pages/sops.adoc @@ -3,3 +3,4 @@ - xref:sop_installation.adoc[SOP Openshift 4 Installation on Fedora Infra] - xref:sop_configure_baremetal_pxe_uefi_boot.adoc[SOP Configure Baremetal PXE-UEFI Boot] - xref:sop_create_machineconfigs.adoc[SOP Create MachineConfigs to Configure RHCOS] +- xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT] From 2c7109e747767bec48033256682fc1f3cc25ee40 Mon Sep 17 00:00:00 2001 From: David Kirwan Date: Sep 29 2021 01:45:35 +0000 Subject: [PATCH 5/15] SOP Configure Image Registry Operator Signed-off-by: David Kirwan --- diff --git a/modules/ocp4/pages/sop_configure_image_registry_operator.adoc b/modules/ocp4/pages/sop_configure_image_registry_operator.adoc new file mode 100644 index 0000000..417a715 --- /dev/null +++ b/modules/ocp4/pages/sop_configure_image_registry_operator.adoc @@ -0,0 +1,59 @@ +== SOP Configure the Image Registry Operator + +=== Resources +- [1] https://docs.openshift.com/container-platform/4.8/registry/configuring_registry_storage/configuring-registry-storage-baremetal.html#configuring-registry-storage-baremetal[Configuring Registry Storage Baremetal] + + +=== Enable the image registry operator +For detailed instructions please refer to the official documentation for the particular version of Openshift [1]. + +From the `os-control01` node we can enable the Image Registry Operator set it to a `Managed` state like so via the CLI.: + +---- +oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"managementState":"Managed"}}' +---- + +Next edit the configuration for the Image Registry operator like so: + +---- +oc edit configs.imageregistry.operator.openshift.io +---- + +Add the following to replace the `storage: {}`: + +---- +... +storage: + pvc: + claim: +... +---- + +Save the config. + +The Image registry will automatically claim a 100G sized PV if available. It is best to open a ticket with Fedora Infra and have a 100G NFS share be created. + +Use the following template for inspiration, populate the particular values to match the newly created NFS Share. + +---- +kind: PersistentVolume +apiVersion: v1 +metadata: + name: ocp-image-registry-volume +spec: + capacity: + storage: 100Gi + nfs: + server: 10.3.162.11 + path: /ocp_prod_registry + accessModes: + - ReadWriteMany + persistentVolumeReclaimPolicy: Retain + volumeMode: Filesystem +---- + +To create this new PV, create a persisent volume template file like above and apply it using the Openshift client tool like so: + +---- +oc apply -f image-registry-pv.yaml +---- diff --git a/modules/ocp4/pages/sop_installation.adoc b/modules/ocp4/pages/sop_installation.adoc index 45f03d2..d2a6072 100644 --- a/modules/ocp4/pages/sop_installation.adoc +++ b/modules/ocp4/pages/sop_installation.adoc @@ -203,11 +203,7 @@ At this point the cluster is basically up and running. === Follow on SOPs Several other SOPs should be followed to perform the post installation configuration on the cluster. -- http://linkmeh[Retrieve the OCP4 Cluster's CA Cert to configure haproxy] -- http://linkmeh[Configure the Image Registry Operator to use NFS Storage] -- http://linkmeh[Configure OIDC for Noggin/IPA in OCP4] -- http://linkmeh[Disable self provisioners role] -- http://linkmeh[Installation/Configuration of the Local Storage Operator] -- http://linkmeh[Installation/Configuration of the Openshift Container Storage Operator] -- http://linkmeh[Configure the OCP4 User Workload Monitoring Stack] - +- xref:sop_configure_baremetal_pxe_uefi_boot.adoc[SOP Configure Baremetal PXE-UEFI Boot] +- xref:sop_create_machineconfigs.adoc[SOP Create MachineConfigs to Configure RHCOS] +- xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT] +- xref:sop_configure_image_registry_operator.adoc[SOP Configure the Image Registry Operator] diff --git a/modules/ocp4/pages/sops.adoc b/modules/ocp4/pages/sops.adoc index c496ba4..ee61b06 100644 --- a/modules/ocp4/pages/sops.adoc +++ b/modules/ocp4/pages/sops.adoc @@ -4,3 +4,4 @@ - xref:sop_configure_baremetal_pxe_uefi_boot.adoc[SOP Configure Baremetal PXE-UEFI Boot] - xref:sop_create_machineconfigs.adoc[SOP Create MachineConfigs to Configure RHCOS] - xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT] +- xref:sop_configure_image_registry_operator.adoc[SOP Configure the Image Registry Operator] From ecb1217069807e8d7d33b1540aad58e10539ac57 Mon Sep 17 00:00:00 2001 From: David Kirwan Date: Sep 29 2021 01:45:35 +0000 Subject: [PATCH 6/15] SOP configure oauth Signed-off-by: David Kirwan --- diff --git a/modules/ocp4/pages/sop_configure_oauth_ipa.adoc b/modules/ocp4/pages/sop_configure_oauth_ipa.adoc new file mode 100644 index 0000000..12a989e --- /dev/null +++ b/modules/ocp4/pages/sop_configure_oauth_ipa.adoc @@ -0,0 +1,48 @@ +== SOP Configure oauth Authentication via IPA/Noggin + + +=== Resources + +- [1] https://pagure.io/fedora-infra/ansible/blob/main/f/files/communishift/objects[Example Config from Communishift] + + +=== OIDC Setup +The first step is to request that a secret be created for this environment, please open a ticket with Fedora Infra. Once the secret has been made available we can add it to an Openshift Secret in the cluster like so: + +---- +oc create secret generic fedoraidp-clientsecret --from-literal=clientSecret= -n openshift-config +---- + +Next we can update the oauth configuration on the cluster and add the config for ipa/noggin/ipsilon. See the following snippet for inspiration: + +---- +apiVersion: config.openshift.io/v1 +kind: OAuth +metadata: + name: cluster +spec: + identityProviders: +... + - name: fedoraidp + login: true + challenge: false + mappingMethod: claim + type: OpenID + openID: + clientID: ocp + clientSecret: + name: fedoraidp-clientsecret + extraScopes: + - email + - profile + claims: + preferredUsername: + - nickname + name: + - name + email: + - email + issuer: https://id.fedoraproject.org +---- + +This config already exists in the cluster so you need to edit or patch it, you can't just `oc apply -f template.yaml`. diff --git a/modules/ocp4/pages/sops.adoc b/modules/ocp4/pages/sops.adoc index ee61b06..4c509ed 100644 --- a/modules/ocp4/pages/sops.adoc +++ b/modules/ocp4/pages/sops.adoc @@ -1,7 +1,8 @@ == SOPs - xref:sop_installation.adoc[SOP Openshift 4 Installation on Fedora Infra] -- xref:sop_configure_baremetal_pxe_uefi_boot.adoc[SOP Configure Baremetal PXE-UEFI Boot] - xref:sop_create_machineconfigs.adoc[SOP Create MachineConfigs to Configure RHCOS] +- xref:sop_configure_baremetal_pxe_uefi_boot.adoc[SOP Configure Baremetal PXE-UEFI Boot] - xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT] - xref:sop_configure_image_registry_operator.adoc[SOP Configure the Image Registry Operator] +- xref:sop_configure_oauth_ipa.adoc[SOP Configure oauth Authentication via IPA/Noggin] From bcf8bf187cc12e02e6ba3b376c291718e707ead3 Mon Sep 17 00:00:00 2001 From: David Kirwan Date: Sep 29 2021 01:45:35 +0000 Subject: [PATCH 7/15] SOP disable self-provisioners role --- diff --git a/modules/ocp4/pages/sop_disable_provisioners_role.adoc b/modules/ocp4/pages/sop_disable_provisioners_role.adoc new file mode 100644 index 0000000..0808696 --- /dev/null +++ b/modules/ocp4/pages/sop_disable_provisioners_role.adoc @@ -0,0 +1,70 @@ +== SOP Disable `self-provisioners` Role + +=== Resources + +- [1] https://docs.openshift.com/container-platform/4.4/applications/projects/configuring-project-creation.html#disabling-project-self-provisioning_configuring-project-creation + + +=== Disabling self-provisioners role +By default, when a user authenticates with Openshift via Oauth, it is part of the `self-provisioners` group. This group provides the ability to create new projects. On CentOS CI we do not want users to be able to create their own projects, as we have a system in place where we create a project and control the administrators of that project. + +To disable the self-provisioner role do the following as outlined in the documentation[1]. + +---- +oc describe clusterrolebinding.rbac self-provisioners + +Name: self-provisioners +Labels: +Annotations: rbac.authorization.kubernetes.io/autoupdate=true +Role: + Kind: ClusterRole + Name: self-provisioner +Subjects: + Kind Name Namespace + ---- ---- --------- + Group system:authenticated:oauth +---- + +Remove the subjects that the self-provisioners role applies to. + +---- +oc patch clusterrolebinding.rbac self-provisioners -p '{"subjects": null}' +---- + +Verify the change occurred successfully + +---- +oc describe clusterrolebinding.rbac self-provisioners +Name: self-provisioners +Labels: +Annotations: rbac.authorization.kubernetes.io/autoupdate: true +Role: + Kind: ClusterRole + Name: self-provisioner +Subjects: + Kind Name Namespace + ---- ---- --------- +---- + +When the cluster is updated to a new version, unless we mark the role appropriately, the permissions will be restored after the update is complete. + +Verify that the value is currently set to be restored after an update: + +---- +oc get clusterrolebinding.rbac self-provisioners -o yaml +---- + +---- +apiVersion: authorization.openshift.io/v1 +kind: ClusterRoleBinding +metadata: + annotations: + rbac.authorization.kubernetes.io/autoupdate: "true" + ... +---- + +We wish to set this `rbac.authorization.kubernetes.io/autoupdate` to `false`. To patch this do the following. + +---- +oc patch clusterrolebinding.rbac self-provisioners -p '{ "metadata": { "annotations": { "rbac.authorization.kubernetes.io/autoupdate": "false" } } }' +---- diff --git a/modules/ocp4/pages/sops.adoc b/modules/ocp4/pages/sops.adoc index 4c509ed..998ca86 100644 --- a/modules/ocp4/pages/sops.adoc +++ b/modules/ocp4/pages/sops.adoc @@ -6,3 +6,4 @@ - xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT] - xref:sop_configure_image_registry_operator.adoc[SOP Configure the Image Registry Operator] - xref:sop_configure_oauth_ipa.adoc[SOP Configure oauth Authentication via IPA/Noggin] +- xref:sop_disable_provisioners_role.adoc[SOP Disable the Provisioners Role] From 0a1ee8e13cb14e95ecdba04ae1317d309454ddec Mon Sep 17 00:00:00 2001 From: David Kirwan Date: Sep 29 2021 01:45:35 +0000 Subject: [PATCH 8/15] SOP Configure local storage Signed-off-by: David Kirwan --- diff --git a/modules/ocp4/pages/sop_configure_local_storage_operator.adoc b/modules/ocp4/pages/sop_configure_local_storage_operator.adoc new file mode 100644 index 0000000..f467742 --- /dev/null +++ b/modules/ocp4/pages/sop_configure_local_storage_operator.adoc @@ -0,0 +1,27 @@ +== Configure the Local Storage Operator + +=== Resources +- [1] https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.7/html/deploying_openshift_container_storage_using_bare_metal_infrastructure/deploy-using-local-storage-devices-bm +- [2] https://github.com/centosci/ocp4-docs/blob/master/sops/localstorage/installation.md + + +=== Installation +For installation instructions visit the official docs at: [1]. The CentOS CI SOP at [2] also has more context but it is now slightly dated. + +- From the webconsole, click on the `Operators` option, then `OperatorHub` +- Search for `Local Storage` +- Click install +- Make sure the `Update Channel` matches the major.minor version of your OCP4 install +- Choose `A specific namespace on this cluster` +- Choose `Operator recommended namespace` +- Update approval set to automatic +- Click install + +=== Configuration +A prerequisite to this step is to have all volumes on the nodes already formatted and available prior to this step. This can be done via a machineconfig/ignition file during installation time, or alternatively SSH onto the boxes and manually create / format the volumes. + +- Create a `LocalVolumeDiscovery` and configured it to target the disks on all nodes +- When that process is complete, it creates `LocalVolumeDiscoveryResult` objects which you can search the type for, then examine to see if it has found the correct disks and if they are showing as available. +- Create a `LocalVolumeSet`: name `local-block` storage class `local-block` type all, devicetypes disk, part, filter disks by, choose the selected nodes worker01-03, volume mode block. Create. +- After a period of time check the newly created LocalVolumeSet `local-block` object's yaml definition, it should show the correct number of volumes listed in the `totalProvisionedDeviceCount` field. + diff --git a/modules/ocp4/pages/sops.adoc b/modules/ocp4/pages/sops.adoc index 998ca86..c82528b 100644 --- a/modules/ocp4/pages/sops.adoc +++ b/modules/ocp4/pages/sops.adoc @@ -7,3 +7,5 @@ - xref:sop_configure_image_registry_operator.adoc[SOP Configure the Image Registry Operator] - xref:sop_configure_oauth_ipa.adoc[SOP Configure oauth Authentication via IPA/Noggin] - xref:sop_disable_provisioners_role.adoc[SOP Disable the Provisioners Role] +- xref:sop_configure_local_storage_operator.adoc[SOP Configure the Local Storage Operator] + From 7308a04519a25ce7f3895a4e49303ba6147e432f Mon Sep 17 00:00:00 2001 From: David Kirwan Date: Sep 29 2021 01:45:35 +0000 Subject: [PATCH 9/15] SOP Enable User Workload Monitoring Stack SOP Configure Openshift Container Storage --- diff --git a/modules/ocp4/pages/sop_configure_openshift_container_storage.adoc b/modules/ocp4/pages/sop_configure_openshift_container_storage.adoc new file mode 100644 index 0000000..2178cdd --- /dev/null +++ b/modules/ocp4/pages/sop_configure_openshift_container_storage.adoc @@ -0,0 +1,29 @@ +== Configure the Openshift Container Storage Operator + + +=== Resources + +- [1] https://docs.openshift.com/container-platform/4.8/storage/persistent_storage/persistent-storage-ocs.html[Official Docs] +- [2] https://github.com/red-hat-storage/ocs-operator[Github] + +=== Installation +For full detailed instructions please refer to the official docs at: [1]. For general instructions see below: + +- In the webconsole, click the Operators menu +- Click the OperatorHub menu +- Search for `OpenShift Container Storage` +- Click install +- Choose the update channel to match the major.minor version of the cluster itself. +- Installation mode, A specified namespace on the cluster +- Installed namespace, Operator Recommended +- Update approval, automatic +- Click install + + +=== Configuration +When the operator is finished installing, we can continue, please ensure that a minimum of 3 nodes are available. + +- A `StorageCluster` is required to complete this installation, click the Create StorageCluster. +- + + diff --git a/modules/ocp4/pages/sop_configure_userworkload_monitoring_stack.adoc b/modules/ocp4/pages/sop_configure_userworkload_monitoring_stack.adoc new file mode 100644 index 0000000..3116300 --- /dev/null +++ b/modules/ocp4/pages/sop_configure_userworkload_monitoring_stack.adoc @@ -0,0 +1,40 @@ +== Enable User Workload Monitoring Stack + +=== Resources +- [1] https://docs.openshift.com/container-platform/4.8/monitoring/enabling-monitoring-for-user-defined-projects.html[Official Docs] +- [2] https://docs.openshift.com/container-platform/4.8/monitoring/enabling-monitoring-for-user-defined-projects.html#granting-users-permission-to-monitor-user-defined-projects_enabling-monitoring-for-user-defined-projects[Providing Access to the UWMS features] +- [3] https://docs.openshift.com/container-platform/4.8/monitoring/enabling-monitoring-for-user-defined-projects.html#granting-user-permissions-using-the-web-console_enabling-monitoring-for-user-defined-projects[Providing Access to the UWMS dashboard] + +=== Configuration +To enable the stack edit the `cluster-monitoring` ConfigMap like so: + +---- +oc -n openshift-monitoring edit configmap cluster-monitoring-config +---- + +Set the `enableUserWorkload` to `true` like so: + +---- +apiVersion: v1 +kind: ConfigMap +metadata: + name: cluster-monitoring-config + namespace: openshift-monitoring +data: + config.yaml: | + enableUserWorkload: true +---- + +Save the configmap changes. Monitor the rollout progress of the User Workload Monitoring Stack with the following: + +---- +oc -n openshift-user-workload-monitoring get pod +NAME READY STATUS RESTARTS AGE +prometheus-operator-6f7b748d5b-t7nbg 2/2 Running 0 3h +prometheus-user-workload-0 4/4 Running 1 3h +prometheus-user-workload-1 4/4 Running 1 3h +thanos-ruler-user-workload-0 3/3 Running 0 3h +thanos-ruler-user-workload-1 3/3 Running 0 3h +---- + +To provide access to users to create `PrometheusRule` and `ServiceMonitor` and `PodMonitor` objects see [2]. To allow access to the User Workload Monitoring Stack dashboard see [3]. diff --git a/modules/ocp4/pages/sops.adoc b/modules/ocp4/pages/sops.adoc index c82528b..0e74933 100644 --- a/modules/ocp4/pages/sops.adoc +++ b/modules/ocp4/pages/sops.adoc @@ -8,4 +8,5 @@ - xref:sop_configure_oauth_ipa.adoc[SOP Configure oauth Authentication via IPA/Noggin] - xref:sop_disable_provisioners_role.adoc[SOP Disable the Provisioners Role] - xref:sop_configure_local_storage_operator.adoc[SOP Configure the Local Storage Operator] - +- xref:sop_configure_openshift_container_storage.adoc[SOP Configure the Openshift Container Storage Operator] +- xref:sop_configure_userworkload_monitoring_stack.adoc[SOP Configure the Userworkload Monitoring Stack] From 460af177da00f459a1e0fc43bc252666d84ebabc Mon Sep 17 00:00:00 2001 From: David Kirwan Date: Sep 29 2021 01:45:35 +0000 Subject: [PATCH 10/15] SOP Updated configure monitoring stack - added retention policy - added PVC claim policy Signed-off-by: David Kirwan --- diff --git a/modules/ocp4/pages/sop_configure_userworkload_monitoring_stack.adoc b/modules/ocp4/pages/sop_configure_userworkload_monitoring_stack.adoc index 3116300..ac4a1cd 100644 --- a/modules/ocp4/pages/sop_configure_userworkload_monitoring_stack.adoc +++ b/modules/ocp4/pages/sop_configure_userworkload_monitoring_stack.adoc @@ -4,6 +4,7 @@ - [1] https://docs.openshift.com/container-platform/4.8/monitoring/enabling-monitoring-for-user-defined-projects.html[Official Docs] - [2] https://docs.openshift.com/container-platform/4.8/monitoring/enabling-monitoring-for-user-defined-projects.html#granting-users-permission-to-monitor-user-defined-projects_enabling-monitoring-for-user-defined-projects[Providing Access to the UWMS features] - [3] https://docs.openshift.com/container-platform/4.8/monitoring/enabling-monitoring-for-user-defined-projects.html#granting-user-permissions-using-the-web-console_enabling-monitoring-for-user-defined-projects[Providing Access to the UWMS dashboard] +- [4] https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html#configuring-persistent-storage[Configure Monitoring Stack] === Configuration To enable the stack edit the `cluster-monitoring` ConfigMap like so: @@ -23,6 +24,21 @@ metadata: data: config.yaml: | enableUserWorkload: true + prometheusK8s: + retention: 30d + volumeClaimTemplate: + spec: + storageClassName: ocs-storagecluster-ceph-rbd + resources: + requests: + storage: 100Gi + alertmanagerMain: + volumeClaimTemplate: + spec: + storageClassName: ocs-storagecluster-ceph-rbd + resources: + requests: + storage: 50Gi ---- Save the configmap changes. Monitor the rollout progress of the User Workload Monitoring Stack with the following: @@ -37,4 +53,43 @@ thanos-ruler-user-workload-0 3/3 Running 0 3h thanos-ruler-user-workload-1 3/3 Running 0 3h ---- +At this point we can create a `ConfigMap` to configure the User Workload Monitoring stack in the `openshift-user-workload-monitoring` namespace. + +---- +oc create configmap user-workload-monitoring-config -n openshift-user-workload-monitoring +---- + +Then edit this ConfigMap: + +---- +oc -n openshift-user-workload-monitoring edit configmap user-workload-monitoring-config +---- + +Save the following configuration + +---- +apiVersion: v1 +kind: ConfigMap +metadata: + name: user-workload-monitoring-config + namespace: openshift-user-workload-monitoring +data: + config.yaml: | + prometheus: + retention: 30d + volumeClaimTemplate: + spec: + storageClassName: ocs-storagecluster-ceph-rbd + resources: + requests: + storage: 100Gi + thanosRuler: + volumeClaimTemplate: + spec: + storageClassName: ocs-storagecluster-ceph-rbd + resources: + requests: + storage: 50Gi +---- + To provide access to users to create `PrometheusRule` and `ServiceMonitor` and `PodMonitor` objects see [2]. To allow access to the User Workload Monitoring Stack dashboard see [3]. From c64ef798cdef0fbdd17a6156f8e125f5f3f07b35 Mon Sep 17 00:00:00 2001 From: David Kirwan Date: Sep 29 2021 01:45:35 +0000 Subject: [PATCH 11/15] SOP Configuration/Installation of the OCS Operator Signed-off-by: David Kirwan --- diff --git a/modules/ocp4/pages/sop_configure_baremetal_pxe_uefi_boot.adoc b/modules/ocp4/pages/sop_configure_baremetal_pxe_uefi_boot.adoc index 17b584b..048aaad 100644 --- a/modules/ocp4/pages/sop_configure_baremetal_pxe_uefi_boot.adoc +++ b/modules/ocp4/pages/sop_configure_baremetal_pxe_uefi_boot.adoc @@ -6,8 +6,8 @@ A high level overview of how a baremetal node in the Fedora Infra gets booted vi - DHCP server uses `next-server` command to point the Server to next contact the tftpboot server and retrieve `grub.cfg` - tftpboot serves `grub.cfg` - Sysadmin manually chooses the correct UEFI menu to boot -- tftpboot serves kernal and initramfs to the server -- Server boots with kernal and initramfs, and retrieves ingition file from `os-control01` +- tftpboot serves kernel and initramfs to the server +- Server boots with kernel and initramfs, and retrieves ingition file from `os-control01` === Resources diff --git a/modules/ocp4/pages/sop_configure_openshift_container_storage.adoc b/modules/ocp4/pages/sop_configure_openshift_container_storage.adoc index 2178cdd..2ad137b 100644 --- a/modules/ocp4/pages/sop_configure_openshift_container_storage.adoc +++ b/modules/ocp4/pages/sop_configure_openshift_container_storage.adoc @@ -7,6 +7,8 @@ - [2] https://github.com/red-hat-storage/ocs-operator[Github] === Installation +Important: before following this SOP, please ensure that you have already followed the SOP to install the Local Storage Operator first, as this is a requirement for the OCS operator. + For full detailed instructions please refer to the official docs at: [1]. For general instructions see below: - In the webconsole, click the Operators menu @@ -24,6 +26,12 @@ For full detailed instructions please refer to the official docs at: [1]. For ge When the operator is finished installing, we can continue, please ensure that a minimum of 3 nodes are available. - A `StorageCluster` is required to complete this installation, click the Create StorageCluster. -- +- At the top, choose the `internal - attached devices` mode. +- In the storageclass choose the `local-block` from the list. +- The compute/worker nodes with available storage appear in the list +- It automatically calculates the possible storage amount +- Click next +- On the `Security and Network` section just click next. +- Click create. From 23c84096dde1b467a7268f99cf0d9c5aa52cd8fd Mon Sep 17 00:00:00 2001 From: David Kirwan Date: Sep 29 2021 01:45:35 +0000 Subject: [PATCH 12/15] Metrics-for-apps: sorted sysadmin SOP list Signed-off-by: David Kirwan --- diff --git a/modules/sysadmin_guide/nav.adoc b/modules/sysadmin_guide/nav.adoc index 269824d..8cd2ae9 100644 --- a/modules/sysadmin_guide/nav.adoc +++ b/modules/sysadmin_guide/nav.adoc @@ -1,5 +1,4 @@ * xref:orientation.adoc[Orientation for Sysadmin Guide] -** xref:ocp4:sops.adoc[Openshift 4 SOPs] * xref:index.adoc[Sysadmin Guide] ** xref:2-factor.adoc[Two factor auth] ** xref:accountdeletion.adoc[Account Deletion SOP] @@ -42,8 +41,8 @@ ** xref:gdpr_delete.adoc[GDPR Delete - SOP] ** xref:gdpr_sar.adoc[GDPR SAR - SOP] ** xref:geoip-city-wsgi.adoc[geoip-city-wsgi - SOP] -** xref:github2fedmsg.adoc[github2fedmsg - SOP] ** xref:github.adoc[Using github for Infra Projects - SOP] +** xref:github2fedmsg.adoc[github2fedmsg - SOP] ** xref:greenwave.adoc[Greenwave - SOP] ** xref:guestdisk.adoc[Guest Disk Resize - SOP] ** xref:guestedit.adoc[Guest Editing - SOP] @@ -60,9 +59,9 @@ ** xref:jenkins-fedmsg.adoc[Jenkins Fedmsg - SOP] ** xref:kerneltest-harness.adoc[Kerneltest-harness - SOP] ** xref:kickstarts.adoc[Kickstart Infrastructure - SOP] -** xref:koji.adoc[Koji Infrastructure - SOP] ** xref:koji-archive.adoc[Koji Archive - SOP] ** xref:koji-builder-setup.adoc[Setup Koji Builder - SOP] +** xref:koji.adoc[Koji Infrastructure - SOP] ** xref:koschei.adoc[Koschei - SOP] ** xref:layered-image-buildsys.adoc[Layered Image Build System - SOP] ** xref:mailman.adoc[Mailman Infrastructure - SOP] @@ -73,14 +72,15 @@ ** xref:memcached.adoc[Memcached Infrastructure - SOP] ** xref:message-tagging-service.adoc[Message Tagging Service - SOP] ** xref:mirrorhiding.adoc[Mirror Hiding Infrastructure - SOP] -** xref:mirrormanager.adoc[MirrorManager Infrastructure - SOP] ** xref:mirrormanager-S3-EC2-netblocks.adoc[AWS Mirrors - SOP] +** xref:mirrormanager.adoc[MirrorManager Infrastructure - SOP] ** xref:mote.adoc[mote - SOP] ** xref:nagios.adoc[Fedora Infrastructure Nagios - SOP] ** xref:netapp.adoc[Netapp Infrastructure - SOP] ** xref:new-hosts.adoc[DNS Host Addition - SOP] ** xref:nonhumanaccounts.adoc[Non-human Accounts Infrastructure - SOP] ** xref:nuancier.adoc[Nuancier - SOP] +** xref:ocp4:sops.adoc[Openshift 4 SOPs] ** xref:odcs.adoc[On Demand Compose Service - SOP] ** xref:openqa.adoc[OpenQA Infrastructure - SOP] ** xref:openshift.adoc[OpenShift - SOP] @@ -111,8 +111,8 @@ ** xref:torrentrelease.adoc[Torrent Releases Infrastructure - SOP] ** xref:unbound.adoc[Fedora Infra Unbound Notes - SOP] ** xref:virt-image.adoc[Fedora Infrastructure Kpartx Notes - SOP] -** xref:virtio.adoc[Virtio Notes - SOP] ** xref:virt-notes.adoc[Fedora Infrastructure Libvirt Notes - SOP] +** xref:virtio.adoc[Virtio Notes - SOP] ** xref:voting.adoc[Voting Infrastructure - SOP] ** xref:waiverdb.adoc[WaiverDB - SOP] ** xref:wcidff.adoc[What Can I Do For Fedora - SOP] From 6c17d91dbbeec9e35c463c747d4d685732243fd8 Mon Sep 17 00:00:00 2001 From: David Kirwan Date: Sep 29 2021 01:45:35 +0000 Subject: [PATCH 13/15] Metrics-for-apps: Added SOPs - cordoning nodes - graceful shutdown - graceful startup Signed-off-by: David Kirwan --- diff --git a/modules/ocp4/pages/sop_cordoning_nodes_and_draining_pods.adoc b/modules/ocp4/pages/sop_cordoning_nodes_and_draining_pods.adoc new file mode 100644 index 0000000..004657a --- /dev/null +++ b/modules/ocp4/pages/sop_cordoning_nodes_and_draining_pods.adoc @@ -0,0 +1,56 @@ +== Cordoning Nodes and Draining Pods +This SOP should be followed in the following scenarios: + +- If maintenance is scheduled to be carried out on an Openshift node. + + +=== Steps + +1. Connect to the `os-control01` host associated with this ENV. Become root `sudo su -`. + +2. Mark the node as unschedulable: + +---- +nodes=$(oc get nodes -o name | sed -E "s/node\///") +echo $nodes + +for node in ${nodes[@]}; do oc adm cordon $node; done +node/ cordoned +---- + +3. Check that the node status is `NotReady,SchedulingDisabled` + +---- +oc get node +NAME STATUS ROLES AGE VERSION + NotReady,SchedulingDisabled worker 1d v1.18.3 +---- + +Note: It might not switch to `NotReady` immediately, there maybe many pods still running. + + +4. Evacuate the Pods from **worker nodes** using one of the following methods +This will drain node ``, delete any local data, and ignore daemonsets, and give a period of 60 seconds for pods to drain gracefully. + +---- +oc adm drain --delete-emptydir-data=true --ignore-daemonsets=true --grace-period=15 +---- + +5. Perform the scheduled maintenance on the node +Do what ever is required in the scheduled maintenance window + + +6. Once the node is ready to be added back into the cluster +We must uncordon the node. This allows it to be marked scheduleable once more. + +---- +nodes=$(oc get nodes -o name | sed -E "s/node\///") +echo $nodes + +for node in ${nodes[@]}; do oc adm uncordon $node; done +---- + + +=== Resources + +- [1] [Nodes - working with nodes](https://docs.openshift.com/container-platform/4.8/nodes/nodes/nodes-nodes-working.html) diff --git a/modules/ocp4/pages/sop_graceful_shutdown_ocp_cluster.adoc b/modules/ocp4/pages/sop_graceful_shutdown_ocp_cluster.adoc new file mode 100644 index 0000000..7de41f5 --- /dev/null +++ b/modules/ocp4/pages/sop_graceful_shutdown_ocp_cluster.adoc @@ -0,0 +1,30 @@ +== Graceful Shutdown of an Openshift 4 Cluster +This SOP should be followed in the following scenarios: + +- Graceful full shut down of the Openshift 4 cluster is required. + +=== Steps + +Prequisite steps: +- Follow the SOP for cordoning and draining the nodes. +- Follow the SOP for creating an `etcd` backup. + + +1. Connect to the `os-control01` host associated with this ENV. Become root `sudo su -`. + +2. Get a list of the nodes + +---- +nodes=$(oc get nodes -o name | sed -E "s/node\///") +---- + +3. Shutdown the nodes from the administration box associated with the cluster `ENV` eg production/staging. + +---- +for node in ${nodes[@]}; do ssh -i /root/ocp4/ocp-/ssh/id_rsa core@$node sudo shutdown -h now; done +---- + + +==== Resources + +- [1] https://docs.openshift.com/container-platform/4.5/backup_and_restore/graceful-cluster-shutdown.html[Graceful Cluster Shutdown] diff --git a/modules/ocp4/pages/sop_graceful_startup_ocp_cluster.adoc b/modules/ocp4/pages/sop_graceful_startup_ocp_cluster.adoc new file mode 100644 index 0000000..4fe76ae --- /dev/null +++ b/modules/ocp4/pages/sop_graceful_startup_ocp_cluster.adoc @@ -0,0 +1,88 @@ +== Graceful Startup of an Openshift 4 Cluster +This SOP should be followed in the following scenarios: + +- Graceful start up of an Openshift 4 cluster. + +=== Steps +Prequisite steps: + + +==== Start the VM Control Plane instances +Ensure that the control plane instances start first. + +---- +# Virsh command to start the VMs +---- + + +==== Start the physical nodes +To connect to `idrac`, you must be connected to the Red Hat VPN. Next find the management IP associated with each node. + +On the `batcave01` instance, in the dns configuration, the following bare metal machines make up the production/staging OCP4 worker nodes. + +---- +oshift-dell01 IN A 10.3.160.180 # worker01 prod +oshift-dell02 IN A 10.3.160.181 # worker02 prod +oshift-dell03 IN A 10.3.160.182 # worker03 prod +oshift-dell04 IN A 10.3.160.183 # worker01 staging +oshift-dell05 IN A 10.3.160.184 # worker02 staging +oshift-dell06 IN A 10.3.160.185 # worker03 staging +---- + +Login to the `idrac` interface that corresponds with each worker, one at a time. Ensure the node is booting via harddrive, then power it on. + +==== Once the nodes have been started they must be uncordoned if appropriate + +---- +oc get nodes +NAME STATUS ROLES AGE VERSION +dumpty-n1.ci.centos.org Ready,SchedulingDisabled worker 77d v1.18.3+6c42de8 +dumpty-n2.ci.centos.org Ready,SchedulingDisabled worker 77d v1.18.3+6c42de8 +dumpty-n3.ci.centos.org Ready,SchedulingDisabled worker 77d v1.18.3+6c42de8 +dumpty-n4.ci.centos.org Ready,SchedulingDisabled worker 77d v1.18.3+6c42de8 +dumpty-n5.ci.centos.org Ready,SchedulingDisabled worker 77d v1.18.3+6c42de8 +kempty-n10.ci.centos.org Ready,SchedulingDisabled worker 106d v1.18.3+6c42de8 +kempty-n11.ci.centos.org Ready,SchedulingDisabled worker 106d v1.18.3+6c42de8 +kempty-n12.ci.centos.org Ready,SchedulingDisabled worker 106d v1.18.3+6c42de8 +kempty-n6.ci.centos.org Ready,SchedulingDisabled master 106d v1.18.3+6c42de8 +kempty-n7.ci.centos.org Ready,SchedulingDisabled master 106d v1.18.3+6c42de8 +kempty-n8.ci.centos.org Ready,SchedulingDisabled master 106d v1.18.3+6c42de8 +kempty-n9.ci.centos.org Ready,SchedulingDisabled worker 106d v1.18.3+6c42de8 + +nodes=$(oc get nodes -o name | sed -E "s/node\///") + +for node in ${nodes[@]}; do oc adm uncordon $node; done +node/dumpty-n1.ci.centos.org uncordoned +node/dumpty-n2.ci.centos.org uncordoned +node/dumpty-n3.ci.centos.org uncordoned +node/dumpty-n4.ci.centos.org uncordoned +node/dumpty-n5.ci.centos.org uncordoned +node/kempty-n10.ci.centos.org uncordoned +node/kempty-n11.ci.centos.org uncordoned +node/kempty-n12.ci.centos.org uncordoned +node/kempty-n6.ci.centos.org uncordoned +node/kempty-n7.ci.centos.org uncordoned +node/kempty-n8.ci.centos.org uncordoned +node/kempty-n9.ci.centos.org uncordoned + +oc get nodes +NAME STATUS ROLES AGE VERSION +dumpty-n1.ci.centos.org Ready worker 77d v1.18.3+6c42de8 +dumpty-n2.ci.centos.org Ready worker 77d v1.18.3+6c42de8 +dumpty-n3.ci.centos.org Ready worker 77d v1.18.3+6c42de8 +dumpty-n4.ci.centos.org Ready worker 77d v1.18.3+6c42de8 +dumpty-n5.ci.centos.org Ready worker 77d v1.18.3+6c42de8 +kempty-n10.ci.centos.org Ready worker 106d v1.18.3+6c42de8 +kempty-n11.ci.centos.org Ready worker 106d v1.18.3+6c42de8 +kempty-n12.ci.centos.org Ready worker 106d v1.18.3+6c42de8 +kempty-n6.ci.centos.org Ready master 106d v1.18.3+6c42de8 +kempty-n7.ci.centos.org Ready master 106d v1.18.3+6c42de8 +kempty-n8.ci.centos.org Ready master 106d v1.18.3+6c42de8 +kempty-n9.ci.centos.org Ready worker 106d v1.18.3+6c42de8 +---- + + +=== Resources + +- [1] https://docs.openshift.com/container-platform/4.5/backup_and_restore/graceful-cluster-restart.html[Graceful Cluster Startup] +- [2] https://docs.openshift.com/container-platform/4.5/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-restoring-cluster-state[Cluster disaster recovery] diff --git a/modules/ocp4/pages/sops.adoc b/modules/ocp4/pages/sops.adoc index 0e74933..7620939 100644 --- a/modules/ocp4/pages/sops.adoc +++ b/modules/ocp4/pages/sops.adoc @@ -1,12 +1,15 @@ == SOPs -- xref:sop_installation.adoc[SOP Openshift 4 Installation on Fedora Infra] -- xref:sop_create_machineconfigs.adoc[SOP Create MachineConfigs to Configure RHCOS] - xref:sop_configure_baremetal_pxe_uefi_boot.adoc[SOP Configure Baremetal PXE-UEFI Boot] -- xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT] - xref:sop_configure_image_registry_operator.adoc[SOP Configure the Image Registry Operator] -- xref:sop_configure_oauth_ipa.adoc[SOP Configure oauth Authentication via IPA/Noggin] -- xref:sop_disable_provisioners_role.adoc[SOP Disable the Provisioners Role] - xref:sop_configure_local_storage_operator.adoc[SOP Configure the Local Storage Operator] +- xref:sop_configure_oauth_ipa.adoc[SOP Configure oauth Authentication via IPA/Noggin] - xref:sop_configure_openshift_container_storage.adoc[SOP Configure the Openshift Container Storage Operator] - xref:sop_configure_userworkload_monitoring_stack.adoc[SOP Configure the Userworkload Monitoring Stack] +- xref:sop_cordoning_nodes_and_draining_pods.adoc[SOP Cordoning and Draining Nodes] +- xref:sop_create_machineconfigs.adoc[SOP Create MachineConfigs to Configure RHCOS] +- xref:sop_disable_provisioners_role.adoc[SOP Disable the Provisioners Role] +- xref:sop_graceful_shutdown_ocp_cluster.adoc[SOP Graceful Cluster Shutdown] +- xref:sop_graceful_startup_ocp_cluster.adoc[SOP Graceful Cluster Startup] +- xref:sop_installation.adoc[SOP Openshift 4 Installation on Fedora Infra] +- xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT] From 3908d237cd75dca48101bd987102ed43b6c002d8 Mon Sep 17 00:00:00 2001 From: David Kirwan Date: Sep 29 2021 01:45:35 +0000 Subject: [PATCH 14/15] metrics-for-apps: Added new sops - Cluster upgrades - Creating etcd backups Signed-off-by: David Kirwan --- diff --git a/modules/ocp4/pages/sop_etcd_backup.adoc b/modules/ocp4/pages/sop_etcd_backup.adoc new file mode 100644 index 0000000..fbfc07c --- /dev/null +++ b/modules/ocp4/pages/sop_etcd_backup.adoc @@ -0,0 +1,50 @@ +== Create etcd backup +This SOP should be followed in the following scenarios: + +- When the need exists to create an etcd backup. +- When shutting a cluster down gracefully. + +=== Resources + +- [1] https://docs.openshift.com/container-platform/4.8/backup_and_restore/backing-up-etcd.html[Creating an etcd backup] + +=== Take etcd backup + +1. Connect to the `os-control01` node associated with the ENV. + +2. Use the `oc` tool to make a debug connection to a controlplane node + +---- +oc debug node/ +---- + +3. Chroot to the /host directory on the containers filesystem + +---- +sh-4.2# chroot /host +---- + +4. Run the cluster-backup.sh script and pass in the location to save the backup to + +---- +sh-4.4# /usr/local/bin/cluster-backup.sh /home/core/assets/backup +---- + +5. Chown the backup files to be owned by user `core` and group `core` + +---- +chown -R core:core /home/core/assets/backup +---- + +6. From the admin machine, see inventory group: `ocp-ci-management`, become the Openshift service account, see the inventory hostvars for the host identified in the previous step and note the `ocp_service_account` variable. + +---- +ssh +sudo su - +---- + +7. Copy the files down to the `os-control01` machine. + +---- +scp -i core@:/home/core/assets/backup/* ocp_backups/ +---- diff --git a/modules/ocp4/pages/sop_installation.adoc b/modules/ocp4/pages/sop_installation.adoc index d2a6072..6e74301 100644 --- a/modules/ocp4/pages/sop_installation.adoc +++ b/modules/ocp4/pages/sop_installation.adoc @@ -207,3 +207,9 @@ Several other SOPs should be followed to perform the post installation configura - xref:sop_create_machineconfigs.adoc[SOP Create MachineConfigs to Configure RHCOS] - xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT] - xref:sop_configure_image_registry_operator.adoc[SOP Configure the Image Registry Operator] +- xref:sop_disable_provisioners_role.adoc[SOP Disable the Provisioners Role] +- xref:sop_configure_oauth_ipa.adoc[SOP Configure oauth Authentication via IPA/Noggin] +- xref:sop_configure_local_storage_operator.adoc[SOP Configure the Local Storage Operator] +- xref:sop_configure_openshift_container_storage.adoc[SOP Configure the Openshift Container Storage Operator] +- xref:sop_configure_userworkload_monitoring_stack.adoc[SOP Configure the Userworkload Monitoring Stack] + diff --git a/modules/ocp4/pages/sop_upgrade.adoc b/modules/ocp4/pages/sop_upgrade.adoc new file mode 100644 index 0000000..8cbef4f --- /dev/null +++ b/modules/ocp4/pages/sop_upgrade.adoc @@ -0,0 +1,37 @@ +== Upgrade OCP4 Cluster +Please see the official documentation for more information [1][3], this SOP can be used as a rough guide. + +=== Resources + +- [1] https://docs.openshift.com/container-platform/4.8/updating/updating-cluster-between-minor.html[Upgrading OCP4 Cluster Between Minor Versions] +- [2] xref:sop_etcd_backup.adoc[SOP Create etcd backup] +- [3] https://docs.openshift.com/container-platform/4.8/operators/admin/olm-upgrading-operators.html +- [4] https://docs.openshift.com/container-platform/4.8/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-restoring-cluster-state[Restore etcd backup] +- [5] https://docs.openshift.com/container-platform/4.8/operators/admin/olm-upgrading-operators.html#olm-upgrading-operators[Upgrading Operators Prior to Cluster Update] + +=== Prerequisites + +- Incase an upgrade fails, it is wise to first take an `etcd` backup. To do so follow the SOP [2]. +- Ensuare that all installed Operators are at the latest versions for their channel [5]. + +=== Upgrade OCP +At the time of writing the version installed on the cluster is `4.8.11` and the `upgrade channel` is set to `stable-4.8`. It is easiest to update the cluster via the web console. Go to: + +- Administration +- Cluster Settings +- In order to upgrade between `z` or `patch` version (x.y.z), when one is available, click the update button. +- When moving between `y` or `minor` versions, you must first switch the `upgrade channel` to `fast-4.9` as an example. You should also be on the very latest `z`/`patch` version before upgrading. +- When the upgrade has finished, switch back to the `upgrade channel` for stable. + + +=== Upgrade failures +In the worst case scenario we may have to restore etcd from the backups taken at the start [4]. Or reinstall a node entirely. + +==== Troubleshooting +There are many possible ways an upgrade can fail mid way through. + +- Check the monitoring alerts currently firing, this can often hint towards the problem +- Often individual nodes are failing to take the new MachineConfig changes and will show up when examining the `MachineConfigPool` status. +- Might require a manual reboot of that particular node +- Might require killing pods on that particular node + diff --git a/modules/ocp4/pages/sops.adoc b/modules/ocp4/pages/sops.adoc index 7620939..cb2bd29 100644 --- a/modules/ocp4/pages/sops.adoc +++ b/modules/ocp4/pages/sops.adoc @@ -13,3 +13,5 @@ - xref:sop_graceful_startup_ocp_cluster.adoc[SOP Graceful Cluster Startup] - xref:sop_installation.adoc[SOP Openshift 4 Installation on Fedora Infra] - xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT] +- xref:sop_upgrade.adoc[SOP Upgrade OCP4 Cluster] +- xref:sop_etcd_backup.adoc[SOP Create etcd backup] From 50846738b6e77db59e549ce792e9d696e8889b82 Mon Sep 17 00:00:00 2001 From: David Kirwan Date: Sep 29 2021 01:45:35 +0000 Subject: [PATCH 15/15] metrics-for-apps: SOP Configure Openshift Virtualization Operator Signed-off-by: David Kirwan --- diff --git a/modules/ocp4/pages/sop_configure_openshift_virtualization_operator.adoc b/modules/ocp4/pages/sop_configure_openshift_virtualization_operator.adoc new file mode 100644 index 0000000..97d058f --- /dev/null +++ b/modules/ocp4/pages/sop_configure_openshift_virtualization_operator.adoc @@ -0,0 +1,56 @@ +== Installation of the Openshift Virtualisation Operator + +=== Resources +- [1] https://alt.fedoraproject.org/cloud/[Fedora Images] +- [2] https://github.com/kubevirt/kubevirt/blob/main/docs/container-register-disks.md[Kubevirt Importing Containers of VMI Images] + + +=== Installation +From the web console, choose the `Operators` menu, and choose `OperatorHub`. + +Search for `Openshift Virtualization` + +Click install. + +When the installation of the Operator is completed, create a `HyperConverged` object and follow the wizard, the default options should be fine, click next through the menus. + +Next create a `HostPathProvisioner` object the default options should be fine, click next through the menus. + + +=== Verification +To verify that the installation of the Operator is successful, we can attempt to create a VM. + +From the [1] location download the Fedora34 `Cloud Base image for Openstack` image with the `qcow2` format locally. + +Create a `Dockerfile` with the following contents: + +---- +FROM scratch +ADD fedora34.qcow2 /disk/ +---- + +Build the contianer: + +---- +podman build -t fedora34:latest . +---- + +Push the container to your username at quay.io. + +---- +podman push quay.io//fedora34:latest +---- + +In the web console, visit the Workloads, then Virtualization menu. + +Create a VirtualMachine with Wizard + +Choose Fedora and click next + +From the boot source dropdown menu, select import via Registry + +In the container image, you can add the one prepared earlier. eg `quay.io/dkirwan/fedora34` + +Click the `Advanced Storage settings`, change the storageclass to `oc-storagecluster-ceph-rbd` and click next and done. + +Once the VM is created and booted, the console is available from the top right drop down menu. diff --git a/modules/ocp4/pages/sops.adoc b/modules/ocp4/pages/sops.adoc index cb2bd29..292d40a 100644 --- a/modules/ocp4/pages/sops.adoc +++ b/modules/ocp4/pages/sops.adoc @@ -15,3 +15,4 @@ - xref:sop_retrieve_ocp4_cacert.adoc[SOP Retrieve OCP4 CACERT] - xref:sop_upgrade.adoc[SOP Upgrade OCP4 Cluster] - xref:sop_etcd_backup.adoc[SOP Create etcd backup] +- xref:sop_configure_openshift_virtualization_operator.adoc[SOP Configure the Openshift Virtualization Operator]