From f03fde0f5ad537d422a2848e96243743bdc4fc3b Mon Sep 17 00:00:00 2001 From: Kevin Fenzi Date: May 15 2023 09:46:13 +0000 Subject: massupgrade: lots of fixes and improvements This SOP had a bunch of old stuff in it. This syncs it back up with reality mostly. Proofreading/formatting welcome. Questions also welcome. :) Signed-off-by: Kevin Fenzi --- diff --git a/modules/sysadmin_guide/pages/massupgrade.adoc b/modules/sysadmin_guide/pages/massupgrade.adoc index 3ee53ea..003972b 100644 --- a/modules/sysadmin_guide/pages/massupgrade.adoc +++ b/modules/sysadmin_guide/pages/massupgrade.adoc @@ -38,39 +38,29 @@ Purpose::: == Preparation -[arabic] -. Determine which host group you are going to be doing updates/reboots -on. -+ -Group "A":: - servers that end users will see or note being down and anything that - depends on them. -Group "B":: - servers that contributors will see or note being down and anything - that depends on them. -Group "C":: - servers that infrastructure will notice are down, or are redundent - enough to reboot some with others taking the load. -. Appoint an 'Update Leader' for the updates. -. Follow the xref:outage.adoc[Outage Infrastructure SOP] and send advance notification -to the appropriate lists. Try to schedule the update at a time when many -admins are around to help/watch for problems and when impact for the -group affected is less. Do NOT do multiple groups on the same day if -possible. -. Plan an order for rebooting the machines considering two factors: -+ -____ -* Location of systems on the kvm or xen hosts. [You will normally reboot -all systems on a host together] -* Impact of systems going down on other services, operations and users. -Thus since the database servers and nfs servers are the backbone of many -other systems, they and systems that are on the same xen boxes would be -rebooted before other boxes. -____ -. To aid in organizing a mass upgrade/reboot with many people helping, -it may help to create a checklist of machines in a gobby document. -. Schedule downtime in nagios. -. Make doubly sure that various app owners are aware of the reboots +Mass updates are usually applied every few months or sooner if there's some +critical bugs fixed. Mass updates are done outside of freeze windows to avoid +causing any problems for Fedora releases. + +The following items are all done before the actual mass update: + +* Plan a outage window or windows outside of a freeze. +* File a outage ticket in the fedora-infrastructure tracker, using the outage +template. This should describe the exact time/date and what is included. +* Get the outage ticket reviewed by someone else to confirm there's no mistakes +in it. +* Sent outage announcement to infrastructure and devel-announce lists (for +outages that affect contributors only) or infrastructure, devel-announce +and announce (for outages that affect all users). +* Add a 'planned' outage to fedorastatus. This will show the planned outage +there for higher visibility. +* Setup a hackmd or other shared document that lists all the virthosts and +bare metal hosts that need rebooting and organize it per day. This is used +to track what admin is handling what server(s). + +Typically updates/reboots are done on all staging hosts on a monday, +then all non outage causing hosts on tuesday and then finally the outages +are on wednsday. == Staging @@ -78,43 +68,25 @@ ____ Any updates that can be tested in staging or a pre-production environment should be tested there first. Including new kernels, updates to core database applications / libraries. Web applications, libraries, -etc. +etc. This is typically done a few days before the actual outage. +Too far in advance and things may have changed again, so it's important +to do this just before the production updates. ____ +== Non outage causing hosts + +Some hosts can be safely updated/rebooted without an outage because +they either have multiple machines in a load balancer or are not +visible to end users or other reasons. These updates are typically +done on tuesday of the outage week so they are done before the outage +on wed. These hosts include proxies and a number of virthosts that +have vm's that meet this criteria. + == Special Considerations While this may not be a complete list, here are some special things that must be taken into account before rebooting certain systems: -=== Disable builders - -Before the following machines are rebooted, all koji builders should be -disabled and all running jobs allowed to complete: - -____ -* db04 -* nfs01 -* kojipkgs02 -____ - -Builders can be removed from koji, updated and re-added. Use: - -.... -koji disable-host NAME - - and - -koji enable-host NAME -.... - -[NOTE] -==== -you must be a koji admin -==== - -Additionally, rel-eng and builder boxes may need a special version -of rpm. Make sure to check with rel-eng on any rpm upgrades for them. - === Post reboot action The following machines require post-boot actions (mostly entering @@ -122,50 +94,64 @@ passphrases). Make sure admins that have the passphrases are on hand for the reboot: ____ -* backup-2 (LUKS passphrase on boot) -* sign-vault01 (NSS passphrase for sigul service) -* sign-bridge01 (NSS passphrase for sigul bridge service) -* serverbeach* (requires fixing firewall rules): +* backup01 (ssh agent passphrase for backup ssh key) +* sign-vault01 (NSS passphrase for sigul service and luks passphrase) +* sign-bridge01 (run: 'sigul_bridge -dvv' after it comes back up, not passphrase needed) +* autosign01 (NSS passphrase for robosignatory service and luks passphrase) +* buildvm-s390x-15/16/16 ( need sshfs mount of koji volume redone) +* batcave01 (ssh agent passphrase for ansible ssh key) +* notifs-backend01 ( + rabbitmqctl eval 'application:set_env(rabbit, consumer_timeout, 36000000).' + systemctl restart fmn-backend@1; for i in `seq 1 24`; do echo $i; systemctl restart fmn-worker@$i | cat; done ____ -Each serverbeach host needs 3 or 4 iptables rules added anytime it's -rebooted or libvirt is upgraded: +=== Bastion01 and Bastion02 and openvpn server -.... -iptables -I FORWARD -o virbr0 -j ACCEPT -iptables -I FORWARD -i virbr0 -j ACCEPT -iptables -t nat -I POSTROUTING -s 192.168.122.3/32 -j SNAT --to-source 66.135.62.187 -.... +If a reboot of bastion01 is done during an outage, nothing needs to be changed +here. However, if bastion01 will be down for an extended period of time +openvpn can be switched to bastion02 by stopping openvpn-server@openvpn +on bastion01 and starting it on bastion02. -[NOTE] -==== -The source is the internal guest ips, the to-source is the external ips -that map to that guest ip. If there are multiple guests, each one needs -the above SNAT rule inserted. -==== +on bastion01: 'systemctl stop openvpn-server@openvpn' +on bastion02: 'systemctl start openvpn-server@openvpn' -=== Schedule autoqa01 reboot +and the process can be reversed after the other is back. +Clients try 01 first, then 02 if it's down. It's important +to make sure all the clients are using one machine or the +other, because if they are split routing between machines +may be confused. -There is currently an autoqa01.c host on cnode01. Check with QA folks -before rebooting this guest/host. +=== batcave01 -=== Bastion01 and Bastion02 and openvpn server +batcave01 is our ansible control host. It's where you run playbooks +that have been mentioned in this SOP. However, it too needs updating +and rebooting and you cannot use the vhost_reboot playbook for it, +since it's rebooting it's own virthost. For this host you should +go to the virthost and 'virsh shutdown' all the other vm's, then +'virsh shutdown' batcave01, then reboot the virthost manually. + +=== noc01 / dhcp server + +noc01 is our dhcp server. Unfortunately, when rebooting the vmhost that +contains noc01 vm, it means that that vmhost has no dhcp server to +answer it when booting and trying to configure network to talk to +the tang server. To work around this you can run a simple dhcpd +on batcave01. Start it there and let the vmhost with noc01 come +up and then stop it. Ideally we would make another dhcp host +to avoid this issue at some point. -We need one of the bastion machines to be up to provide openvpn for all -machines. Before rebooting bastion02, modify: -`manifests/nodes/bastion0*.iad2.fedoraproject.org.pp` files to start -openvpn server on bastion01, wait for all clients to re-connect, reboot -bastion02 and then revert back to it as openvpn hub. +batcave01: 'systemctl start dhcpd' -=== Special yum directives +remember to stop it after the host comes back up. -Sometimes we will wish to exclude or otherwise modify the yum.conf on a -machine. For this purpose, all machines have an include, making them -read -http://infrastructure.fedoraproject.org/infra/hosts/FQHN/yum.conf.include -(TODO Fix link) -from the infrastructure repo. If you need to make such changes, add them -to the infrastructure repo before doing updates. +=== Special package management directives + +Sometimes we need to exclude something from being updated. +This can be done with the package_exlcudes variable. Set +that and the playbooks doing updates will exclude listed items. + +This variable is set in ansible/host_vars or ansible/group_vars +for the host or group. == Update Leader @@ -178,168 +164,29 @@ come back up from reboot or aren't working right after reboot. It's important to avoid multiple people operating on a single machine in a read-write manner and interfering with changes. -== Group A reboots - -Group A machines are end user critical ones. Outages here should be -planned at least a week in advance and announced to the announce list. - -List of machines currently in A group (note: this is going to be -automated) - -These hosts are grouped based on the virt host they reside on: - -* torrent02.fedoraproject.org -* ibiblio02.fedoraproject.org -* people03.fedoraproject.org -* ibiblio03.fedoraproject.org -* collab01.fedoraproject.org -* serverbeach09.fedoraproject.org -* db05.iad2.fedoraproject.org -* virthost03.iad2.fedoraproject.org -* db01.iad2.fedoraproject.org -* virthost04.iad2.fedoraproject.org -* db-fas01.iad2.fedoraproject.org -* proxy01.iad2.fedoraproject.org -* virthost05.iad2.fedoraproject.org -* ask01.iad2.fedoraproject.org -* virthost06.iad2.fedoraproject.org - -These are the rest: - -* bapp02.iad2.fedoraproject.org -* bastion02.iad2.fedoraproject.org -* app05.fedoraproject.org -* backup02.fedoraproject.org -* bastion01.iad2.fedoraproject.org -* fas01.iad2.fedoraproject.org -* fas02.iad2.fedoraproject.org -* log02.iad2.fedoraproject.org -* memcached03.iad2.fedoraproject.org -* noc01.iad2.fedoraproject.org -* ns02.fedoraproject.org -* ns04.iad2.fedoraproject.org -* proxy04.fedoraproject.org -* smtp-mm03.fedoraproject.org -* batcave02.iad2.fedoraproject.org -* mm3test.fedoraproject.org -* packages02.iad2.fedoraproject.org - -=== Group B reboots - -This Group contains machines that contributors use. Announcements of -outages here should be at least a week in advance and sent to the -devel-announce list. - -These hosts are grouped based on the virt host they reside on: - -* db04.iad2.fedoraproject.org -* bvirthost01.iad2.fedoraproject.org -* nfs01.iad2.fedoraproject.org -* bvirthost02.iad2.fedoraproject.org -* pkgs01.iad2.fedoraproject.org -* bvirthost03.iad2.fedoraproject.org -* kojipkgs02.iad2.fedoraproject.org -* bvirthost04.iad2.fedoraproject.org - -These are the rest: - -* koji04.iad2.fedoraproject.org -* releng03.iad2.fedoraproject.org -* releng04.iad2.fedoraproject.org - -=== Group C reboots - -Group C are machines that infrastructure uses, or can be rebooted in -such a way as to continue to provide services to others via multiple -machines. Outages here should be announced on the infrastructure list. - -Group C hosts that have proxy servers on them: - -* proxy02.fedoraproject.org -* ns05.fedoraproject.org -* hosted-lists01.fedoraproject.org -* internetx01.fedoraproject.org -* app01.dev.fedoraproject.org -* darkserver01.dev.fedoraproject.org -* fakefas01.fedoraproject.org -* proxy06.fedoraproject.org -* osuosl01.fedoraproject.org -* proxy07.fedoraproject.org -* bodhost01.fedoraproject.org -* proxy03.fedoraproject.org -* smtp-mm02.fedoraproject.org -* tummy01.fedoraproject.org -* app06.fedoraproject.org -* noc02.fedoraproject.org -* proxy05.fedoraproject.org -* smtp-mm01.fedoraproject.org -* telia01.fedoraproject.org -* app08.fedoraproject.org -* proxy08.fedoraproject.org -* coloamer01.fedoraproject.org - -Other Group C hosts: - -* ask01.stg.iad2.fedoraproject.org -* app02.stg.iad2.fedoraproject.org -* proxy01.stg.iad2.fedoraproject.org -* releng01.stg.iad2.fedoraproject.org -* value01.stg.iad2.fedoraproject.org -* virthost13.iad2.fedoraproject.org -* db-fas01.stg.iad2.fedoraproject.org -* pkgs01.stg.iad2.fedoraproject.org -* packages01.stg.iad2.fedoraproject.org -* virthost11.iad2.fedoraproject.org -* app01.stg.iad2.fedoraproject.org -* koji01.stg.iad2.fedoraproject.org -* db02.stg.iad2.fedoraproject.org -* fas01.stg.iad2.fedoraproject.org -* virthost10.iad2.fedoraproject.org -* autoqa01.qa.fedoraproject.org -* autoqa-stg01.qa.fedoraproject.org -* bastion-comm01.qa.fedoraproject.org -* batcave-comm01.qa.fedoraproject.org -* virthost-comm01.qa.fedoraproject.org -* compose-x86-01.iad2.fedoraproject.org -* compose-x86-02.iad2.fedoraproject.org -* download01.iad2.fedoraproject.org -* download02.iad2.fedoraproject.org -* download03.iad2.fedoraproject.org -* download04.iad2.fedoraproject.org -* download05.iad2.fedoraproject.org -* download-rdu01.vpn.fedoraproject.org -* download-rdu02.vpn.fedoraproject.org -* download-rdu03.vpn.fedoraproject.org -* fas03.iad2.fedoraproject.org -* secondary01.iad2.fedoraproject.org -* memcached04.iad2.fedoraproject.org -* virthost01.iad2.fedoraproject.org -* app02.iad2.fedoraproject.org -* value03.iad2.fedoraproject.org -* virthost07.iad2.fedoraproject.org -* app03.iad2.fedoraproject.org -* value04.iad2.fedoraproject.org -* ns03.iad2.fedoraproject.org -* darkserver01.iad2.fedoraproject.org -* virthost08.iad2.fedoraproject.org -* app04.iad2.fedoraproject.org -* packages02.iad2.fedoraproject.org -* virthost09.iad2.fedoraproject.org -* hosted03.fedoraproject.org -* serverbeach06.fedoraproject.org -* hosted04.fedoraproject.org -* serverbeach07.fedoraproject.org -* collab02.fedoraproject.org -* serverbeach08.fedoraproject.org -* dhcp01.iad2.fedoraproject.org -* relepel01.iad2.fedoraproject.org -* sign-bridge02.iad2.fedoraproject.org -* koji03.iad2.fedoraproject.org -* bvirthost05.iad2.fedoraproject.org -* (disable each builder in turn, update and reenable). -* ppc11.iad2.fedoraproject.org -* ppc12.iad2.fedoraproject.org -* backup03 +Uusally for a mass update/reboot there will be a hackmd or similar +document that tracks what machines have already been rebooted +and who is working on which one. Please check with the leader +for a link to this document. + +== Updates and Reboots via playbook + +There's several playbooks related to this task: +vhost_update.yml applies updates to a vmhost and all it's guests +vhost_reboot.yml shuts down vm's and reboots a vmhost +vhost_update_reboot.yml does both of the above + +For hosts out of outage you probably want to use these to make sure +updates are applied before reboots. Once updates are applied globally +before the outage you will want to just use the reboot playbook. + +Additionally there are two more playbooks to check things: +check-for-nonvirt-updates.yml +check-for-updates.yml +See those playbooks for more information, but basically they allow +you to see how many updates are pending on all the virthosts/bare +metal machines and/or all machines. This is good to run at the end +of outages to confirm that everything was updated. == Doing the upgrade @@ -348,55 +195,21 @@ If possible, system upgrades should be done in advance of the reboot make sure that the Infrastructure RHEL repo is updated as necessary to pull in the new packages (xref:infra-repo.adoc[Infrastructure Yum Repo SOP]) -On batcave01, as root run: - -.... -func-yum [--host=hostname] update -.... - -..note: --host can be specified multiple times and takes wildcards. - -pinging people as necessary if you are unsure about any packages. - -Additionally you can see which machines still need rebooted with: - -.... -sudo func-command --timeout=10 --oneline /usr/local/bin/needs-reboot.py | grep yes -.... - -You can also see which machines would need a reboot if updates were all -applied with: - -.... -sudo func-command --timeout=10 --oneline /usr/local/bin/needs-reboot.py after-updates | grep yes -.... - -== Doing the reboot - -In the order determined above, reboots will usually be grouped by the -virtualization hosts that the servers are on. You can see the guests per -virt host on batcave01 in `/var/log/virthost-lists.out` - -To reboot sets of boxes based on which virthost they are we've written a -special script which facilitates it: - -.... -func-vhost-reboot virthost-fqdn -.... - -ex: - -.... -sudo func-vhost-reboot virthost13.iad2.fedoraproject.org -.... +Before outage, ansible can be use to just apply all updates to hosts or +apply all updates to staging hosts before those are done. Something like: +ansible -m shell 'yum clean all; yum update -y; rkhunter --propupd' hostlist == Aftermath [arabic] . Make sure that everything's running fine -. Reenable nagios notification as needed +. Check nagios for alerts and clear them all +. Reenable nagios notification after they are cleared. . Make sure to perform any manual post-boot setup (such as entering passphrases for encrypted volumes) +. Consider running check-for-updates or check-for-nonvirt-updates to confirm + that all hosts are updated. +. close fedorastatus outage . Close outage ticket. === Non virthost reboots @@ -405,8 +218,5 @@ If you need to reboot specific hosts and make sure they recover - consider using: .... -sudo func-host-reboot hostname hostname1 hostname2 ... +sudo ansible -m reboot hostname .... - -If you want to reboot the hosts one at a time waiting for each to come -back before rebooting the next pass a `-o` to `func-host-reboot`.