The DRAC firmware on all Dell servers needs to be updated to the latest version. This task does not reboot the actual server but only its management controller.
Login to Management Interface: - Access the management web interface of each server.
Download DRAC Firmware: - Retrieve the service tag for each server. - Visit https://dell.com/support/ and search for the DRAC firmware corresponding to each service tag. - Download the firmware file.
Upload Firmware: - Upload the downloaded firmware file to each server’s management interface.
Schedule Downtime in Nagios: - Schedule appropriate downtime to prevent alert notifications during the update.
Update Firmware: - Initiate the firmware update, allowing the management controller to reboot and apply the update.
Centralize Firmware Files: - Store all required firmware files on the central server (batcave) under /srv/web/infra/fw/dell/.
Automate with Ansible: - Explore using the Ansible collection for Dell OpenManage to automate the firmware updates and other DRAC configurations.
Metadata Update from @zlopez: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: high-gain, medium-trouble, ops
Here are the hostnames of all the dell servers we currently have (hopefully I didn't forgot any):
There are also few machines for COPR and I'm not sure if we should do it or just let them know about this. And few machines that don't have a hostname, so I'm not sure if they are actually used anywhere.
P.S.: I didn't know that there is a web interface for managing those machines.
Additionally we have a firmware update for the BMC in the older emag arm devices that we should apply.
I am thinking of taking this up. I suspect these all are in the intranet and would necessitate a VPN.
Metadata Update from @t0xic0der: - Issue assigned to t0xic0der
So, we don't have a howto/SOP for this, but there's mentions in:
modules/sysadmin_guide/pages/failedharddrive.adoc modules/howtos/pages/restart_datacenter_server.adoc modules/sysadmin_guide/pages/hardware_troubleshooting_power.adoc
Basically in IAD2 all our management interfaces are in a mgmt.iad2.fedoraproject.org network ( 10.3.160.0/24 ). They can be reached directly from the RH vpn, or via noc01/batcave01 if you want to tunnel via those.
In RDU they are in a internal mgmt network thats only reachable via noc-cc01.fedoraproject.org. You have to tunnel via this vm to get to them and they are all on 172.21.2.0/24. There is no dns zone for this, but we should probibly standarize them and setup one someday.
Once you can reach a device, you need an admin passwword to login to them. We have seperate ones for staging and production.
Once you login the process is basically:
I would suggest stating with staging hosts to get the hang of things first.
Happy to walk through one as an example.
I tried to update rest of the servers, but encountered some issues. Here is the list:
RED008 - Unable to extract payloads from Update Package.
Other machines on the list are up to date :-)
So, many of these are not in iad2, so accessing them is different. Some are not correctly named on the list. Some are really not reachable.
ibiblio02.fedoraproject.org - this is ibiblio02-mgmt.fedoraproject.org in dns. You cannot reach it normally, you have to tunnel https through ibiblio05 (below) ibiblio05.fedoraproject.org - This one really is unreachable. I have been working with folks there to try and fix it without much luck. openqa-x86-worker02.mgmt.iad2.fedoraproject.org - This one is so old it's using the old SHA1 firmware and cannot read the new SHA256 ones. Probibly there is one in the middle somewhere you can upgrade to that will let you then upgrade to the latest. However, I don't think it's worth it. osuosl02.fedoraproject.org - This one requires access to mgmt network there at osuosl which requires a openvpn vpn. I could try and add you, or I can just do the update there? virthost-cc-rdu02.fedoraproject.org - This is the old name of vmhost-x86-cc02... it no longer exists under this name. vmhost-x86-07.mgmt.iad2.fedoraproject.org - This host was retired/no longer exists. vmhost-x86-cc01.rdu-cc.fedoraproject.org - Unreachable vmhost-x86-cc02.rdu-cc.fedoraproject.org - Unreachable vmhost-x86-cc03.rdu-cc.fedoraproject.org - Unreachable vmhost-x86-cc05.rdu-cc.fedoraproject.org - Unreachable vmhost-x86-cc06.rdu-cc.fedoraproject.org - Unreachable
All these can be reached via tunneling via noc-cc01. There is a docs/rdu-networks.txt file that lists the 172.21.x.x ip's for each.
These don't have '.ocp' in the name. They should just be 'worker02.mgmt.iad2.fedoraproject.org' and 'worker04-stg.mgmt.iad2.fedoraproject.org', etc
worker02.ocp.mgmt.iad2.fedoraproject.org - Unreachable worker04-stg.ocp.mgmt.iad2.fedoraproject.org - Unreachable worker04.ocp.mgmt.iad2.fedoraproject.org - Unreachable worker05.ocp.mgmt.iad2.fedoraproject.org - Unreachable worker06.ocp.mgmt.iad2.fedoraproject.org - Unreachable
I will leave the osuosl02 to you.
I will try to update the workers and rdu-cc machines. Will see how that will go.
I finished another round of updates. I was able to update all the worker machines and few of rdu-cc machines. I still have issue with two of them:
osuosl02 is up to date (it's a new machine).
The vmhost-x86-cc05/06 machines are going to be replaced next year, so I'm not sure how much we should worry about them.
I fixed the password on 06, but it ran into the 'so old it's using sha1' problem.
So, I think we can close this now?
Sounds good.
We can always reopen this one now that we have an SOP and a spreadsheet associated with this.
Metadata Update from @t0xic0der: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.