PR#296: Adds SOP for Replacing Failed Hard Drives from machines - infra-docs-fpo

infra-docs-fpo

#296 Adds SOP for Replacing Failed Hard Drives from machines

Merged a month ago by jnsamyak. Opened a month ago by jnsamyak.

jnsamyak/infra-docs-fpo sop_failed_drive into master

Adds SOP for Replacing Failed Hard Drives from machines

Samyak Jain • a month ago

71d0af5

modules/ROOT/nav.adoc

file modified

		`@@ -33,6 +33,7 @@`
		`** xref:sysadmin_guide:index.adoc[Sysadmin Guide]`
		`*** xref:sysadmin_guide:orientation.adoc[Orientation for Sysadmin Guide]`
		`*** xref:sysadmin_guide:index.adoc#_standard_operating_procedures[Standard Operation Procedures]`
		`+ *** xref:sysadmin_guide:failedharddrive.adoc[Replacing Failed Hard Drives]`
		`*** xref:sysadmin_guide:index.adoc#_howtos[HOWTOs]`
		`* xref:release_guide:index.adoc[Release Engineering]`
		`** xref:release_guide:release_process.adoc[Release process]`

modules/sysadmin_guide/pages/failedharddrive.adoc

file added

+88

		`@@ -0,0 +1,88 @@`
		`+`
		`+ = Replacing a Failed Hard Drive`
		`+ :page-description: Steps for replacing a failed drive on a Fedora infrastructure server.`
		`+ :page-aliases: replacing-failed-drive.adoc`
		`+`
		`+ == Overview`
		`+`
		`+ This document provides a step-by-step procedure for replacing a failed hard drive on a Fedora infrastructure server. It includes access requirements, necessary tools, and the process for initiating and completing the drive replacement.`
		`+`
		`+ == Contact Information`
		`+`
		`+ Owner::`
		`+ Fedora Infrastructure Team`
		`+ Contact::`
		`+ #fedora-admin, sysadmin-main`
		`+ Purpose::`
		`+ Provide basic orientation and introduction to the sysadmin group`
		`+`
		`+ == Access Level`
		`+`
		`+ To perform this procedure, you may need to have sysadmin-main access. In the future, access details might be shared with a dedicated assignee or stored in a smaller vault. Currently, reach out to the sysadmin-main team for necessary information exchange.`
		`+`
		`+ == Requirements`
		`+`
		`+ * Red Hat VPN Access - Needed for SSH access to the machine.`
		`+ * Bitwarden Vault Access - Access to the vault is under discussion. For now, consult the sysadmin-main team for the login credentials.`
		`+`
		`+ == Process`
		`+`
		`+ .Firstly, access the management console:`
		`+ . Ensure you are connected to the official Red Hat VPN.`
		+ . Identify the server in question. For this SOP, we will use `bvmhost-x86-01.stg.iad2.fedoraproject.org` as an example.
		+ . To access the management console, append `.mgmt` to the hostname: `bvmhost-x86-01-stg.mgmt.iad2.fedoraproject.org`.
		+ . Obtain the IP address by pinging the server from `batcave01`:
		`+ +`
		`+ [source,bash]`
		`+ ----`
		`+ ssh batcave01.iad2.fedoraproject.org`
		`+ ping bvmhost-x86-01-stg.mgmt.iad2.fedoraproject.org`
		`+ ----`
		`+`
		`+ . Visit the IP address in a web browser. The management console uses HTTPS, so accept the self-signed certificate:`
		`+ +`
		`+ [source]`
		`+ ----`
		`+ https://<IP_ADDRESS>`
		`+ ----`
		`+`
		+ . Login using the credentials found in the `admin-stg` entry in Bitwarden.
		`+`
		`+ .Navigate to the overview page to find the serial number/service tag of the machine.`
		`+`
		`+ === Identify the Failed Drive`
		`+`
		`+ . Navigate to the storage menu to identify the failed drive. Warnings about failing/failed drives will be indicated here.`
		`+ . Note the failed drive's details (e.g., drive 4).`
		`+ . Create a failed drice report by clicking on the exporting the information of failed drive.`
		`+`
		`+ === Create a Support Ticket`
		`+`
		`+ . In the management console, click on the support link in the top right corner.`
		`+ . Follow these steps to contact technical support:`
		`+ .. Go to the top left search bar and select "Support > Contact Technical Support".`
		`+ .. Search for the device using the service tag from the overview page.`
		`+ .. Select "HardDrive and RAID Controller" from the drop-down menu.`
		`+ .. Choose one of the support options:`
		`+ ... Call: 24/7`
		`+ ... Live Chat: 7 am - 9 pm CDT, Monday - Friday`
		`+ ... Social Connect`
		`+`
		`+ . In the live chat support, provide the failed drive report, once they verify and confirm the failure issue, they will send an email regarding replacement details.`
		`+ . If live chat is unsuccessful, call support at 1-866-362-5350 (available 24/7).`
		`+`
		`+ === Follow-Up with the Support Ticket`
		`+`
		`+ . Once the support ticket is created, the assignee will receive a form via email.`
		`+ . Forward this form to Patrick Cole (pcole@redhat.com) along with the machine's serial number and location.`
		`+ +`
		`+ [NOTE]`
		`+ ====`
		`+ At this point, Patrick Cole will handle the coordination with Dell for the drive replacement. This avoids adding unnecessary intermediaries.`
		`+ ====`
		`+`
		`+ Patrick will then coordinate the replacement with Dell, including arranging access for the technician if needed.`
		`+`
		`+ == Conclusion`
		`+`
		`+ Following this SOP ensures a systematic approach to replacing failed drives, minimizing downtime and maintaining system integrity. Always reach out to the sysadmin-main team for any clarifications or additional support.`
		`\ No newline at end of file`