#296 Adds SOP for Replacing Failed Hard Drives from machines
Merged a month ago by jnsamyak. Opened a month ago by jnsamyak.
jnsamyak/infra-docs-fpo sop_failed_drive  into  master

file modified
+1
@@ -33,6 +33,7 @@ 

  ** xref:sysadmin_guide:index.adoc[Sysadmin Guide]

  *** xref:sysadmin_guide:orientation.adoc[Orientation for Sysadmin Guide]

  *** xref:sysadmin_guide:index.adoc#_standard_operating_procedures[Standard Operation Procedures]

+ *** xref:sysadmin_guide:failedharddrive.adoc[Replacing Failed Hard Drives]

  *** xref:sysadmin_guide:index.adoc#_howtos[HOWTOs]

  * xref:release_guide:index.adoc[Release Engineering]

  ** xref:release_guide:release_process.adoc[Release process]

@@ -0,0 +1,88 @@ 

+ 

+ = Replacing a Failed Hard Drive

+ :page-description: Steps for replacing a failed drive on a Fedora infrastructure server.

+ :page-aliases: replacing-failed-drive.adoc

+ 

+ == Overview

+ 

+ This document provides a step-by-step procedure for replacing a failed hard drive on a Fedora infrastructure server. It includes access requirements, necessary tools, and the process for initiating and completing the drive replacement.

+ 

+ == Contact Information

+ 

+ Owner::

+   Fedora Infrastructure Team

+ Contact::

+   #fedora-admin, sysadmin-main

+ Purpose::

+   Provide basic orientation and introduction to the sysadmin group

+ 

+ == Access Level

+ 

+ To perform this procedure, you may need to have sysadmin-main access. In the future, access details might be shared with a dedicated assignee or stored in a smaller vault. Currently, reach out to the sysadmin-main team for necessary information exchange.

+ 

+ == Requirements

+ 

+ * Red Hat VPN Access - Needed for SSH access to the machine.

+ * Bitwarden Vault Access - Access to the vault is under discussion. For now, consult the sysadmin-main team for the login credentials.

+ 

+ == Process

+ 

+ .Firstly, access the management console:

+ . Ensure you are connected to the official Red Hat VPN.

+ . Identify the server in question. For this SOP, we will use `bvmhost-x86-01.stg.iad2.fedoraproject.org` as an example.

+ . To access the management console, append `.mgmt` to the hostname: `bvmhost-x86-01-stg.mgmt.iad2.fedoraproject.org`.

+ . Obtain the IP address by pinging the server from `batcave01`:

+ +

+ [source,bash]

+ ----

+ ssh batcave01.iad2.fedoraproject.org

+ ping bvmhost-x86-01-stg.mgmt.iad2.fedoraproject.org

+ ----

+ 

+ . Visit the IP address in a web browser. The management console uses HTTPS, so accept the self-signed certificate:

+ +

+ [source]

+ ----

+ https://<IP_ADDRESS>

+ ----

+ 

+ . Login using the credentials found in the `admin-stg` entry in Bitwarden.

+ 

+ .Navigate to the overview page to find the serial number/service tag of the machine.

+ 

+ === Identify the Failed Drive

+ 

+ . Navigate to the storage menu to identify the failed drive. Warnings about failing/failed drives will be indicated here.

+ . Note the failed drive's details (e.g., drive 4).

+ . Create a failed drice report by clicking on the exporting the information of failed drive.

+ 

+ === Create a Support Ticket

+ 

+ . In the management console, click on the support link in the top right corner.

+ . Follow these steps to contact technical support:

+ .. Go to the top left search bar and select "Support > Contact Technical Support".

+ .. Search for the device using the service tag from the overview page.

+ .. Select "HardDrive and RAID Controller" from the drop-down menu.

+ .. Choose one of the support options:

+ ... Call: 24/7

+ ... Live Chat: 7 am - 9 pm CDT, Monday - Friday

+ ... Social Connect

+ 

+ . In the live chat support, provide the failed drive report, once they verify and confirm the failure issue, they will send an email regarding replacement details.

+ . If live chat is unsuccessful, call support at 1-866-362-5350 (available 24/7).

+ 

+ === Follow-Up with the Support Ticket

+ 

+ . Once the support ticket is created, the assignee will receive a form via email.

+ . Forward this form to Patrick Cole (pcole@redhat.com) along with the machine's serial number and location.

+ +

+ [NOTE]

+ ====

+ At this point, Patrick Cole will handle the coordination with Dell for the drive replacement. This avoids adding unnecessary intermediaries.

+ ====

+ 

+ Patrick will then coordinate the replacement with Dell, including arranging access for the technician if needed.

+ 

+ == Conclusion

+ 

+ Following this SOP ensures a systematic approach to replacing failed drives, minimizing downtime and maintaining system integrity. Always reach out to the sysadmin-main team for any clarifications or additional support. 

\ No newline at end of file

rebased onto 32f3ccdcc62f71468dc89445ae585c917a173e02

a month ago

rebased onto 71d0af5

a month ago

:thumbsup: looks good!

Thanks, merging it now, we can always enhance the SOP later :D

Pull-Request has been merged by jnsamyak

a month ago