#818 Investigate hardware/software issue on critical infra/mirror node
Closed: Fixed 2 years ago by arrfab. Opened 2 years ago by arrfab.

Related to #812 and #814
The server is now removed from the pool but we have to investigate about the underlying issue (recurring [230893.029726] blk_update_request: I/O error, dev vdb, sector 0) on that Virtual Machine, before we can put it back in action
Following on this, we should sync with Fedora Infra so that they use a DNS name instead of ip address for their rsync operations (for internal EPEL mirrors and also mirrormanager) so that we can also ensure they are redirected elsewhere directly with a simple dns push (as it's the case for the rest of infra)


Metadata Update from @arrfab:
- Issue assigned to arrfab
- Issue priority set to: 🔥 Urgent 🔥 (was: Needs Review)
- Issue tagged with: centos-common-infra, high-gain, high-trouble

2 years ago

Found a strange combination of multiple factor but fixing the issue could have been longer then just recreating the VM (as we have full ansible automation) and as node was removed from pool already (nobody was using that node) I decided to just delete the VM and start from fresh qcow2 image/filesystem.
Current status : node is reinstalled and under ansible control and is getting mirror content back.
Once fully validated, it will be added back and I'll reach out to @adrian and @kevin to ensure that at the Fedora infra side, it's pointing to a CNAME to a record we (centos-infra) can quickly update to transparently redirect elsewhere, should such issue would appear again

status update : node was fully reinstalled and under ansible control but it just takes time to rsync data down. So just keeping this ticket open as reminder to add it back when it will be having full content (and temporary removed from ansible automation through /etc/no-ansible for local rsync job in parallel to speed things up)

status update : node was fully reinstalled and under ansible control but it just takes time to rsync data down. So just keeping this ticket open as reminder to add it back when it will be having full content (and temporary removed from ansible automation through /etc/no-ansible for local rsync job in parallel to speed things up)

Node is fully back online with all data and in sync with primary reference mirror. DNS record updated and so we have again redundancy/workload balance for the mirror cdn (both mirror.centos.org and mirror.stream.centos.org)
Closing

Metadata Update from @arrfab:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Log in to comment on this ticket.

Metadata
Boards 1