#53 Extend storage space on storage02 node with new HDD
Closed: Fixed 3 years ago by arrfab. Opened 3 years ago by arrfab.

(task imported from JIRA CPE-517)

storage02.rdu2.centos.org has some HDD waiting to be installed, but it needs to be done while node is powered off, due to the way the node was racked (short cables and missing sliding rails).
We need to orchestrate this because :
* we need someone onsite (and internal RH ticket)
* we need to power off the server
* so we need to shutdown all services using iscsi/nfs volumes from that storinator box , including (but not limited to) , okd3 , ocp, mbox openshift clusters

Also, due to covid-19 restrictions, and lack of on-site engineers, that task was created in March but still waiting/blocked.

Importing it here to track request, and close original jira board


Metadata Update from @arrfab:
- Issue tagged with: centos-ci-infra

3 years ago

Metadata Update from @arrfab:
- Issue tagged with: groomed

3 years ago

Metadata Update from @arrfab:
- Issue tagged with: centos-common-infra

3 years ago

Status update : still waiting for ack from internal team in charge of DC and so we'll announce the downtime/maintenant window while at the same time we'll kick the disk layout reorg

Commenting here so that other issues linked to here will be able to follow. I had a quick look at the storage node exporting the nfs volume for openshift. As it's a single node (supermicro board) without any dedicated HBA with read/write cache, performances would of course not be that great (but we knew that in advance, nothing better to use for this at this stage).
The goal of the maintenance is to have more sata (no SSD ..) drives added in the linux software array (md device), to add more spindles, and then kick a mdadm --grow operation to realign/redistribute data amongst all physical volume in that VG.

Once done that should speed up Read/Write operations.
I confirmed that by looking with iostat at the nodes, read/write IOPs aren't optimized, but that maintenance window will help with that.

But I also had a look at some openshift doc, and clearly for some example they claim that the no_wdelay should be used .
From https://docs.openshift.com/container-platform/4.5/registry/configuring_registry_storage/configuring-registry-storage-baremetal.html#registry-configuring-storage-baremetal_configuring-registry-storage-baremetal :

If the storage type is NFS, you must enable the no_wdelay and root_squash mount options. For example:

So while the exports for openshift are actually setup with "sync", that means that all concurrent "write" IOPs go to the nfs server, that has a slow io subsystem (remember, not a lot of cheap sata disks and not read/write cache that speed IOPs a lot) and kernel waits for all writes to be fully acknowledged (top to bottom in the stack) to return "ok" to the nfs client (openshift nodes and pods in our cases).

The "no_wdelay" would solve and speed up this too . From man exports :

 no_wdelay
              This option has no effect if async is also set.  The NFS server will normally delay committing a write
              request  to  disc slightly if it suspects that another related write request may be in progress or may
              arrive soon.  This allows multiple write requests to be committed to disc with the one operation which
              can  improve  performance.   If an NFS server received mainly small unrelated requests, this behaviour
              could actually reduce performance, so no_wdelay is available to turn  it  off.   The  default  can  be
              explicitly requested with the wdelay option.

So my recommendation for this storage node is (and that's what we'll implement once we'll have confirmation about the date on which DC people can put the disk) :

  • add new disks (but need unfortunately a shutdown, no need to hot-swap HDDs in that box)
  • already switch NFS exports with no_wdelay
  • kick a mdadm --grow on a friday evening, as it can take ~48h to reshape the array

From there, we can analyze again kernel (also up2date one as we'll also do the full update on that storage node) messages, and test performances

We got a date through agreement with DC people : we'll do the maintenance on Sept 30th, 12:00 UTC time.
We'll send mail to announce this hardware maintenance

Metadata Update from @arrfab:
- Issue assigned to arrfab

3 years ago

Creating in advance some LV (through ansible nfs-server role) to rsync from old vg to new vg/lv (on top of mdadm device). That will permit to already sync/import most of the data so that after HDD expansion, OCP (including legacy) volumes will be served from new md/vg setup (and reshaped)

Just to give a status update here.
Yesterday we :

  • powered off all openshift cluster (legacy and okd ones)
  • worked with DC people to add disks in the storage box (powered down)
  • due to xfs corrupted fs on reboot, (but not affect all LVs) we decided to reinstall storage node with CentOs 8 (was running 7) and restore LV/nfs exports that had no issue
  • xfs_repair was launched on affected LVs (in old VG)
  • once done and verified, mounted and restorted nfs exports (through ansible)

Next on the list (to be able to close ticket) :

  • add previous disks (cleared out) into md device
  • kick a md device reshape to expand the array accross all available disks

Once done, we'll close this ticket but we heard already that some people already noticed a speed difference (fast!) on the ci-users list, and it should even be better next week (I'll kick the md device reshape/expand over the week-end, as it would also impact IO during the blocks reorg/move)

We're seeing degraded NFS performance now in the other project which so far had been immune to this (coreos-ci). Export in question is nfs02.ci.centos.org:/exports/ocp-prod/pv-10gi-c7401bb0-4053-5307-9e3c-873580f0f23e.

I'm guessing this might be due to things changing in the backend though just wanted to raise awareness on this.

@jlebon thanks for the report, we had initial backup job ongoing, before also going to step 2 : md device reshape (adding disks) and I'll kick that later today, goal was to have that operation during the week-end, as there would be normally less activity on disks, as clearly a md device grow/reshape has io impact .. that's the price to pay when using cheap sata drives and no raid controller ...
I'd like to see how that storage box react for normal workload next week, if that speeds things up or not , but I already sees some kernel messages about disks errors:

[137450.865846] blk_update_request: I/O error, dev sdh, sector 9045517312 op 0x0:(READ) flags 0x80700 phys_seg 80 prio class 0
[137450.865852] sd 0:0:5:0: task abort: SUCCESS scmd(000000001afef560)
[137450.865860] sd 0:0:5:0: [sdh] tag#5761 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=12s
[137450.865873] sd 0:0:5:0: [sdh] tag#5815 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=12s
[137450.865878] sd 0:0:5:0: [sdh] tag#5815 CDB: Read(16) 88 00 00 00 00 02 1b 27 a2 80 00 00 01 80 00 00
[137450.865882] blk_update_request: I/O error, dev sdh, sector 9045516928 op 0x0:(READ) flags 0x80700 phys_seg 48 prio class 0
[137450.865954] sd 0:0:5:0: [sdh] tag#51 CDB: Read(16) 88 00 00 00 00 02 1e f6 34 c0 00 00 00 80 00 00
[137450.866044] sd 0:0:5:0: [sdh] tag#5761 CDB: Read(16) 88 00 00 00 00 02 1b 27 a6 80 00 00 01 80 00 00
[137450.866112] blk_update_request: I/O error, dev sdh, sector 9109386432 op 0x0:(READ) flags 0x80700 phys_seg 16 prio class 0
[137450.866257] blk_update_request: I/O error, dev sdh, sector 9045517952 op 0x0:(READ) flags 0x80700 phys_seg 48 prio class 0
[137451.294441] sd 0:0:5:0: Power-on or device reset occurred

Let's kick the md reshape and then we can investigate .. I can also launch some smartctl reports too

I guess it's the kind of errors we're hitting too : https://bugs.centos.org/view.php?id=17423 (and if you google you'll find similar issues) . Worth knowing that there are zillions of very small files in the PVs (on top of exported nfs volumes) so if multiple projects are hitting at the same time the node, yes, that seems to match with "heavy disk load". Depending on available time on my side, I'll be able (or not) to dive more into that

Metadata Update from @arrfab:
- Issue untagged with: groomed

3 years ago

Metadata Update from @arrfab:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Login to comment on this ticket.