#11476 Please power-cycle Copr p08 hypervisors (and allow us to do so)
Closed: Fixed a year ago by praiskup. Opened a year ago by praiskup.

This is a similar request to the #11094 ticket, but now we need access to
a different host:

vmhost-p08-copr01.rdu-cc.fedoraproject.org (host is down now)

Ideally also for (for the future):

vmhost-p08-copr02.rdu-cc.fedoraproject.org
vmhost-x86-copr01.rdu-cc.fedoraproject.org
vmhost-x86-copr02.rdu-cc.fedoraproject.org
vmhost-x86-copr03.rdu-cc.fedoraproject.org
vmhost-x86-copr04.rdu-cc.fedoraproject.org

Since the SSH daemon doesn't reply, I'm curious what is going on (I'd expect
nagios failures).


Metadata Update from @phsmoura:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

a year ago

Metadata Update from @kevin:
- Issue assigned to kevin

a year ago

I think I can get you setup with access... gonna take some doing. :)

Note that the power8 machines we don't have actually a console. We do have access to login to mgmt, where you can power cycle it, but no console.

The x86 ones all have normal dracs. The power9 one has a ok mgmt with console/etc...

I'll try and get this setup this week.

Sorry, I forgot to mention here; the machine appeared online yesterday (uptime 4 days) so I just power-cycled them all over ssh to make the ipv6 stack work again (not sure what was happening, but some VMs were offline, and some were unable to get RH subscription).

Yeah, I power cycled it the other day. Sorry for not updating here.

Ok, we have the same problem with vmhost-p08-copr01.rdu-cc.fedoraproject.org now. SSH to this vmhost fails, or rather timeouts with -o ConnectTimeout=something:

debug1: Reading configuration data /home/praiskup/.ssh/config
debug1: /home/praiskup/.ssh/config line 41: Applying options for *fedoraproject.org
debug1: /home/praiskup/.ssh/config line 50: Applying options for *
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: configuration requests final Match pass
debug1: re-parsing configuration
debug1: Reading configuration data /home/praiskup/.ssh/config
debug1: /home/praiskup/.ssh/config line 41: Applying options for *fedoraproject.org
debug1: /home/praiskup/.ssh/config line 50: Applying options for *
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: Connecting to vmhost-p08-copr01.rdu-cc.fedoraproject.org [8.43.85.55] port 22.

I'm curoious if we can tell what is going on here before we reboot the machine; this is happening too often now to just ignore :-(

The worst thing here is that even though Copr uses:

   ServerAliveInterval 20
   ServerAliveCountMax 5

The ssh connection to a particular VM on this vmhost continues to send packets (so SSH is semi-working both on vmhost and on its vms). This confuses our logic a lot (if ssh failed, we would detect it and start the build over again on a different VM - different vmhost).

There are two indefinitely running connections to one of those "zombie" VMs:

resalloc  203516  0.0  0.0  11292  8576 ?        S    01:00   0:00 ssh root@2620:52:3:1:dead:beef:cafe:c20b subscription-manager unregister
copr     3282698  0.0  0.0  10752  7680 ?        S    Aug24   0:00 ssh -F /home/copr/.ssh/config mockbuilder@2620:52:3:1:dead:beef:cafe:c20b copr-rpmbuild-log

The first one is fixed by this commit, we can afford this because the subscriptions are garbage collected later if unregister goes wrong.

But I'm really lost how to affect the second one. The remotely executed process
is (I'm almost sure, but hard to confirm without ssh) exited, but the controlling
ssh channel is still sending keepalive packets:

strace: Process 3282698 attached
pselect6(7, [3], [], NULL, {tv_sec=0, tv_nsec=396725938}, NULL) = 0 (Timeout)
getpid()                                = 3282698
getrandom("\xc9\x24\x9a\x9b", 4, 0)     = 4
pselect6(7, [3], [3], NULL, {tv_sec=20, tv_nsec=0}, NULL) = 1 (out [3], left {tv_sec=19, tv_nsec=999997930})
write(3, "somethign"..., 52) = 52
pselect6(7, [3], [], NULL, {tv_sec=20, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=19, tv_nsec=991985020})
read(3, "dropped-for-security", 8192) = 28
getpid()                                = 3282698
pselect6(7, [3], [], NULL, {tv_sec=20, tv_nsec=0}, NULL) = 0 (Timeout)
getpid()                                = 3282698
getrandom("\xbf\xc6\xef\xc1", 4, 0)     = 4
pselect6(7, [3], [3], NULL, {tv_sec=20, tv_nsec=0}, NULL) = 1 (out [3], left {tv_sec=19, tv_nsec=999997160})
write(3, "dropped-for-security"..., 52) = 52
pselect6(7, [3], [], NULL, {tv_sec=20, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=19, tv_nsec=991982040})
read(3, "\t\215\312\306\324\234pA\332\340\325o\330\2148%;@2\237Z\335i\271\227\214\335u", 8192) = 28
getpid()                                = 3282698
pselect6(7, [3], [], NULL, {tv_sec=20, tv_nsec=0}, NULL) = 0 (Timeout)
getpid()                                = 3282698
getrandom("\x7a\x8b\x48\x0e", 4, 0)     = 4
pselect6(7, [3], [3], NULL, {tv_sec=20, tv_nsec=0}, NULL) = 1 (out [3], left {tv_sec=19, tv_nsec=999997370})
write(3, "dropped-for-security"..., 52) = 52
pselect6(7, [3], [], NULL, {tv_sec=20, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=19, tv_nsec=991883548})
read(3, "dropped-for-security", 8192) = 28
getpid()                                = 3282698
....

OK, my feeling is that there's something with disks:

$ ssh -vv ...
...
debug1: pledge: filesystem

Seems like the disk is stuck; which kind of explains why keepalive packages work fine.

I've filled a ticket against Copr explaining what is happening in the background.

Well, sadly as I noted in the other ticket we have no remote console on these power8 boxes.
We have access to a web interface that lets you power one off and power it back on, but you cannot see the console.

I've power cycled that box... perhaps there's something in logs?

I don't see anything obvious. There's a big hole in var/log/messages:

Aug 24 17:17:11 vmhost-p08-copr01 dnsmasq-dhcp[5114]: DHCPREQUEST(virbr0) 192.168.122.111 52:54:00:b4:fa:97
Aug 24 17:17:11 vmhost-p08-copr01 dnsmasq-dhcp[5114]: DHCPACK(virbr0) 192.168.122.111 52:54:00:b4:fa:97
Aug 25 16:38:22 vmhost-p08-copr01 tag_audit_log: type=VIRT_RESOURCE msg=audit(1692890776.991:281779): pid=4770 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm resrc=rng reason=start vm="c
Aug 25 16:38:22 vmhost-p08-copr01 tag_audit_log: type=VIRT_RESOURCE msg=audit(1692890776.991:281780): pid=4770 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm resrc=mem reason=start vm="c

I suppose that writing to var/log/messages (on /) was blocked, too.

We used to observe issues with RAID10 on copr-be machine; caused by heavy load there + periodic raid checks. So we had a work-around unblocking all the frozen processes. Unfortunately, this wouldn't help us here on hypervisors where we have / partition together with SWAP and data (libvirt images) on the same software RAID device md2 (so when data/swap on md2 is under heavy load, / is too -> which efectively blocks everything else, including ssh attempts and /var/log/messages producing?).

How difficult would be to reinstall those machines with a slightly different disk layout? Since we use hypervisors as a pure computational power (no data storage), I can imagine having just one RAID1 for / and /boot (or even non-raid?) and the rest of those disks could just form a volume group (or RAID0). RAID6 as used currently seems suboptimal (it's known to be relatively slow, that's why we went with RAID10 in copr-be), especially if we have SWAP on it.

As an experiment, i installed sysctl.conf dev.raid.speed_limit_max = 25000 and dropped the check frequency to "just first Sunday" (not every sunday). Only for vmhost-p08-copr01, vmhost-p08-copr02 stays unchanged.

I power cycled 01 and 02 today. ;(

I have no idea why they are suddenly so unstable. Can we see anything that changed there package wise?

There was a new kernel:
kernel-4.18.0-477.21.1.el8_8.ppc64le Wed 16 Aug 2023 01:54:59 AM UTC

perhaps we should boot back to the previous one?

I power cycled 01 and 02 today. ;(

Hm, have you received any Nagios notification about this?

perhaps we should boot back to the previous one?

Not a bad idea, I'm not sure what is going on.

How difficult it would be to reorganize disks? I mean, if a disk performance is the issue - and we have separated / partition -> the system on the hypervisor should stay generally useable (ie not halting processes in D state).

I suppose RHEL 9 is not supported on Power8, so we can not upgrade from RHEL 8, right?

02 just hung again... it was reported in nagios:

[13:50:56] <zodbot> PROBLEM - vmhost-p08-copr02.rdu-cc.fedoraproject.org/Rsyslogd Process is CRITICAL: CRITICAL - Plugin timed out (noc01)

and I managed to get a serial console on it!

[198441.896036] systemd[1]: systemd-journald.service: State 'stop-final-sigterm' timed out. Killing.                                                                     
[198441.896194] systemd[1]: systemd-journald.service: Killing process 1173298 (systemd-journal) with signal SIGKILL.                                                     
[198441.896421] systemd[1]: systemd-journald.service: Killing process 3575 (systemd-journal) with signal SIGKILL.                                                        
[198441.896443] systemd[1]: systemd-journald.service: Killing process 1165312 (systemd-journal) with signal SIGKILL.                                                     
[198441.896465] systemd[1]: systemd-journald.service: Killing process 1166204 (systemd-journal) with signal SIGKILL.                                                     
[198441.896487] systemd[1]: systemd-journald.service: Killing process 1167105 (systemd-journal) with signal SIGKILL.                                                     
[198441.896509] systemd[1]: systemd-journald.service: Killing process 1167997 (systemd-journal) with signal SIGKILL.                                                     
[198441.896530] systemd[1]: systemd-journald.service: Killing process 1168891 (systemd-journal) with signal SIGKILL.                                                     
[198441.896551] systemd[1]: systemd-journald.service: Killing process 1169781 (systemd-journal) with signal SIGKILL.                                                     
[198441.896572] systemd[1]: systemd-journald.service: Killing process 1170702 (systemd-journal) with signal SIGKILL.                                                     
[198532.143326] systemd[1]: systemd-journald.service: Processes still around after final SIGKILL. Entering failed mode.                                                  
[198532.143357] systemd[1]: systemd-journald.service: Failed with result 'timeout'.                                                                                      
[198532.143502] systemd[1]: Failed to start Journal Service.                                                                                                             
[198532.144186] systemd[1]: systemd-journald.service: Service has no hold-off time (RestartSec=0), scheduling restart.                                                   
[198532.144234] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 12.                                                                   
[198532.144359] systemd[1]: Stopped Journal Service.                                                                                                                     
[198532.144447] systemd[1]: systemd-journald.service: Found left-over process 3575 (systemd-journal) in control group while starting unit. Ignoring.                     
[198532.144458] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.                                        
[198532.144482] systemd[1]: systemd-journald.service: Found left-over process 1165312 (systemd-journal) in control group while starting unit. Ignoring.                  
[198532.144492] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.                                        
[198532.727752] audit: type=1130 audit(1693431309.415:170798): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-journald com
m="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'                                                                                      
[198532.727782] audit: type=1130 audit(1693431309.415:170799): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-journald com
m="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'                                                                                     
[198532.727796] audit: type=1131 audit(1693431309.415:170800): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-journald com
m="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'                                                                                     

No login prompt tho. ;(

How comes Killing process .... was the machine shutting down itself? :-)

No login prompt tho. ;(

That would explain the disk issue; the login shell definiely reads stuff from /.

Though ... is this happening now with x86 box as well? copr@vmhost-x86-copr04.rdu-cc.fedoraproject.org

Yes, I saw the killing thing on vmhost-x86-copr04 today too, which is odd.

I rebooted everything, but I am not sure whats going on.

@msekleta would you mind taking a look at the serial console output above? I'm not sure what is happening; can systemd itself decide that the machine should be rebooted? Can this be related to OOM killer?

In the meantime, vmhost-p08-copr02.rdu-cc.fedoraproject.org seems to be down again, ICMP responding but I can't ssh there.

Some! of these issues are reported by Nagios, sorry. This is again just the sad switch from Zimbra to Gmail we did, I'm unable to create a valid copr-related gmail filter ... from:(nagios@fedoraproject.org) copr doesn't match vmhost-p08-copr02 string, I'm giving up after another 1/2 hour of searching and experimenting with non working RE2 syntax.

Said that, I still don't understand why vmhost-p08-copr02.rdu-cc.fedoraproject.org isn't now in "CRITICAL".

ok. I cycled both vmhost-p08-copr01 and vmhost-p08-copr02.

I am not sure why 02 didn't alert. Possibly, the nrpe daemon was still responding, even though ssh was not working?

I've been looking for a way to add a copr mgmt user to those machines, but it seems not possible. ;( So, I guess I will just change the admin password for you there. I'll see about getting that setup soon.

Thank you @kevin!

I'm curious if you could allow me to reinstall these machines some way (not sure if I can actually do it, I never installed anything else than x86/arm64). Or whether you would find some time to help us with this. I believe we should get rid of raid6 volumes && swaps, somehow, rather soon.

Thank you @kevin!

I'm curious if you could allow me to reinstall these machines some way (not sure if I can actually do it, I never installed anything else than x86/arm64). Or whether you would find some time to help us with this. I believe we should get rid of raid6 volumes && swaps, somehow, rather soon.

Well, it's theoretically possible to reinstall them... but it could also easly result in them being unbootable and down for a long time until we got someone on-site to fix them.
I'd really prefer not to try, but if you really want to and don't mind possibly having them drop off, you could try. I could try and help, but I would sadly put this pretty low on my priority list.

ok, so I put access info in ~praiskup/copr-mgmt on batcave01. It has the info on the ip's and passwords. I hope you can securely transfer it to the rest of the copr team?

Please let me know if you hit any issues with that info. If you want to try doing reinstalls, I guess we need a kickstart to use and can try, but perhaps open a new ticket for that?

Metadata Update from @kevin:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

a year ago

@praiskup first of all sorry for the late reply I was on PTO...

From the provided snippet I can't really tell whether machine was shutting down or not. Actually, to me it seems like service (journald) was stopping because of the restart. Now, the question is why it was needed to restart it and why there were multiple leftover processes in the cgroup.

One possible explanation is that journald process got stuck (D scheduling state) due to storage issues mentioned above, then it failed to ping systemd in time and watchdog expired which in turn caused service restart. Stop operation timed out also and process couldn't be cleaned up even using SIGKILL. Then new journald process was started but also got stuck and failed to reset watchdog in time and so on in the loop.

In any case, for more in depth root-cause analysis I'd need full log from the machine.

@msekleta thank you for checking, D state probably explains this then - and the box did not reboot itself. I assume systemd never decides doing so, or?

@kevin I realized that we can likely fix the disk layout without reinstallation; it seems that we can pull out one disk out from the large raid6 on the hypervisor (raid5-like solution stays with / on it), place the disk into the same volume group as additional physical volume -> move the / partition onto the new physical volume and remove the rest of raid6 for recycling as a non-raid volume group.

FTR - for early testing - we just changed the disk layout with @msuchy today on vmhost-x86-copr01.rdu-cc.fedoraproject.org to the desired layout where /libvirt-images holding guest roots and SWAPs is on non-raid LV with 8 stripes. / is on a different set of disks. This was the easier excercise becase we did not have to move data between physical volumes. Do you want to review the layout? Unless you stop us, we'll experiment with vmhost-x86-copr04.rdu-cc.fedoraproject.org which will be a bit harder because / has to be moved.

Also, would you mind rebooting vmhost-x86-copr02.rdu-cc.fedoraproject.org which appears down now? And vmhost-p08-copr02.rdu-cc.fedoraproject.org?

I missed the reply somehow.

Well, it's theoretically possible to reinstall them... but it could also easly
result in them being unbootable and down for a long time until we got someone
on-site to fix them. I'd really prefer not to try, but if you really want to
and don't mind possibly having them drop off, you could try. I could try and
help, but I would sadly put this pretty low on my priority list.

Understood, and we can avoid reinstallation eventually.

ok, so I put access info in ~praiskup/copr-mgmt on batcave01. It has the
info on the ip's and passwords. I hope you can securely transfer it to
the rest of the copr team?

Sure, thank you! I'll try to reboot the stuck machines today.

Please let me know if you hit any issues with that info. If you want to
try doing reinstalls, I guess we need a kickstart to use and can try,
but perhaps open a new ticket for that?

Sure. Thank you again!

Sure, thank you! I'll try to reboot the stuck machines today.

@kevin I'm trying with Firefox session tunneled-over-ssh session now, and the password for idrac copr user doesn't seem to log us in. I'm not sure if I do something wrong. Could you please double-check we have the correct login/password? (FTR I tried SSH as well, and it doesn't work either).

Metadata Update from @praiskup:
- Issue status updated to: Open (was: Closed)

a year ago

P08 access seems fine!

And I broke the x86 hypervisor 04 right now while trying to re-partition it (I thought I was successful but the machine doesn't come up anymore). idrac access is needed :(

We sorted out the access on irc.

Should all be working. Can you confirm?

Metadata Update from @praiskup:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a year ago

@kevin I realized that we can likely fix the disk layout without reinstallation

And FTR, already done. We
- kept / on a separate raid6 volume (this is under a relatively low I/O pressure, and we want to save some storage for other purposes),
- enabled SWAP partition on each disk (the same priority, kernel should stripe)
- and the main I/O consuming place /libvirt-images is now on striped raid0

The disks on p08 seem to be quite worn out (tens of millions ECC-corrected errors, not sure how to interpret it), but we still can write >= 1GB/s, with 1500IOPS+.

Hopefully the original problem, with the new layout, never cames back on us.

Very glad to hear it.

Yeah, those power8 boxes we got a long time ago... they were builders for koji for a long time, then cloud nodes, now copr build nodes, so I am not surprised about the disks. :)

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog
Attachments 1