This is a similar request to the #11094 ticket, but now we need access to a different host:
vmhost-p08-copr01.rdu-cc.fedoraproject.org (host is down now)
Ideally also for (for the future):
vmhost-p08-copr02.rdu-cc.fedoraproject.org vmhost-x86-copr01.rdu-cc.fedoraproject.org vmhost-x86-copr02.rdu-cc.fedoraproject.org vmhost-x86-copr03.rdu-cc.fedoraproject.org vmhost-x86-copr04.rdu-cc.fedoraproject.org
Since the SSH daemon doesn't reply, I'm curious what is going on (I'd expect nagios failures).
Metadata Update from @phsmoura: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: medium-gain, medium-trouble, ops
Metadata Update from @kevin: - Issue assigned to kevin
I think I can get you setup with access... gonna take some doing. :)
Note that the power8 machines we don't have actually a console. We do have access to login to mgmt, where you can power cycle it, but no console.
The x86 ones all have normal dracs. The power9 one has a ok mgmt with console/etc...
I'll try and get this setup this week.
Sorry, I forgot to mention here; the machine appeared online yesterday (uptime 4 days) so I just power-cycled them all over ssh to make the ipv6 stack work again (not sure what was happening, but some VMs were offline, and some were unable to get RH subscription).
Yeah, I power cycled it the other day. Sorry for not updating here.
Ok, we have the same problem with vmhost-p08-copr01.rdu-cc.fedoraproject.org now. SSH to this vmhost fails, or rather timeouts with -o ConnectTimeout=something:
vmhost-p08-copr01.rdu-cc.fedoraproject.org
-o ConnectTimeout=something
debug1: Reading configuration data /home/praiskup/.ssh/config debug1: /home/praiskup/.ssh/config line 41: Applying options for *fedoraproject.org debug1: /home/praiskup/.ssh/config line 50: Applying options for * debug1: Reading configuration data /etc/ssh/ssh_config debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config debug1: configuration requests final Match pass debug1: re-parsing configuration debug1: Reading configuration data /home/praiskup/.ssh/config debug1: /home/praiskup/.ssh/config line 41: Applying options for *fedoraproject.org debug1: /home/praiskup/.ssh/config line 50: Applying options for * debug1: Reading configuration data /etc/ssh/ssh_config debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config debug1: Connecting to vmhost-p08-copr01.rdu-cc.fedoraproject.org [8.43.85.55] port 22.
I'm curoious if we can tell what is going on here before we reboot the machine; this is happening too often now to just ignore :-(
The worst thing here is that even though Copr uses:
ServerAliveInterval 20 ServerAliveCountMax 5
The ssh connection to a particular VM on this vmhost continues to send packets (so SSH is semi-working both on vmhost and on its vms). This confuses our logic a lot (if ssh failed, we would detect it and start the build over again on a different VM - different vmhost).
There are two indefinitely running connections to one of those "zombie" VMs:
resalloc 203516 0.0 0.0 11292 8576 ? S 01:00 0:00 ssh root@2620:52:3:1:dead:beef:cafe:c20b subscription-manager unregister copr 3282698 0.0 0.0 10752 7680 ? S Aug24 0:00 ssh -F /home/copr/.ssh/config mockbuilder@2620:52:3:1:dead:beef:cafe:c20b copr-rpmbuild-log
The first one is fixed by this commit, we can afford this because the subscriptions are garbage collected later if unregister goes wrong.
But I'm really lost how to affect the second one. The remotely executed process is (I'm almost sure, but hard to confirm without ssh) exited, but the controlling ssh channel is still sending keepalive packets:
strace: Process 3282698 attached pselect6(7, [3], [], NULL, {tv_sec=0, tv_nsec=396725938}, NULL) = 0 (Timeout) getpid() = 3282698 getrandom("\xc9\x24\x9a\x9b", 4, 0) = 4 pselect6(7, [3], [3], NULL, {tv_sec=20, tv_nsec=0}, NULL) = 1 (out [3], left {tv_sec=19, tv_nsec=999997930}) write(3, "somethign"..., 52) = 52 pselect6(7, [3], [], NULL, {tv_sec=20, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=19, tv_nsec=991985020}) read(3, "dropped-for-security", 8192) = 28 getpid() = 3282698 pselect6(7, [3], [], NULL, {tv_sec=20, tv_nsec=0}, NULL) = 0 (Timeout) getpid() = 3282698 getrandom("\xbf\xc6\xef\xc1", 4, 0) = 4 pselect6(7, [3], [3], NULL, {tv_sec=20, tv_nsec=0}, NULL) = 1 (out [3], left {tv_sec=19, tv_nsec=999997160}) write(3, "dropped-for-security"..., 52) = 52 pselect6(7, [3], [], NULL, {tv_sec=20, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=19, tv_nsec=991982040}) read(3, "\t\215\312\306\324\234pA\332\340\325o\330\2148%;@2\237Z\335i\271\227\214\335u", 8192) = 28 getpid() = 3282698 pselect6(7, [3], [], NULL, {tv_sec=20, tv_nsec=0}, NULL) = 0 (Timeout) getpid() = 3282698 getrandom("\x7a\x8b\x48\x0e", 4, 0) = 4 pselect6(7, [3], [3], NULL, {tv_sec=20, tv_nsec=0}, NULL) = 1 (out [3], left {tv_sec=19, tv_nsec=999997370}) write(3, "dropped-for-security"..., 52) = 52 pselect6(7, [3], [], NULL, {tv_sec=20, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=19, tv_nsec=991883548}) read(3, "dropped-for-security", 8192) = 28 getpid() = 3282698 ....
OK, my feeling is that there's something with disks:
$ ssh -vv ... ... debug1: pledge: filesystem
Seems like the disk is stuck; which kind of explains why keepalive packages work fine.
Nagios doesn't complain: https://nagios.fedoraproject.org/nagios/cgi-bin//status.cgi?host=vmhost-p08-copr01.rdu-Whacc.fedoraproject.org&style=detail
Neither: https://nagios-external.fedoraproject.org/nagios/cgi-bin//status.cgi?host=vmhost-p08-copr01.rdu-cc.fedoraproject.org&style=detail
I've filled a ticket against Copr explaining what is happening in the background.
Well, sadly as I noted in the other ticket we have no remote console on these power8 boxes. We have access to a web interface that lets you power one off and power it back on, but you cannot see the console.
I've power cycled that box... perhaps there's something in logs?
I don't see anything obvious. There's a big hole in var/log/messages:
Aug 24 17:17:11 vmhost-p08-copr01 dnsmasq-dhcp[5114]: DHCPREQUEST(virbr0) 192.168.122.111 52:54:00:b4:fa:97 Aug 24 17:17:11 vmhost-p08-copr01 dnsmasq-dhcp[5114]: DHCPACK(virbr0) 192.168.122.111 52:54:00:b4:fa:97 Aug 25 16:38:22 vmhost-p08-copr01 tag_audit_log: type=VIRT_RESOURCE msg=audit(1692890776.991:281779): pid=4770 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm resrc=rng reason=start vm="c Aug 25 16:38:22 vmhost-p08-copr01 tag_audit_log: type=VIRT_RESOURCE msg=audit(1692890776.991:281780): pid=4770 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm resrc=mem reason=start vm="c
I suppose that writing to var/log/messages (on /) was blocked, too.
/
We used to observe issues with RAID10 on copr-be machine; caused by heavy load there + periodic raid checks. So we had a work-around unblocking all the frozen processes. Unfortunately, this wouldn't help us here on hypervisors where we have / partition together with SWAP and data (libvirt images) on the same software RAID device md2 (so when data/swap on md2 is under heavy load, / is too -> which efectively blocks everything else, including ssh attempts and /var/log/messages producing?).
md2
How difficult would be to reinstall those machines with a slightly different disk layout? Since we use hypervisors as a pure computational power (no data storage), I can imagine having just one RAID1 for / and /boot (or even non-raid?) and the rest of those disks could just form a volume group (or RAID0). RAID6 as used currently seems suboptimal (it's known to be relatively slow, that's why we went with RAID10 in copr-be), especially if we have SWAP on it.
/boot
As an experiment, i installed sysctl.conf dev.raid.speed_limit_max = 25000 and dropped the check frequency to "just first Sunday" (not every sunday). Only for vmhost-p08-copr01, vmhost-p08-copr02 stays unchanged.
dev.raid.speed_limit_max = 25000
vmhost-p08-copr01
vmhost-p08-copr02
I power cycled 01 and 02 today. ;(
I have no idea why they are suddenly so unstable. Can we see anything that changed there package wise?
There was a new kernel: kernel-4.18.0-477.21.1.el8_8.ppc64le Wed 16 Aug 2023 01:54:59 AM UTC
perhaps we should boot back to the previous one?
Hm, have you received any Nagios notification about this?
Not a bad idea, I'm not sure what is going on.
How difficult it would be to reorganize disks? I mean, if a disk performance is the issue - and we have separated / partition -> the system on the hypervisor should stay generally useable (ie not halting processes in D state).
D
I suppose RHEL 9 is not supported on Power8, so we can not upgrade from RHEL 8, right?
02 just hung again... it was reported in nagios:
[13:50:56] <zodbot> PROBLEM - vmhost-p08-copr02.rdu-cc.fedoraproject.org/Rsyslogd Process is CRITICAL: CRITICAL - Plugin timed out (noc01)
and I managed to get a serial console on it!
[198441.896036] systemd[1]: systemd-journald.service: State 'stop-final-sigterm' timed out. Killing. [198441.896194] systemd[1]: systemd-journald.service: Killing process 1173298 (systemd-journal) with signal SIGKILL. [198441.896421] systemd[1]: systemd-journald.service: Killing process 3575 (systemd-journal) with signal SIGKILL. [198441.896443] systemd[1]: systemd-journald.service: Killing process 1165312 (systemd-journal) with signal SIGKILL. [198441.896465] systemd[1]: systemd-journald.service: Killing process 1166204 (systemd-journal) with signal SIGKILL. [198441.896487] systemd[1]: systemd-journald.service: Killing process 1167105 (systemd-journal) with signal SIGKILL. [198441.896509] systemd[1]: systemd-journald.service: Killing process 1167997 (systemd-journal) with signal SIGKILL. [198441.896530] systemd[1]: systemd-journald.service: Killing process 1168891 (systemd-journal) with signal SIGKILL. [198441.896551] systemd[1]: systemd-journald.service: Killing process 1169781 (systemd-journal) with signal SIGKILL. [198441.896572] systemd[1]: systemd-journald.service: Killing process 1170702 (systemd-journal) with signal SIGKILL. [198532.143326] systemd[1]: systemd-journald.service: Processes still around after final SIGKILL. Entering failed mode. [198532.143357] systemd[1]: systemd-journald.service: Failed with result 'timeout'. [198532.143502] systemd[1]: Failed to start Journal Service. [198532.144186] systemd[1]: systemd-journald.service: Service has no hold-off time (RestartSec=0), scheduling restart. [198532.144234] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 12. [198532.144359] systemd[1]: Stopped Journal Service. [198532.144447] systemd[1]: systemd-journald.service: Found left-over process 3575 (systemd-journal) in control group while starting unit. Ignoring. [198532.144458] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. [198532.144482] systemd[1]: systemd-journald.service: Found left-over process 1165312 (systemd-journal) in control group while starting unit. Ignoring. [198532.144492] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. [198532.727752] audit: type=1130 audit(1693431309.415:170798): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-journald com m="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' [198532.727782] audit: type=1130 audit(1693431309.415:170799): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-journald com m="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [198532.727796] audit: type=1131 audit(1693431309.415:170800): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-journald com m="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
No login prompt tho. ;(
How comes Killing process .... was the machine shutting down itself? :-)
Killing process
That would explain the disk issue; the login shell definiely reads stuff from /.
Though ... is this happening now with x86 box as well? copr@vmhost-x86-copr04.rdu-cc.fedoraproject.org
x86
Yes, I saw the killing thing on vmhost-x86-copr04 today too, which is odd.
I rebooted everything, but I am not sure whats going on.
@msekleta would you mind taking a look at the serial console output above? I'm not sure what is happening; can systemd itself decide that the machine should be rebooted? Can this be related to OOM killer?
In the meantime, vmhost-p08-copr02.rdu-cc.fedoraproject.org seems to be down again, ICMP responding but I can't ssh there.
Ah, also vmhost-p08-copr01.rdu-cc.fedoraproject.org. Nagios only reports for this host:
<img alt="Screenshot_20230908_074836.png" src="/fedora-infrastructure/issue/raw/files/c63f4803a8a125b1a751575b9687ef4341d035ecb938c49a83994efbdc59c896-Screenshot_20230908_074836.png" />
Some! of these issues are reported by Nagios, sorry. This is again just the sad switch from Zimbra to Gmail we did, I'm unable to create a valid copr-related gmail filter ... from:(nagios@fedoraproject.org) copr doesn't match vmhost-p08-copr02 string, I'm giving up after another 1/2 hour of searching and experimenting with non working RE2 syntax.
from:(nagios@fedoraproject.org) copr
Said that, I still don't understand why vmhost-p08-copr02.rdu-cc.fedoraproject.org isn't now in "CRITICAL".
vmhost-p08-copr02.rdu-cc.fedoraproject.org
ok. I cycled both vmhost-p08-copr01 and vmhost-p08-copr02.
I am not sure why 02 didn't alert. Possibly, the nrpe daemon was still responding, even though ssh was not working?
I've been looking for a way to add a copr mgmt user to those machines, but it seems not possible. ;( So, I guess I will just change the admin password for you there. I'll see about getting that setup soon.
Thank you @kevin!
I'm curious if you could allow me to reinstall these machines some way (not sure if I can actually do it, I never installed anything else than x86/arm64). Or whether you would find some time to help us with this. I believe we should get rid of raid6 volumes && swaps, somehow, rather soon.
Thank you @kevin! I'm curious if you could allow me to reinstall these machines some way (not sure if I can actually do it, I never installed anything else than x86/arm64). Or whether you would find some time to help us with this. I believe we should get rid of raid6 volumes && swaps, somehow, rather soon.
Well, it's theoretically possible to reinstall them... but it could also easly result in them being unbootable and down for a long time until we got someone on-site to fix them. I'd really prefer not to try, but if you really want to and don't mind possibly having them drop off, you could try. I could try and help, but I would sadly put this pretty low on my priority list.
ok, so I put access info in ~praiskup/copr-mgmt on batcave01. It has the info on the ip's and passwords. I hope you can securely transfer it to the rest of the copr team?
Please let me know if you hit any issues with that info. If you want to try doing reinstalls, I guess we need a kickstart to use and can try, but perhaps open a new ticket for that?
Metadata Update from @kevin: - Issue close_status updated to: Fixed with Explanation - Issue status updated to: Closed (was: Open)
@praiskup first of all sorry for the late reply I was on PTO...
From the provided snippet I can't really tell whether machine was shutting down or not. Actually, to me it seems like service (journald) was stopping because of the restart. Now, the question is why it was needed to restart it and why there were multiple leftover processes in the cgroup.
One possible explanation is that journald process got stuck (D scheduling state) due to storage issues mentioned above, then it failed to ping systemd in time and watchdog expired which in turn caused service restart. Stop operation timed out also and process couldn't be cleaned up even using SIGKILL. Then new journald process was started but also got stuck and failed to reset watchdog in time and so on in the loop.
In any case, for more in depth root-cause analysis I'd need full log from the machine.
@msekleta thank you for checking, D state probably explains this then - and the box did not reboot itself. I assume systemd never decides doing so, or?
@kevin I realized that we can likely fix the disk layout without reinstallation; it seems that we can pull out one disk out from the large raid6 on the hypervisor (raid5-like solution stays with / on it), place the disk into the same volume group as additional physical volume -> move the / partition onto the new physical volume and remove the rest of raid6 for recycling as a non-raid volume group.
raid6
raid5
FTR - for early testing - we just changed the disk layout with @msuchy today on vmhost-x86-copr01.rdu-cc.fedoraproject.org to the desired layout where /libvirt-images holding guest roots and SWAPs is on non-raid LV with 8 stripes. / is on a different set of disks. This was the easier excercise becase we did not have to move data between physical volumes. Do you want to review the layout? Unless you stop us, we'll experiment with vmhost-x86-copr04.rdu-cc.fedoraproject.org which will be a bit harder because / has to be moved.
vmhost-x86-copr01.rdu-cc.fedoraproject.org
/libvirt-images
vmhost-x86-copr04.rdu-cc.fedoraproject.org
Also, would you mind rebooting vmhost-x86-copr02.rdu-cc.fedoraproject.org which appears down now? And vmhost-p08-copr02.rdu-cc.fedoraproject.org?
vmhost-x86-copr02.rdu-cc.fedoraproject.org
I missed the reply somehow.
Understood, and we can avoid reinstallation eventually.
Sure, thank you! I'll try to reboot the stuck machines today.
Sure. Thank you again!
@kevin I'm trying with Firefox session tunneled-over-ssh session now, and the password for idrac copr user doesn't seem to log us in. I'm not sure if I do something wrong. Could you please double-check we have the correct login/password? (FTR I tried SSH as well, and it doesn't work either).
copr
Metadata Update from @praiskup: - Issue status updated to: Open (was: Closed)
P08 access seems fine!
And I broke the x86 hypervisor 04 right now while trying to re-partition it (I thought I was successful but the machine doesn't come up anymore). idrac access is needed :(
We sorted out the access on irc.
Should all be working. Can you confirm?
Yes, thank you @kevin and @darknao!
Metadata Update from @praiskup: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
@kevin I realized that we can likely fix the disk layout without reinstallation
And FTR, already done. We - kept / on a separate raid6 volume (this is under a relatively low I/O pressure, and we want to save some storage for other purposes), - enabled SWAP partition on each disk (the same priority, kernel should stripe) - and the main I/O consuming place /libvirt-images is now on striped raid0
raid0
The disks on p08 seem to be quite worn out (tens of millions ECC-corrected errors, not sure how to interpret it), but we still can write >= 1GB/s, with 1500IOPS+.
p08
Hopefully the original problem, with the new layout, never cames back on us.
Very glad to hear it.
Yeah, those power8 boxes we got a long time ago... they were builders for koji for a long time, then cloud nodes, now copr build nodes, so I am not surprised about the disks. :)
Log in to comment on this ticket.