#232 ssh issue on libvirt based images
Closed 2 years ago Opened 2 years ago by kushal.

During our Atomic tests, after disabling the chronyd service, we reboot the VM. After that the ssh serivce is not coming back on time. This VM has one vcpu in it.

Sometimes we saw behaviour on Vagrant libvirt (which boots with 2 VCPU(s)), and at least once with Vagratnt Virtualbox before.

One such example of failure: https://apps.fedoraproject.org/autocloud/jobs/2188/output#30

https://lists.fedoraproject.org/archives/list/cloud@lists.fedoraproject.org/thread/6YLFENGHJY2CFKMYIFYE7NEGLDVZ6ZPM/ is the thread in the mailing list on the same topic.


hey @kushal, is there any way at all to reproduce this issue outside of the test suite? I'm thinking this probably falls into one of a few different categories:

  • This is easily reproducible with a specific VM libvirt configuration (i.e. the hardware presented to the VM) and thus should be easily reproducible anywhere.
  • This doesn't reproduce any other way than by running tunir.
  • This doesn't reproduce anywhere but the environment that is running the tunir tests.

Can you try to chase down to see if one of those statements is true?

I can say that with my ansible tests (run on an instance locally with testcloud) this issue does not happen.

hey @kushal, is there any way at all to reproduce this issue outside of the test suite? I'm thinking this probably falls into one of a few different categories:

This is easily reproducible with a specific VM libvirt configuration (i.e. the hardware presented to the VM) and thus should be easily reproducible anywhere.
This doesn't reproduce any other way than by running tunir.
This doesn't reproduce anywhere but the environment that is running the tunir tests.
I can reproduce it on my test hardware (lenovo servers) even on Vagrant libvirt images.

curl -O http://infrastructure.fedoraproject.org/infra/autocloud/tunirtests.tar.gz
tar -xzvf tunirtests.tar.gz
sudo python3 -m unittest tunirtests.cloudservice.TestServiceDisable -v
@@ sudo reboot
sudo python3 -m unittest tunirtests.testreboot.TestReboot -v

This seems to be producing the error for me.

It worked fine for me. Tested with 1 vcpu and 2 vcpu(s). Unable to reproduce the issue.

Kushal, were you able to get the system logs from a failed image?

Kushal, were you able to get the system logs from a failed image?
Nope, but I managed to get the tests working on the same boxes if we wait much longer time for the ssh to come back.

So I pushed tunir-0.17.1 in production after testing on my servers, it has POLL directive, and we are polling for 300 seconds for the ssh service to come back.

I'd like to look at the logs still. :)

Metadata Update from @dustymabe:
- Issue tagged with: host

2 years ago

Finally managed to isolate the issue. If we boot the image with only one CPU, the error comes up. If we boot with 2 or more CPU(s), no issues at all. Now the question is if we should make local testing on Autocloud with 2 CPU(s) or get this issue fixed somehow? Kushal

Delayed success with 1 VCPU vs. "success" with 2 VCPUs supports theory regarding "entropy problem" in mailing-list

haven't seen this lately - let's close it and re-open if the issue pops back up

Metadata Update from @dustymabe:
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata