#292 F28/F29 Networking breaks in GCE
Closed: fixed 6 years ago by cevich. Opened 6 years ago by cevich.

While following https://kyle.io/2018/06/fedora-cloud-on-gce/ using Fedora-Cloud-Base-28-1.1 I'm seeing problems after updating and rebooting the VM. It works fine as long as I don't update. I tried F29 previously, and it had a similar (same?) problem. Since there are a TON of updates, I'm having difficulty narrowing down which one is causing the problem. I will try to provide working and not-working serial-console output.


Ugg, didn't break, forgot to kill the cloud-init data...

[fedora@cevich-testing-fedora ~]$ sudo rm -rf /var/lib/cloud/instance?
[fedora@cevich-testing-fedora ~]$ sudo sys-unconfig
Connection to 35.211.52.220 closed by remote host.

Metadata Update from @cevich:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

6 years ago

@cevich so everything seems to be working fine?

@dustymabe ya, before I had a chance to figure out what the problem was :confused: Thanks for checking.

Metadata Update from @cevich:
- Issue status updated to: Open (was: Closed)

6 years ago

Re-opening this, as I seem to be hitting the same problem with Fedora-Cloud-Base-29_Beta-1.5.x86_64.raw.xz. Output from the serial console stops after

         Starting OpenSSH ecdsa Server Key Generation...
         Starting Login Service...
[  OK  ] Started Login Service.
[  OK  ] Started NTP client/server.

So there's basically no way to access the VM (cloud-init didn't even get to run).

Going to try the steps covered in the docs. since without networking, the VM is entirely useless.

Figured out how to preserve the VM's image, then mount it up on another (working) VM. From there, I was able to access the failing VM's journal. Unfortunately, it's not helpful:

Oct 26 21:51:17 localhost.localdomain systemd[1]: Reached target Timers.
Oct 26 21:51:17 localhost.localdomain chronyd[436]: chronyd version 3.3 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +SECHASH +IPV6 +DEBUG)
Oct 26 21:51:17 localhost.localdomain chronyd[436]: Frequency 0.000 +/- 1000000.000 ppm read from /var/lib/chrony/drift
Oct 26 21:51:17 localhost.localdomain chronyd[436]: Using right/UTC timezone to obtain leap second data
Oct 26 21:51:17 localhost.localdomain systemd-logind[433]: Watching system buttons on /dev/input/event0 (Power Button)
Oct 26 21:51:17 localhost.localdomain systemd-logind[433]: Watching system buttons on /dev/input/event1 (Sleep Button)
Oct 26 21:51:17 localhost.localdomain systemd-logind[433]: Watching system buttons on /dev/input/event2 (AT Translated Set 2 keyboard)
Oct 26 21:51:17 localhost.localdomain systemd-logind[433]: New seat seat0.
Oct 26 21:51:17 localhost.localdomain systemd-vconsole-setup[408]: Setting fonts failed with a "system error", ignoring.
Oct 26 21:51:17 localhost.localdomain dbus-daemon[434]: dbus[434]: Unknown username "systemd-timesync" in message bus configuration file
Oct 26 21:51:17 localhost.localdomain systemd[1]: Started Login Service.
Oct 26 21:51:17 localhost.localdomain audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-logind comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Oct 26 21:51:18 localhost.localdomain systemd[1]: Started NTP client/server.
Oct 26 21:51:18 localhost.localdomain audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=chronyd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Oct 26 21:54:00 localhost.localdomain systemd-logind[433]: Power key pressed.
...and shutdown as instructed...

Using my access to the disk, I edited /boot/grub2/grub.cfg, adding console=ttyS0 console=ttyS0,115200n8 rd.break enforcing=0 to the kernel command-line. Detached the disk, booted it up as a new VM, and accessed the serial-console:

Generating "/run/initramfs/rdsosreport.txt"


Entering emergency mode. Exit the shell to continue.
Type "journalctl" to view system logs.
You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot
after mounting them and attach it to a bug report.


switch_root:/#

I'm unsure what to do next, since normally the VM simply "stops" after Started NTP client/server. - it's literally unresponsive :confounded:

I can try disabling the NTP service I s'pose, maybe that will get it un-stuck. Any/all other suggestions are welcome :smile:

Disabled chronyd, now it gets stuck on boot with:

[  OK  ] Reached target Basic System.
         Starting Login Service...
         Starting Initial cloud-init job (pre-networking)...
[  OK  ] Started D-Bus System Message Bus.
[  OK  ] Reached target sshd-keygen.target.
[    8.774534] EDAC sbridge:  Ver: 1.1.2 
[  OK  ] Started Login Service.

Tested with a newer image https://kojipkgs.fedoraproject.org/compose/29/Fedora-29-20181024.1/compose/Cloud/x86_64/images/Fedora-Cloud-Base-29-1.2.x86_64.raw.xz but got a similar result. No networking, and boot seems to get stuck based on serial-port output:

...cut...
         Starting OpenSSH rsa Server Key Generation...
[  OK  ] Started daily update of the root trust anchor for DNSSEC.
[  OK  ] Reached target Timers.
[  OK  ] Started NTP client/server.
[  OK  ] Started System Security Services Daemon.
[  OK  ] Reached target User and Group Name Lookups.
         Starting Login Service...
[  OK  ] Started Login Service.
         Starting Cleanup of Temporary Directories...
[  OK  ] Started Cleanup of Temporary Directories.

I'll try accessing the boot disk and see what logs I can scrape out...

so I tried out a boot on AWS EC2 (ami-0d7a9cc499e108f74 in us-east-1) and I get a successful boot there so it looks like we should concentrate on GCE investigation.

@jdoss @King InuYasha - can you help with that. I haven't climbed the GCE mountain yet and probably not going to have time to this week.

so here is something I see that is interesting. From your log above from a GCE boot I see:

Oct 29 20:44:31 localhost.localdomain systemd[1]: Starting Login Service...
Oct 29 20:44:31 localhost.localdomain systemd-logind[461]: Watching system buttons on /dev/input/event0 (Power Button)
Oct 29 20:44:31 localhost.localdomain systemd-logind[461]: Watching system buttons on /dev/input/event1 (Sleep Button)
Oct 29 20:44:31 localhost.localdomain systemd-logind[461]: Watching system buttons on /dev/input/event2 (AT Translated Set 2 keyboard)
Oct 29 20:44:31 localhost.localdomain systemd-logind[461]: New seat seat0.
Oct 29 20:44:31 localhost.localdomain systemd[1]: Started Login Service.
Oct 29 20:44:31 localhost.localdomain audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-logind comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=
Oct 29 20:59:20 localhost.localdomain systemd[1]: Starting Cleanup of Temporary Directories...
Oct 29 20:59:20 localhost.localdomain systemd[1]: Started Cleanup of Temporary Directories.
Oct 29 20:59:20 localhost.localdomain audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? t
Oct 29 20:59:20 localhost.localdomain audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? te
Oct 29 21:42:13 localhost.localdomain systemd-logind[461]: Power key pressed.
Oct 29 21:42:13 localhost.localdomain systemd-logind[461]: Powering Off...
Oct 29 21:42:13 localhost.localdomain systemd-logind[461]: System is powering down.

So it looks like right after the login service gets started a power key pressed event is detected and the thing shuts itself down.

Contrast this with a serial console log from a boot on aws and I see:

Oct 30 00:01:15 localhost.localdomain systemd[1]: Starting Login Service...                                                                                                                                                                
Oct 30 00:01:15 localhost.localdomain systemd-logind[479]: New seat seat0.                                                                                                                                                                 
Oct 30 00:01:15 localhost.localdomain systemd-logind[479]: Watching system buttons on /dev/input/event0 (Power Button)                                                                                                                      
Oct 30 00:01:15 localhost.localdomain systemd-logind[479]: Watching system buttons on /dev/input/event1 (Sleep Button)                                                                                                                     
Oct 30 00:01:15 localhost.localdomain systemd-logind[479]: Watching system buttons on /dev/input/event2 (AT Translated Set 2 keyboard)                                                                                                     
Oct 30 00:01:15 localhost.localdomain systemd[1]: Started Login Service.                                                                                                                                                                   
Oct 30 00:01:15 localhost.localdomain audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-logind comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=
? res=success'                                                                                                                                                                                                                             
Oct 30 00:01:25 localhost.localdomain dhclient[523]: Internet Systems Consortium DHCP Client 4.3.6                                                                                                                                          
Oct 30 00:01:25 localhost.localdomain dhclient[523]: Copyright 2004-2017 Internet Systems Consortium.                                                                                                                                       
Oct 30 00:01:25 localhost.localdomain dhclient[523]: All rights reserved.                                   
Oct 30 00:01:25 localhost.localdomain dhclient[523]: For info, please visit https://www.isc.org/software/dhcp/                                                                                                                              
Oct 30 00:01:25 localhost.localdomain dhclient[523]: [3B blob data]                                                                                                                                                                        
Oct 30 00:01:25 localhost.localdomain dhclient[523]: Listening on LPF/eth0/12:52:ab:f5:01:de                                                                                                                                                
Oct 30 00:01:25 localhost.localdomain dhclient[523]: Sending on   LPF/eth0/12:52:ab:f5:01:de                                                                                                                                               
Oct 30 00:01:25 localhost.localdomain dhclient[523]: Sending on   Socket/fallback                                                                                                                                                          
Oct 30 00:01:25 localhost.localdomain dhclient[523]: Created duid "\000\004a*\320O\220\264M\177\205\335r6%\332\340\002".                                                                                                                    
Oct 30 00:01:25 localhost.localdomain dhclient[523]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 4 (xid=0x3e616474)                                                                                                           
Oct 30 00:01:25 localhost.localdomain dhclient[523]: DHCPREQUEST on eth0 to 255.255.255.255 port 67 (xid=0x3e616474)                                                                                                                        
Oct 30 00:01:25 localhost.localdomain dhclient[523]: DHCPOFFER from 10.0.0.1                
Oct 30 00:01:25 localhost.localdomain dhclient[523]: DHCPACK from 10.0.0.1 (xid=0x3e616474)
Oct 30 00:01:25 localhost.localdomain dhclient[523]: bound to 10.0.41.19 -- renewal in 1775 seconds.
Oct 30 00:01:25 localhost.localdomain dhclient[526]: receive_packet failed on eth0: Network is down
Oct 30 00:01:25 localhost.localdomain cloud-init[460]: Cloud-init v. 17.1 running 'init-local' at Tue, 30 Oct 2018 00:01:23 +0000. Up 46.91 seconds.
Oct 30 00:01:25 localhost.localdomain cloud-init[460]: 2018-10-30 00:01:25,842 - stages.py[ERROR]: Unable to render networking. Network config is likely broken: No available network renderers found. Searched through list: ['eni', 'sysco
nfig', 'netplan']
Oct 30 00:01:25 localhost.localdomain systemd[1]: Started Initial cloud-init job (pre-networking).
Oct 30 00:01:25 localhost.localdomain audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=cloud-init-local comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? termina
l=? res=success'
Oct 30 00:01:25 localhost.localdomain systemd[1]: Reached target Network (Pre).
Oct 30 00:01:25 localhost.localdomain systemd[1]: Starting Network Manager...

So it starts networkmanager right after the login service. It looks like the instance on GCE is getting powered down. Does GCE have some "phone home" stuff that is required that isn't happening?

So it starts networkmanager right after the login service. It looks like the instance on GCE is getting powered down.

That was me, pressing the 'stop' button and shutting down the VM. You can see from the timestamps, I waited almost an hour. it's getting waiting for a networking device to come up.

Remember, GCE is using some qemu-kvm version (probably really old). Did we disable virtio-net modules or was there a change from F28 that made them incompatible with older qemu-kvm? Maybe this suggests a reproducer?

That was me, pressing the 'stop' button and shutting down the VM. You can see from the timestamps, I waited almost an hour. it's getting waiting for a networking device to come up.

ahh. missed that

I'm going to try mounting up the F29 image in another VM, dnf update -y on it, then install the GCE tooling to check my assumption of it not being necessary.

No love, I installed the GCE stuff, enabled the services, but we're not getting far enough in boot for them to start :( Then I noticed a message, and found it also in the log from yesterday:

Oct 29 20:44:22 localhost systemd-udevd[271]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.

I do not see that message on my CentOS VM (in GCE), instead NetworkManager simply starts up and we get the normal System eth0 and DHCP messages.

It's a PITA, but it is possible to get the Fedora VM to boot into single-user mode, then access it over the serial console. I'm going to poke around in there some more to try and see what might be blocking NetworkManager form starting.

We have Fedora 29 Cloud running on GCP when we use the below ansible tasks to add the Google Cloud Compute Repo so we can install the CentOS 7 google-compute-engine-oslogin RPM.

- name: Add Google Cloud repo
  yum_repository:
    name: google-cloud
    description: Google Cloud Compute Repo
    baseurl: https://packages.cloud.google.com/yum/repos/google-cloud-compute-el7-x86_64
    gpgcheck: yes
    gpgkey: https://packages.cloud.google.com/yum/repos/google-cloud-compute-el7-x86_64,https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
    state: present
    enabled: yes

- name: Copy custom google-compute-engine-oslogin package
  copy:
    src: files/google-compute-engine-oslogin-1.3.0-1.el7.x86_64.rpm
    dest: /root/

- name: Install custom google-compute-engine-oslogin package
  dnf:
    name: /root/google-compute-engine-oslogin-1.3.0-1.el7.x86_64.rpm
    state: present

- name: Let's versionlock this custom package
  command: dnf versionlock add google-compute-engine-oslogin

- name: Now install GCE packages
  dnf:
    name: "{{ item }}"
    state: present
  with_items:
    - python-google-compute-engine
    - google-compute-engine

We also did a vanilla Fedora 29 install without the about ansible tasks and it fails to get on the network as @cevich is describing.

The google services are not required for basic-level networking. You can see this for yourself by looking in a system journal. NetworkManager comes up, enumerates eth0, get's an IP, and THEN (seconds later) the google services all start. I've observed this behavior as well when I was setting up F28, so something more fundamental is going wrong here, (and/or specific to my GCE project).

Thinking maybe somehow my F29 boot disk is goofed after booting it so many times. I re-downloaded the cloud image, then mounted it directly to install Neil's COPR packages. Finally I imported it into GCE and started a new VM.

Same deal :disappointed:

@dustymabe @jdoss I keep seeing this message in the serial output (across all attempts). Do you see this in your "working" GCE F29 VMs?

Oct 31 13:53:39 localhost systemd-udevd[422]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.

How are you building your images then? We are not seeing any issues getting them on the network with our image build process.

@jdoss I'm roughly following this howto. I also blogged about the steps I used, internally (assuming you can see it). Tried installing the tweaked CentOS packages, as well as using Neil's packages (same result either way).

I've not done anything special in my GCE project, it's pretty much bone-stock the way google gives it to you (as of July). Yesterday, I tried adding a custom network + all-open firewall, and starting a VM with that...same deal :disappointed: Now I'm going to try rescue.target over a serial connection, try to suss out what is blocking networking.

Small progress...this is certainly a lower-level networking-device setup issue. I booted the F29 VM after systemctl set-default rescue.target (from a chroot into it's boot image, mounted on another VM). From rescue-mode:

...connected to serial port...
[root@localhost ~]# lsmod | grep virtio
virtio_net             53248  0
net_failover           20480  1 virtio_net
virtio_scsi            20480  1
[root@localhost ~]# systemctl start NetworkManager
[   71.158354] kauditd_printk_skb: 14 callbacks suppressed
[   71.158355] audit: type=1130 audit(1540999129.659:58): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dbus comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   71.731496] audit: type=1130 audit(1540999130.232:59): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[root@localhost ~]# [   71.894559] audit: type=1130 audit(1540999130.395:60): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   72.174890] audit: type=1130 audit(1540999130.675:61): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   72.242262] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   73.013460] audit: type=1130 audit(1540999131.513:62): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=polkit comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   83.039852] audit: type=1131 audit(1540999141.540:63): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[  101.955868] audit: type=1131 audit(1540999160.455:64): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'

[root@localhost ~]# 
[root@localhost ~]# nmcli 
eth0: connecting (getting IP configuration) to System eth0
        "Red Hat Virtio"
        ethernet (virtio_net), 42:01:0A:80:00:05, hw, mtu 1500

lo: unmanaged
        "lo"
        loopback (unknown), 00:00:00:00:00:00, sw, mtu 65536

Use "nmcli device show" to get complete information about known devices and
"nmcli connection show" to get an overview on active connection profiles.

Consult nmcli(1) and nmcli-examples(5) manual pages for complete usage details.
[root@localhost ~]# nmcli dev show eth0
GENERAL.DEVICE:                         eth0
GENERAL.TYPE:                           ethernet
GENERAL.HWADDR:                         42:01:0A:80:00:05
GENERAL.MTU:                            1500
GENERAL.STATE:                          70 (connecting (getting IP configuratio>
GENERAL.CONNECTION:                     System eth0
GENERAL.CON-PATH:                       /org/freedesktop/NetworkManager/ActiveC>
WIRED-PROPERTIES.CARRIER:               on
[root@localhost ~]# nmcli con
NAME         UUID                                  TYPE      DEVICE 
System eth0  5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03  ethernet  --     
[root@localhost ~]# nmcli con show 5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03
connection.id:                          System eth0
connection.uuid:                        5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03
connection.stable-id:                   --
connection.type:                        802-3-ethernet
connection.interface-name:              eth0
connection.autoconnect:                 yes
connection.autoconnect-priority:        0
connection.autoconnect-retries:         -1 (default)
connection.auth-retries:                -1
connection.timestamp:                   0
connection.read-only:                   no
connection.permissions:                 --
connection.zone:                        --
connection.master:                      --
connection.slave-type:                  --
connection.autoconnect-slaves:          -1 (default)
connection.secondaries:                 --
connection.gateway-ping-timeout:        0
connection.metered:                     unknown
connection.lldp:                        default
connection.mdns:                        -1 (default)
802-3-ethernet.port:                    --
802-3-ethernet.speed:                   0

Ohhh, I got it to connect! The bad news is, I have no idea how/why it worked:

[root@localhost ~]# nmcli dev con eth0
...waited a minute, then hit ctrl-c...
[  525.689703] random: crng init done
[  525.693303] random: 7 urandom warning(s) missed due to ratelimiting
Error: nmcli terminated by signal Interrupt (2)
[root@localhost ~]# [  525.876868] audit: type=1130 audit(1540999584.377:67): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[  525.945024] audit: type=1130 audit(1540999584.445:68): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[root@localhost ~]# nmcli dev show eth0
GENERAL.DEVICE:                         eth0
GENERAL.TYPE:                           ethernet
GENERAL.HWADDR:                         42:01:0A:80:00:05
GENERAL.MTU:                            1460
GENERAL.STATE:                          100 (connected)
GENERAL.CONNECTION:                     System eth0
GENERAL.CON-PATH:                       /org/freedesktop/NetworkManager/ActiveC>
WIRED-PROPERTIES.CARRIER:               on
IP4.ADDRESS[1]:                         10.128.0.5/32
IP4.GATEWAY:                            10.128.0.1
IP4.ROUTE[1]:                           dst = 10.128.0.1/32, nh = 0.0.0.0, mt =>
IP4.ROUTE[2]:                           dst = 0.0.0.0/0, nh = 10.128.0.1, mt = >
IP4.ROUTE[3]:                           dst = 10.128.0.5/32, nh = 0.0.0.0, mt =>
IP4.DNS[1]:                             169.254.169.254
...cut...

and for the ultimate test:

[cevich@cevich dockerfiles]$ ping <EXTERNAL_IP>
PING <EXTERNAL_IP> (<EXTERNAL_IP>) 56(84) bytes of data.
64 bytes from <EXTERNAL_IP>: icmp_seq=1 ttl=51 time=44.5 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=2 ttl=51 time=43.9 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=3 ttl=51 time=44.6 ms
^C
--- <EXTERNAL_IP> ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 43.932/44.381/44.612/0.317 ms

So WTF is going on that my hitting ctrl-c unblocked network manager and/or the ethernet device? @jdoss that's the $34B question :smile:

Immediately upon breaking nmcli dev con eth0 I got those two crng messages. Could we perhaps have some kind of circular dependency here, where NetworkManager (or dhcp-client) needs random numbers, but the kernel's entropy pool is sparse because networking isn't up yet (and disk activity is predictable on boot)???

I'll poke around the hive-mind (google) to see what I can find...

Ah ha! This has got to be the smoking gun:

[root@localhost ~]# cat /proc/sys/kernel/random/entropy_avail
37
[root@localhost ~]# cat /sys/class/misc/hw_random/rng_current
none

Explanation is here (ignoring the unrelated higher-level topic): https://wiki.yoctoproject.org/wiki/Entropy_on_Autobuilders

It also explains why @jdoss doesn't see the problem. Part of his build process must be introducing additional entropy or perhaps a more active GCE project makes this happen faster(?). Our little GCE project has perhaps 3-4 VMs running at any given time :pensive:

So, since I can't poke google into setting -device virtio-rng-pci in their kvm setup, what can I do from the VM-side, to allow it to get past this initial entropy starvation?

Ahhhh. This makes sense. We use Vagrant to make a bunch of changes to our base image, we stop the KVM VM, convert the .img file to .raw with qemu-img convert and upload/import it with gcloud. I assume this is enough to kick enough entropy into the VM to not produce this problem.

Short term I would maybe consider the above process to unblock you. Long term, I am not sure what the best path forward is right now. It sure would be nice to have official Fedora Cloud (Server) images on GCP so we didn't have to do these images manually.

I have some contacts at Google Cloud, I will reach out and see if I can get some better ideas from them.

@jdoss actually found an easier solution and workaround. While you're into the image anyway (before converting):

  • dnf -y copr enable ngompa/gce-oslogin
  • dnf -y install rng-tools google-compute-engine google-compute-engine-oslogin
  • dnf -y update
  • systemctl enable rngd # the critical part
  • sync
  • fstrim -v -a
  • profit

The entropy thing is going to be a problem for anybody that uses our cloud images anywhere (including on AWS). I'm wondering if maybe we shouldn't be installing/enabling rngd by default. Sure it's not a perfect solution, but it's a hell of a lot better than spending three days troubleshooting why your VM won't boot :confounded: @dustymabe whatcha think?

I have some contacts at Google Cloud, I will reach out and see if I can get some better ideas from them.

Since they're using KVM, the absolute BEST thing they could do for everyone is add -device virtio-rng-pci (or equivilent) as a default command-line option for every instance. Doesn't cost them a thing, and VMs will automatically pick this up as their entropy source.

@cevich my contacts at GCP got back with me on this issue and it seems they were able to get google-compute-enable-virtio-rng=true as a metadata setting or a project wide setting.

GCE (Google Cloud Engine) now supports project/instance metadata 'google-compute-enable-virtio-rng=true' to get a virtio-rng device.

https://bugzilla.redhat.com/show_bug.cgi?id=1572944#c79

This should help with this issue moving forward.

Hoooo yaay! That's fantastic news, thanks @jdoss ! I Shall enable it project-wide immediately!

Metadata Update from @cevich:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

6 years ago

Log in to comment on this ticket.

Metadata
Attachments 4
Attached 6 years ago View Comment
Attached 6 years ago View Comment
Attached 6 years ago View Comment
log
Attached 6 years ago View Comment