#452 F28 Atomic Beta and Openshift Origin 3.9 results in nodes NotReady
Closed: Fixed 6 years ago Opened 6 years ago by jdoss.

I have been trying to get a local install of Openshift Origin 3.9 going on a fresh Fedora Atomic 28 server install and the openshift-ansible installer fails because none of the nodes get put into the Ready state.

# oc get nodes
NAME            STATUS     ROLES     AGE       VERSION
atomic28   NotReady   master    12h       v1.9.1+a0ce1bc657

Doing oc describe nodes shows this error:

# oc describe nodes
*snip*
Warning  KubeletSetupFailed       20m                kubelet, apex.example.net  Failed to start ContainerManager Delegation not available for unit type
*snip*

Which lead me to this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1558425 which seems to think this issue is due to the recent changes in systemd 238 that removes can_delegate from slices.

Is there an easy way to downgrade just systemd on Atomic to test to see if that fixes the issue?


Is there an easy way to downgrade just systemd on Atomic to test to see if that fixes the issue?

Grab an older version of the systemd RPMs then try rpm-ostree override replace <systemd1.rpm> <systemd2.rpm> ...

I've also hit this issue, but haven't found an older version of systemd that I've been able to use override replace with and then successfully boot up the host. For example, I used this version of systemd-237

https://kojipkgs.fedoraproject.org//packages/systemd/237/6.git84c8da5.fc28/x86_64/systemd-container-237-6.git84c8da5.fc28.x86_64.rpm

...and it appears that systemd-tmpfiles-setup fails to start and that sends the whole system into a tizzy.

# journalctl -b -u systemd-tmpfiles-setup --no-pager 
-- Logs begin at Tue 2018-04-10 20:26:09 UTC, end at Tue 2018-04-10 20:30:50 UTC. --
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/rpcbind.conf:2] Unknown user 'rpc'.
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/systemd.conf:11] Unknown group 'utmp'.
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/systemd.conf:19] Unknown user 'systemd-network'.
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/systemd.conf:20] Unknown user 'systemd-network'.
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/systemd.conf:21] Unknown user 'systemd-network'.
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/systemd.conf:25] Unknown group 'systemd-journal'.
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/systemd.conf:26] Unknown group 'systemd-journal'.
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/systemd.conf:32] Unknown group 'systemd-journal'.
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/systemd.conf:33] Unknown group 'systemd-journal'.
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/systemd.conf:34] Unknown group 'systemd-journal'.
Apr 10 20:26:09 localhost systemd[1]: Started Create Volatile Files and Directories.
Apr 10 20:26:11 localhost systemd[1]: Stopped Create Volatile Files and Directories.
Apr 10 20:26:12 atomichost-by-dustymabe systemd[1]: Starting Create Volatile Files and Directories...
Apr 10 20:26:12 atomichost-by-dustymabe systemd-tmpfiles[710]: "/home" already exists and is not a directory.
Apr 10 20:26:12 atomichost-by-dustymabe systemd-tmpfiles[710]: "/srv" already exists and is not a directory.
Apr 10 20:26:12 atomichost-by-dustymabe systemd-tmpfiles[710]: "/tmp" already exists and is not a directory.
Apr 10 20:26:12 atomichost-by-dustymabe systemd-tmpfiles[710]: Unable to fix SELinux security context of /tmp/.X11-unix: Read-only file system
Apr 10 20:26:12 atomichost-by-dustymabe systemd-tmpfiles[710]: Unable to fix SELinux security context of /tmp/.ICE-unix: Read-only file system
Apr 10 20:26:12 atomichost-by-dustymabe systemd-tmpfiles[710]: Unable to fix SELinux security context of /tmp/.font-unix: Read-only file system
Apr 10 20:26:12 atomichost-by-dustymabe systemd[1]: systemd-tmpfiles-setup.service: Main process exited, code=exited, status=1/FAILURE
Apr 10 20:26:12 atomichost-by-dustymabe systemd[1]: systemd-tmpfiles-setup.service: Failed with result 'exit-code'.
Apr 10 20:26:12 atomichost-by-dustymabe systemd[1]: Failed to start Create Volatile Files and Directories.

Following the BZ to the various upstream issues, it looks like this might be fixed with a change to runc

See the following PR - https://github.com/opencontainers/runc/pull/1776

Specifically this comment - https://github.com/opencontainers/runc/pull/1776#issuecomment-380206972

I hit this w/ kube, and did this workaround to get past it for now:

# cp /usr/lib/systemd/system/docker.service /etc/systemd/system/

# sed -i 's/cgroupdriver=systemd/cgroupdriver=cgroupfs/' /etc/systemd/system/docker.service

# sed -i 's/cgroup-driver=systemd/cgroup-driver=cgroupfs/' /etc/systemd/system/kubelet.service.d/kubeadm.conf

# systemctl daemon-reload

# systemctl restart docker

Metadata Update from @miabbott:
- Issue tagged with: bug, meeting

6 years ago

Metadata Update from @walters:
- Issue assigned to walters

6 years ago

I don't know the reasoning for the systemd vs cgroupfs driver issue, but here's some discussion from when CoreOS switched from systemd to cgroupfs: https://github.com/coreos/bugs/issues/1435 -- is the dependency on systemd here important to us?

is the dependency on systemd here important to us?

Mmm. I don't think we should change architecture in response to a bug, at least not immediately. I personally like the idea of integrating with systemd but it's a very complex topic.

Anyways so that patch doesn't apply to the runc vendored in docker, but it WFM if I bind mount it over, and restart docker. So we could update the runc vendored in our docker and try that?

So we could update the runc vendored in our docker and try that?

Yes please. I've asked @lsm5 to do so, unless someone else is able to

A scratch builds of docker package to workaround this issue:

f28: https://koji.fedoraproject.org/koji/taskinfo?taskID=26334563

Please test it out and report if things work for you.

It looks like this needs to get solved in docker and their vendored in version of runc as well as in kubernetes

@runcom pointed at this PR that looks like would be part of the kube fix - https://github.com/kubernetes/kubernetes/pull/61926

should we consider applying the patches from jason's earlier comment directly to the ostree for f28? It seems like the kube changes are going to take a while to apply

Agreed, @dustymabe. I think applying the workaround would make sense.

should we consider applying the patches from jason's earlier comment directly to the ostree for f28? It seems like the kube changes are going to take a while to apply

Let's consider it, but also realize that we don't have any testing/exposure to using the cgroupfs driver in the Fedora ecosystem (as far as I know).

There are plenty of places upstream where it is used (and preferred!) over the systemd driver, so it's not a huge risk, in my opinion. But we should be cognizant of the possibility of problems.

Also, if we include the workaround in F28 and we get the fixes we want, what is the process/impact of reverting back to the systemd driver?

There's also the option of reverting the systemd commit.

There's also the option of reverting the systemd commit.

Can you explore that option? I honestly don't know how big of a change this was or how much of a big deal it would be to revert it now. It would certainly help us support kube/openshift in the short term if we reverted and had this land at a later time.

This seems to be all based on a misunderstanding — systemd never allowed (*) other entities to muck around with parts of the cgroup hierarchy that it manages, including slice units. The difference with systemd-238 is that it's slightly clearer about this, and e.g. setting Delegate before would be silently ignored before and results in an error now. See https://github.com/systemd/systemd/issues/8645 for another discussion of this.

In particular, .slice units would be create "on demand", i.e. when another .service/.slice/whatever nested unit was requested, the parent .slice would be created, and destroyed when systemd thinks it's not needed any more.

(*) "allow" needs a clarification: there is no enforcement of this, because a user space process cannot prevent another privileged process from changing the cgroup hierarchy. So "allow"/"disallow" here is at the level of "please don't do this" or "you get to keep the pieces".

There's also the option of reverting the systemd commit.

This could be discussed as an option to do this as a hack to get things to work temporarily, but it's not a long term solution.

This could be discussed as an option to do this as a hack to get things to work temporarily, but it's not a long term solution.

Yes! I think we all agree we don't won't to "revert" the change for the long term. We just want to revert for now until other upstream projects (kube/runc/openshift) have had a chance to get in fixes to work with the new change.

So the options I see are:

  • don't revert, flail around and try to get upstream PRs into f28
  • don't revert, try to add some "glue" config like jason's comment to atomic host
  • revert, change back in f28 once kube/runc/docker/openshift have merged in and properly tested/soaked all changes
  • revert in f28, Leave in place for f29+ (leave it out of f28)

@zbyszek WDYT?

Can somebody explain what runc does with the cgroup hierarchy of a slice on which it has set Delegate=yes?

@zbyszek there is an upstream fix for it https://github.com/opencontainers/runc/pull/1776

It looks like before runc wasn't making a difference between a scope and a slice, it used a transient scope to test for Delegate= support but used with slices as well. The fix is that now it tests separately for Delegate= support.

Can you please check the patch though? From what I understand a slice has never supported Delegate=, so probably the additional check must be dropped and just not try Delegate= at all when a slice is used.

I saw the patch, and I see the check that it does. But it doesn't answer why. In particular, I'd like to understand if runc actually starts .slice units with Delegate=yes, and if tries to touch the cgroup hierarchy that systemd sets up for this unit, and if tries to create a sub-hierarchy underneath that unit.

Yes, runc makes changes to slices/scopes created by systemd as systemd doesn't support all the knobs that runc needs.

ok, talked with @mpatel and @zbyszek in #atomic earlier. We are going to try for option 4. I opened BZ1568594 for this and proposed it as an FE.

this fix has been pushed to stable. tomorrow's run of f28 should have this in it so please wait f24 hours and then test!

Metadata Update from @dustymabe:
- Issue untagged with: meeting
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

6 years ago

Metadata Update from @dustymabe:
- Issue tagged with: F28, host

6 years ago

Log in to comment on this ticket.

Metadata