Issue #452: F28 Atomic Beta and Openshift Origin 3.9 results in nodes NotReady - atomic-wg

atomic-wg

#452 F28 Atomic Beta and Openshift Origin 3.9 results in nodes NotReady

Closed: Fixed 6 years ago Opened 6 years ago by jdoss.

I have been trying to get a local install of Openshift Origin 3.9 going on a fresh Fedora Atomic 28 server install and the openshift-ansible installer fails because none of the nodes get put into the Ready state.

# oc get nodes
NAME            STATUS     ROLES     AGE       VERSION
atomic28   NotReady   master    12h       v1.9.1+a0ce1bc657

Doing oc describe nodes shows this error:

# oc describe nodes
*snip*
Warning  KubeletSetupFailed       20m                kubelet, apex.example.net  Failed to start ContainerManager Delegation not available for unit type
*snip*

Which lead me to this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1558425 which seems to think this issue is due to the recent changes in systemd 238 that removes can_delegate from slices.

Is there an easy way to downgrade just systemd on Atomic to test to see if that fixes the issue?

miabbott commented 6 years ago

Is there an easy way to downgrade just systemd on Atomic to test to see if that fixes the issue?

Grab an older version of the systemd RPMs then try rpm-ostree override replace <systemd1.rpm> <systemd2.rpm> ...

miabbott commented 6 years ago

I've also hit this issue, but haven't found an older version of systemd that I've been able to use override replace with and then successfully boot up the host. For example, I used this version of systemd-237

https://kojipkgs.fedoraproject.org//packages/systemd/237/6.git84c8da5.fc28/x86_64/systemd-container-237-6.git84c8da5.fc28.x86_64.rpm

...and it appears that systemd-tmpfiles-setup fails to start and that sends the whole system into a tizzy.

# journalctl -b -u systemd-tmpfiles-setup --no-pager 
-- Logs begin at Tue 2018-04-10 20:26:09 UTC, end at Tue 2018-04-10 20:30:50 UTC. --
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/rpcbind.conf:2] Unknown user 'rpc'.
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/systemd.conf:11] Unknown group 'utmp'.
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/systemd.conf:19] Unknown user 'systemd-network'.
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/systemd.conf:20] Unknown user 'systemd-network'.
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/systemd.conf:21] Unknown user 'systemd-network'.
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/systemd.conf:25] Unknown group 'systemd-journal'.
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/systemd.conf:26] Unknown group 'systemd-journal'.
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/systemd.conf:32] Unknown group 'systemd-journal'.
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/systemd.conf:33] Unknown group 'systemd-journal'.
Apr 10 20:26:09 localhost systemd-tmpfiles[200]: [/usr/lib/tmpfiles.d/systemd.conf:34] Unknown group 'systemd-journal'.
Apr 10 20:26:09 localhost systemd[1]: Started Create Volatile Files and Directories.
Apr 10 20:26:11 localhost systemd[1]: Stopped Create Volatile Files and Directories.
Apr 10 20:26:12 atomichost-by-dustymabe systemd[1]: Starting Create Volatile Files and Directories...
Apr 10 20:26:12 atomichost-by-dustymabe systemd-tmpfiles[710]: "/home" already exists and is not a directory.
Apr 10 20:26:12 atomichost-by-dustymabe systemd-tmpfiles[710]: "/srv" already exists and is not a directory.
Apr 10 20:26:12 atomichost-by-dustymabe systemd-tmpfiles[710]: "/tmp" already exists and is not a directory.
Apr 10 20:26:12 atomichost-by-dustymabe systemd-tmpfiles[710]: Unable to fix SELinux security context of /tmp/.X11-unix: Read-only file system
Apr 10 20:26:12 atomichost-by-dustymabe systemd-tmpfiles[710]: Unable to fix SELinux security context of /tmp/.ICE-unix: Read-only file system
Apr 10 20:26:12 atomichost-by-dustymabe systemd-tmpfiles[710]: Unable to fix SELinux security context of /tmp/.font-unix: Read-only file system
Apr 10 20:26:12 atomichost-by-dustymabe systemd[1]: systemd-tmpfiles-setup.service: Main process exited, code=exited, status=1/FAILURE
Apr 10 20:26:12 atomichost-by-dustymabe systemd[1]: systemd-tmpfiles-setup.service: Failed with result 'exit-code'.
Apr 10 20:26:12 atomichost-by-dustymabe systemd[1]: Failed to start Create Volatile Files and Directories.

miabbott commented 6 years ago

Following the BZ to the various upstream issues, it looks like this might be fixed with a change to runc

See the following PR - https://github.com/opencontainers/runc/pull/1776

Specifically this comment - https://github.com/opencontainers/runc/pull/1776#issuecomment-380206972

jasonbrooks commented 6 years ago

I hit this w/ kube, and did this workaround to get past it for now:

# cp /usr/lib/systemd/system/docker.service /etc/systemd/system/

# sed -i 's/cgroupdriver=systemd/cgroupdriver=cgroupfs/' /etc/systemd/system/docker.service

# sed -i 's/cgroup-driver=systemd/cgroup-driver=cgroupfs/' /etc/systemd/system/kubelet.service.d/kubeadm.conf

# systemctl daemon-reload

# systemctl restart docker

Metadata Update from @miabbott:
- Issue tagged with: bug, meeting

6 years ago

Metadata Update from @walters:
- Issue assigned to walters

6 years ago

jasonbrooks commented 6 years ago

I don't know the reasoning for the systemd vs cgroupfs driver issue, but here's some discussion from when CoreOS switched from systemd to cgroupfs: https://github.com/coreos/bugs/issues/1435 -- is the dependency on systemd here important to us?

walters commented 6 years ago

is the dependency on systemd here important to us?

Mmm. I don't think we should change architecture in response to a bug, at least not immediately. I personally like the idea of integrating with systemd but it's a very complex topic.

Anyways so that patch doesn't apply to the runc vendored in docker, but it WFM if I bind mount it over, and restart docker. So we could update the runc vendored in our docker and try that?

dustymabe commented 6 years ago

So we could update the runc vendored in our docker and try that?

Yes please. I've asked @lsm5 to do so, unless someone else is able to

dustymabe commented 6 years ago

A scratch builds of docker package to workaround this issue:

f28: https://koji.fedoraproject.org/koji/taskinfo?taskID=26334563

Please test it out and report if things work for you.

dustymabe commented 6 years ago

actually let's test the bodhi update instead with docker-1.13.1-52.git89b0e65.fc28

miabbott commented 6 years ago

It looks like this needs to get solved in docker and their vendored in version of runc as well as in kubernetes

@runcom pointed at this PR that looks like would be part of the kube fix - https://github.com/kubernetes/kubernetes/pull/61926

dustymabe commented 6 years ago

should we consider applying the patches from jason's earlier comment directly to the ostree for f28? It seems like the kube changes are going to take a while to apply

smilner commented 6 years ago

Agreed, @dustymabe. I think applying the workaround would make sense.

miabbott commented 6 years ago

should we consider applying the patches from jason's earlier comment directly to the ostree for f28? It seems like the kube changes are going to take a while to apply

Let's consider it, but also realize that we don't have any testing/exposure to using the cgroupfs driver in the Fedora ecosystem (as far as I know).

There are plenty of places upstream where it is used (and preferred!) over the systemd driver, so it's not a huge risk, in my opinion. But we should be cognizant of the possibility of problems.

Also, if we include the workaround in F28 and we get the fixes we want, what is the process/impact of reverting back to the systemd driver?

walters commented 6 years ago

There's also the option of reverting the systemd commit.

dustymabe commented 6 years ago

There's also the option of reverting the systemd commit.

Can you explore that option? I honestly don't know how big of a change this was or how much of a big deal it would be to revert it now. It would certainly help us support kube/openshift in the short term if we reverted and had this land at a later time.

zbyszek commented 6 years ago

This seems to be all based on a misunderstanding — systemd never allowed (*) other entities to muck around with parts of the cgroup hierarchy that it manages, including slice units. The difference with systemd-238 is that it's slightly clearer about this, and e.g. setting Delegate before would be silently ignored before and results in an error now. See https://github.com/systemd/systemd/issues/8645 for another discussion of this.

In particular, .slice units would be create "on demand", i.e. when another .service/.slice/whatever nested unit was requested, the parent .slice would be created, and destroyed when systemd thinks it's not needed any more.

(*) "allow" needs a clarification: there is no enforcement of this, because a user space process cannot prevent another privileged process from changing the cgroup hierarchy. So "allow"/"disallow" here is at the level of "please don't do this" or "you get to keep the pieces".

There's also the option of reverting the systemd commit.

This could be discussed as an option to do this as a hack to get things to work temporarily, but it's not a long term solution.

dustymabe commented 6 years ago

This could be discussed as an option to do this as a hack to get things to work temporarily, but it's not a long term solution.

Yes! I think we all agree we don't won't to "revert" the change for the long term. We just want to revert for now until other upstream projects (kube/runc/openshift) have had a chance to get in fixes to work with the new change.

So the options I see are:

don't revert, flail around and try to get upstream PRs into f28
don't revert, try to add some "glue" config like jason's comment to atomic host
revert, change back in f28 once kube/runc/docker/openshift have merged in and properly tested/soaked all changes
revert in f28, Leave in place for f29+ (leave it out of f28)

@zbyszek WDYT?

zbyszek commented 6 years ago

Can somebody explain what runc does with the cgroup hierarchy of a slice on which it has set Delegate=yes?

gscrivano commented 6 years ago

@zbyszek there is an upstream fix for it https://github.com/opencontainers/runc/pull/1776

It looks like before runc wasn't making a difference between a scope and a slice, it used a transient scope to test for Delegate= support but used with slices as well. The fix is that now it tests separately for Delegate= support.

Can you please check the patch though? From what I understand a slice has never supported Delegate=, so probably the additional check must be dropped and just not try Delegate= at all when a slice is used.

zbyszek commented 6 years ago

I saw the patch, and I see the check that it does. But it doesn't answer why. In particular, I'd like to understand if runc actually starts .slice units with Delegate=yes, and if tries to touch the cgroup hierarchy that systemd sets up for this unit, and if tries to create a sub-hierarchy underneath that unit.

mpatel commented 6 years ago

Yes, runc makes changes to slices/scopes created by systemd as systemd doesn't support all the knobs that runc needs.

dustymabe commented 6 years ago

ok, talked with @mpatel and @zbyszek in #atomic earlier. We are going to try for option 4. I opened BZ1568594 for this and proposed it as an FE.

dustymabe commented 6 years ago

this fix has been pushed to stable. tomorrow's run of f28 should have this in it so please wait f24 hours and then test!

Metadata Update from @dustymabe:
- Issue untagged with: meeting
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

6 years ago

Metadata Update from @dustymabe:
- Issue tagged with: F28, host

6 years ago

Metadata

Assignee

walters

Tags

Blocking

None

Depending on

None

Milestone

None

atomic-wg

Source Code

#452 F28 Atomic Beta and Openshift Origin 3.9 results in nodes NotReady Closed: Fixed 6 years ago Opened 6 years ago by jdoss.

Metadata

bug host F28

#452 F28 Atomic Beta and Openshift Origin 3.9 results in nodes NotReady

Closed: Fixed 6 years ago Opened 6 years ago by jdoss.