Issue #498: upgrading CAH `7.1711` -> `7.1803` docker daemon fails to start - atomic-wg

atomic-wg

#498 upgrading CAH `7.1711` -> `7.1803` docker daemon fails to start

Opened 5 years ago by dustymabe. Modified 5 years ago

Started the CentOS Atomic Host vagrant box I have lying around on my system. It was the 7.1711 version. It starts fine but after I upgrade docker service doesn't come up. I see two issues:

docker-storage-setup fails

-bash-4.2# systemctl status docker-storage-setup -o cat | tee
● docker-storage-setup.service - Docker Storage Setup
   Loaded: loaded (/usr/lib/systemd/system/docker-storage-setup.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2018-05-16 01:30:26 UTC; 10min ago
  Process: 719 ExecStart=/usr/bin/container-storage-setup (code=exited, status=1/FAILURE)
 Main PID: 719 (code=exited, status=1/FAILURE)

Starting Docker Storage Setup...
ERROR: Storage is already configured with devicemapper driver. Can't configure it with overlay2 driver. To override, remove /etc/sysconfig/docker-storage and retry.
docker-storage-setup.service: main process exited, code=exited, status=1/FAILURE
Failed to start Docker Storage Setup.
Unit docker-storage-setup.service entered failed state.
docker-storage-setup.service failed.

docker service itself doesn't come up

-bash-4.2# systemctl status docker -o cat | tee
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/docker.service.d
           └─flannel.conf
   Active: failed (Result: exit-code) since Wed 2018-05-16 01:30:27 UTC; 9min ago
     Docs: http://docs.docker.com
  Process: 795 ExecStart=/usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --seccomp-profile=/etc/docker/seccomp.json $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_NETWORK_OPTIONS $ADD_REGISTRY $BLOCK_REGISTRY $INSECURE_REGISTRY $REGISTRIES (code=exited, status=1/FAILURE)
 Main PID: 795 (code=exited, status=1/FAILURE)

Starting Docker Application Container Engine...
time="2018-05-16T01:30:26.496375725Z" level=warning msg="could not change group /var/run/docker.sock to docker: group docker not found"
time="2018-05-16T01:30:26.499243981Z" level=info msg="libcontainerd: new containerd process, pid: 925"
Error starting daemon: error initializing graphdriver: devicemapper: Non existing device atomicos-docker--pool
docker.service: main process exited, code=exited, status=1/FAILURE
Failed to start Docker Application Container Engine.
Unit docker.service entered failed state.
docker.service failed.

We probably need to address these issues.

Here is the status information:

-bash-4.2# rpm-ostree status                                                                                                                           
State: idle                                                                
Deployments:                                                                                
● centos-atomic-host:centos-atomic-host/7/x86_64/standard                  
                   Version: 7.1803 (2018-04-03 12:35:38)                   
                    Commit: cbb9dbf9c8697e9254f481fff8f399d6808cecbed0fa6cc24e659d2f50e05a3e
              GPGSignature: Valid signature by 64E3E7558572B59A319452AAF17E745691BA8335                                                                

  centos-atomic-host:centos-atomic-host/7/x86_64/standard                                                                                              
                   Version: 7.1711 (2017-11-28 11:43:40)                                                                                               
                    Commit: 86d991cbb122af96a96cf2c55ccf1bb778c2342dd9a444dfed4fe96f70bb0ef9                                                           
              GPGSignature: Valid signature by 64E3E7558572B59A319452AAF17E745691BA8335

dustymabe commented 5 years ago

@jasonbrooks or @miabbott - can you take a look?

Metadata Update from @dustymabe:
- Issue assigned to jasonbrooks
- Issue tagged with: CentOS7, host

5 years ago

dustymabe commented 5 years ago

walters commented 5 years ago

In general, I'm not sure about filing atomic-wg issues that relate to CentOS because we can't generally fix them without going through RHEL, right?

Anyways https://bodhi.fedoraproject.org/updates/FEDORA-2018-03bdc0733a is the Fedora version of the first one. This fix needs to go through the whole downstream thing.

For the pool issue...I think that deserves a bug against docker in RHEL? Though of course that sort of wants to have reproduced it against RHELAH first.

That said it's definitely possible with some of these things that it's CentOS-AH specific as some of the metadata gets important.

Metadata Update from @dustymabe:
- Issue tagged with: bug

5 years ago

dustymabe commented 5 years ago

In general, I'm not sure about filing atomic-wg issues that relate to CentOS because we can't generally fix them without going through RHEL, right?

You are right, but I think it's mostly a philosophical question. Filing issues here helps bring awareness to the issues I believe. For me things get too easily lost in bugzilla.

Anyways https://bodhi.fedoraproject.org/updates/FEDORA-2018-03bdc0733a is the Fedora version of the first one. This fix needs to go through the whole downstream thing.

+1, thanks

For the pool issue...I think that deserves a bug against docker in RHEL? Though of course that sort of wants to have reproduced it against RHELAH first.

probably

That said it's definitely possible with some of these things that it's CentOS-AH specific as some of the metadata gets important.

Right. To me this is as good a place as any to "route" issues and find the most effective way to get them resolved. Whether they are specific to Fedora, specific to CentOS, or specific to RHELAH and need their own BZ. Maybe I'm created too much overhead. Maybe not..

vgoyal commented 5 years ago

I understand that why docker-storage-setup error message is coming and how upgrade will fix that. What I don't understand is that why docker is complaining that "atomicos-docker--pool" does not exist.

If storage is already setup, then this thin pool should have come up automatically after reboot. If its not there, then we are looking at a different issue.

What's the output of "lvs", "vgs", "pvs" and "lsblk" command on the system.

dustymabe commented 5 years ago

I understand that why docker-storage-setup error message is coming and how upgrade will fix that. What I don't understand is that why docker is complaining that "atomicos-docker--pool" does not exist.
If storage is already setup, then this thin pool should have come up automatically after reboot. If its not there, then we are looking at a different issue.
What's the output of "lvs", "vgs", "pvs" and "lsblk" command on the system.

just tried to reproduce and I only see the first error this time and the docker service itself seems to come up ok.. Maybe a race condition? I can try to play around and see if I can repro.

dustymabe commented 5 years ago

and just rebooting that same host allows me to reproduce:

-bash-4.2# reboot 
Connection to 192.168.121.193 closed by remote host.
Connection to 192.168.121.193 closed.
$
$ vagrant ssh
Last login: Mon May 21 15:22:08 2018 from 192.168.121.1
[vagrant@vanilla-c7atomic ~]$ 
[vagrant@vanilla-c7atomic ~]$ sudo su -
Last login: Mon May 21 15:22:10 UTC 2018 on pts/0
-bash-4.2# systemctl status docker -o cat | tee
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/docker.service.d
           └─flannel.conf
   Active: failed (Result: exit-code) since Mon 2018-05-21 15:25:53 UTC; 9min ago
     Docs: http://docs.docker.com
  Process: 806 ExecStart=/usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --seccomp-profile=/etc/docker/seccomp.json $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_NETWORK_OPTIONS $ADD_REGISTRY $BLOCK_REGISTRY $INSECURE_REGISTRY $REGISTRIES (code=exited, status=1/FAILURE)
 Main PID: 806 (code=exited, status=1/FAILURE)

Starting Docker Application Container Engine...
time="2018-05-21T15:25:52.960602241Z" level=warning msg="could not change group /var/run/docker.sock to docker: group docker not found"
time="2018-05-21T15:25:52.965747392Z" level=info msg="libcontainerd: new containerd process, pid: 834"
Error starting daemon: error initializing graphdriver: devicemapper: Non existing device atomicos-docker--pool
docker.service: main process exited, code=exited, status=1/FAILURE
Failed to start Docker Application Container Engine.
Unit docker.service entered failed state.
docker.service failed.
-bash-4.2# 
-bash-4.2# lvs
  LV          VG       Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  docker-pool atomicos twi-a-t---  2.68g             0.71   0.33                            
  root        atomicos -wi-ao---- <2.93g                                                    
-bash-4.2# vgs
  VG       #PV #LV #SN Attr   VSize VFree 
  atomicos   1   2   0 wz--n- 9.70g <4.07g
-bash-4.2# pvs
  PV         VG       Fmt  Attr PSize PFree 
  /dev/vda2  atomicos lvm2 a--  9.70g <4.07g
-bash-4.2# lsblk
NAME                            MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda                             252:0    0   11G  0 disk 
├─vda1                          252:1    0  300M  0 part /boot
└─vda2                          252:2    0  9.7G  0 part 
  ├─atomicos-root               253:0    0    3G  0 lvm  /sysroot
  ├─atomicos-docker--pool_tmeta 253:1    0   12M  0 lvm  
  │ └─atomicos-docker--pool     253:3    0  2.7G  0 lvm  
  └─atomicos-docker--pool_tdata 253:2    0  2.7G  0 lvm  
    └─atomicos-docker--pool     253:3    0  2.7G  0 lvm  
vdb                             252:16   0   20G  0 disk 
-bash-4.2#

dustymabe commented 5 years ago

upstream discussion about this race condition: https://github.com/projectatomic/container-storage-setup/issues/267#issuecomment-390697519