#6766 Getting dogtag status stuck
Opened 2 years ago by mbasti. Modified a year ago

Sometimes installer cannot determine if dogtag is running, because there is no timeout on socket and if dogtag doesn't reply it stucks forever.

Log:

tail /var/log/ipaserver-kra-install.log 
2017-03-15T05:16:21Z DEBUG Starting external process
2017-03-15T05:16:21Z DEBUG args=/bin/systemctl is-active pki-tomcatd@pki-tomcat.service
2017-03-15T05:16:21Z DEBUG Process finished, return code=0
2017-03-15T05:16:21Z DEBUG stdout=active

2017-03-15T05:16:21Z DEBUG stderr=
2017-03-15T05:16:21Z DEBUG wait_for_open_ports: localhost [8080, 8443] timeout 300
2017-03-15T05:16:23Z DEBUG Waiting until the CA is running
2017-03-15T05:16:23Z DEBUG request POST http://vm-239.example.com:8080/ca/admin/ca/getStatus
2017-03-15T05:16:23Z DEBUG request body ''

Traceback:

Traceback (most recent call first):
  File "/usr/lib64/python2.7/socket.py", line 480, in readline
    data = self._sock.recv(self._rbufsize)
  File "/usr/lib64/python2.7/httplib.py", line 394, in _read_status
    line = self.fp.readline(_MAXLINE + 1)
  File "/usr/lib64/python2.7/httplib.py", line 438, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python2.7/httplib.py", line 1121, in getresponse
    response.begin()
  File "/usr/lib/python2.7/site-packages/ipapython/dogtag.py", line 200, in _httplib_request
    res = conn.getresponse()
  File "/usr/lib/python2.7/site-packages/ipapython/dogtag.py", line 170, in http_request
    'http', host, port, url, httplib.HTTPConnection, body)
  File "/usr/lib/python2.7/site-packages/ipapython/dogtag.py", line 117, in ca_status
    ca_host, 8080, '/ca/admin/ca/getStatus')
  File "/usr/lib/python2.7/site-packages/ipaplatform/redhat/services.py", line 203, in wait_until_running
    status = dogtag.ca_status(api.env.host)
  File "/usr/lib/python2.7/site-packages/ipaplatform/redhat/services.py", line 224, in restart
    self.wait_until_running()
  File "/usr/lib/python2.7/site-packages/ipaserver/install/service.py", line 349, in restart
    self.service.restart(instance_name, capture_output=capture_output, wait=wait)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/dogtaginstance.py", line 185, in restart_instance
    self.restart('pki-tomcat')
  File "/usr/lib/python2.7/site-packages/ipaserver/install/service.py", line 439, in run_step
    method()
  File "/usr/lib/python2.7/site-packages/ipaserver/install/service.py", line 449, in start_creation
    run_step(full_msg, method)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/krainstance.py", line 126, in configure_instance
    self.start_creation(runtime=126)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/kra.py", line 67, in install
    options.dm_password, subject_base=subject)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/ipa_kra_install.py", line 225, in run
    kra.install(api, config, self.options)
  File "/usr/lib/python2.7/site-packages/ipapython/admintool.py", line 172, in execute
    return_value = self.run()
  File "/usr/lib/python2.7/site-packages/ipapython/admintool.py", line 149, in main
    return command.execute()
  File "/usr/lib/python2.7/site-packages/ipapython/admintool.py", line 125, in run_cli
    sys.exit(cls.main(sys.argv))
  File "/usr/sbin/ipa-kra-install", line 23, in <module>
    KRAInstall.run_cli()

NSSConnection did not support timeouts but since we're using httplib.HTTP[S]Connection now, the timeouts can be added.

Metadata Update from @pvoborni:
- Issue priority set to: critical
- Issue set to the milestone: FreeIPA 4.5.1

2 years ago

I have the same issue on replica-install, it hangs every time. So this is blocking me from creating a replica on 4.4

Metadata Update from @stlaz:
- Custom field external_tracker adjusted to https://pagure.io/dogtagpki/issue/2646

2 years ago

Seems to be this issue:

https://pagure.io/389-ds-base/issue/49209

It's hanging in ?

=====================================================================
[...]
Done configuring directory server (dirsrv).
Configuring certificate server (pki-tomcatd). Estimated time: 3 minutes 30 seconds
[1/27]: creating certificate server user
[2/27]: configuring certificate server instance
[3/27]: stopping certificate server instance to update CS.cfg
[4/27]: backing up CS.cfg
[5/27]: disabling nonces
[6/27]: set up CRL publishing
[7/27]: enable PKIX certificate path discovery and validation
[8/27]: starting certificate server instance
=================================================================

We should take a pstack of ns-slapd process and confirm we see these sort of stacktraces:

Thread 32 (Thread 0x7f3e70ff9700 (LWP 3717)):
#0  0x00007f3e9784e6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f3e97ea5473 in PR_EnterMonitor () from /lib64/libnspr4.so
#2  0x00007f3e8c7e0321 in replica_get_generation () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#3  0x00007f3e8c7d9b90 in copy_operation_parameters () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#4  0x00007f3e8c7db1ea in multimaster_preop_modify () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#5  0x00007f3e99cdcdb8 in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0
#6  0x00007f3e99cdd043 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0
#7  0x00007f3e99cca359 in op_shared_modify () from /usr/lib64/dirsrv/libslapd.so.0
#8  0x00007f3e99ccb91f in do_modify () from /usr/lib64/dirsrv/libslapd.so.0
#9  0x00007f3e9a1b49d0 in connection_threadmain ()
#10 0x00007f3e97eaa9bb in _pt_root () from /lib64/libnspr4.so
#11 0x00007f3e9784adc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f3e9757973d in clone () from /lib64/libc.so.6
Thread 31 (Thread 0x7f3e6bfff700 (LWP 3718)):
#0  0x00007f3e9784e6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f3e97ea5473 in PR_EnterMonitor () from /lib64/libnspr4.so
#2  0x00007f3e8c7dfe50 in replica_is_updatedn () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#3  0x00007f3e8c7c6e2a in multimaster_extop_StartNSDS50ReplicationRequest () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#4  0x00007f3e9a1bbec4 in do_extended ()
#5  0x00007f3e9a1b4aca in connection_threadmain ()
#6  0x00007f3e97eaa9bb in _pt_root () from /lib64/libnspr4.so
#7  0x00007f3e9784adc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f3e9757973d in clone () from /lib64/libc.so.6

Thread 38 (Thread 0x7f3e80e9b700 (LWP 3710)):
#0  0x00007f3e9784e6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f3e97ea5473 in PR_EnterMonitor () from /lib64/libnspr4.so
#2  0x00007f3e8c7e2e58 in replica_update_state () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#3  0x00007f3e99ca2e9a in eq_loop () from /usr/lib64/dirsrv/libslapd.so.0
#4  0x00007f3e97eaa9bb in _pt_root () from /lib64/libnspr4.so
#5  0x00007f3e9784adc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f3e9757973d in clone () from /lib64/libc.so.6

If it confirms, we have already a patch and test packages to deliver (389-ds-base) to fix this.

Approximately when will the patch land in the code? This issue blocks large part of freeipa integration tests execution during tests with replicas.

Metadata Update from @mbasti:
- Custom field on_review adjusted to https://github.com/freeipa/freeipa/pull/753

2 years ago

@mkubik It might be better to ask at https://pagure.io/dogtagpki/issue/2646, we can't do anything to unblock you here.

edit: except for maybe failing sooner instead of endlessly

Metadata Update from @pvoborni:
- Issue assigned to mbasti

2 years ago

master:

  • 20f7689 http_request: add timeout option
  • 05984f1 ca_status: add HTTP timeout 30 seconds

ipa-4-5:

  • 48bb3cb http_request: add timeout option
  • 68ce9aa ca_status: add HTTP timeout 30 seconds

Metadata Update from @mbasti:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Metadata Update from @stlaz:
- Issue status updated to: Open (was: Closed)

2 years ago

The installation will get stuck at a later step, during migration of cert profiles to LDAP. This is because we only log critical failure when the CA does not start in 5 minutes but don't actually fail. This should change.

Also, please, keep this ticket as a tracker to the Dogtag issue.

Metadata Update from @stlaz:
- Custom field on_review adjusted to https://github.com/freeipa/freeipa/pull/763 (was: https://github.com/freeipa/freeipa/pull/753)

2 years ago

master:

  • 1a7a1f9 Make CA/KRA fail when they don't start
  • aba384d Fix wrong message on Dogtag instances stop

ipa-4-5:

  • 81f97cb Make CA/KRA fail when they don't start
  • 1b44c4c Fix wrong message on Dogtag instances stop

Metadata Update from @mbasti:
- Issue tagged with: tracker

2 years ago

Is there a way to avoid this error to install the ca in the replica?

Metadata Update from @mbasti:
- Issue set to the milestone: FreeIPA 4.5.2 (was: FreeIPA 4.5.1)

2 years ago

FreeIPA 4.5.1 has been released, moving to FreeIPA 4.5.2 milestone

Metadata Update from @tkrizek:
- Issue set to the milestone: FreeIPA 4.5.3 (was: FreeIPA 4.5.2)

a year ago

Metadata Update from @mbasti:
- Assignee reset

a year ago

Metadata Update from @tkrizek:
- Issue set to the milestone: FreeIPA 4.5.4 (was: FreeIPA 4.5.3)

a year ago

Metadata Update from @tkrizek:
- Issue set to the milestone: FreeIPA 4.5.5 (was: FreeIPA 4.5.4)

a year ago

Login to comment on this ticket.

Metadata