#49099 Nunc stans thread workers
Closed: fixed 3 years ago Opened 3 years ago by firstyear.

Replace the slapi direct nspr thread control and event system with nuncstans. This includes timer tasks, async events, and worker threads themselves.

http://www.port389.org/docs/389ds/design/nunc-stans-workers.html

This can be targeted for 1.3.7 or later.


Metadata Update from @firstyear:
- Issue assigned to firstyear
- Issue set to the milestone: 1.3.7 backlog

3 years ago

Metadata Update from @firstyear:
- Issue close_status updated to: None
- Issue tagged with: Complex, Performance

3 years ago

Metadata Update from @firstyear:
- Custom field reviewstatus adjusted to review

3 years ago

This patch serves a number of purposes:

  • Remove --enable-nunc-stans from configure. We always use it now.
  • Move the creation of the worker pool earlier so that all tasks can use it (ie ldif2db).
  • Fix detach handling to make the worker pool init easier
  • fix setuid calls (most of them were not needed)
  • Add test cases to show that timer events work as expected with NS. We need this for timer jobs like disk monitoring.

An important aspect of this patch is that it was tested with and without config.enable_nunc_stans on/off. This means if we have issues, we can still turn nunc-stans off and back out of the change.

Looks good! Nice and clean, no indentation/spacing issues :) Thanks!

It looks like a combination of things has a problem buildng though. :(

commit 19f676a
To ssh://git @pagure.io/389-ds-base.git
54e4fca. 19f676a master -> master

It looks like a combination of things has a problem buildng though. :(

commit 19f676a
To ssh://git @pagure.io/389-ds-base.git
54e4fca. 19f676a master -> master

Metadata Update from @firstyear:
- Custom field reviewstatus adjusted to review (was: ack)

3 years ago

Metadata Update from @mreynolds:
- Custom field reviewstatus adjusted to ack (was: review)

3 years ago

commit a05cf36
To ssh://git@pagure.io/389-ds-base.git
7b3e401..d7a4910 master -> master

This "fix" broke the server. The server fails to install or start or stop, etc.

attaching gdb and running the startup shows

...
...
[New Thread 0x7fff257fa700 (LWP 15335)]
new_ns_job acdda0 initial NS_JOB_WAITING
ns_add_io_job state 7 moving to NS_JOB_ARMED
internal_ns_rearm_job acdda0 state 4 moving to NS_JOB_ARMED
event_q_notify enqueuing acdda0 with state 5
sds_queue_dequeue: Queue 0x8e1ac0 - <== enqueuing
sds_queue_enqueue: Queue 0x8e1ac0 - Queueing ptr 0xacdda0 to 0xaaeab0
sds_queue_enqueue: Queue 0x8e1ac0 - empty, adding 0xaaeab0 to head and tail
sds_queue_enqueue: Queue 0x8e1ac0 - complete head: 0xaaeab0 tail: 0xaaeab0
event_q_wake attempting to wake event queue.
event_q_wake result. 0
event_cb 8e2460 state 5 non-threaded, execute right meow
work_job_execute 8e2460 state 5 moving to NS_JOB_RUNNING
wakeup_cb 8e2460 state 6 wakeup_cb
sds_queue_dequeue: Queue 0x8e1ac0 - ==> dequeuing
sds_queue_dequeue: Queue 0x8e1ac0 - complete head: (nil) tail: (nil)
get_new_event_requests Dequeuing acdda0 with state 5
update_event acdda0 state 5
sds_queue_dequeue: Queue 0x8e1ac0 - ==> dequeuing
sds_queue_dequeue: Queue 0x8e1ac0 - queue exhausted.
work_job_execute PERSIST and RUNNING, remarking 8e2460 as NS_JOB_NEEDS_ARM
work_job_execute 8e2460 state 4 job func complete, sending to rearm...
internal_ns_rearm_job 8e2460 state 4 moving to NS_JOB_ARMED
update_event 8e2460 state 5
event_loop_thread_func woke event queue. rc=1
sds_queue_dequeue: Queue 0x8e1ac0 - ==> dequeuing
sds_queue_dequeue: Queue 0x8e1ac0 - queue exhausted.
ns_thrpool_wait has begun
sds_queue_dequeue: Queue 0x7e9460 - ==> dequeuing
sds_queue_dequeue: Queue 0x7e9460 - complete head: 0x8e2a30 tail: 0x8e5cd0
^C
Thread 1 "ns-slapd" received signal SIGINT, Interrupt.
0x00007ffff524f96d in pthread_join () from /lib64/libpthread.so.0
(gdb) where
#0  0x00007ffff524f96d in pthread_join () from /lib64/libpthread.so.0
#1  0x00007ffff7bd2dba in ns_thrpool_wait (tp=0x8e19a0)
    at ../389-ds-base/src/nunc-stans/ns/ns_thrpool.c:1564
#2  0x000000000041cb79 in slapd_daemon (ports=0x7fffffffdc10, tp=0x8e19a0)
    at ../389-ds-base/ldap/servers/slapd/daemon.c:1107
#3  0x0000000000425f0b in main (argc=9, argv=0x7fffffffdd48)
    at ../389-ds-base/ldap/servers/slapd/main.c:1180

But the process never fully starts. Running start-dirsrv just hangs.

What are you doing to produce this error? I can't reproduce, and I can the full ticket test suite with the patch to make sure it worked. I'm confused to what's going on here....

If this is blocking you, revert it, and we can re-add later. Alternately, set nunc-stans OFF in libglobs.

1651     init_enable_nunc_stans = cfg->enable_nunc_stans = LDAP_ON;                                                           

^ That line in libglobs, set to LDAP_OFF.

I do a "make install" as I always do. But now I can not install servers, I can not stop or start them - they all hang here:

#0  0x00007ffff524f96d in pthread_join () from /lib64/libpthread.so.0
#1  0x00007ffff7bd2dba in ns_thrpool_wait (tp=0x8e19a0)
    at ../389-ds-base/src/nunc-stans/ns/ns_thrpool.c:1564
#2  0x000000000041cb79 in slapd_daemon (ports=0x7fffffffdc10, tp=0x8e19a0)
    at ../389-ds-base/ldap/servers/slapd/daemon.c:1107

And all the CI tests fail on the Jenkins server. If I go back to before this change everything works fine - which is what I did to workaround this problem, but this needs to get fixed asap.

Mate, no matter what I do, I can not reproduce this. Can you shoot me an email with thread apply all bt from the "broken" server? As well, can you give me the console output / error log?

I use this configure command (on F25):

CFLAGS='-g -pipe -Wall  -fexceptions -fstack-protector --param=ssp-buffer-size=4  -m64 -mtune=generic' CXXFLAGS='-g -pipe -Wall -O0 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic' ../389-ds-base/configure --enable-autobind --with-selinux --with-openldap --with-tmpfiles-d=/etc/tmpfiles.d --with-systemdsystemunitdir=/usr/lib/systemd/system --with-systemdsystemconfdir=/etc/systemd/system --enable-debug --with-systemdgroupname=dirsrv.target --with-fhs --libdir=/usr/lib64  --with-systemd

Then "make install"

Thread apply all shows all the worker threads, but main() is stuck trying to join a thread as shown above. It never detaches the process - you can only "attach" gdb if you use gdb to start the server: "gdb /usr/sbin/ns-slapd" --> set args --> run

There is also nothing in the logs. I even completely wiped my system of all things dirsrv, and started from scratch, but no luck. So for now I can only develop on the 1.3.6 branch.

I just want to chime in and confirm that server doesn't start fully after 19f676a. I'm using make -f rpm.mk srpms and build using mock.

I've managed to reproduce this. It looks like an issue with systemd + this patch. I'm working on it now.

0001-Ticket-49099-resolve-systemd-startup-interaction-wit.patch

This patch resolves the issue :) Very sorry about this issue. My development environment does not use systemd :(

Metadata Update from @firstyear:
- Custom field reviewstatus adjusted to review

3 years ago

Quick test shows it passes all the basic tests on a systemd enabled system.

Works for me. But I noticed that instance creation now takes 7 seconds instead of 2. I guess this is a side effect of e086b83?

Works for me, everything looks good, ack

Metadata Update from @mreynolds:
- Custom field reviewstatus adjusted to ack (was: review)

3 years ago

commit d3aa098
To ssh://git@pagure.io/389-ds-base.git
15f5f6a..d3aa098 master -> master

Metadata Update from @firstyear:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Login to comment on this ticket.

Metadata