#48379 ns-slapd takes 100% CPU
Closed: wontfix None Opened 8 years ago by evli964.

Hi,

We are currently evaluating 389-ds to migrate/upgrade our current openldap installation.

We have a dotNet application that keeps the ldap directory in sync with the HR database.
On the old OpenLDAP server this application seems to work fine.

However, on the 389-ds server, typically after this app has run a couple of hours the ns-slapd server starts to take up 100% cpu.

It seems the app internally generates a lot of exceptions which are worked around, and on OpenLDAP never caused any serious problems.

I can simulate this behaviour with the included ldapsrch.c

By starting multiple async search operations over 1 connection,
and then crashing the client program before waiting for the result
there are multiple threads that start to consume 100% cpu ...

tested against '389-Directory/1.3.4.0 B2015.342.1139'
(rpm 389-ds-base-1.3.4.0-21.el7_2.x86_64)

regards,
E.Vanlaar


It seems the app internally generates a lot of exceptions which are worked around, and on OpenLDAP never caused any serious problems.

Could you share what kind of errors the client is gettign?

Also, could you provide us the server log files /var/log/dirsrv/slapd-<YOURID>/{errors,access}?

Regarding the multiple threads, there is only one connection shared among many worker threads invoked by asynchronous searches (up to SRCH_COUNT). When sending back the search results, only one thread is allowed to do so at a time and the rests should wait for acquiring the lock. The behaviour is normal.

Thanks.

Please provide a stacktrace of the server while it is hung/at 100% CPU. See http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-hangs

Hi,

I have attached the requested files.
After the client crash, the folllowing thread-ids are taking all the CPU: 31469, 31472, 31464, 31470 and 31463)

Note that I now performed this on a fresh Fedora 23 install
(389-Directory/1.3.4.5 B2015.323.048; rpm 389-ds-base-1.3.4.5-1.fc23.x86_64)

The dotNet client Application is not connecting to this machine,
the only connected clients were the Admin console and the attached ldapsrch program.

Regarding the crashes of the dotnet application: these are all internal in the
Novell C# library that is apparantly being used (threading/locking issues).

Regards

Before the crash, the server is basically idle. There are no worker threads doing any work on behalf of the client(s).

After the crash, the threads at 100% CPU are all doing a thread yield, in the sched_yield system call.

In the original report, you said "after this app has run a couple of hours the ns-slapd server starts to take up 100% cpu". But in the ldapsrch test program, you have to use -c to make it crash, in order to make the server hit 100% cpu?

For a workaround, you might try experimenting with the value of nsslapd-maxthreadsperconn https://access.redhat.com/documentation/en-US/Red_Hat_Directory_Server/10/html-single/Configuration_Command_and_File_Reference/index.html#cnconfig-nsslapd_maxthreadsperconn_Maximum_Threads_per_Connection

Try a value of 1, which will turn off concurrency for connections. If that causes your application not to work, try a high value like 10 or so.

As stated, the dumps were made on a system with no client connections (other than the crashing ldapsrch program).

On the 'live' system it takes a couple of hours to get the high CPU load.
This is probably because of the rate the Exceptions/crashes occur in the application.

I will try to limit the concurrency per connection and let you know how it goes.

In the mean time I have another way to trigger the high cpu load.
I have modified the ldapsrch.c, the concurrent nr of async searches is now set to 2.

{{{
for (( i=0; i<60; i++ ))
do
./ldapsrch -c -h localhost -b 'search base' -D <bind dn> -w <password> \
'(some long running search)' &
done
}}}

On my test fedora23 vm I can trigger the high cpu load in this way also.
(see the attached stacktrace.1450292617.txt)

Regards

multiple async search with crash
ldapsrch.c

Hi,

yesterday evening i changed the nsslapd-maxthreadsperconn to 1 on the 'live' test/evaluation system.
The cpu load has not yet gone up, but we are now evaluating if everything is still working correctly.

However, on my test system I can still cause some threads to lock up with high cpu load.
On this system nsslapd-maxthreadsperconn has also been set to 1

By executing the following I can now lock up 1 thread per execution of the crashing ldapsrch program.
Note that i launch 2 async searches, with nsslapd-maxthreadsperconn set to 1.
When I don't let the program crash everything works fine.

{{{
./ldapsrch -n 2 -c -h localhost -b <base dn> -D <bind dn> -w <password> \
'<some long running search filter>'
}}}

If I repeat this enough the server stops accepting connections.

It looks like the server is not detecting the client is gone.

Regards

If the connections are not liberated, you could try to play with thevalues
{{{ nsslapd-timelimit }}} and {{{nsslapd-idletimeout}}}. The last parameter limits the timeout if inactive connections.

I set nsslapd-idletimeout to 10s on my test system.
This does not seem to make a difference.
The threads keep using 100% until the directory server is restarted.

But please note '<some long running search filter>' in this case means just a couple of seconds (just long enough to cause some parallellism)

regards

Replying to [comment:6 evli964]:

{{{
./ldapsrch -n 2 -c -h localhost -b <base dn> -D <bind dn> -w <password> \
'<some long running search filter>'
}}}

If I repeat this enough the server stops accepting connections.

How large the "enough" would be? 10s, 100s, 1000s, or larger?

It looks like the server is not detecting the client is gone.

Could you tell us how you figured the clients are really gone?
When this happens, could you please run "pstack <PID>" and provide us the stacktraces?
Thanks.

Replying to [comment:10 nhosoi]:

Replying to [comment:6 evli964]:

{{{
./ldapsrch -n 2 -c -h localhost -b <base dn> -D <bind dn> -w <password> \
'<some long running search filter>'
}}}

If I repeat this enough the server stops accepting connections.

How large the "enough" would be? 10s, 100s, 1000s, or larger?
The 31st run of the ldapsrch program cannot connect any more,
I suppose because all worker threads are in a corrupt state ...

It looks like the server is not detecting the client is gone.

Could you tell us how you figured the clients are really gone?

because of (check the attached ldapsrch.c)
{{{
Floating point exception (core dumped)
}}}

When this happens, could you please run "pstack <PID>" and provide us the stacktraces?

I ran gdb stack straces between the first couple of crashes of ldapsrch.
These are included, together with a pstack trace whith 30 hanging threads.

Thanks.

I experimented some more.

For the first thread to become locked/corrupt/... I sometimes need to run the crashing application up to 3 times. But after that every run of the crashing ldapsrch locks up 1 more thread.

Furthermore, 'not able to connect' is technically not completely correct.
The tcp connection works fine, it's the ldap_bind that does not seem to work anymore
(or maybe really slow).

Regards

Some more test cases:

  • nsslapd-maxthreadsperconn = 1
    {{{
    ./ldapsrch -n 2 -c -h localhost -b <base dn> -D <bind dn> -w <password> \
    '<some long running search filter>'
    }}}
    (note: 1 more async search than nsslapd-maxthreadsperconn)
    This will block 1 thread per execution after triggering lockup of the first thread
  • nsslapd-maxthreadsperconn = 5
    {{{
    ./ldapsrch -n 2 -c -h localhost -b <base dn> -D <bind dn> -w <password> \
    '<some long running search filter>'
    }}}
    As far as I can tell does not cause hangs/lockups if run 1 at a time
  • nsslapd-maxthreadsperconn = 5
    {{{
    for (( i=0; i<60; i++ ))
    do
    ./ldapsrch -n 2 -c -h localhost -b <base dn> -D <bind dn> -w <password> \
    '<some long running search filter>' &
    done
    }}}
    (note the '&')
    This almost always causes some threads to get stuck (sometimes none, sometimes all of them at once)
    After this (if there are non-stuck threads left) single executions with '-n 2' do not cause extra threads to get stuck:
    {{{
    ./ldapsrch -n 2 -c -h localhost -b <base dn> -D <bind dn> -w <password> \
    '<some long running search filter>'
    }}}
    however by specifying '-n 6' another 5 threads get stuck for every single execution of the above command.

  • nsslapd-maxthreadsperconn = 5
    {{{
    ./ldapsrch -n 6 -c -h localhost -b <base dn> -D <bind dn> -w <password> \
    '<some long running search filter>'
    }}}
    (Note: 1 more async search than nsslapd-maxthreadsperconn)
    This will almost always trigger 5 threads at a time to get stuck (again: to trigger the first 5 threads to get stuck more than 1 execution might be necessary; after that 5 threads per execution get stuck)

Regards

Ok. Looks like there is definitely a bug in the server that we need to investigate.

In the meantime, is there some tuning of nsslapd-maxthreadsperconn + nsslapd-threadnumber that can get you up and running without hanging/crashing?

Thanks !

We were planning to have at least a couple of months of testing before going live with the production environment.

So if we know you are looking into this we can stick to our schedule for now.

The issue is not (yet ?) blocking because it seems in our setup the threads get stuck 'slowly'.
A workaround for now would be restarting ns-slapd daily (before all or too many threads get stuck)

Also, by setting nsslapd-maxthreadsperconn=1 it seems the problem is not triggered by our sync application, but then performance seems to be a lot lower.
(maybe it would have been triggered if I kept nsslapd-maxthreadsperconn=1 for a bit longer)

Regards

Thank you for your reproducer.

Attachment ldapsrch.c​ added
multiple async search with crash

I could reproduce the issue and verified the fix for this ticket 48412 solves the problem.
https://fedorahosted.org/389/ticket/48412

We are going to respin the new bits as soon as possible.

I keep opening this ticket for now. Once the new bits are verified at your site, please update this ticket. Once again, thank you for your support.

Hi,

I just installed the new RPMS from the RHSA.
I cannot reproduce the problem with the attached ldapsrch.

So it's looking good ...

Regards

Thank you for your updates!

Let me close this ticket as a dup of #48412...

Metadata Update from @evli964:
- Issue assigned to nhosoi
- Issue set to the milestone: 0.0 NEEDS_TRIAGE

7 years ago

Metadata Update from @vashirov:
- Issue set to the milestone: None (was: 0.0 NEEDS_TRIAGE)

4 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1710

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Duplicate)

3 years ago

Login to comment on this ticket.

Metadata