The directory server will consume large amounts of memory when hit with endless ldapmodify requests in ratio to the configured cachememsize (regardless of whether the entries fill the cache or not).
Examples: 12GB cachememsize followed by a stream of ldapmodify quickly fill and crash a 32GB machine.
3GB cachememsize followed by heavy modify activity levels off at a memory usage of 24 GB.
1GB cachememsize followed by heavy modify activity levels off at a memory usage of 11-12 GB (including only 2GB of DB cache).
0GB cachememsize, which is reset to the minimum of 512000, followed by heavy modify activity will result in zero noticeable memory growth.
The server seems to believe it can have an in-memory workspace of too many multiples of the cachememsize. The ratio appears to be somewhere around (7 * cachememsize).
This behavior is seen in both the default RedHat 6.2 installation from the EPEL repository, which was a 1.2.9, as well as the latest 1.2.10 release from the rmeggins repo.
set default ticket origin to Community
Added initial screened field value.
beall,
Do you have more info on the setup?
What ldapmodify's are you running? How many existing entries are there? It appears you set the entry cache(nsslapd-cachememsize) to a high value, but did you use the default db cache size(nsslapd-dbcachesize)?
I just want to make I'm following your exact steps.
Thanks, Mark
I have been able to reproduce the memory growth. In my initial setup, after priming the entry cache (250K entries) with a cachememsize of 3gigs, the process was around 4 gigs in size. After modifying each entry a few times, it got above 8 gigs.
I ran the same test case under valgrind, and as expected there is no leak. This is such a basic operation that if it did leak we would of seen it along time ago.
I think what we are facing is simply memory fragmentation.
I'm going to run some more tests to see if I can manipulate the results.
Hi mreynolds,
It seems you were able to reproduce the issue, so I'd be glad to provide more info if you need it, but I think you have it.
My environment uses over 10 gigs of memory to hold the entire set of entries in cache prior to any growth. Once I start a stream of modifies, this grows beyond the boundaries of the machine I have available. We are in the process of purchasing hardware, but we are sizing the machines to handle more than 8 times the basic entry cache requirements. Hopefully with that size hardware, the usage will level off.
It would be much better if we don't have to buy that huge hardware.
Is there perhaps a linux system setting where we can pre-allocate a large chunk of memory to the process and then it will never fragment?
Thanks, Russ.
Hi Russ,
Looks like we identified a new malloc setting that might help. We are preparing a new test build, and I hope you will be willing to test this for us. The issue is that we don't have a perfect recommendation on what the setting should be.
You just set the env var SLAPD_MXFAST to a value between 0 and 128(128 is the default value).
0 disables the "fastbin" feature. This seems to reduce the fragmentation the most(smallest memory footprint), but there is a small performance impact.
Setting it to 64 reduced the fragmentation, but did not impact the performance as much.
All of this depends on the size of the entries, how many entries, overall usage, etc.
So, we would like you to test this and play around with the value(0, 32, 64, etc), and report your results. Is this something you would be willing to test for us?
If you will test it, I will let you know when/where to get the new 389 package.
I would be very glad to test this. We are at the critical stage of soon to be determining what hardware to buy, and if we can reduce the overconsumption of memory, we can buy nodes considerably cheaper than otherwise needed.
Russ, what version of 389 should we create the patch for? And can you confirm the os and version?
Using RedHat 6. uname -a follows: Linux gds-dev5.usc.edu 2.6.32-279.2.1.el6.x86_64 #1 SMP Thu Jul 5 21:08:58 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
Oh, and the 389 version can be your latest. I am not tied to a specific version. The last version I was testing with is this: 389-Directory/1.2.10.12 B2012.180.1623
Russ,
You can get the patch at rmeggins testing repo: 389-ds-base-1.2.10.14-2
Keep me posted on your results.
Ok. I'm running a test but the memory is still growing. I think it is growing a bit slower than before. I'd like to make sure I made the setting correctly. I put the following line in my /etc/sysconfig/dirsrv file: export SLAPD_MXFAST=0
This is the version now showing in my errors logfile after enabling the testing repo and running a yum update on 389-ds-base: 389-Directory/1.2.10.14 B2012.250.1848
Am I correctly set up to test this?
That should work. I also set it from the command line. I saw the same initial growth(regardless of the setting), but in my tests the growth slowed down and stopped after awhile.
In my tests I had a 3 gig entry cache, with 250k entries. I saw it grow up to 9-11 gigs, but with setting this to zero, it only grew to 6-7 gigs.
Correction, setting from the command line does NOT work. It must be in /etc/sysconfig/dirsrv.
Ok. Initial results are that my 15G cache grows to the 30G hard limit I set for virtual memory usage, but instead of dying for lack of memory, the system is slowly continuing to churn. It will be a while before I know how much slower it is. Definitely good that it is not dying.
I tried to research Linux memory fragmentation, and specifically I wanted a tool that would let me scrape a process and defragment it, but I don't see one. If such a tool existed, we could get by very well since we could just run that once a day or each time the update script completes.
Yeah I don't think such tools exists.
Also, don't forget to try other values for SLAPD_MXFAST. I saw good results with using 64. This might work for you, or maybe some other value.
My 2 cents: the other solution to avoid memory fragmentation would be to use hugepages for large memory sizes (like Oracle databases do, for example) but it would require a great deal of rewrite of the server code i think...
If you want to make sure you are using the binary with the mxfast option, do {{{ strings /usr/sbin/ns-slapd | grep SLAPD_MXFAST }}}
Replying to [comment:21 pj101]:
Can you point me at more information about this?
Hi Rich,
Huge pages are memory pages of 2Mb instead of 4kb (at least, in Linux). They are usually pinned in memory and not swappable. They can be used (i think) only through a shared memory and should be preallocated - that's a lot of constraints.
The advantage is a much lower overhead for kernel memory management for large memory sizes (staring from ~8Gb). As for the fragmentation, since the number of huge pages is much less fro the same memory size, the fragmentation overhead should be smaller.
Used in large memory installations for Oracle databases (maybe some other vendors do it too, i do it regularly during Oracle installations) and gives a large performance benefit...
Huge pages support in Linux: http://lwn.net/Articles/375098/
A recent presentation (from RedHat) and information about huge pages in RHEL6: http://www.slideshare.net/raghusiddarth/transparent-hugepages-in-rhel-6 https://access.redhat.com/knowledge/solutions/46111
Looks like the env var setting shows up, so I am using the setting. It seems to be changing the behavior to a very small degree.
My latest testing shows though that there is basically little difference in the behavior. The memory still fills up quickly and crashes the server regardless of the MXFAST setting, and I have some "startling" new test results from another angle that shows something fishy going on.
I have been experimenting this morning with a new way to exercise the cache. Perhaps the server isn't meant to do this generally, but it shows up some surprising behavior.
The test is simple: 1. start the server with a cachesize of 1 instead of -1 for unlimited. 2. use ldapmodify on the running server to change that back to -1 3. run ldapsearch on objectclass=* returning the dn all entries get loaded into the cache by this actual process memory rises by exactly double the reported "currententrycachesize" 4. use ldapmodify on the running server to change cachesize back to 1 all entries are deleted from the cache according to a search over cn=config, but nothing about the system memory footprint changes 5. run the cycle again loading all entries entries begin to load and the memory footprint immediately starts rising instead of using existing process memory that should have been freed. the next time cachesize is set back to 1, a large chunk of the memory footprint is freed up in the system, and goes back to about what it was the first time the entire cache was loaded -- approximately double the original memory used = currententrycachesize plus the dbcachesize -- even though no entries are in the cache and the reported currententrycachesize is the size of one entry.
My conclusion from this is that something in the system is holding onto large chunks of memory instead of freeing them. The server may not be "leaking" them per se, and ultimately freeing the space on server shutdown, but while running it is definitely not freeing large amounts of memory that probably should be freed. Fragmentation shouldn't be an issue when much of the memory should be completely freed up. It is acting like there is a copy or history of the previously cached entries being kept around in addition to the existing current entry cache. Then, adding fragmentation to that usage pattern could definitely cause even greater memory growth over time. The huge pages solution may help with the fragmentation, but it seems there is something more going on.
This patch doesn't fix the problem - it allows us to set different values for mxfast
The remainder of the work is still scheduled for 1.3.0.a1
master commit changeset:20dc4bc/389-ds-base Author: Mark Reynolds mreynolds@redhat.com Date: Thu Sep 6 13:21:27 2012 -0400 389-ds-base-1.2.11 commit changeset:8b33f23/389-ds-base Author: Mark Reynolds mreynolds@redhat.com Date: Fri Sep 7 13:47:10 2012 -0400
Ok. Didn't know that you guys were already looking at more than just fragmentation issues.
I ran some more tests on and off today and focused specifically on checking the functionality of the MXFAST setting. I set my server to cache only half of the entries, so they would generally fit well into memory. Here is what I found.
At each of the values ( 0, 32, 64 ) memory fragmentation did not overwhelm the server and the memory footprint would float up and down with the range 19G to 27G using my newer cache exercise test. No noticeable speed difference was noted.
When SLAPD_MXFAST was commented out of /etc/sysconfig/dirsrv, the server quickly becomes overwhelmed with what must be fragmentation and slows way down for a little while and then crashes.
If you need a speed comparison of MXFAST versus no MXFAST, I'll have to change my cache parameters and run it all again. It seems clear though that this setting has a significant effect on fragmentation.
Thanks again for running these tests! Did you notice which setting(0, 32, 64) caused the least amount of memory growth?
I did not notice much of a difference but I think 0 was the best as might be expected. I'll probably let more tests keep running in the background while I'm getting other stuff done. If I find out more, I'll let you know.
Enabling TRIM_FASTBINS is likely to work as well and is preferable to setting MXFAST to zero. How about exposing that malloc option in a similar way?
TRIM_FASTBINS this is a compile time option, while MXFAST can be set via environment variable.
Created a new ticket to investigate using TRIM_FASTBINS:
https://fedorahosted.org/389/ticket/489
Closing this ticket.
Metadata Update from @beall: - Issue assigned to mreynolds - Issue set to the milestone: 1.3.0.a1
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/386
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Log in to comment on this ticket.