I want to make sure we capture this thread: https://firstname.lastname@example.org/thread/EP6RSEGGKSSHCZHUOJ7QGWMVOR5MOWXB/
We should investigate the use of lock free data structures in 389. liblfds is already being used by nunc-stans. liblfds has lock free stack and queue data structures. There is a possibility of a lock free tree structure.
As part of this work, we need to develop tools/tests to measure 389 lock contention, so that we can identify places in the server to change, as well as to measure how the performance is related to lock contention, then verify that using lock free data structures improves both contention and performance.
My current work is the benchmark app. I've now got this to the point where it's not far from complete - I need to get gnuplots working again, and then add benchmarks.
Right now, as of tonight, the queue benchmark is back - so below are the current results for the freelist and queue.
http://fpaste.org/343983/ (new link, 01:31 GMT 23rd March - I previously had mistakenly used a debug liblfds, with a release libbenchmark and release benchmark; the new results are for a proper release build.)
This is a release build on my laptop, which is a Core i5. You can see the topology at the top of the output.
The lock-free results have ten lines each - this is where the exponential backoff length is varied, as this factor has a very great impact on performance. The results themselves are the number of ops on each core, where the core number is in brackets.
On the freelist, lock-free (as soon as you step outside of a single physical core) wins, and wins big - an order of magnitude.
For the queue, there is no win - with all four cores, the lock-free queue is second, s about 10% or 15% slower than the pthread spinlock (this is one run of the benchmarks, where each runs for five seconds; there's quite a bit of variability between runs, depending on exactly how things play out between the different logical cores on any given run). A reason for this is that this particular lock-free queue is pretty heavyweight, performing a couple of lock-free ops per enqueue/dequeue. Now I have a benchmark for it, I will be working on performance improvements.
Note the rwlocks in these two benchmarks are used only for write locks, as the data structures are a freelist and a queue. Tomorrow I'll be writing a benchmark for a btree, where there will be a mix of reads and writes.
The gnuplots when they're done (should be tomorrow as well) will be much easier to take in.
Latest results, now including btree benchmark.
No gnuplots yet - that was a 15 hours session after 5 hours sleep to get the btree benchmark up, time for some sleep :-)
There is in the literature a lock-free balanced red-black tree. This tree, in the library, is something I threw together off the top of my head in a day or two, a little while ago, just to get something tree-ish out there; it's add-only, and unbalanced.
The benchark initializes by creating 1024 elements per thread, gives them a random integer key, and inserts them into the tree. The benchmark stores the list of randomly generated keys. The benchmark itself them runs for n seconds (only 1 second for the fpaste here - normally it's 5, 10 is best) where each thread loops, performing a read and then a write - where a read means randomly generating an index into the array of keys and then searching the tree for the key, and then reading the value from that element, and a write is the same, but where we write a new value into the element.
The rwlock is appropriately using read and write type locks.
Where each loop iteration in the main loop is a read and a write, the mix is 50/50. Writes I believe favour lock-free, so an 80/20 mix I expect to show a smaller lead.
However, lock-free as it is, is 10x the performance of rwlock, and 5x that of the next best, the GCC spinlocks. (This being for four logical cores - that's all my laptop has). However, I think I can improve the non-lock-free versions, by making the btree API more suitable for normal locking-type data structures - right now, it behaves like the lock-free code in that there is one lock to find the element, then another lock to read or to write. So you can get an element at any time, and then if you later read from it, you'll get the correct value. We could merge the find and read (or write), which will halve the number of locks taken. This changes functionality, but it is the case here somewhat that the functionality offered by locking and lock-free data structures naturally varies. The lock-free stuff must always get the latest value - that's just how it is. Locking data structure can dispense with this functionality, and so gain some performance - although by the looks of it, they will still be much slower.
So given say an 80/20 mix and that reduction in functionality, the locking data structures should improve quite a bit. I expect however they will still be much slower than lock-free - but we will see.
This is preliminary!!
The locking benchmarks may be underplaying their performance - I may have them performing four locks per iteration when they should only be performing two - but I need to think more about how the benchmark works, whether or not what they're doing now is fair; it's not obviously clear.
Either way it won't matter - you can see in the final chart just how poorly locking scales compared to non-locking - even if locking doubled in performance, they'd all still be half the speed of lock-free.
Latest, improved gnuplot/benchmark.
I've modified the benchmark to favour locking versions of the data structure.
The comparason now isn't quite apples with apples - the lock-free data structure is being disadvantaged, but it's fine, as it's so much faster anyway. It's not like it's a borderline case.
Per ticket triage, set milestone to 1.3.6.
For the "low hanging fruits", let's open a ticket for each fruit and process it.
Metadata Update from @liblfds:
- Issue set to the milestone: 1.3.6 backlog
Metadata Update from @firstyear:
- Issue assigned to firstyear
okay, at this point we have lfds in nunc-stans, and libsds on the roadmap to include in DS.
I think we are well underway, and a better idea is to just target specific areas with these structures now. By integrating sds, we get this.
So we don't need a highlevel tracker, we just need to "do the work" at this point.
Metadata Update from @firstyear:
- Custom field reviewstatus adjusted to review
- Issue close_status updated to: None
Metadata Update from @firstyear:
- Issue close_status updated to: invalid
- Issue status updated to: Closed (was: Open)
to comment on this ticket.