/.../ The completely inlinable macros the old allocator is using seem quite hard to beat.
Is there a big difference? Tobias' comparison with the pike benchmarks appear to show some speedups and some slowdowns in various tests, which I attribute to noise mostly. I couldn't see any obvious trend that it's slower.
Actually, I thought there would be a win from not inlining so much as Pike currently does - it's better to fit more code into the cpu caches than to inline some constants. I'm keen on reducing the macros and inlining in Pike in general.
I see no problem however, merging it back into the tree. /.../
That'd be great, I think. As I said, I'm in favor, and nobody has spoken against it so far. It'd be an experimental feature in the beginning though - we usually add a configure switch to be able to switch back to the old implementation.
Do you have any plans for switching to object-local storage? /.../
It would be worthwhile to test the overhead when only using few blocks. The allocator tries to make initialization as cheap as possible, so it might not even hurt performance much. On the other hand, memory overhead could be quite high.
Yes, allocating a full block for every object would not be good, unless the prospective use cases only are a few very large trees, but I take it this is intended as a general-use data structure?
I think the ideal memory management would be to use a realloc double-the-size strategy for small trees, and switch to a block allocator when they reach the size of a memory page. That's more complex, of course.
We had some plans to make a thread safe allocator out of it, allocating blocks from thread local pages, etc. However, it never got any further than some thinking.
It's easy enough to let nodes start out in a thread local area, but how do you keep them that way when you share the data structure between threads? You'd have to implement some sort of migration heuristic so that nodes can move to the thread local area of another thread, and that still wouldn't work well if several threads are manipulating the same large tree.
I can't see how that can be made to work well, so that's why I think it's better to opt for locality-by-relation, i.e. that nodes in the same (sub)tree tend to be kept close to each other. Then trees which are used by a single thread only will effectively be in thread local storage for that thread.
When multisets are reallocated, their nodes are always sorted in memory. That's not any significant extra work since the operation is linear anyway, it keeps the nodes of the smallest subtrees right next to each other, and an iterator will access memory completely linearly.
Interesting, will have a read there. What we did in the CritBit implementation is much simpler than that. Modifications of the tree while iterators are active forces the iterator to make one lookup next time.
I take it you have a generation counter in each object then, so that the iterator can test if its direct pointer still is valid?
It can potentially hurt performance quite a lot, though.
Performance would suffer only when iterators are used a lot in heavily mutating trees, wouldn't it? That's a fairly peculiar use case imo. Sounds to me like your simpler approach will do just fine in general.
There is other places where object-local storage might help. Mapping rehash for instance could be made much simpler, now that we have power of 2 hashtables (the bucket n gets split into bucket n and n*2 when the hashtable grows). However, I dont know of any easy way to make that work properly with iterators.
Doesn't the generation counter scheme solve that as well?