On Mon, Apr 16, 2012 at 12:30 AM, Martin Stjernholm, Roxen IS @ Pike developers forum 10353@lyskom.lysator.liu.se wrote:
Just to make it clear, you're talking about the linear lookup to find the right block when structs are freed?
Yes.
No, std::allocator is a block allocator used by std::list, std::map and so on. Compared to glibc free/malloc, the std::allocator is quite fast.
It would be interesting to see a comparison with plain old (glibc) malloc, just to put it into perspective.
There is updated graphs on https://github.com/arneg/GJAlloc/wiki/Performance-comparison. Turns out, in fact, I misattributed benefits to the std::allocator it does not have.
Slight ones. For the CritBits, we need the allocator in the object storage (block allocators are used tree-locally).
So the nodes for each critbit tree are allocated within its own blocks? Sifting through the code just briefly I got the impression the nodes were allocated from a single global pool, considering that CB_NODE_ALLOC() doesn't take any context. Did I miss something?
And my bad again. Currently the code does not do this, for comparability between the two block allocators as used by CritBits, I think.
http://fortranland.com/compare_7.9_block_alloc.txt http://fortranland.com/compare_7.9_block_alloc_cleanup_on_exit.txt
Hmm, do you have any theories to explain the variations seen in those two?
Currently, no explanation that would make you go "Ah ok, in that case, sure." is available. Especially not any based on facts :).
Otherwise it seems to me like there's still a lot of noise, and I suspect none of those tests really stress BLOCK_ALLOC in the way where your implementation would show its strength, i.e. freeing structs when there's a large number allocated.
I figured I could add a test for that, but I can't compile CritBit from clean source on the arne/block_alloc branch:
#### Making dynamic: post_modules/CritBit make[5]: *** No rule to make target `/home/mast/Pike/devel/src/post_modules/CritBit/bitvector.h', needed by `inttree.o'. Stop. make[4]: *** [all] Error 2 make[3]: *** [CritBit] Error 1
That seems to be either an issue with dependencies, or you need to touch src/post_modules/CritBit/*.cmod, or both. In arne/block_alloc, bitvector.h was moved from src/post_modules/CritBit/ to src/ - any chance, your src/post_modules/CritBit/dependencies has size > 0?
And of course, a couple of CritBit benchmarks would be welcome - they also ought to show off GJAlloc, right? ;)
That can be arranged.
There is updated graphs on https://github.com/arneg/GJAlloc/wiki/Performance-comparison. Turns out, in fact, I misattributed benefits to the std::allocator it does not have.
Nice, thank you. Afaik there are no near-term 7.9 release plans, so I'm for letting this allocator go in, provided there's a configure option to switch to the old one.
I'd rather have it directly in the source than as a bundle, but I guess it's up to you really. (I tried to symlink the bundle dir to your upstream git, but I couldn't get that to work very well - clearly the bundle system isn't made with that in mind. But that's fixable, of course.)
And my bad again. Currently the code does not do this, for comparability between the two block allocators as used by CritBits, I think.
Do you have any plans for switching to object-local storage? I'm not sure a single shared pool would do that well if there are many objects - I suspect it'd be bad for locality, and if we ever go properly multi-cpu it'd produce a lot of false sharing in cpu caches.
When I implemented the new multisets based on red-black trees, the node memory management was actually the most time consuming part. My implementation separates the nodes for each multiset, and moves them to be able to resize the memory blocks, and to compact them to limit internal fragmentation. It always allocates a single block for the whole multiset, regardless how large. It'd probably be better to use multiple blocks when it fills more than one page, but it could still be important to support moving nodes to be able to compact smaller trees.
Note that handling moving nodes has big effect on the iterators. They need to address nodes by index rather than by direct pointer, if they should be able to handle simultaneous change in the tree. For multisets there's a "semi-frozen" state when there are active iterators: Since iterators address by index, the memory block as a whole may move (and thus be enlarged), but the nodes inside it may not, so compaction is postponed until all iterators are gone. (There's a comment near the top of multiset.h that goes into detail on this.)
So my point is just that handling object-local storage of the nodes could have a quite big effect on the code.
/.../ any chance, your src/post_modules/CritBit/dependencies has size > 0?
Hmm, I'm pretty sure I tried "make depend" a few times before giving up yesterday, but now it works. Sorry for the noise.
And of course, a couple of CritBit benchmarks would be welcome - they also ought to show off GJAlloc, right? ;)
That can be arranged.
Cool. I've added a test that shows the difference better: It allocates and frees random objects while keeping 1000000 objects allocated. That shows a clear difference:
total user mem (runs) Old BLOCK_ALLOC 2.036s 1.689s 89112kb (3) (0/s) GJAlloc 1.110s 0.805s 89084kb (5) (0/s)
Interesting for Roxen as well, since Roxen servers tend to have quite a lot of objects allocated. Sweet. :)
On Mon, 16 Apr 2012, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
There is updated graphs on https://github.com/arneg/GJAlloc/wiki/Performance-comparison. Turns out, in fact, I misattributed benefits to the std::allocator it does not have.
Nice, thank you. Afaik there are no near-term 7.9 release plans, so I'm for letting this allocator go in, provided there's a configure option to switch to the old one.
I'd rather have it directly in the source than as a bundle, but I guess it's up to you really. (I tried to symlink the bundle dir to your upstream git, but I couldn't get that to work very well - clearly the bundle system isn't made with that in mind. But that's fixable, of course.)
When we first wrote it, we did this within the pike tree. However, after our benchmarks did not quite give the same result we got when directly comparing the old one to the new one in micro-benchmarks (e.h. test.cpp in the gjalloc source), we decided to move it into a seperate library, since we were not sure if it will ever be included. The completely inlinable macros the old allocator is using seem quite hard to beat. I see no problem however, merging it back into the tree. It might still be useful as a seperate package, but thats another matter.
And my bad again. Currently the code does not do this, for comparability between the two block allocators as used by CritBits, I think.
Do you have any plans for switching to object-local storage? I'm not sure a single shared pool would do that well if there are many objects
- I suspect it'd be bad for locality, and if we ever go properly
multi-cpu it'd produce a lot of false sharing in cpu caches.
It would be worthwhile to test the overhead when only using few blocks. The allocator tries to make initialization as cheap as possible, so it might not even hurt performance much. On the other hand, memory overhead could be quite high.
We had some plans to make a thread safe allocator out of it, allocating blocks from thread local pages, etc. However, it never got any further than some thinking.
When I implemented the new multisets based on red-black trees, the node memory management was actually the most time consuming part. My implementation separates the nodes for each multiset, and moves them to be able to resize the memory blocks, and to compact them to limit internal fragmentation. It always allocates a single block for the whole multiset, regardless how large. It'd probably be better to use multiple blocks when it fills more than one page, but it could still be important to support moving nodes to be able to compact smaller trees.
Note that handling moving nodes has big effect on the iterators. They need to address nodes by index rather than by direct pointer, if they should be able to handle simultaneous change in the tree. For multisets there's a "semi-frozen" state when there are active iterators: Since iterators address by index, the memory block as a whole may move (and thus be enlarged), but the nodes inside it may not, so compaction is postponed until all iterators are gone. (There's a comment near the top of multiset.h that goes into detail on this.)
So my point is just that handling object-local storage of the nodes could have a quite big effect on the code.
Interesting, will have a read there. What we did in the CritBit implementation is much simpler than that. Modifications of the tree while iterators are active forces the iterator to make one lookup next time. It can potentially hurt performance quite a lot, though.
There is other places where object-local storage might help. Mapping rehash for instance could be made much simpler, now that we have power of 2 hashtables (the bucket n gets split into bucket n and n*2 when the hashtable grows). However, I dont know of any easy way to make that work properly with iterators.
On Tue, 17 Apr 2012, Arne Goedeke wrote:
There is other places where object-local storage might help. Mapping rehash for instance could be made much simpler, now that we have power of 2 hashtables (the bucket n gets split into bucket n and n*2 when the hashtable grows). [...]
Thats actually not correct. Its n and hashsize + n.
/.../ The completely inlinable macros the old allocator is using seem quite hard to beat.
Is there a big difference? Tobias' comparison with the pike benchmarks appear to show some speedups and some slowdowns in various tests, which I attribute to noise mostly. I couldn't see any obvious trend that it's slower.
Actually, I thought there would be a win from not inlining so much as Pike currently does - it's better to fit more code into the cpu caches than to inline some constants. I'm keen on reducing the macros and inlining in Pike in general.
I see no problem however, merging it back into the tree. /.../
That'd be great, I think. As I said, I'm in favor, and nobody has spoken against it so far. It'd be an experimental feature in the beginning though - we usually add a configure switch to be able to switch back to the old implementation.
Do you have any plans for switching to object-local storage? /.../
It would be worthwhile to test the overhead when only using few blocks. The allocator tries to make initialization as cheap as possible, so it might not even hurt performance much. On the other hand, memory overhead could be quite high.
Yes, allocating a full block for every object would not be good, unless the prospective use cases only are a few very large trees, but I take it this is intended as a general-use data structure?
I think the ideal memory management would be to use a realloc double-the-size strategy for small trees, and switch to a block allocator when they reach the size of a memory page. That's more complex, of course.
We had some plans to make a thread safe allocator out of it, allocating blocks from thread local pages, etc. However, it never got any further than some thinking.
It's easy enough to let nodes start out in a thread local area, but how do you keep them that way when you share the data structure between threads? You'd have to implement some sort of migration heuristic so that nodes can move to the thread local area of another thread, and that still wouldn't work well if several threads are manipulating the same large tree.
I can't see how that can be made to work well, so that's why I think it's better to opt for locality-by-relation, i.e. that nodes in the same (sub)tree tend to be kept close to each other. Then trees which are used by a single thread only will effectively be in thread local storage for that thread.
When multisets are reallocated, their nodes are always sorted in memory. That's not any significant extra work since the operation is linear anyway, it keeps the nodes of the smallest subtrees right next to each other, and an iterator will access memory completely linearly.
Interesting, will have a read there. What we did in the CritBit implementation is much simpler than that. Modifications of the tree while iterators are active forces the iterator to make one lookup next time.
I take it you have a generation counter in each object then, so that the iterator can test if its direct pointer still is valid?
It can potentially hurt performance quite a lot, though.
Performance would suffer only when iterators are used a lot in heavily mutating trees, wouldn't it? That's a fairly peculiar use case imo. Sounds to me like your simpler approach will do just fine in general.
There is other places where object-local storage might help. Mapping rehash for instance could be made much simpler, now that we have power of 2 hashtables (the bucket n gets split into bucket n and n*2 when the hashtable grows). However, I dont know of any easy way to make that work properly with iterators.
Doesn't the generation counter scheme solve that as well?
On Sat, 21 Apr 2012, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
/.../ The completely inlinable macros the old allocator is using seem quite hard to beat.
Is there a big difference? Tobias' comparison with the pike benchmarks appear to show some speedups and some slowdowns in various tests, which I attribute to noise mostly. I couldn't see any obvious trend that it's slower.
There are a few tests which suffer noticably (e.g. AppendMapping(2) and some others) and I have no idea what the reason is. most of the tests look the same or faster with the new one.
Actually, I thought there would be a win from not inlining so much as Pike currently does - it's better to fit more code into the cpu caches than to inline some constants. I'm keen on reducing the macros and inlining in Pike in general.
Yes, true. It depends about the code surrounding it. In those micro benchmarks inlining is clearly a win. Inside of the pike source I dont know. I guess the 'noise' we see is somehow related to that, aswell. And -O3 doesnt make inlining very predictable.
I see no problem however, merging it back into the tree. /.../
That'd be great, I think. As I said, I'm in favor, and nobody has spoken against it so far. It'd be an experimental feature in the beginning though - we usually add a configure switch to be able to switch back to the old implementation.
yes, that sounds good.
Do you have any plans for switching to object-local storage? /.../
It would be worthwhile to test the overhead when only using few blocks. The allocator tries to make initialization as cheap as possible, so it might not even hurt performance much. On the other hand, memory overhead could be quite high.
Yes, allocating a full block for every object would not be good, unless the prospective use cases only are a few very large trees, but I take it this is intended as a general-use data structure?
I think the ideal memory management would be to use a realloc double-the-size strategy for small trees, and switch to a block allocator when they reach the size of a memory page. That's more complex, of course.
Interesting idea. So then one would grow the page until it reaches the allocators page size and let the allocator handle it from then on?
We had some plans to make a thread safe allocator out of it, allocating blocks from thread local pages, etc. However, it never got any further than some thinking.
It's easy enough to let nodes start out in a thread local area, but how do you keep them that way when you share the data structure between threads? You'd have to implement some sort of migration heuristic so that nodes can move to the thread local area of another thread, and that still wouldn't work well if several threads are manipulating the same large tree.
The allocator page would be local in the sense that only one thread is serving allocations from it. Deallocations could happen from all threads, so there some synchronization is necessary. I was hoping that a CAS would be enough there. Its not useful for pike, because of the global lock. We were just pondering about how one could go about making it thread safe without locks.
I can't see how that can be made to work well, so that's why I think it's better to opt for locality-by-relation, i.e. that nodes in the same (sub)tree tend to be kept close to each other. Then trees which are used by a single thread only will effectively be in thread local storage for that thread.
yes, true.
When multisets are reallocated, their nodes are always sorted in memory. That's not any significant extra work since the operation is linear anyway, it keeps the nodes of the smallest subtrees right next to each other, and an iterator will access memory completely linearly.
Interesting, will have a read there. What we did in the CritBit implementation is much simpler than that. Modifications of the tree while iterators are active forces the iterator to make one lookup next time.
I take it you have a generation counter in each object then, so that the iterator can test if its direct pointer still is valid?
exactly.
It can potentially hurt performance quite a lot, though.
Performance would suffer only when iterators are used a lot in heavily mutating trees, wouldn't it? That's a fairly peculiar use case imo. Sounds to me like your simpler approach will do just fine in general.
right, its probably not the dominant use case.
There is other places where object-local storage might help. Mapping rehash for instance could be made much simpler, now that we have power of 2 hashtables (the bucket n gets split into bucket n and n*2 when the hashtable grows). However, I dont know of any easy way to make that work properly with iterators.
Doesn't the generation counter scheme solve that as well?
when the table has not been grown, a generation counter should be enough to restart from the same bucket. one would have to keep track of all entries the iterator has already seen in that bucket. If the table grows or shrinks, it gets complicated.
On Sat, 21 Apr 2012, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
I think the ideal memory management would be to use a realloc double-the-size strategy for small trees, and switch to a block allocator when they reach the size of a memory page. That's more complex, of course.
True, it would be much more complex. It would effectlively mean having two different algorithms also of the data type using the blocks, I guess. There is also the additional danger of it leading to much higher memory fragmentation. Using realloc and copying has the advantage of always being able to shrink back.
One other (and quite experimental) solution would be to use realloc on reserved address space. There is interesting work by niall douglas and others on a new malloc api which could do that: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1527.pdf
I have not used it, and who knows if it will ever have the chance of becoming part of the standard... On the other hand, he has an implementation, so it would be possible to experiment with it.
On Sun, 22 Apr 2012, Arne Goedeke wrote:
On Sat, 21 Apr 2012, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
I think the ideal memory management would be to use a realloc double-the-size strategy for small trees, and switch to a block allocator when they reach the size of a memory page. That's more complex, of course.
True, it would be much more complex. It would effectlively mean having two different algorithms also of the data type using the blocks, I guess. There is also the additional danger of it leading to much higher memory fragmentation. Using realloc and copying has the advantage of always being able to shrink back.
One other (and quite experimental) solution would be to use realloc on reserved address space. There is interesting work by niall douglas and others on a new malloc api which could do that: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1527.pdf
I have not used it, and who knows if it will ever have the chance of becoming part of the standard... On the other hand, he has an implementation, so it would be possible to experiment with it.
Sorry for mass posting here. As a follow up: I just saw that the reserved space is merely a hint. Its intended as an optimization, not as a feature that guarantees the addresses to stay valid. Also, the sizes we are looking at are probably much smaller than memory page size, so it would also mean to always use minimum size of 4K.
pike-devel@lists.lysator.liu.se