I have added two new tests to pike -x benchmark, added the --tests='Glob here' argument to be above mentioned program to be able to run only the tests I want to, and then optimized cloning the null pike-class about 20%.
Also, cloning a non-null pike class (one with a create method that actually does something and a local variable) is also faster.
Before:
test total user mem (runs) --------------------------------------------------------- Clone null-object.......... 0.502s 0.421s 2936kb (10) Clone object............... 0.800s 0.761s 2936kb (7)
After:
test total user mem (runs) --------------------------------------------------------- Clone null-object.......... 0.360s 0.326s 2948kb (14) Clone object............... 0.688s 0.651s 2948kb (8)
On a related note, the same optimizations could rather easily be done to gc_check_object, gc_mark_object_as_referenced and real_gc_cycle_check_object.
The key to the whole thing is that there were a lot of struct pike_frame:s created, initialized, linked and deinitialized unessearily, and this took quite a lot of time.
BTW, for the benchmark it would be nice to get a "iterations per second" measurement or perhaps "ms per iteration". That way it's easier to compare. Or perhaps the number shown is one iteration already? Hmm. That might make sense I guess. :-P
/ David Hedbor
Previous text:
2002-12-17 20:56: Subject: pike_frames vs. clone_object/destruct. (7.5)
I have added two new tests to pike -x benchmark, added the --tests='Glob here' argument to be above mentioned program to be able to run only the tests I want to, and then optimized cloning the null pike-class about 20%.
Also, cloning a non-null pike class (one with a create method that actually does something and a local variable) is also faster.
Before:
test total user mem (runs)
Clone null-object.......... 0.502s 0.421s 2936kb (10) Clone object............... 0.800s 0.761s 2936kb (7)
After:
test total user mem (runs)
Clone null-object.......... 0.360s 0.326s 2948kb (14) Clone object............... 0.688s 0.651s 2948kb (8)
On a related note, the same optimizations could rather easily be done to gc_check_object, gc_mark_object_as_referenced and real_gc_cycle_check_object.
The key to the whole thing is that there were a lot of struct pike_frame:s created, initialized, linked and deinitialized unessearily, and this took quite a lot of time.
/ Per Hedbor ()
It is.
/ Per Hedbor ()
Previous text:
2002-12-17 21:04: Subject: pike_frames vs. clone_object/destruct. (7.5)
BTW, for the benchmark it would be nice to get a "iterations per second" measurement or perhaps "ms per iteration". That way it's easier to compare. Or perhaps the number shown is one iteration already? Hmm. That might make sense I guess. :-P
/ David Hedbor
I've thought of adding a reporting of "n", so you can get the number of operations per second. This is most useful for the operations with a clear inner loop, for instance in this case and the loop test cases.
/ Mirar
Previous text:
2002-12-17 21:04: Subject: pike_frames vs. clone_object/destruct. (7.5)
BTW, for the benchmark it would be nice to get a "iterations per second" measurement or perhaps "ms per iteration". That way it's easier to compare. Or perhaps the number shown is one iteration already? Hmm. That might make sense I guess. :-P
/ David Hedbor
What kept you from doing them in the gc functions?
I wonder if one can be a little naughty and allocate the pike_frames on the stack in those functions. One thing I don't understand though is why most functions carefully avoids having the extra ref during most of the frame's lifetime.
/ Martin Stjernholm, Roxen IS
Previous text:
2002-12-17 20:56: Subject: pike_frames vs. clone_object/destruct. (7.5)
I have added two new tests to pike -x benchmark, added the --tests='Glob here' argument to be above mentioned program to be able to run only the tests I want to, and then optimized cloning the null pike-class about 20%.
Also, cloning a non-null pike class (one with a create method that actually does something and a local variable) is also faster.
Before:
test total user mem (runs)
Clone null-object.......... 0.502s 0.421s 2936kb (10) Clone object............... 0.800s 0.761s 2936kb (7)
After:
test total user mem (runs)
Clone null-object.......... 0.360s 0.326s 2948kb (14) Clone object............... 0.688s 0.651s 2948kb (8)
On a related note, the same optimizations could rather easily be done to gc_check_object, gc_mark_object_as_referenced and real_gc_cycle_check_object.
The key to the whole thing is that there were a lot of struct pike_frame:s created, initialized, linked and deinitialized unessearily, and this took quite a lot of time.
/ Per Hedbor ()
Mostly I'm lazy. I also thought that someone more involved in the GC-code should do that.
That might be possible. It's not the actual allocation that is most expensive, though, it's the initialization of the frames. Also, the accesses of prog, storage and similar members through pike_frame instead of a local variable generated extra memory operations, gcc did not really optimize that code all that well.
/ Per Hedbor ()
Previous text:
2002-12-17 22:33: Subject: pike_frames vs. clone_object/destruct. (7.5)
What kept you from doing them in the gc functions?
I wonder if one can be a little naughty and allocate the pike_frames on the stack in those functions. One thing I don't understand though is why most functions carefully avoids having the extra ref during most of the frame's lifetime.
/ Martin Stjernholm, Roxen IS
If it doesn't optimize common subexpressions well then there's a whole lot of trivial optimizations we can do; long access chains are very common in the pike core.
/ Martin Stjernholm, Roxen IS
Previous text:
2002-12-17 22:53: Subject: pike_frames vs. clone_object/destruct. (7.5)
Mostly I'm lazy. I also thought that someone more involved in the GC-code should do that.
That might be possible. It's not the actual allocation that is most expensive, though, it's the initialization of the frames. Also, the accesses of prog, storage and similar members through pike_frame instead of a local variable generated extra memory operations, gcc did not really optimize that code all that well.
/ Per Hedbor ()
It seems to fail to do that when there are function calls of some kind between the accesses. I would say that that is a feature, not a misfeature in gcc.
However, code like
if(pike_frame->context.prog->event_handler) pike_frame->context.prog->event_handler(PROG_EVENT_GC_RECURSE);
for(q=0;q<(int)pike_frame->context.prog->num_variable_index;q++) { int d=pike_frame->context.prog->variable_index[q]; if(IDENTIFIER_IS_ALIAS(pike_frame->context.prog->identifiers[d]. identifier_flags)) { ... gc_mark_svalues( s, 1 ); ... } }
is not really all that optimal, since the memory access in the for-loop seems to be done once for each loop.
2e09: 8b 73 4c mov 0x4c(%ebx),%esi 2e0c: 83 c4 10 add $0x10,%esp 2e0f: 89 f1 mov %esi,%ecx 2e11: 47 inc %edi 2e12: 0f b7 41 6e movzwl 0x6e(%ecx),%eax 2e16: 39 c7 cmp %eax,%edi 2e18: 7c a6 jl 2dc0 <gc_mark_object_as_referenced+0x1c0>
/ Per Hedbor ()
Previous text:
2002-12-17 23:34: Subject: pike_frames vs. clone_object/destruct. (7.5)
If it doesn't optimize common subexpressions well then there's a whole lot of trivial optimizations we can do; long access chains are very common in the pike core.
/ Martin Stjernholm, Roxen IS
gcc would have to do global optimizations to be able to do better, I guess. Still, there are lots of places which are spoiled by function calls.
/ Martin Stjernholm, Roxen IS
Previous text:
2002-12-17 23:47: Subject: pike_frames vs. clone_object/destruct. (7.5)
It seems to fail to do that when there are function calls of some kind between the accesses. I would say that that is a feature, not a misfeature in gcc.
However, code like
if(pike_frame->context.prog->event_handler) pike_frame->context.prog->event_handler(PROG_EVENT_GC_RECURSE);
for(q=0;q<(int)pike_frame->context.prog->num_variable_index;q++) { int d=pike_frame->context.prog->variable_index[q]; if(IDENTIFIER_IS_ALIAS(pike_frame->context.prog->identifiers[d]. identifier_flags)) { ... gc_mark_svalues( s, 1 ); ... } }
is not really all that optimal, since the memory access in the for-loop seems to be done once for each loop.
2e09: 8b 73 4c mov 0x4c(%ebx),%esi 2e0c: 83 c4 10 add $0x10,%esp 2e0f: 89 f1 mov %esi,%ecx 2e11: 47 inc %edi 2e12: 0f b7 41 6e movzwl 0x6e(%ecx),%eax 2e16: 39 c7 cmp %eax,%edi 2e18: 7c a6 jl 2dc0 <gc_mark_object_as_referenced+0x1c0>
/ Per Hedbor ()
pike-devel@lists.lysator.liu.se