I've tested on a few more architectures now and it seems that it indeed is Athlon specific. The following are measurements with machine code. The last column is the difference in user time wrt without machine code.
sparc (Ultra-80):
test total user mem (runs) diff Compile.................... 6.588s 4.096s -1kb (76) (5894 lines/s) 2% Compile & Exec............. 7.060s 4.488s -1kb (71) (134009 lines/s) 0% Ackermann.................. 5.311s 2.841s -1kb (95) -10% Loops Nested (local)....... 4.426s 1.981s -1kb (100) (8469060 iters/s) 9% Loops Nested (global)...... 4.992s 2.532s -1kb (100) (6625550 iters/s) -18% Loops Recursed............. 4.747s 2.270s -1kb (100) (461968 iters/s) -16%
Seems like local variable accesses could be improved on sparc.
ia32 (PIII 700 MHz):
test total user mem (runs) diff Compile.................... 3.462s 2.945s 5260kb (100) (8198 lines/s) 11% Compile & Exec............. 3.408s 2.760s 3628kb (100) (217922 lines/s) 9% Ackermann.................. 1.789s 1.323s 3716kb (100) -8% Loops Nested (local)....... 1.217s 0.740s 3504kb (100) (22678034 iters/s) -27% Loops Nested (global)...... 1.857s 1.359s 3504kb (100) (12342540 iters/s) -23% Loops Recursed............. 1.512s 1.031s 3504kb (100) (1017443 iters/s) -9%
ia32 (Athlon XP 1535 MHz):
test total user mem (runs) diff Compile.................... 1.464s 1.251s 5244kb (100) (19300 lines/s) 7% Compile & Exec............. 1.389s 1.197s 3656kb (100) (502423 lines/s) 6% Ackermann.................. 1.387s 1.198s 3724kb (100) 84% !! Loops Nested (local)....... 0.450s 0.261s 3508kb (100) (64182112 iters/s) -39% Loops Nested (global)...... 0.787s 0.598s 3508kb (100) (28036812 iters/s) -23% Loops Recursed............. 1.518s 1.329s 3508kb (100) (789115 iters/s) 172% !!
I also tried with a binary copied from the PIII system on my Athlon in case there's some kind of compiler difference, but that didn't change anything much. It's amazing that some cpu difference can have this dramatic effect on function call performance.
It'd be interesting to see this on more systems.
/ Martin Stjernholm, Roxen IS
Previous text:
2003-07-31 02:18: Subject: Machine code efficiency
It's on an Athlon XP. But if that matters appreciably then it's a problem in itself. I doubt it, though. Anyway, please try it yourself and see if there's a difference.
/ Martin Stjernholm, Roxen IS
I did away with the assignments to the frame return addresses (for ia32 only). Now it's much better on Athlon, and there's some improvement on Intel too:
PIII:
test total user mem (runs) diff Compile.................... 3.389s 2.906s 5316kb (100) (8306 lines/s) 9% Compile & Exec............. 3.262s 2.710s 3760kb (100) (221952 lines/s) 7% Ackermann.................. 1.622s 1.109s 3752kb (100) -23% Loops Nested (local)....... 1.137s 0.649s 3540kb (100) (25870792 iters/s) -36% Loops Nested (global)...... 2.059s 1.327s 3540kb (100) (12643922 iters/s) -25% Loops Recursed............. 1.486s 0.928s 3540kb (100) (1129566 iters/s) -18%
Athlon XP:
test total user mem (runs) diff Compile.................... 1.529s 1.317s 3668kb (100) (18332 lines/s) 13% Compile & Exec............. 1.474s 1.264s 3672kb (100) (475716 lines/s) 11% Ackermann.................. 0.732s 0.534s 3728kb (100) -18% Loops Nested (local)....... 0.450s 0.248s 3520kb (100) (67677360 iters/s) -42% Loops Nested (global)...... 0.780s 0.587s 3520kb (100) (28561814 iters/s) -25% Loops Recursed............. 0.571s 0.367s 3520kb (100) (2857154 iters/s) -25%
/ Martin Stjernholm, Roxen IS
Previous text:
2003-08-05 14:53: Subject: Machine code efficiency
I've tested on a few more architectures now and it seems that it indeed is Athlon specific. The following are measurements with machine code. The last column is the difference in user time wrt without machine code.
sparc (Ultra-80):
test total user mem (runs) diff Compile.................... 6.588s 4.096s -1kb (76) (5894 lines/s) 2% Compile & Exec............. 7.060s 4.488s -1kb (71) (134009 lines/s) 0% Ackermann.................. 5.311s 2.841s -1kb (95) -10% Loops Nested (local)....... 4.426s 1.981s -1kb (100) (8469060 iters/s) 9% Loops Nested (global)...... 4.992s 2.532s -1kb (100) (6625550 iters/s) -18% Loops Recursed............. 4.747s 2.270s -1kb (100) (461968 iters/s) -16%
Seems like local variable accesses could be improved on sparc.
ia32 (PIII 700 MHz):
test total user mem (runs) diff Compile.................... 3.462s 2.945s 5260kb (100) (8198 lines/s) 11% Compile & Exec............. 3.408s 2.760s 3628kb (100) (217922 lines/s) 9% Ackermann.................. 1.789s 1.323s 3716kb (100) -8% Loops Nested (local)....... 1.217s 0.740s 3504kb (100) (22678034 iters/s) -27% Loops Nested (global)...... 1.857s 1.359s 3504kb (100) (12342540 iters/s) -23% Loops Recursed............. 1.512s 1.031s 3504kb (100) (1017443 iters/s) -9%
ia32 (Athlon XP 1535 MHz):
test total user mem (runs) diff Compile.................... 1.464s 1.251s 5244kb (100) (19300 lines/s) 7% Compile & Exec............. 1.389s 1.197s 3656kb (100) (502423 lines/s) 6% Ackermann.................. 1.387s 1.198s 3724kb (100) 84% !! Loops Nested (local)....... 0.450s 0.261s 3508kb (100) (64182112 iters/s) -39% Loops Nested (global)...... 0.787s 0.598s 3508kb (100) (28036812 iters/s) -23% Loops Recursed............. 1.518s 1.329s 3508kb (100) (789115 iters/s) 172% !!
I also tried with a binary copied from the PIII system on my Athlon in case there's some kind of compiler difference, but that didn't change anything much. It's amazing that some cpu difference can have this dramatic effect on function call performance.
It'd be interesting to see this on more systems.
/ Martin Stjernholm, Roxen IS
I did away with the assignments to the frame return addresses (for ia32 only). Now it's much better on Athlon, and there's some improvement on Intel too:
The dead cat in me has to ask: how? What did you replace it with?
/ Fredrik (Naranek) Hubinette (Real Build Master)
Previous text:
2003-08-06 21:21: Subject: Machine code efficiency
I did away with the assignments to the frame return addresses (for ia32 only). Now it's much better on Athlon, and there's some improvement on Intel too:
PIII:
test total user mem (runs) diff Compile.................... 3.389s 2.906s 5316kb (100) (8306 lines/s) 9% Compile & Exec............. 3.262s 2.710s 3760kb (100) (221952 lines/s) 7% Ackermann.................. 1.622s 1.109s 3752kb (100) -23% Loops Nested (local)....... 1.137s 0.649s 3540kb (100) (25870792 iters/s) -36% Loops Nested (global)...... 2.059s 1.327s 3540kb (100) (12643922 iters/s) -25% Loops Recursed............. 1.486s 0.928s 3540kb (100) (1129566 iters/s) -18%
Athlon XP:
test total user mem (runs) diff Compile.................... 1.529s 1.317s 3668kb (100) (18332 lines/s) 13% Compile & Exec............. 1.474s 1.264s 3672kb (100) (475716 lines/s) 11% Ackermann.................. 0.732s 0.534s 3728kb (100) -18% Loops Nested (local)....... 0.450s 0.248s 3520kb (100) (67677360 iters/s) -42% Loops Nested (global)...... 0.780s 0.587s 3520kb (100) (28561814 iters/s) -25% Loops Recursed............. 0.571s 0.367s 3520kb (100) (2857154 iters/s) -25%
/ Martin Stjernholm, Roxen IS
They return the address to jump to instead and there's a "jmp *%eax" after each such opcode call. The pipelines can track that much better, apparently.
/ Martin Stjernholm, Roxen IS
Previous text:
2003-08-07 00:34: Subject: Machine code efficiency
I did away with the assignments to the frame return addresses (for ia32 only). Now it's much better on Athlon, and there's some improvement on Intel too:
The dead cat in me has to ask: how? What did you replace it with?
/ Fredrik (Naranek) Hubinette (Real Build Master)
Could you test on P4 before and after too? I'm sort of curious about that. :-)
/ Per Hedbor ()
Previous text:
2003-08-06 21:21: Subject: Machine code efficiency
I did away with the assignments to the frame return addresses (for ia32 only). Now it's much better on Athlon, and there's some improvement on Intel too:
PIII:
test total user mem (runs) diff Compile.................... 3.389s 2.906s 5316kb (100) (8306 lines/s) 9% Compile & Exec............. 3.262s 2.710s 3760kb (100) (221952 lines/s) 7% Ackermann.................. 1.622s 1.109s 3752kb (100) -23% Loops Nested (local)....... 1.137s 0.649s 3540kb (100) (25870792 iters/s) -36% Loops Nested (global)...... 2.059s 1.327s 3540kb (100) (12643922 iters/s) -25% Loops Recursed............. 1.486s 0.928s 3540kb (100) (1129566 iters/s) -18%
Athlon XP:
test total user mem (runs) diff Compile.................... 1.529s 1.317s 3668kb (100) (18332 lines/s) 13% Compile & Exec............. 1.474s 1.264s 3672kb (100) (475716 lines/s) 11% Ackermann.................. 0.732s 0.534s 3728kb (100) -18% Loops Nested (local)....... 0.450s 0.248s 3520kb (100) (67677360 iters/s) -42% Loops Nested (global)...... 0.780s 0.587s 3520kb (100) (28561814 iters/s) -25% Loops Recursed............. 0.571s 0.367s 3520kb (100) (2857154 iters/s) -25%
/ Martin Stjernholm, Roxen IS
Be my guest. One can experiment with commenting out either or both of the OPCODE_INLINE_BRANCH and OPCODE_RETURN_JUMPADDR defines in ia32.h. Without any of them it should be like before.
/ Martin Stjernholm, Roxen IS
Previous text:
2003-08-07 03:40: Subject: Machine code efficiency
Could you test on P4 before and after too? I'm sort of curious about that. :-)
/ Per Hedbor ()
You seem to have broken the other machinecode targets though. Both sparc and PPC are now non-working, as can be observed in xenofarm.
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Previous text:
2003-08-06 21:21: Subject: Machine code efficiency
I did away with the assignments to the frame return addresses (for ia32 only). Now it's much better on Athlon, and there's some improvement on Intel too:
PIII:
test total user mem (runs) diff Compile.................... 3.389s 2.906s 5316kb (100) (8306 lines/s) 9% Compile & Exec............. 3.262s 2.710s 3760kb (100) (221952 lines/s) 7% Ackermann.................. 1.622s 1.109s 3752kb (100) -23% Loops Nested (local)....... 1.137s 0.649s 3540kb (100) (25870792 iters/s) -36% Loops Nested (global)...... 2.059s 1.327s 3540kb (100) (12643922 iters/s) -25% Loops Recursed............. 1.486s 0.928s 3540kb (100) (1129566 iters/s) -18%
Athlon XP:
test total user mem (runs) diff Compile.................... 1.529s 1.317s 3668kb (100) (18332 lines/s) 13% Compile & Exec............. 1.474s 1.264s 3672kb (100) (475716 lines/s) 11% Ackermann.................. 0.732s 0.534s 3728kb (100) -18% Loops Nested (local)....... 0.450s 0.248s 3520kb (100) (67677360 iters/s) -42% Loops Nested (global)...... 0.780s 0.587s 3520kb (100) (28561814 iters/s) -25% Loops Recursed............. 0.571s 0.367s 3520kb (100) (2857154 iters/s) -25%
/ Martin Stjernholm, Roxen IS
Seems to work on SPARC if compiled with debug. I suspect the return of the opcode leaf function bug.
/ Henrik Grubbström (Lysator)
Previous text:
2003-08-07 17:04: Subject: Machine code efficiency
You seem to have broken the other machinecode targets though. Both sparc and PPC are now non-working, as can be observed in xenofarm.
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
In that case it should be able to fix it by implementing the OPCODE_RETURN_JUMPADDR mode. We should get a jury verdict for the PPC case from the AIX machines any hour now... :-)
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Previous text:
2003-08-07 20:06: Subject: Machine code efficiency
Seems to work on SPARC if compiled with debug. I suspect the return of the opcode leaf function bug.
/ Henrik Grubbström (Lysator)
:-)
/ Henrik Grubbström (Lysator)
Previous text:
2003-08-07 20:17: Subject: Machine code efficiency
In that case it should be able to fix it by implementing the OPCODE_RETURN_JUMPADDR mode. We should get a jury verdict for the PPC case from the AIX machines any hour now... :-)
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
The Sparc problem was that I "cleaned up" the global DEF_PROG_COUNTER. It's now reinstated as GLOBAL_DEF_PROG_COUNTER (the ia32 support with MSVC needs to have one that isn't used both globally and in every opcode function).
So the ppc32 problem must be something different, but what it is I have no idea about. The intention was that nothing would change if OPCODE_RETURN_JUMPADDR isn't defined. Anyway, it's hopefully a non-issue soon.
/ Martin Stjernholm, Roxen IS
Previous text:
2003-08-07 20:06: Subject: Machine code efficiency
Seems to work on SPARC if compiled with debug. I suspect the return of the opcode leaf function bug.
/ Henrik Grubbström (Lysator)
The Sparc problem was that I "cleaned up" the global DEF_PROG_COUNTER.
Um, shouldn't that cause a compilation error rather than a core dump (which is what I get on SPARC)?
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Previous text:
2003-08-07 20:46: Subject: Machine code efficiency
The Sparc problem was that I "cleaned up" the global DEF_PROG_COUNTER. It's now reinstated as GLOBAL_DEF_PROG_COUNTER (the ia32 support with MSVC needs to have one that isn't used both globally and in every opcode function).
So the ppc32 problem must be something different, but what it is I have no idea about. The intention was that nothing would change if OPCODE_RETURN_JUMPADDR isn't defined. Anyway, it's hopefully a non-issue soon.
/ Martin Stjernholm, Roxen IS
No, since it was used locally too in every function. Sneaky.
/ Martin Stjernholm, Roxen IS
Previous text:
2003-08-07 20:49: Subject: Machine code efficiency
The Sparc problem was that I "cleaned up" the global DEF_PROG_COUNTER.
Um, shouldn't that cause a compilation error rather than a core dump (which is what I get on SPARC)?
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Btw, did you notice that the dmalloc build on mahoro seems to have started leaking objects since the OPCODE_RETURN_JUMPADDR change? That's a rather unforseen development, no? :-)
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Previous text:
2003-08-07 20:46: Subject: Machine code efficiency
The Sparc problem was that I "cleaned up" the global DEF_PROG_COUNTER. It's now reinstated as GLOBAL_DEF_PROG_COUNTER (the ia32 support with MSVC needs to have one that isn't used both globally and in every opcode function).
So the ppc32 problem must be something different, but what it is I have no idea about. The intention was that nothing would change if OPCODE_RETURN_JUMPADDR isn't defined. Anyway, it's hopefully a non-issue soon.
/ Martin Stjernholm, Roxen IS
Yes, that's strange. Perhaps it has something to do with that much less opcodes are inlined in dmalloc mode. Thanks for letting me know.
/ Martin Stjernholm, Roxen IS
Previous text:
2003-08-07 21:02: Subject: Machine code efficiency
Btw, did you notice that the dmalloc build on mahoro seems to have started leaking objects since the OPCODE_RETURN_JUMPADDR change? That's a rather unforseen development, no? :-)
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Does it still leak? I plugged a few leaks in the Nettle module yesterday.
/ Henrik Grubbström (Lysator)
Previous text:
2003-08-07 21:02: Subject: Machine code efficiency
Btw, did you notice that the dmalloc build on mahoro seems to have started leaking objects since the OPCODE_RETURN_JUMPADDR change? That's a rather unforseen development, no? :-)
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Right now it seems to segfault instead. Note that there were no Nettle-related checkins at the time the plupps went from green to yellow.
Btw, machine code on SPARC still seems a bit shaky; constants don't work properly:
pelix:~/Pike/7.5/build% /pike/home/marcus/Pike/7.5/build/pike -DNOT_INSTALLED -DPRECOMPILED_SEARCH_MORE -m/pike/home/marcus/Pike/7.5/build/master.pike -e 'String.trim_whites;' -:3:Index 'trim_whites' not present in module 'String'. Compilation failed. /pike/home/marcus/Pike/7.5/build/master.pike:296: master()->compile_string("#define NOT(X) !(X)\n#define CHAR(X) 'X'\nmixed r un(int argc, array(string) argv,mapping(string:string) env){String.trim_whi tes;;}",0,0) pelix:~/Pike/7.5/build%
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Previous text:
2003-08-07 22:29: Subject: Machine code efficiency
Does it still leak? I plugged a few leaks in the Nettle module yesterday.
/ Henrik Grubbström (Lysator)
I've just found a problem: If a program only has one reference through a function being called then F_RETURN in that function will free the program. Thus there might not be any "jmp *%eax" after the opcode call anymore. Ho hum.. :\
/ Martin Stjernholm, Roxen IS
Previous text:
2003-08-07 22:45: Subject: Machine code efficiency
Right now it seems to segfault instead. Note that there were no Nettle-related checkins at the time the plupps went from green to yellow.
Btw, machine code on SPARC still seems a bit shaky; constants don't work properly:
pelix:~/Pike/7.5/build% /pike/home/marcus/Pike/7.5/build/pike -DNOT_INSTALLED -DPRECOMPILED_SEARCH_MORE -m/pike/home/marcus/Pike/7.5/build/master.pike -e 'String.trim_whites;' -:3:Index 'trim_whites' not present in module 'String'. Compilation failed. /pike/home/marcus/Pike/7.5/build/master.pike:296: master()->compile_string("#define NOT(X) !(X)\n#define CHAR(X) 'X'\nmixed r un(int argc, array(string) argv,mapping(string:string) env){String.trim_whi tes;;}",0,0) pelix:~/Pike/7.5/build%
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Anyone got any good idea of how to cope with this? The best I can think of is to do destruct_objects_to_destruct on entry of each function instead of on exit.
/ Martin Stjernholm, Roxen IS
Previous text:
2003-08-07 22:55: Subject: Machine code efficiency
I've just found a problem: If a program only has one reference through a function being called then F_RETURN in that function will free the program. Thus there might not be any "jmp *%eax" after the opcode call anymore. Ho hum.. :\
/ Martin Stjernholm, Roxen IS
A simple workaround would be to replace
jsr F_RETURN jmp %eax
with:
jmp static_return_stub
; Somewhere global return_stub: jsr F_RETURN jmp %eax
/ Fredrik (Naranek) Hubinette (Real Build Master)
Previous text:
2003-08-07 23:43: Subject: Machine code efficiency
Anyone got any good idea of how to cope with this? The best I can think of is to do destruct_objects_to_destruct on entry of each function instead of on exit.
/ Martin Stjernholm, Roxen IS
Another option would be to put the responsibility of popping the frame on the caller. But that would probably screw things up...
Hmm, I haven't looked at the source in a while, but just where does an F_RETURN jump to? Shouldn't F_RETURN just do a return? Ie, shouldn't F_RETURN be compiled as:
jmp F_RETURN
instead of
jsr F_RETURN jmp *%eax
?
/ Fredrik (Naranek) Hubinette (Real Build Master)
Previous text:
2003-08-07 23:57: Subject: Machine code efficiency
A simple workaround would be to replace
jsr F_RETURN jmp %eax
with:
jmp static_return_stub
; Somewhere global return_stub: jsr F_RETURN jmp %eax
/ Fredrik (Naranek) Hubinette (Real Build Master)
Well, in that case all jump opcodes could be changed to that approach as well. It'd be necessary to insert some assembler in each opcode function. Is it possible to get the stack restored correctly?
/ Martin Stjernholm, Roxen IS
Previous text:
2003-08-08 00:01: Subject: Machine code efficiency
Another option would be to put the responsibility of popping the frame on the caller. But that would probably screw things up...
Hmm, I haven't looked at the source in a while, but just where does an F_RETURN jump to? Shouldn't F_RETURN just do a return? Ie, shouldn't F_RETURN be compiled as:
jmp F_RETURN
instead of
jsr F_RETURN jmp *%eax
?
/ Fredrik (Naranek) Hubinette (Real Build Master)
Another option would be to put the responsibility of popping the frame on the caller. But that would probably screw things up...
Hmm, when I read my copy of an early pike 7.4, it seems that this is how it used to work whenever the function call originated from mega_apply(). Tail recursion would cause F_RETURN to free the frame, but in a tailrecursion call there would always be at least one more reference to the current program, so that shouldn't be a problem.
I assume this has changed? (Too lazy to read the new source right now...)
/ Fredrik (Naranek) Hubinette (Real Build Master)
Previous text:
2003-08-08 00:01: Subject: Machine code efficiency
Another option would be to put the responsibility of popping the frame on the caller. But that would probably screw things up...
Hmm, I haven't looked at the source in a while, but just where does an F_RETURN jump to? Shouldn't F_RETURN just do a return? Ie, shouldn't F_RETURN be compiled as:
jmp F_RETURN
instead of
jsr F_RETURN jmp *%eax
?
/ Fredrik (Naranek) Hubinette (Real Build Master)
The calls via mega_apply aren't the problem. Rather it's those that use low_mega_apply and low_return at different points, i.e. whenever a pike function calls another pike function. I don't think that has changed. Wasn't it you that implemented the stuff that avoids recursion on the C stack in that case?
/ Martin Stjernholm, Roxen IS
Previous text:
2003-08-08 01:36: Subject: Machine code efficiency
Another option would be to put the responsibility of popping the frame on the caller. But that would probably screw things up...
Hmm, when I read my copy of an early pike 7.4, it seems that this is how it used to work whenever the function call originated from mega_apply(). Tail recursion would cause F_RETURN to free the frame, but in a tailrecursion call there would always be at least one more reference to the current program, so that shouldn't be a problem.
I assume this has changed? (Too lazy to read the new source right now...)
/ Fredrik (Naranek) Hubinette (Real Build Master)
Here's another hair-brained idea; perhaps instead of assembling each jump instruction as:
call F_INSTRUCTION jmp *%eax
Perhaps it is possible to use just "call F_INSTRUCTION", but instead of changing the return address, you just define DO_JUMP(X) as:
movl %ebp, %esp ; Unlink stack frame (-fno-omit-frame-pointer) popl %ebp ; restore %ebp addl $4,%esp ; pop return address movl X,%eax ; Jump jmp *%eax
IE. a standard gcc-function-epilogue, but with a pop/jump instead of a ret. It should be possible to make it work with -fomit-frame-pointer as well, but that's a little trickier.
Although, it is possible that this will be just as slow as changing the return address on the stack.
/ Fredrik (Naranek) Hubinette (Real Build Master)
Previous text:
2003-08-08 01:52: Subject: Machine code efficiency
The calls via mega_apply aren't the problem. Rather it's those that use low_mega_apply and low_return at different points, i.e. whenever a pike function calls another pike function. I don't think that has changed. Wasn't it you that implemented the stuff that avoids recursion on the C stack in that case?
/ Martin Stjernholm, Roxen IS
I checked in the delay-past-destruct_objects_to_destruct kludge yesterday and I think it'll do. Afterall, there has been no 100% guarantee that the current object is destructed immediately at return anyway, since the destruct_objects_to_destruct call only has been made if there are things to pop on the stack.
/ Martin Stjernholm, Roxen IS
Previous text:
2003-08-08 23:48: Subject: Machine code efficiency
Here's another hair-brained idea; perhaps instead of assembling each jump instruction as:
call F_INSTRUCTION jmp *%eax
Perhaps it is possible to use just "call F_INSTRUCTION", but instead of changing the return address, you just define DO_JUMP(X) as:
movl %ebp, %esp ; Unlink stack frame (-fno-omit-frame-pointer) popl %ebp ; restore %ebp addl $4,%esp ; pop return address movl X,%eax ; Jump jmp *%eax
IE. a standard gcc-function-epilogue, but with a pop/jump instead of a ret. It should be possible to make it work with -fomit-frame-pointer as well, but that's a little trickier.
Although, it is possible that this will be just as slow as changing the return address on the stack.
/ Fredrik (Naranek) Hubinette (Real Build Master)
A word of caution: I've made very similar arguments in the past, and more often then not, it has come back to bite me on the ass afterwards. Hopefully your kluge will work though. :)
Hmm... What happens if the object we are returning from is already destructed? In that case, program will be freed directly from really_free_pike_frame, won't it?
/ Fredrik (Naranek) Hubinette (Real Build Master)
Previous text:
2003-08-09 01:02: Subject: Machine code efficiency
I checked in the delay-past-destruct_objects_to_destruct kludge yesterday and I think it'll do. Afterall, there has been no 100% guarantee that the current object is destructed immediately at return anyway, since the destruct_objects_to_destruct call only has been made if there are things to pop on the stack.
/ Martin Stjernholm, Roxen IS
Yes, I know. Unfortunately your old half-baked solutions have bitten more people than you.. :\ The prime example is the cyclic resolver issues in the compiler and the decoder - layer upon layer upon layer of ever more elaborate kludges, and it still don't work all the time.
Hmm... What happens if the object we are returning from is already destructed? In that case, program will be freed directly from really_free_pike_frame, won't it?
Good point. Looks like we need something like a destruct_programs_to_destruct then. Or make a fake object for the program and directly link it into the objects_to_destruct list. :P
/ Martin Stjernholm, Roxen IS
Previous text:
2003-08-09 01:58: Subject: Machine code efficiency
A word of caution: I've made very similar arguments in the past, and more often then not, it has come back to bite me on the ass afterwards. Hopefully your kluge will work though. :)
Hmm... What happens if the object we are returning from is already destructed? In that case, program will be freed directly from really_free_pike_frame, won't it?
/ Fredrik (Naranek) Hubinette (Real Build Master)
Simple in principle, but it's not fun to have stubs for the whole plethora of return opcodes. :P
I'm thinking of adding an argument to destruct_objects_to_destruct so that the current object can be excluded. Maybe we can live with that it's freed a bit later instead. It's afterall not often that it's freed directly on function return anyway.
/ Martin Stjernholm, Roxen IS
Previous text:
2003-08-07 23:57: Subject: Machine code efficiency
A simple workaround would be to replace
jsr F_RETURN jmp %eax
with:
jmp static_return_stub
; Somewhere global return_stub: jsr F_RETURN jmp %eax
/ Fredrik (Naranek) Hubinette (Real Build Master)
How about encoding a return like this:
push static_dangerous_jump_stub jmp F_RETURN
static_dangerous_jump_stub: jmp *%eax
That way you will only need one stub for all of the return opcodes.
Although, if you still use the PROG_COUNTER() macro for relative jumping, that wouldn't work very well...
/ Fredrik (Naranek) Hubinette (Real Build Master)
Previous text:
2003-08-08 00:36: Subject: Machine code efficiency
Simple in principle, but it's not fun to have stubs for the whole plethora of return opcodes. :P
I'm thinking of adding an argument to destruct_objects_to_destruct so that the current object can be excluded. Maybe we can live with that it's freed a bit later instead. It's afterall not often that it's freed directly on function return anyway.
/ Martin Stjernholm, Roxen IS
The frame return address is still used for relative jumping, to update Pike_fp->pc etc. Good point, that rules out any sort of stubs afaics.
/ Martin Stjernholm, Roxen IS
Previous text:
2003-08-08 01:30: Subject: Machine code efficiency
How about encoding a return like this:
push static_dangerous_jump_stub jmp F_RETURN
static_dangerous_jump_stub: jmp *%eax
That way you will only need one stub for all of the return opcodes.
Although, if you still use the PROG_COUNTER() macro for relative jumping, that wouldn't work very well...
/ Fredrik (Naranek) Hubinette (Real Build Master)
Right now it seems to segfault instead. Note that there were no Nettle-related checkins at the time the plupps went from green to yellow.
I think the CVS browser links lie in this case, and that this is the relevant checkin:
o 2003-08-07 16:12:39 UTC (nilsson) Pike/7.5/lib/modules/Crypto.pmod/testsuite.in, (+111/-2) (209 lines) dl Nettle tests
/ Henrik Grubbström (Lysator)
Previous text:
2003-08-07 22:45: Subject: Machine code efficiency
Right now it seems to segfault instead. Note that there were no Nettle-related checkins at the time the plupps went from green to yellow.
Btw, machine code on SPARC still seems a bit shaky; constants don't work properly:
pelix:~/Pike/7.5/build% /pike/home/marcus/Pike/7.5/build/pike -DNOT_INSTALLED -DPRECOMPILED_SEARCH_MORE -m/pike/home/marcus/Pike/7.5/build/master.pike -e 'String.trim_whites;' -:3:Index 'trim_whites' not present in module 'String'. Compilation failed. /pike/home/marcus/Pike/7.5/build/master.pike:296: master()->compile_string("#define NOT(X) !(X)\n#define CHAR(X) 'X'\nmixed r un(int argc, array(string) argv,mapping(string:string) env){String.trim_whi tes;;}",0,0) pelix:~/Pike/7.5/build%
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
The testsuite never gets that far, so no, that is not what breaks the dmalloc build.
/ Martin Nilsson (ja till euro, nej till cent)
Previous text:
2003-08-07 22:58: Subject: Machine code efficiency
Right now it seems to segfault instead. Note that there were no Nettle-related checkins at the time the plupps went from green to yellow.
I think the CVS browser links lie in this case, and that this is the relevant checkin:
o 2003-08-07 16:12:39 UTC (nilsson) Pike/7.5/lib/modules/Crypto.pmod/testsuite.in, (+111/-2) (209 lines) dl Nettle tests
/ Henrik Grubbström (Lysator)
Courtesy of dmalloc, probably. See http://pike.ida.liu.se/generated/ pikefarm/7.5/1010_168/verifylog.txt. You can see four leaked objects, along with lines where each object has been used and where it's referenced from. Really neat, actually. Use dmalloc. Dmalloc is good for you. And wear sensible shoes.
/ Martin Stjernholm, Roxen IS
Previous text:
2003-08-07 23:09: Subject: Machine code efficiency
Oh. Where do you see leaked objects?
/ Martin Nilsson (ja till euro, nej till cent)
That was verbose. I just wanted to know the build and client...
/ Martin Nilsson (ja till euro, nej till cent)
Previous text:
2003-08-07 23:36: Subject: Machine code efficiency
Courtesy of dmalloc, probably. See http://pike.ida.liu.se/generated/ pikefarm/7.5/1010_168/verifylog.txt. You can see four leaked objects, along with lines where each object has been used and where it's referenced from. Really neat, actually. Use dmalloc. Dmalloc is good for you. And wear sensible shoes.
/ Martin Stjernholm, Roxen IS
pike-devel@lists.lysator.liu.se