I did away with the assignments to the frame return addresses (for ia32 only). Now it's much better on Athlon, and there's some improvement on Intel too:
PIII:
test total user mem (runs) diff Compile.................... 3.389s 2.906s 5316kb (100) (8306 lines/s) 9% Compile & Exec............. 3.262s 2.710s 3760kb (100) (221952 lines/s) 7% Ackermann.................. 1.622s 1.109s 3752kb (100) -23% Loops Nested (local)....... 1.137s 0.649s 3540kb (100) (25870792 iters/s) -36% Loops Nested (global)...... 2.059s 1.327s 3540kb (100) (12643922 iters/s) -25% Loops Recursed............. 1.486s 0.928s 3540kb (100) (1129566 iters/s) -18%
Athlon XP:
test total user mem (runs) diff Compile.................... 1.529s 1.317s 3668kb (100) (18332 lines/s) 13% Compile & Exec............. 1.474s 1.264s 3672kb (100) (475716 lines/s) 11% Ackermann.................. 0.732s 0.534s 3728kb (100) -18% Loops Nested (local)....... 0.450s 0.248s 3520kb (100) (67677360 iters/s) -42% Loops Nested (global)...... 0.780s 0.587s 3520kb (100) (28561814 iters/s) -25% Loops Recursed............. 0.571s 0.367s 3520kb (100) (2857154 iters/s) -25%
/ Martin Stjernholm, Roxen IS
Previous text:
2003-08-05 14:53: Subject: Machine code efficiency
I've tested on a few more architectures now and it seems that it indeed is Athlon specific. The following are measurements with machine code. The last column is the difference in user time wrt without machine code.
sparc (Ultra-80):
test total user mem (runs) diff Compile.................... 6.588s 4.096s -1kb (76) (5894 lines/s) 2% Compile & Exec............. 7.060s 4.488s -1kb (71) (134009 lines/s) 0% Ackermann.................. 5.311s 2.841s -1kb (95) -10% Loops Nested (local)....... 4.426s 1.981s -1kb (100) (8469060 iters/s) 9% Loops Nested (global)...... 4.992s 2.532s -1kb (100) (6625550 iters/s) -18% Loops Recursed............. 4.747s 2.270s -1kb (100) (461968 iters/s) -16%
Seems like local variable accesses could be improved on sparc.
ia32 (PIII 700 MHz):
test total user mem (runs) diff Compile.................... 3.462s 2.945s 5260kb (100) (8198 lines/s) 11% Compile & Exec............. 3.408s 2.760s 3628kb (100) (217922 lines/s) 9% Ackermann.................. 1.789s 1.323s 3716kb (100) -8% Loops Nested (local)....... 1.217s 0.740s 3504kb (100) (22678034 iters/s) -27% Loops Nested (global)...... 1.857s 1.359s 3504kb (100) (12342540 iters/s) -23% Loops Recursed............. 1.512s 1.031s 3504kb (100) (1017443 iters/s) -9%
ia32 (Athlon XP 1535 MHz):
test total user mem (runs) diff Compile.................... 1.464s 1.251s 5244kb (100) (19300 lines/s) 7% Compile & Exec............. 1.389s 1.197s 3656kb (100) (502423 lines/s) 6% Ackermann.................. 1.387s 1.198s 3724kb (100) 84% !! Loops Nested (local)....... 0.450s 0.261s 3508kb (100) (64182112 iters/s) -39% Loops Nested (global)...... 0.787s 0.598s 3508kb (100) (28036812 iters/s) -23% Loops Recursed............. 1.518s 1.329s 3508kb (100) (789115 iters/s) 172% !!
I also tried with a binary copied from the PIII system on my Athlon in case there's some kind of compiler difference, but that didn't change anything much. It's amazing that some cpu difference can have this dramatic effect on function call performance.
It'd be interesting to see this on more systems.
/ Martin Stjernholm, Roxen IS