Machine code efficiency

6 Aug 2003


      I did away with the assignments to the frame return addresses (for
ia32 only). Now it's much better on Athlon, and there's some
improvement on Intel too:
PIII:
test                        total    user    mem   (runs)                    diff
Compile.................... 3.389s  2.906s  5316kb (100) (8306 lines/s)        9%
Compile & Exec............. 3.262s  2.710s  3760kb (100) (221952 lines/s)      7%
Ackermann.................. 1.622s  1.109s  3752kb (100)		     -23%
Loops Nested (local)....... 1.137s  0.649s  3540kb (100) (25870792 iters/s)  -36%
Loops Nested (global)...... 2.059s  1.327s  3540kb (100) (12643922 iters/s)  -25%
Loops Recursed............. 1.486s  0.928s  3540kb (100) (1129566 iters/s)   -18%
Athlon XP:
test                        total    user    mem   (runs)                    diff
Compile.................... 1.529s  1.317s  3668kb (100) (18332 lines/s)      13%
Compile & Exec............. 1.474s  1.264s  3672kb (100) (475716 lines/s)     11%
Ackermann.................. 0.732s  0.534s  3728kb (100)		     -18%
Loops Nested (local)....... 0.450s  0.248s  3520kb (100) (67677360 iters/s)  -42%
Loops Nested (global)...... 0.780s  0.587s  3520kb (100) (28561814 iters/s)  -25%
Loops Recursed............. 0.571s  0.367s  3520kb (100) (2857154 iters/s)   -25%
/ Martin Stjernholm, Roxen IS
Previous text:
...
2003-08-05 14:53:
Subject: Machine code efficiency

I've tested on a few more architectures now and it seems that it
indeed is Athlon specific. The following are measurements with machine
code. The last column is the difference in user time wrt without
machine code.
sparc (Ultra-80):
test                        total    user    mem   (runs)                    diff
Compile.................... 6.588s  4.096s    -1kb  (76) (5894 lines/s)        2%
Compile & Exec............. 7.060s  4.488s    -1kb  (71) (134009 lines/s)      0%
Ackermann.................. 5.311s  2.841s    -1kb  (95)		     -10%
Loops Nested (local)....... 4.426s  1.981s    -1kb (100) (8469060 iters/s)     9%
Loops Nested (global)...... 4.992s  2.532s    -1kb (100) (6625550 iters/s)   -18%
Loops Recursed............. 4.747s  2.270s    -1kb (100) (461968 iters/s)    -16%
Seems like local variable accesses could be improved on sparc.
ia32 (PIII 700 MHz):
test                        total    user    mem   (runs)                    diff
Compile.................... 3.462s  2.945s  5260kb (100) (8198 lines/s)       11%
Compile & Exec............. 3.408s  2.760s  3628kb (100) (217922 lines/s)      9%
Ackermann.................. 1.789s  1.323s  3716kb (100)		      -8%
Loops Nested (local)....... 1.217s  0.740s  3504kb (100) (22678034 iters/s)  -27%
Loops Nested (global)...... 1.857s  1.359s  3504kb (100) (12342540 iters/s)  -23%
Loops Recursed............. 1.512s  1.031s  3504kb (100) (1017443 iters/s)    -9%
ia32 (Athlon XP 1535 MHz):
test                        total    user    mem   (runs)                    diff
Compile.................... 1.464s  1.251s  5244kb (100) (19300 lines/s)       7%
Compile & Exec............. 1.389s  1.197s  3656kb (100) (502423 lines/s)      6%
Ackermann.................. 1.387s  1.198s  3724kb (100)		      84% !!
Loops Nested (local)....... 0.450s  0.261s  3508kb (100) (64182112 iters/s)  -39%
Loops Nested (global)...... 0.787s  0.598s  3508kb (100) (28036812 iters/s)  -23%
Loops Recursed............. 1.518s  1.329s  3508kb (100) (789115 iters/s)    172% !!
I also tried with a binary copied from the PIII system on my Athlon in
case there's some kind of compiler difference, but that didn't change
anything much. It's amazing that some cpu difference can have this
dramatic effect on function call performance.
It'd be interesting to see this on more systems.
/ Martin Stjernholm, Roxen IS

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Machine code efficiency