I downloaded icc-7.0 and ran a comparison.
test gcc icc Pike start overhead..... 0.001s 0.001s Ackermann............... 0.660s 0.646s Append array............ 0.470s (1063830/s) 0.534s (935551) Append mapping.......... 2.890s (3460/s) 2.785s (3591) Append multiset......... 0.476s (20997/s) 0.432s (23148) Array & String Juggling. 0.687s 0.604s Read binary INT16....... 0.283s (3537736/s) 0.257s (3892944) Read binary INT32....... 0.177s (2820513/s) 0.150s (3342618) Read binary INT128...... 0.895s (11173/s) 0.857s (11673) Clone null-object....... 0.228s (1313869/s) 0.177s (1691176) Clone object............ 0.429s (699153/s) 0.369s (812641) Compile................. 1.080s (22352 lines/s) 0.893s (27022) Compile & Exec.......... 0.904s (665265 lines/s) 0.738s (814537) GC...................... 0.587s 0.578s Insert in mapping....... 0.443s (1128668/s) 0.446s (1121076) Insert in multiset...... 0.880s (568182/s) 0.777s (643777) Matrix multiplication... 0.410s 0.402s Loops Nested (local).... 0.332s (50561473 iters/s) 0.487s (34473732) Loops Nested (global)... 0.528s (31788409 iters/s) 0.723s (23209587) Loops Recursed.......... 1.383s (758464 iters/s) 0.504s (2078675)
Note that icc function calls seems to be much faster (Loops Recursed), and that the speed is comparable even though the icc version is compiled without machine code. (Both are with 64-bit float and int.)
Has anyone looked into how difficult it would be to get icc to use the machine code stuff?
On Fri, 7 Feb 2003, Mirar @ Pike developers forum wrote:
I downloaded icc-7.0 and ran a comparison.
test gcc icc Pike start overhead..... 0.001s 0.001s Ackermann............... 0.660s 0.646s Append array............ 0.470s (1063830/s) 0.534s (935551) Append mapping.......... 2.890s (3460/s) 2.785s (3591) Append multiset......... 0.476s (20997/s) 0.432s (23148) Array & String Juggling. 0.687s 0.604s Read binary INT16....... 0.283s (3537736/s) 0.257s (3892944) Read binary INT32....... 0.177s (2820513/s) 0.150s (3342618) Read binary INT128...... 0.895s (11173/s) 0.857s (11673) Clone null-object....... 0.228s (1313869/s) 0.177s (1691176) Clone object............ 0.429s (699153/s) 0.369s (812641) Compile................. 1.080s (22352 lines/s) 0.893s (27022) Compile & Exec.......... 0.904s (665265 lines/s) 0.738s (814537) GC...................... 0.587s 0.578s Insert in mapping....... 0.443s (1128668/s) 0.446s (1121076) Insert in multiset...... 0.880s (568182/s) 0.777s (643777) Matrix multiplication... 0.410s 0.402s Loops Nested (local).... 0.332s (50561473 iters/s) 0.487s (34473732) Loops Nested (global)... 0.528s (31788409 iters/s) 0.723s (23209587) Loops Recursed.......... 1.383s (758464 iters/s) 0.504s (2078675)
Note that icc function calls seems to be much faster (Loops Recursed), and that the speed is comparable even though the icc version is compiled without machine code. (Both are with 64-bit float and int.)
Nice results for intels compiler. How machine dependent is this result? E.g. is this pentium 4 specific, or can we expect AMD Athlon or pentium III (II) to benefit likewise? What type of CPU did you run your tests btw.?
--- Ludger
Has anyone looked into how difficult it would be to get icc to use the machine code stuff?
Nice results for intels compiler. How machine dependent is this result?
This was on my Athlon XP 1900+ (1G6Hz), so it seems that at least Athlon XP benefits.
This is the compilation flags used:
icc: -Ob2 -ipo -ipo_obj -axKW -O2 -g gcc: -O3 -pipe -fomit-frame-pointer -march=athlon-xp -mcpu=athlon-xp -g
"-ax<codes> Generate code specialized for processor extensions spec- ified by <codes> while also generating generic IA-32 code. <codes> includes one or more of the following characters:
i -- Pentium Pro and Pentium II processor instructions M -- MMX(TM) instructions K -- Streaming SIMD Extensions W -- Pentium(R) 4 New Instructions"
I tried to get it to run machine code, but I got obscure linking errors (compilation went fine). It seems that some objects missed eval_instruction().
/ Mirar
Previous text:
2003-02-07 14:29: Subject: Re: gcc/icc
On Fri, 7 Feb 2003, Mirar @ Pike developers forum wrote:
I downloaded icc-7.0 and ran a comparison.
test gcc icc Pike start overhead..... 0.001s 0.001s Ackermann............... 0.660s 0.646s Append array............ 0.470s (1063830/s) 0.534s (935551) Append mapping.......... 2.890s (3460/s) 2.785s (3591) Append multiset......... 0.476s (20997/s) 0.432s (23148) Array & String Juggling. 0.687s 0.604s Read binary INT16....... 0.283s (3537736/s) 0.257s (3892944) Read binary INT32....... 0.177s (2820513/s) 0.150s (3342618) Read binary INT128...... 0.895s (11173/s) 0.857s (11673) Clone null-object....... 0.228s (1313869/s) 0.177s (1691176) Clone object............ 0.429s (699153/s) 0.369s (812641) Compile................. 1.080s (22352 lines/s) 0.893s (27022) Compile & Exec.......... 0.904s (665265 lines/s) 0.738s (814537) GC...................... 0.587s 0.578s Insert in mapping....... 0.443s (1128668/s) 0.446s (1121076) Insert in multiset...... 0.880s (568182/s) 0.777s (643777) Matrix multiplication... 0.410s 0.402s Loops Nested (local).... 0.332s (50561473 iters/s) 0.487s (34473732) Loops Nested (global)... 0.528s (31788409 iters/s) 0.723s (23209587) Loops Recursed.......... 1.383s (758464 iters/s) 0.504s (2078675)
Note that icc function calls seems to be much faster (Loops Recursed), and that the speed is comparable even though the icc version is compiled without machine code. (Both are with 64-bit float and int.)
Nice results for intels compiler. How machine dependent is this result? E.g. is this pentium 4 specific, or can we expect AMD Athlon or pentium III (II) to benefit likewise? What type of CPU did you run your tests btw.?
--- Ludger
Has anyone looked into how difficult it would be to get icc to use the machine code stuff?
/ Brevbäraren
I could add that my athlon 1.2 GHz results I posted earlier were about the same. Results from P3 and P4 computers would be very interesting - especially P4.
/ David Hedbor
Previous text:
2003-02-07 17:46: Subject: Re: gcc/icc
Nice results for intels compiler. How machine dependent is this result?
This was on my Athlon XP 1900+ (1G6Hz), so it seems that at least Athlon XP benefits.
This is the compilation flags used:
icc: -Ob2 -ipo -ipo_obj -axKW -O2 -g gcc: -O3 -pipe -fomit-frame-pointer -march=athlon-xp -mcpu=athlon-xp -g
"-ax<codes> Generate code specialized for processor extensions spec- ified by <codes> while also generating generic IA-32 code. <codes> includes one or more of the following characters:
i -- Pentium Pro and Pentium II processor instructions M -- MMX(TM) instructions K -- Streaming SIMD Extensions W -- Pentium(R) 4 New Instructions"
I tried to get it to run machine code, but I got obscure linking errors (compilation went fine). It seems that some objects missed eval_instruction().
/ Mirar
I am downloading the compiler now. I'll be back. :-)
/ Per Hedbor ()
Previous text:
2003-02-07 18:44: Subject: Re: gcc/icc
I could add that my athlon 1.2 GHz results I posted earlier were about the same. Results from P3 and P4 computers would be very interesting - especially P4.
/ David Hedbor
You may or may not need to patch your bits/byteswap.h. On my Linux/glibc you needed to change this "16" to "32":
vv # define __bswap_16(x) \ (__extension__ \ ({ register unsigned int __x = (x); __bswap_constant_32 (__x); }))
(Just a tip for everyone who wants to get icc working. Newer/CVS versions of glibc doesn't have this problem.)
/ Mirar
Previous text:
2003-02-07 22:24: Subject: Re: gcc/icc
I am downloading the compiler now. I'll be back. :-)
/ Per Hedbor ()
What happened if you did not?
/ David Hedbor
Previous text:
2003-02-07 22:43: Subject: Re: gcc/icc
You may or may not need to patch your bits/byteswap.h. On my Linux/glibc you needed to change this "16" to "32":
vv
# define __bswap_16(x) \ (__extension__ \ ({ register unsigned int __x = (x); __bswap_constant_32 (__x); }))
(Just a tip for everyone who wants to get icc working. Newer/CVS versions of glibc doesn't have this problem.)
/ Mirar
Any use of ntohl/htonl gives errors when the program is linked. It probably doesn't show up in old enough or new enough glibc's, but it shows at least up in the current gentoo.
/ Mirar
Previous text:
2003-02-07 23:08: Subject: Re: gcc/icc
What happened if you did not?
/ David Hedbor
It worked for me.
On a not quite related note:
Matrix multiplication......Gmp.mpz conversion failed (Gmp.bignum not loaded). /usr/local/pike/7.5.3/lib/modules/Tools.pmod/Shoot.pmod/MatrixMult.pike:21: Tools.Shoot.MatrixMult()->test(-1) /usr/local/pike/7.5.3/lib/modules/Tools.pmod/Shoot.pmod/MatrixMult.pike:26: Tools.Shoot.MatrixMult()->perform() /usr/local/pike/7.5.3/lib/modules/Tools.pmod/Shoot.pmod/module.pmod:129: Tools.Shoot->_shoot("MatrixMult") -:3: -()->run(1,({"/usr/local/pike/7.5.3/bin/pike"}),mapping[34]) failed to spawn pike or run test
I can use Gmp.mpz (or Gmp.bignum) without trouble if I just start the pike, though.
/ Per Hedbor ()
Previous text:
2003-02-07 22:43: Subject: Re: gcc/icc
You may or may not need to patch your bits/byteswap.h. On my Linux/glibc you needed to change this "16" to "32":
vv
# define __bswap_16(x) \ (__extension__ \ ({ register unsigned int __x = (x); __bswap_constant_32 (__x); }))
(Just a tip for everyone who wants to get icc working. Newer/CVS versions of glibc doesn't have this problem.)
/ Mirar
Adding main_resolv("Gmp.bignum" ); in _main in the master fixed the problem (and quite exactly doubled the pike start overhead. :-))
/ Per Hedbor ()
Previous text:
2003-02-07 23:12: Subject: Re: gcc/icc
It worked for me.
On a not quite related note:
Matrix multiplication......Gmp.mpz conversion failed (Gmp.bignum not loaded). /usr/local/pike/7.5.3/lib/modules/Tools.pmod/Shoot.pmod/MatrixMult.pike:21: Tools.Shoot.MatrixMult()->test(-1) /usr/local/pike/7.5.3/lib/modules/Tools.pmod/Shoot.pmod/MatrixMult.pike:26: Tools.Shoot.MatrixMult()->perform() /usr/local/pike/7.5.3/lib/modules/Tools.pmod/Shoot.pmod/module.pmod:129: Tools.Shoot->_shoot("MatrixMult") -:3: -()->run(1,({"/usr/local/pike/7.5.3/bin/pike"}),mapping[34]) failed to spawn pike or run test
I can use Gmp.mpz (or Gmp.bignum) without trouble if I just start the pike, though.
/ Per Hedbor ()
Yes, something that probably should doesn't load Gmp. It's a recent (few weeks) error, I think.
I patched the benchmarks a week ago or so so they load Gmp. *looks* Hmm, it never got checked in... *checks in*
/ Mirar
Previous text:
2003-02-07 23:12: Subject: Re: gcc/icc
It worked for me.
On a not quite related note:
Matrix multiplication......Gmp.mpz conversion failed (Gmp.bignum not loaded). /usr/local/pike/7.5.3/lib/modules/Tools.pmod/Shoot.pmod/MatrixMult.pike:21: Tools.Shoot.MatrixMult()->test(-1) /usr/local/pike/7.5.3/lib/modules/Tools.pmod/Shoot.pmod/MatrixMult.pike:26: Tools.Shoot.MatrixMult()->perform() /usr/local/pike/7.5.3/lib/modules/Tools.pmod/Shoot.pmod/module.pmod:129: Tools.Shoot->_shoot("MatrixMult") -:3: -()->run(1,({"/usr/local/pike/7.5.3/bin/pike"}),mapping[34]) failed to spawn pike or run test
I can use Gmp.mpz (or Gmp.bignum) without trouble if I just start the pike, though.
/ Per Hedbor ()
Don't.
/ Martin Nilsson (Åskblod)
Previous text:
2003-02-07 23:16: Subject: Re: gcc/icc
Yes, something that probably should doesn't load Gmp. It's a recent (few weeks) error, I think.
I patched the benchmarks a week ago or so so they load Gmp. *looks* Hmm, it never got checked in... *checks in*
/ Mirar
It's sunch an ugly fix that it shouldn't end up in the Pike main CVS.
Though I can't figure out why the error occurs in the first place. It does appear very much as if Gmp.bignum is resolved in the master before the benchmark is run.
/ Martin Nilsson (Åskblod)
Previous text:
2003-02-07 23:28: Subject: Re: gcc/icc
Any good reason they shouldn't? They need Gmp?
/ Mirar
Feel free to remove it when Gmp is loaded by something else. The problem seems only to appear in the benchmark test pike instances.
/ Mirar
Previous text:
2003-02-07 23:41: Subject: Re: gcc/icc
It's sunch an ugly fix that it shouldn't end up in the Pike main CVS.
Though I can't figure out why the error occurs in the first place. It does appear very much as if Gmp.bignum is resolved in the master before the benchmark is run.
/ Martin Nilsson (Åskblod)
I'm back!
I sadly enough don't have a modern P4, but my Celeron might work as an indication. However, it has 128Kb cache, and a modern P4 has 512Kb. My old 'normal' P4 has 256Kb cache, and is generally speaking somewhat slower per GHz than a modern one.
The 'gain' column is the percentage difference between gcc and icc. gcc-asm is the default compile, gcc is gcc without assembly ( optimizations.
Enough preamble, here are the tests. :-)
lain: dual 560Mhz P3; 1024Mb PC112 SDRAM ------------------------------------------------------------------ test gcc-asm gcc icc gain ------------------------------------------------------------------ Ackermann . . . . . . . . . . 1.96 2.31 1.99 14 Append array. . . . . . . . . 1.78 1.73 1.90 -10 (262697/s) Append mapping. . . . . . . . 12.44 11.65 9.24 21 (1082/s) Append multiset . . . . . . . 1.82 1.87 1.81 4 (5515/s) Array & String Juggling . . . 3.84 4.01 4.68 -16 Clone null-object . . . . . . 0.74 0.75 0.65 14 (458015/s) Clone object. . . . . . . . . 1.73 1.69 1.91 -13 (156794/s) Compile . . . . . . . . . . . 4.30 4.07 3.42 16 (7048 lines/s) Compile & Exec. . . . . . . . 4.48 4.11 3.57 14 (168696 lines/s) GC. . . . . . . . . . . . . . 1.67 1.72 1.36 22 Insert in mapping . . . . . . 0.99 1.06 0.96 10 (518672/s) Insert in multiset. . . . . . 2.71 3.09 2.41 22 (207469/s) Loops Nested (global) . . . . 1.45 2.06 1.98 5 (8473341 iters/s) Loops Nested (local). . . . . 0.90 1.48 1.31 12 (12807036 iters/s) Loops Recursed. . . . . . . . 1.37 1.62 1.48 9 (708497 iters/s) Matrix multiplication . . . . 1.62 1.78 1.38 23 Pike start overhead . . . . . 0.00 0.00 0.00 0 Read binary INT128. . . . . . 4.01 3.71 3.13 16 (3195/s) Read binary INT16 . . . . . . 0.68 0.66 0.64 4 (1570680/s) Read binary INT32 . . . . . . 10.28 10.04 9.15 9 (54645/s) ------------------------------------------------------------------
eiri: 450Mhz P2; 768Mb PC100 SDRAM ------------------------------------------------------------------ test gcc-asm gcc icc gain ------------------------------------------------------------------ Ackermann . . . . . . . . . . 2.19 2.55 2.17 15 Append array. . . . . . . . . 2.11 2.12 2.32 -9 (215517/s) Append mapping. . . . . . . . 14.26 12.67 10.47 18 (955/s) Append multiset . . . . . . . 2.02 2.11 1.98 7 (5059/s) Array & String Juggling . . . 4.56 4.78 5.20 -8 Clone null-object . . . . . . 0.77 0.84 0.72 15 (416667/s) Clone object. . . . . . . . . 1.82 2.20 2.30 -4 (130435/s) Compile . . . . . . . . . . . 4.82 4.39 3.75 15 (6437 lines/s) Compile & Exec. . . . . . . . 5.02 4.78 4.30 10 (139860 lines/s) GC. . . . . . . . . . . . . . 1.86 1.89 1.62 15 Insert in mapping . . . . . . 1.09 1.16 1.01 13 (493827/s) Insert in multiset. . . . . . 3.25 3.53 2.75 23 (181818/s) Loops Nested (global) . . . . 1.60 2.29 2.38 -4 (7049250 iters/s) Loops Nested (local). . . . . 1.00 1.63 1.73 -6 (9679163 iters/s) Loops Recursed. . . . . . . . 1.53 1.77 1.60 10 (655360 iters/s) Matrix multiplication . . . . 1.83 1.86 1.54 18 Pike start overhead . . . . . 0.00 0.00 0.00 25 Read binary INT128. . . . . . 4.11 4.09 3.97 4 (2519/s) Read binary INT16 . . . . . . 0.75 0.72 0.72 1 (1388889/s) Read binary INT32 . . . . . . 10.74 11.42 10.27 11 (48685/s) ------------------------------------------------------------------
ayumu: 2.1Ghz P4 Celeron, 512Mb DDR333 ------------------------------------------------------------------ test gcc-asm gcc icc gain ------------------------------------------------------------------ Ackermann . . . . . . . . . . 0.66 0.69 0.55 21 Append array. . . . . . . . . 0.55 0.60 0.51 16 (985222/s) Append mapping. . . . . . . . 2.94 3.06 2.52 18 (3968/s) Append multiset . . . . . . . 0.44 0.45 0.44 3 (22843/s) Array & String Juggling . . . 1.29 1.29 1.35 -4 Clone null-object . . . . . . 0.28 0.26 0.22 16 (1359516/s) Clone object. . . . . . . . . 0.75 0.55 0.45 19 (663391/s) Compile . . . . . . . . . . . 1.47 1.26 1.07 15 (22508 lines/s) Compile & Exec. . . . . . . . 1.37 1.33 1.09 19 (552757 lines/s) GC. . . . . . . . . . . . . . 0.57 0.57 0.47 17 Insert in mapping . . . . . . 0.27 0.30 0.25 19 (2034884/s) Insert in multiset. . . . . . 0.81 0.85 0.73 14 (681818/s) Loops Nested (global) . . . . 0.41 0.68 0.47 32 (36036980 iters/s) Loops Nested (local). . . . . 0.26 0.44 0.33 25 (50423328 iters/s) Loops Recursed. . . . . . . . 0.42 0.55 0.39 30 (2674939 iters/s) Matrix multiplication . . . . 0.51 0.48 0.47 4 Pike start overhead . . . . . 0.00 0.00 0.00 0 Read binary INT128. . . . . . 1.13 1.07 0.94 13 (10638/s) Read binary INT16 . . . . . . 0.20 0.19 0.17 9 (5802047/s) Read binary INT32 . . . . . . 2.86 2.70 2.35 14 (212766/s) ------------------------------------------------------------------
sakura: 1.65Ghz P4, 512Mb PC800 RDRAM ------------------------------------------------------------------ test gcc-asm gcc icc gain ------------------------------------------------------------------ Ackermann . . . . . . . . . . 0.76 0.81 0.62 25 Append array. . . . . . . . . 0.62 0.65 0.60 8 (829876/s) Append mapping. . . . . . . . 3.42 3.76 2.84 25 (3521/s) Append multiset . . . . . . . 0.52 0.57 0.49 14 (20270/s) Array & String Juggling . . . 1.17 1.18 0.88 26 Clone null-object . . . . . . 0.32 0.31 0.27 15 (1111111/s) Clone object. . . . . . . . . 0.62 0.63 0.55 14 (549199/s) Compile . . . . . . . . . . . 1.42 1.29 1.05 19 (22990 lines/s) Compile & Exec. . . . . . . . 1.41 1.33 1.16 13 (518448 lines/s) GC. . . . . . . . . . . . . . 0.58 0.58 0.47 19 Insert in mapping . . . . . . 0.31 0.34 0.27 21 (1871658/s) Insert in multiset. . . . . . 0.87 0.91 0.76 17 (657895/s) Loops Nested (global) . . . . 0.44 0.71 0.58 19 (28802088 iters/s) Loops Nested (local). . . . . 0.32 0.50 0.43 15 (39107724 iters/s) Loops Recursed. . . . . . . . 0.52 0.64 0.46 29 (2279513 iters/s) Matrix multiplication . . . . 0.47 0.47 0.41 13 Pike start overhead . . . . . 0.00 0.00 0.00 75 Read binary INT128. . . . . . 1.22 1.21 1.08 11 (9225/s) Read binary INT16 . . . . . . 0.22 0.22 0.19 13 (5278592/s) Read binary INT32 . . . . . . 3.54 3.46 3.03 13 (165289/s) ------------------------------------------------------------------
/ Per Hedbor ()
Previous text:
2003-02-07 22:24: Subject: Re: gcc/icc
I am downloading the compiler now. I'll be back. :-)
/ Per Hedbor ()
Nice results on the P4. Also nice to see that the reason for the low numbers in non-recursive loops is (mostly) -asm. It would indeed be interesting to see the result for icc + asm. :)
/ David Hedbor
Previous text:
2003-02-08 00:15: Subject: Re: gcc/icc
I'm back!
I sadly enough don't have a modern P4, but my Celeron might work as an indication. However, it has 128Kb cache, and a modern P4 has 512Kb. My old 'normal' P4 has 256Kb cache, and is generally speaking somewhat slower per GHz than a modern one.
The 'gain' column is the percentage difference between gcc and icc. gcc-asm is the default compile, gcc is gcc without assembly ( optimizations.
Enough preamble, here are the tests. :-)
lain: dual 560Mhz P3; 1024Mb PC112 SDRAM
test gcc-asm gcc icc gain
Ackermann . . . . . . . . . . 1.96 2.31 1.99 14 Append array. . . . . . . . . 1.78 1.73 1.90 -10 (262697/s) Append mapping. . . . . . . . 12.44 11.65 9.24 21 (1082/s) Append multiset . . . . . . . 1.82 1.87 1.81 4 (5515/s) Array & String Juggling . . . 3.84 4.01 4.68 -16 Clone null-object . . . . . . 0.74 0.75 0.65 14 (458015/s) Clone object. . . . . . . . . 1.73 1.69 1.91 -13 (156794/s) Compile . . . . . . . . . . . 4.30 4.07 3.42 16 (7048 lines/s) Compile & Exec. . . . . . . . 4.48 4.11 3.57 14 (168696 lines/s) GC. . . . . . . . . . . . . . 1.67 1.72 1.36 22 Insert in mapping . . . . . . 0.99 1.06 0.96 10 (518672/s) Insert in multiset. . . . . . 2.71 3.09 2.41 22 (207469/s) Loops Nested (global) . . . . 1.45 2.06 1.98 5 (8473341 iters/s) Loops Nested (local). . . . . 0.90 1.48 1.31 12 (12807036 iters/s) Loops Recursed. . . . . . . . 1.37 1.62 1.48 9 (708497 iters/s) Matrix multiplication . . . . 1.62 1.78 1.38 23 Pike start overhead . . . . . 0.00 0.00 0.00 0 Read binary INT128. . . . . . 4.01 3.71 3.13 16 (3195/s) Read binary INT16 . . . . . . 0.68 0.66 0.64 4 (1570680/s) Read binary INT32 . . . . . . 10.28 10.04 9.15 9 (54645/s)
eiri: 450Mhz P2; 768Mb PC100 SDRAM
test gcc-asm gcc icc gain
Ackermann . . . . . . . . . . 2.19 2.55 2.17 15 Append array. . . . . . . . . 2.11 2.12 2.32 -9 (215517/s) Append mapping. . . . . . . . 14.26 12.67 10.47 18 (955/s) Append multiset . . . . . . . 2.02 2.11 1.98 7 (5059/s) Array & String Juggling . . . 4.56 4.78 5.20 -8 Clone null-object . . . . . . 0.77 0.84 0.72 15 (416667/s) Clone object. . . . . . . . . 1.82 2.20 2.30 -4 (130435/s) Compile . . . . . . . . . . . 4.82 4.39 3.75 15 (6437 lines/s) Compile & Exec. . . . . . . . 5.02 4.78 4.30 10 (139860 lines/s) GC. . . . . . . . . . . . . . 1.86 1.89 1.62 15 Insert in mapping . . . . . . 1.09 1.16 1.01 13 (493827/s) Insert in multiset. . . . . . 3.25 3.53 2.75 23 (181818/s) Loops Nested (global) . . . . 1.60 2.29 2.38 -4 (7049250 iters/s) Loops Nested (local). . . . . 1.00 1.63 1.73 -6 (9679163 iters/s) Loops Recursed. . . . . . . . 1.53 1.77 1.60 10 (655360 iters/s) Matrix multiplication . . . . 1.83 1.86 1.54 18 Pike start overhead . . . . . 0.00 0.00 0.00 25 Read binary INT128. . . . . . 4.11 4.09 3.97 4 (2519/s) Read binary INT16 . . . . . . 0.75 0.72 0.72 1 (1388889/s) Read binary INT32 . . . . . . 10.74 11.42 10.27 11 (48685/s)
ayumu: 2.1Ghz P4 Celeron, 512Mb DDR333
test gcc-asm gcc icc gain
Ackermann . . . . . . . . . . 0.66 0.69 0.55 21 Append array. . . . . . . . . 0.55 0.60 0.51 16 (985222/s) Append mapping. . . . . . . . 2.94 3.06 2.52 18 (3968/s) Append multiset . . . . . . . 0.44 0.45 0.44 3 (22843/s) Array & String Juggling . . . 1.29 1.29 1.35 -4 Clone null-object . . . . . . 0.28 0.26 0.22 16 (1359516/s) Clone object. . . . . . . . . 0.75 0.55 0.45 19 (663391/s) Compile . . . . . . . . . . . 1.47 1.26 1.07 15 (22508 lines/s) Compile & Exec. . . . . . . . 1.37 1.33 1.09 19 (552757 lines/s) GC. . . . . . . . . . . . . . 0.57 0.57 0.47 17 Insert in mapping . . . . . . 0.27 0.30 0.25 19 (2034884/s) Insert in multiset. . . . . . 0.81 0.85 0.73 14 (681818/s) Loops Nested (global) . . . . 0.41 0.68 0.47 32 (36036980 iters/s) Loops Nested (local). . . . . 0.26 0.44 0.33 25 (50423328 iters/s) Loops Recursed. . . . . . . . 0.42 0.55 0.39 30 (2674939 iters/s) Matrix multiplication . . . . 0.51 0.48 0.47 4 Pike start overhead . . . . . 0.00 0.00 0.00 0 Read binary INT128. . . . . . 1.13 1.07 0.94 13 (10638/s) Read binary INT16 . . . . . . 0.20 0.19 0.17 9 (5802047/s) Read binary INT32 . . . . . . 2.86 2.70 2.35 14 (212766/s)
sakura: 1.65Ghz P4, 512Mb PC800 RDRAM
test gcc-asm gcc icc gain
Ackermann . . . . . . . . . . 0.76 0.81 0.62 25 Append array. . . . . . . . . 0.62 0.65 0.60 8 (829876/s) Append mapping. . . . . . . . 3.42 3.76 2.84 25 (3521/s) Append multiset . . . . . . . 0.52 0.57 0.49 14 (20270/s) Array & String Juggling . . . 1.17 1.18 0.88 26 Clone null-object . . . . . . 0.32 0.31 0.27 15 (1111111/s) Clone object. . . . . . . . . 0.62 0.63 0.55 14 (549199/s) Compile . . . . . . . . . . . 1.42 1.29 1.05 19 (22990 lines/s) Compile & Exec. . . . . . . . 1.41 1.33 1.16 13 (518448 lines/s) GC. . . . . . . . . . . . . . 0.58 0.58 0.47 19 Insert in mapping . . . . . . 0.31 0.34 0.27 21 (1871658/s) Insert in multiset. . . . . . 0.87 0.91 0.76 17 (657895/s) Loops Nested (global) . . . . 0.44 0.71 0.58 19 (28802088 iters/s) Loops Nested (local). . . . . 0.32 0.50 0.43 15 (39107724 iters/s) Loops Recursed. . . . . . . . 0.52 0.64 0.46 29 (2279513 iters/s) Matrix multiplication . . . . 0.47 0.47 0.41 13 Pike start overhead . . . . . 0.00 0.00 0.00 75 Read binary INT128. . . . . . 1.22 1.21 1.08 11 (9225/s) Read binary INT16 . . . . . . 0.22 0.22 0.19 13 (5278592/s) Read binary INT32 . . . . . . 3.54 3.46 3.03 13 (165289/s)
/ Per Hedbor ()
The P4 difference is nice enough...
My xenofarm can't seem to build with icc. I get "Fatal error: ilio_malloc: out of memory -- 153338 bytes requested".
Does xenofarm put a memory limit or something on the build processes? My Linux' max process size is 2Gb, and it hardly swapped at all before it gave up (1Gb RAM/2Gb swap).
/ Mirar
Previous text:
2003-02-08 00:15: Subject: Re: gcc/icc
I'm back!
I sadly enough don't have a modern P4, but my Celeron might work as an indication. However, it has 128Kb cache, and a modern P4 has 512Kb. My old 'normal' P4 has 256Kb cache, and is generally speaking somewhat slower per GHz than a modern one.
The 'gain' column is the percentage difference between gcc and icc. gcc-asm is the default compile, gcc is gcc without assembly ( optimizations.
Enough preamble, here are the tests. :-)
lain: dual 560Mhz P3; 1024Mb PC112 SDRAM
test gcc-asm gcc icc gain
Ackermann . . . . . . . . . . 1.96 2.31 1.99 14 Append array. . . . . . . . . 1.78 1.73 1.90 -10 (262697/s) Append mapping. . . . . . . . 12.44 11.65 9.24 21 (1082/s) Append multiset . . . . . . . 1.82 1.87 1.81 4 (5515/s) Array & String Juggling . . . 3.84 4.01 4.68 -16 Clone null-object . . . . . . 0.74 0.75 0.65 14 (458015/s) Clone object. . . . . . . . . 1.73 1.69 1.91 -13 (156794/s) Compile . . . . . . . . . . . 4.30 4.07 3.42 16 (7048 lines/s) Compile & Exec. . . . . . . . 4.48 4.11 3.57 14 (168696 lines/s) GC. . . . . . . . . . . . . . 1.67 1.72 1.36 22 Insert in mapping . . . . . . 0.99 1.06 0.96 10 (518672/s) Insert in multiset. . . . . . 2.71 3.09 2.41 22 (207469/s) Loops Nested (global) . . . . 1.45 2.06 1.98 5 (8473341 iters/s) Loops Nested (local). . . . . 0.90 1.48 1.31 12 (12807036 iters/s) Loops Recursed. . . . . . . . 1.37 1.62 1.48 9 (708497 iters/s) Matrix multiplication . . . . 1.62 1.78 1.38 23 Pike start overhead . . . . . 0.00 0.00 0.00 0 Read binary INT128. . . . . . 4.01 3.71 3.13 16 (3195/s) Read binary INT16 . . . . . . 0.68 0.66 0.64 4 (1570680/s) Read binary INT32 . . . . . . 10.28 10.04 9.15 9 (54645/s)
eiri: 450Mhz P2; 768Mb PC100 SDRAM
test gcc-asm gcc icc gain
Ackermann . . . . . . . . . . 2.19 2.55 2.17 15 Append array. . . . . . . . . 2.11 2.12 2.32 -9 (215517/s) Append mapping. . . . . . . . 14.26 12.67 10.47 18 (955/s) Append multiset . . . . . . . 2.02 2.11 1.98 7 (5059/s) Array & String Juggling . . . 4.56 4.78 5.20 -8 Clone null-object . . . . . . 0.77 0.84 0.72 15 (416667/s) Clone object. . . . . . . . . 1.82 2.20 2.30 -4 (130435/s) Compile . . . . . . . . . . . 4.82 4.39 3.75 15 (6437 lines/s) Compile & Exec. . . . . . . . 5.02 4.78 4.30 10 (139860 lines/s) GC. . . . . . . . . . . . . . 1.86 1.89 1.62 15 Insert in mapping . . . . . . 1.09 1.16 1.01 13 (493827/s) Insert in multiset. . . . . . 3.25 3.53 2.75 23 (181818/s) Loops Nested (global) . . . . 1.60 2.29 2.38 -4 (7049250 iters/s) Loops Nested (local). . . . . 1.00 1.63 1.73 -6 (9679163 iters/s) Loops Recursed. . . . . . . . 1.53 1.77 1.60 10 (655360 iters/s) Matrix multiplication . . . . 1.83 1.86 1.54 18 Pike start overhead . . . . . 0.00 0.00 0.00 25 Read binary INT128. . . . . . 4.11 4.09 3.97 4 (2519/s) Read binary INT16 . . . . . . 0.75 0.72 0.72 1 (1388889/s) Read binary INT32 . . . . . . 10.74 11.42 10.27 11 (48685/s)
ayumu: 2.1Ghz P4 Celeron, 512Mb DDR333
test gcc-asm gcc icc gain
Ackermann . . . . . . . . . . 0.66 0.69 0.55 21 Append array. . . . . . . . . 0.55 0.60 0.51 16 (985222/s) Append mapping. . . . . . . . 2.94 3.06 2.52 18 (3968/s) Append multiset . . . . . . . 0.44 0.45 0.44 3 (22843/s) Array & String Juggling . . . 1.29 1.29 1.35 -4 Clone null-object . . . . . . 0.28 0.26 0.22 16 (1359516/s) Clone object. . . . . . . . . 0.75 0.55 0.45 19 (663391/s) Compile . . . . . . . . . . . 1.47 1.26 1.07 15 (22508 lines/s) Compile & Exec. . . . . . . . 1.37 1.33 1.09 19 (552757 lines/s) GC. . . . . . . . . . . . . . 0.57 0.57 0.47 17 Insert in mapping . . . . . . 0.27 0.30 0.25 19 (2034884/s) Insert in multiset. . . . . . 0.81 0.85 0.73 14 (681818/s) Loops Nested (global) . . . . 0.41 0.68 0.47 32 (36036980 iters/s) Loops Nested (local). . . . . 0.26 0.44 0.33 25 (50423328 iters/s) Loops Recursed. . . . . . . . 0.42 0.55 0.39 30 (2674939 iters/s) Matrix multiplication . . . . 0.51 0.48 0.47 4 Pike start overhead . . . . . 0.00 0.00 0.00 0 Read binary INT128. . . . . . 1.13 1.07 0.94 13 (10638/s) Read binary INT16 . . . . . . 0.20 0.19 0.17 9 (5802047/s) Read binary INT32 . . . . . . 2.86 2.70 2.35 14 (212766/s)
sakura: 1.65Ghz P4, 512Mb PC800 RDRAM
test gcc-asm gcc icc gain
Ackermann . . . . . . . . . . 0.76 0.81 0.62 25 Append array. . . . . . . . . 0.62 0.65 0.60 8 (829876/s) Append mapping. . . . . . . . 3.42 3.76 2.84 25 (3521/s) Append multiset . . . . . . . 0.52 0.57 0.49 14 (20270/s) Array & String Juggling . . . 1.17 1.18 0.88 26 Clone null-object . . . . . . 0.32 0.31 0.27 15 (1111111/s) Clone object. . . . . . . . . 0.62 0.63 0.55 14 (549199/s) Compile . . . . . . . . . . . 1.42 1.29 1.05 19 (22990 lines/s) Compile & Exec. . . . . . . . 1.41 1.33 1.16 13 (518448 lines/s) GC. . . . . . . . . . . . . . 0.58 0.58 0.47 19 Insert in mapping . . . . . . 0.31 0.34 0.27 21 (1871658/s) Insert in multiset. . . . . . 0.87 0.91 0.76 17 (657895/s) Loops Nested (global) . . . . 0.44 0.71 0.58 19 (28802088 iters/s) Loops Nested (local). . . . . 0.32 0.50 0.43 15 (39107724 iters/s) Loops Recursed. . . . . . . . 0.52 0.64 0.46 29 (2279513 iters/s) Matrix multiplication . . . . 0.47 0.47 0.41 13 Pike start overhead . . . . . 0.00 0.00 0.00 75 Read binary INT128. . . . . . 1.22 1.21 1.08 11 (9225/s) Read binary INT16 . . . . . . 0.22 0.22 0.19 13 (5278592/s) Read binary INT32 . . . . . . 3.54 3.46 3.03 13 (165289/s)
/ Per Hedbor ()
Yes. Run client.sh with --no-limits to disable the ulimits.
/ Peter Bortas
Previous text:
2003-02-08 23:20: Subject: Re: gcc/icc
The P4 difference is nice enough...
My xenofarm can't seem to build with icc. I get "Fatal error: ilio_malloc: out of memory -- 153338 bytes requested".
Does xenofarm put a memory limit or something on the build processes? My Linux' max process size is 2Gb, and it hardly swapped at all before it gave up (1Gb RAM/2Gb swap).
/ Mirar
In the last episode (Feb 07), Ludger Merkens said:
On Fri, 7 Feb 2003, Mirar @ Pike developers forum wrote:
Note that icc function calls seems to be much faster (Loops Recursed), and that the speed is comparable even though the icc version is compiled without machine code. (Both are with 64-bit float and int.)
Nice results for intels compiler. How machine dependent is this result? E.g. is this pentium 4 specific, or can we expect AMD Athlon or pentium III (II) to benefit likewise? What type of CPU did you run your tests btw.?
icc is consistently faster than gcc; of course, they have no AMD-specific optimizations :)
Has anyone looked into how difficult it would be to get icc to use the machine code stuff?
I can't see anything that would indicate that machine-code is dependant on gcc. Everything in code/ia32 manually builds assembly instructions one word at a time.
The problem is that the pike machine code uses unusual calling conventions, which currently are implemented in a gcc-specific way.
/ Henrik Grubbström (Lysator)
Previous text:
2003-02-07 17:54: Subject: Re: gcc/icc
In the last episode (Feb 07), Ludger Merkens said:
On Fri, 7 Feb 2003, Mirar @ Pike developers forum wrote:
Note that icc function calls seems to be much faster (Loops Recursed), and that the speed is comparable even though the icc version is compiled without machine code. (Both are with 64-bit float and int.)
Nice results for intels compiler. How machine dependent is this result? E.g. is this pentium 4 specific, or can we expect AMD Athlon or pentium III (II) to benefit likewise? What type of CPU did you run your tests btw.?
icc is consistently faster than gcc; of course, they have no AMD-specific optimizations :)
Has anyone looked into how difficult it would be to get icc to use the machine code stuff?
I can't see anything that would indicate that machine-code is dependant on gcc. Everything in code/ia32 manually builds assembly instructions one word at a time.
-- Dan Nelson dnelson@allantgroup.com
/ Brevbäraren
I can't see anything that would indicate that machine-code is dependant on gcc. Everything in code/ia32 manually builds assembly instructions one word at a time.
The calling convention is probably important. It breaks if you compile interpret.o with -fomit-frame-pointer. Other then that, it should be able to work on any supported architecture. (Intel and sparc...?)
/ Mirar
Previous text:
2003-02-07 17:54: Subject: Re: gcc/icc
In the last episode (Feb 07), Ludger Merkens said:
On Fri, 7 Feb 2003, Mirar @ Pike developers forum wrote:
Note that icc function calls seems to be much faster (Loops Recursed), and that the speed is comparable even though the icc version is compiled without machine code. (Both are with 64-bit float and int.)
Nice results for intels compiler. How machine dependent is this result? E.g. is this pentium 4 specific, or can we expect AMD Athlon or pentium III (II) to benefit likewise? What type of CPU did you run your tests btw.?
icc is consistently faster than gcc; of course, they have no AMD-specific optimizations :)
Has anyone looked into how difficult it would be to get icc to use the machine code stuff?
I can't see anything that would indicate that machine-code is dependant on gcc. Everything in code/ia32 manually builds assembly instructions one word at a time.
-- Dan Nelson dnelson@allantgroup.com
/ Brevbäraren
Now with corresponding optimizations (-O3 -ipp7). Run on my Athlon XP. (Not a Pentium 4!)
test gcc (machine code) icc (no machine code) Pike start overhead..... 0.001s 0.001s Ackermann............... 0.660s 0.653s Append array............ 0.470s (1063830/s) 0.490s (1020408) Append mapping.......... 2.890s (3460/s) 2.670s (3745) Append multiset......... 0.476s (20997/s) 0.416s (24017) Array & String Juggling. 0.687s 0.591s Read binary INT16....... 0.283s (3537736/s) 0.247s (4047619) Read binary INT32....... 0.177s (2820513/s) 0.140s (3571429) Read binary INT128...... 0.895s (11173/s) 0.767s (13043) Clone null-object....... 0.228s (1313869/s) 0.188s (1598985) Clone object............ 0.429s (699153/s) 0.367s (816327) Compile................. 1.080s (22352 lines/s) 0.906s (26645) Compile & Exec.......... 0.904s (665265 lines/s) 0.689s (873402) GC...................... 0.587s 0.579s Insert in mapping....... 0.443s (1128668/s) 0.457s (1094092) Insert in multiset...... 0.880s (568182/s) 0.770s (649351) Matrix multiplication... 0.410s 0.386s Loops Nested (local).... 0.332s (50561473 iters/s) 0.457s (36711632) Loops Nested (global)... 0.528s (31788409 iters/s) 0.694s (24164714) Loops Recursed.......... 1.383s (758464 iters/s) 0.475s (2207528)
It's clearly impressive. (Or, gcc lost any impressiveness it had.)
/ Mirar
Previous text:
2003-02-07 14:29: Subject: Re: gcc/icc
On Fri, 7 Feb 2003, Mirar @ Pike developers forum wrote:
I downloaded icc-7.0 and ran a comparison.
test gcc icc Pike start overhead..... 0.001s 0.001s Ackermann............... 0.660s 0.646s Append array............ 0.470s (1063830/s) 0.534s (935551) Append mapping.......... 2.890s (3460/s) 2.785s (3591) Append multiset......... 0.476s (20997/s) 0.432s (23148) Array & String Juggling. 0.687s 0.604s Read binary INT16....... 0.283s (3537736/s) 0.257s (3892944) Read binary INT32....... 0.177s (2820513/s) 0.150s (3342618) Read binary INT128...... 0.895s (11173/s) 0.857s (11673) Clone null-object....... 0.228s (1313869/s) 0.177s (1691176) Clone object............ 0.429s (699153/s) 0.369s (812641) Compile................. 1.080s (22352 lines/s) 0.893s (27022) Compile & Exec.......... 0.904s (665265 lines/s) 0.738s (814537) GC...................... 0.587s 0.578s Insert in mapping....... 0.443s (1128668/s) 0.446s (1121076) Insert in multiset...... 0.880s (568182/s) 0.777s (643777) Matrix multiplication... 0.410s 0.402s Loops Nested (local).... 0.332s (50561473 iters/s) 0.487s (34473732) Loops Nested (global)... 0.528s (31788409 iters/s) 0.723s (23209587) Loops Recursed.......... 1.383s (758464 iters/s) 0.504s (2078675)
Note that icc function calls seems to be much faster (Loops Recursed), and that the speed is comparable even though the icc version is compiled without machine code. (Both are with 64-bit float and int.)
Nice results for intels compiler. How machine dependent is this result? E.g. is this pentium 4 specific, or can we expect AMD Athlon or pentium III (II) to benefit likewise? What type of CPU did you run your tests btw.?
--- Ludger
Has anyone looked into how difficult it would be to get icc to use the machine code stuff?
/ Brevbäraren
Now with corresponding optimizations (-O3 -ipp7).
I believe -ipp7 is the default actually, so the only difference would be -O3 (if you can figure out a way to enable that without enabling -O2 for icc and without enabling -O3 for other compilers, feel free to do so).
/ David Hedbor
Previous text:
2003-02-07 18:17: Subject: Re: gcc/icc
Now with corresponding optimizations (-O3 -ipp7). Run on my Athlon XP. (Not a Pentium 4!)
test gcc (machine code) icc (no machine code) Pike start overhead..... 0.001s 0.001s Ackermann............... 0.660s 0.653s Append array............ 0.470s (1063830/s) 0.490s (1020408) Append mapping.......... 2.890s (3460/s) 2.670s (3745) Append multiset......... 0.476s (20997/s) 0.416s (24017) Array & String Juggling. 0.687s 0.591s Read binary INT16....... 0.283s (3537736/s) 0.247s (4047619) Read binary INT32....... 0.177s (2820513/s) 0.140s (3571429) Read binary INT128...... 0.895s (11173/s) 0.767s (13043) Clone null-object....... 0.228s (1313869/s) 0.188s (1598985) Clone object............ 0.429s (699153/s) 0.367s (816327) Compile................. 1.080s (22352 lines/s) 0.906s (26645) Compile & Exec.......... 0.904s (665265 lines/s) 0.689s (873402) GC...................... 0.587s 0.579s Insert in mapping....... 0.443s (1128668/s) 0.457s (1094092) Insert in multiset...... 0.880s (568182/s) 0.770s (649351) Matrix multiplication... 0.410s 0.386s Loops Nested (local).... 0.332s (50561473 iters/s) 0.457s (36711632) Loops Nested (global)... 0.528s (31788409 iters/s) 0.694s (24164714) Loops Recursed.......... 1.383s (758464 iters/s) 0.475s (2207528)
It's clearly impressive. (Or, gcc lost any impressiveness it had.)
/ Mirar
In the last episode (Feb 07), David Hedbor @ Pike developers forum said:
Now with corresponding optimizations (-O3 -ipp7).
I believe -ipp7 is the default actually, so the only difference would be -O3 (if you can figure out a way to enable that without enabling -O2 for icc and without enabling -O3 for other compilers, feel free to do so).
You probably mean -tpp7, right? That is the same as -mcpu=p4. You'll get more performance by also adding -xW, which enables icc to emit p4 instructions and vectorize loops.
You probably mean -tpp7, right? That is the same as -mcpu=p4. You'll
Uh, yeah, right.
get more performance by also adding -xW, which enables icc to emit p4 instructions and vectorize loops.
If you look back, you see that I also made -axKW a default command line option. I don't know what the cost is of the multi-arch optimizations though (i.e if it has 3 versions of the same method, what is the method calling overhead?)
/ David Hedbor
Previous text:
2003-02-07 20:42: Subject: Re: gcc/icc
In the last episode (Feb 07), David Hedbor @ Pike developers forum said:
Now with corresponding optimizations (-O3 -ipp7).
I believe -ipp7 is the default actually, so the only difference would be -O3 (if you can figure out a way to enable that without enabling -O2 for icc and without enabling -O3 for other compilers, feel free to do so).
You probably mean -tpp7, right? That is the same as -mcpu=p4. You'll get more performance by also adding -xW, which enables icc to emit p4 instructions and vectorize loops.
-- Dan Nelson dnelson@allantgroup.com
/ Brevbäraren
<lazy> What is Loops Recursed? A function itterating over itself? </lazy>
/ Peter Lundqvist (disjunkt)
Previous text:
2003-02-07 18:17: Subject: Re: gcc/icc
Now with corresponding optimizations (-O3 -ipp7). Run on my Athlon XP. (Not a Pentium 4!)
test gcc (machine code) icc (no machine code) Pike start overhead..... 0.001s 0.001s Ackermann............... 0.660s 0.653s Append array............ 0.470s (1063830/s) 0.490s (1020408) Append mapping.......... 2.890s (3460/s) 2.670s (3745) Append multiset......... 0.476s (20997/s) 0.416s (24017) Array & String Juggling. 0.687s 0.591s Read binary INT16....... 0.283s (3537736/s) 0.247s (4047619) Read binary INT32....... 0.177s (2820513/s) 0.140s (3571429) Read binary INT128...... 0.895s (11173/s) 0.767s (13043) Clone null-object....... 0.228s (1313869/s) 0.188s (1598985) Clone object............ 0.429s (699153/s) 0.367s (816327) Compile................. 1.080s (22352 lines/s) 0.906s (26645) Compile & Exec.......... 0.904s (665265 lines/s) 0.689s (873402) GC...................... 0.587s 0.579s Insert in mapping....... 0.443s (1128668/s) 0.457s (1094092) Insert in multiset...... 0.880s (568182/s) 0.770s (649351) Matrix multiplication... 0.410s 0.386s Loops Nested (local).... 0.332s (50561473 iters/s) 0.457s (36711632) Loops Nested (global)... 0.528s (31788409 iters/s) 0.694s (24164714) Loops Recursed.......... 1.383s (758464 iters/s) 0.475s (2207528)
It's clearly impressive. (Or, gcc lost any impressiveness it had.)
/ Mirar
In principle, yes. "pike -x benchmark" or "make benchmark" will run these, if you want to compare with your setup.
Recursive Loops:
| % cat pike/lib/modules/Tools.pmod/Shoot.pmod/RecursiveLoops.pike | ... | int n=0; | int iter=16; | int d=5; | | void recur(int d) | { | if (d--) | for (int i=0; i<iter; i++) recur(d); | else | n++; | } | | void perform() | { | recur(d); | }
Non-recursive loops (Local) looks like this:
| void perform() | { | int iter = 16; | int x=0; | | for (int a; a<iter; a++) | for (int b; b<iter; b++) | for (int c; c<iter; c++) | for (int d; d<iter; d++) | for (int e; e<iter; e++) | for (int f; f<iter; f++) | x++; | | n=x; | }
The difference between "Local" and "Global" is that the variables are function-local or object-global respectively.
/ Mirar
Previous text:
2003-02-07 19:00: Subject: Re: gcc/icc
<lazy> What is Loops Recursed? A function itterating over itself? </lazy>
/ Peter Lundqvist (disjunkt)
I knew that, the strange thing for me was the recursive loops. It seems like such a strange construct. Is it commonly appearing in the "wild"?
/ Peter Lundqvist (disjunkt)
Previous text:
2003-02-07 19:13: Subject: Re: gcc/icc
In principle, yes. "pike -x benchmark" or "make benchmark" will run these, if you want to compare with your setup.
Recursive Loops:
| % cat pike/lib/modules/Tools.pmod/Shoot.pmod/RecursiveLoops.pike | ... | int n=0; | int iter=16; | int d=5; | | void recur(int d) | { | if (d--) | for (int i=0; i<iter; i++) recur(d); | else | n++; | } | | void perform() | { | recur(d); | }
Non-recursive loops (Local) looks like this:
| void perform() | { | int iter = 16; | int x=0; | | for (int a; a<iter; a++) | for (int b; b<iter; b++) | for (int c; c<iter; c++) | for (int d; d<iter; d++) | for (int e; e<iter; e++) | for (int f; f<iter; f++) | x++; | | n=x; | }
The difference between "Local" and "Global" is that the variables are function-local or object-global respectively.
/ Mirar
Oh, yes. Think tree data structure (XML-tree, file-tree, code-tree) traversal.
/ Martin Nilsson (Åskblod)
Previous text:
2003-02-07 19:34: Subject: Re: gcc/icc
I knew that, the strange thing for me was the recursive loops. It seems like such a strange construct. Is it commonly appearing in the "wild"?
/ Peter Lundqvist (disjunkt)
I think heavy function calling is what occur in most Pike programs.
If it is recursing on one function or just say five-ten levels of function calling doesn't matter that much for the measure, I think. (Recall many backtraces that has less then say four levels?)
If anything, the iteration *without* function calls is a less common construct in Pike. Almost all those are on the C levels.
You should note that many common operations in expressions also yield function calls, for instance `+ on mappings and array creation.
/ Mirar
Previous text:
2003-02-07 19:34: Subject: Re: gcc/icc
I knew that, the strange thing for me was the recursive loops. It seems like such a strange construct. Is it commonly appearing in the "wild"?
/ Peter Lundqvist (disjunkt)
That one stared me right in the face, didn't it? It just never occured to me as beeing an recursive loop.
/ Peter Lundqvist (disjunkt)
Previous text:
2003-02-07 21:18: Subject: Re: gcc/icc
I think heavy function calling is what occur in most Pike programs.
If it is recursing on one function or just say five-ten levels of function calling doesn't matter that much for the measure, I think. (Recall many backtraces that has less then say four levels?)
If anything, the iteration *without* function calls is a less common construct in Pike. Almost all those are on the C levels.
You should note that many common operations in expressions also yield function calls, for instance `+ on mappings and array creation.
/ Mirar
Only two tests where gcc is significantly faster.
/ Niels Möller ()
Previous text:
2003-02-07 18:17: Subject: Re: gcc/icc
Now with corresponding optimizations (-O3 -ipp7). Run on my Athlon XP. (Not a Pentium 4!)
test gcc (machine code) icc (no machine code) Pike start overhead..... 0.001s 0.001s Ackermann............... 0.660s 0.653s Append array............ 0.470s (1063830/s) 0.490s (1020408) Append mapping.......... 2.890s (3460/s) 2.670s (3745) Append multiset......... 0.476s (20997/s) 0.416s (24017) Array & String Juggling. 0.687s 0.591s Read binary INT16....... 0.283s (3537736/s) 0.247s (4047619) Read binary INT32....... 0.177s (2820513/s) 0.140s (3571429) Read binary INT128...... 0.895s (11173/s) 0.767s (13043) Clone null-object....... 0.228s (1313869/s) 0.188s (1598985) Clone object............ 0.429s (699153/s) 0.367s (816327) Compile................. 1.080s (22352 lines/s) 0.906s (26645) Compile & Exec.......... 0.904s (665265 lines/s) 0.689s (873402) GC...................... 0.587s 0.579s Insert in mapping....... 0.443s (1128668/s) 0.457s (1094092) Insert in multiset...... 0.880s (568182/s) 0.770s (649351) Matrix multiplication... 0.410s 0.386s Loops Nested (local).... 0.332s (50561473 iters/s) 0.457s (36711632) Loops Nested (global)... 0.528s (31788409 iters/s) 0.694s (24164714) Loops Recursed.......... 1.383s (758464 iters/s) 0.475s (2207528)
It's clearly impressive. (Or, gcc lost any impressiveness it had.)
/ Mirar
Yes, and both can easily be related to the use of machine code. ICC doesn't seem to want to link if I turn it on, so I can't even test to see if the calling convention is similar enough...
/ Mirar
Previous text:
2003-02-07 21:57: Subject: Re: gcc/icc
Only two tests where gcc is significantly faster.
/ Niels Möller ()
pike-devel@lists.lysator.liu.se