gcc/icc

List overview All Threads
Download

newer

older

sscanf %O

Extending Autodocs ?

Mirar ＠ Pike developers forum

7 Feb 2003 7 Feb '03

10 a.m.

I downloaded icc-7.0 and ran a comparison.

test gcc icc Pike start overhead..... 0.001s 0.001s Ackermann............... 0.660s 0.646s Append array............ 0.470s (1063830/s) 0.534s (935551) Append mapping.......... 2.890s (3460/s) 2.785s (3591) Append multiset......... 0.476s (20997/s) 0.432s (23148) Array & String Juggling. 0.687s 0.604s Read binary INT16....... 0.283s (3537736/s) 0.257s (3892944) Read binary INT32....... 0.177s (2820513/s) 0.150s (3342618) Read binary INT128...... 0.895s (11173/s) 0.857s (11673) Clone null-object....... 0.228s (1313869/s) 0.177s (1691176) Clone object............ 0.429s (699153/s) 0.369s (812641) Compile................. 1.080s (22352 lines/s) 0.893s (27022) Compile & Exec.......... 0.904s (665265 lines/s) 0.738s (814537) GC...................... 0.587s 0.578s Insert in mapping....... 0.443s (1128668/s) 0.446s (1121076) Insert in multiset...... 0.880s (568182/s) 0.777s (643777) Matrix multiplication... 0.410s 0.402s Loops Nested (local).... 0.332s (50561473 iters/s) 0.487s (34473732) Loops Nested (global)... 0.528s (31788409 iters/s) 0.723s (23209587) Loops Recursed.......... 1.383s (758464 iters/s) 0.504s (2078675)

Note that icc function calls seems to be much faster (Loops Recursed), and that the speed is comparable even though the icc version is compiled without machine code. (Both are with 64-bit float and int.)

Has anyone looked into how difficult it would be to get icc to use the machine code stuff?

Show replies by date

Ludger Merkens

7 Feb 7 Feb

1:28 p.m.

On Fri, 7 Feb 2003, Mirar @ Pike developers forum wrote:

...

I downloaded icc-7.0 and ran a comparison.

test gcc icc Pike start overhead..... 0.001s 0.001s Ackermann............... 0.660s 0.646s Append array............ 0.470s (1063830/s) 0.534s (935551) Append mapping.......... 2.890s (3460/s) 2.785s (3591) Append multiset......... 0.476s (20997/s) 0.432s (23148) Array & String Juggling. 0.687s 0.604s Read binary INT16....... 0.283s (3537736/s) 0.257s (3892944) Read binary INT32....... 0.177s (2820513/s) 0.150s (3342618) Read binary INT128...... 0.895s (11173/s) 0.857s (11673) Clone null-object....... 0.228s (1313869/s) 0.177s (1691176) Clone object............ 0.429s (699153/s) 0.369s (812641) Compile................. 1.080s (22352 lines/s) 0.893s (27022) Compile & Exec.......... 0.904s (665265 lines/s) 0.738s (814537) GC...................... 0.587s 0.578s Insert in mapping....... 0.443s (1128668/s) 0.446s (1121076) Insert in multiset...... 0.880s (568182/s) 0.777s (643777) Matrix multiplication... 0.410s 0.402s Loops Nested (local).... 0.332s (50561473 iters/s) 0.487s (34473732) Loops Nested (global)... 0.528s (31788409 iters/s) 0.723s (23209587) Loops Recursed.......... 1.383s (758464 iters/s) 0.504s (2078675)

Note that icc function calls seems to be much faster (Loops Recursed), and that the speed is comparable even though the icc version is compiled without machine code. (Both are with 64-bit float and int.)

Nice results for intels compiler. How machine dependent is this result? E.g. is this pentium 4 specific, or can we expect AMD Athlon or pentium III (II) to benefit likewise? What type of CPU did you run your tests btw.?

--- Ludger

...

Has anyone looked into how difficult it would be to get icc to use the machine code stuff?

Mirar ＠ Pike developers forum

4:50 p.m.

...

Nice results for intels compiler. How machine dependent is this result?

This was on my Athlon XP 1900+ (1G6Hz), so it seems that at least Athlon XP benefits.

This is the compilation flags used:

icc: -Ob2 -ipo -ipo_obj -axKW -O2 -g gcc: -O3 -pipe -fomit-frame-pointer -march=athlon-xp -mcpu=athlon-xp -g

"-ax<codes> Generate code specialized for processor extensions spec- ified by <codes> while also generating generic IA-32 code. <codes> includes one or more of the following characters:

i -- Pentium Pro and Pentium II processor instructions M -- MMX(TM) instructions K -- Streaming SIMD Extensions W -- Pentium(R) 4 New Instructions"

I tried to get it to run machine code, but I got obscure linking errors (compilation went fine). It seems that some objects missed eval_instruction().

/ Mirar

Previous text:

...

2003-02-07 14:29: Subject: Re: gcc/icc

On Fri, 7 Feb 2003, Mirar @ Pike developers forum wrote:

...
I downloaded icc-7.0 and ran a comparison.

test gcc icc Pike start overhead..... 0.001s 0.001s Ackermann............... 0.660s 0.646s Append array............ 0.470s (1063830/s) 0.534s (935551) Append mapping.......... 2.890s (3460/s) 2.785s (3591) Append multiset......... 0.476s (20997/s) 0.432s (23148) Array & String Juggling. 0.687s 0.604s Read binary INT16....... 0.283s (3537736/s) 0.257s (3892944) Read binary INT32....... 0.177s (2820513/s) 0.150s (3342618) Read binary INT128...... 0.895s (11173/s) 0.857s (11673) Clone null-object....... 0.228s (1313869/s) 0.177s (1691176) Clone object............ 0.429s (699153/s) 0.369s (812641) Compile................. 1.080s (22352 lines/s) 0.893s (27022) Compile & Exec.......... 0.904s (665265 lines/s) 0.738s (814537) GC...................... 0.587s 0.578s Insert in mapping....... 0.443s (1128668/s) 0.446s (1121076) Insert in multiset...... 0.880s (568182/s) 0.777s (643777) Matrix multiplication... 0.410s 0.402s Loops Nested (local).... 0.332s (50561473 iters/s) 0.487s (34473732) Loops Nested (global)... 0.528s (31788409 iters/s) 0.723s (23209587) Loops Recursed.......... 1.383s (758464 iters/s) 0.504s (2078675)

Note that icc function calls seems to be much faster (Loops Recursed), and that the speed is comparable even though the icc version is compiled without machine code. (Both are with 64-bit float and int.)

Nice results for intels compiler. How machine dependent is this result? E.g. is this pentium 4 specific, or can we expect AMD Athlon or pentium III (II) to benefit likewise? What type of CPU did you run your tests btw.?

--- Ludger

...
Has anyone looked into how difficult it would be to get icc to use the machine code stuff?

/ Brevbäraren

David Hedbor ＠ Pike developers forum

5:45 p.m.

I could add that my athlon 1.2 GHz results I posted earlier were about the same. Results from P3 and P4 computers would be very interesting - especially P4.

/ David Hedbor

Previous text:

...

2003-02-07 17:46: Subject: Re: gcc/icc

...
Nice results for intels compiler. How machine dependent is this result?

This was on my Athlon XP 1900+ (1G6Hz), so it seems that at least Athlon XP benefits.

This is the compilation flags used:

icc: -Ob2 -ipo -ipo_obj -axKW -O2 -g gcc: -O3 -pipe -fomit-frame-pointer -march=athlon-xp -mcpu=athlon-xp -g

"-ax<codes> Generate code specialized for processor extensions spec- ified by <codes> while also generating generic IA-32 code. <codes> includes one or more of the following characters:
           i -- Pentium Pro and Pentium II processor instructions
           M -- MMX(TM) instructions
           K -- Streaming SIMD Extensions
           W -- Pentium(R) 4 New Instructions"
I tried to get it to run machine code, but I got obscure linking errors (compilation went fine). It seems that some objects missed eval_instruction().

/ Mirar

Per Hedbor () ＠ Pike (-) developers forum

9:25 p.m.

I am downloading the compiler now. I'll be back. :-)

/ Per Hedbor ()

Previous text:

...

2003-02-07 18:44: Subject: Re: gcc/icc

I could add that my athlon 1.2 GHz results I posted earlier were about the same. Results from P3 and P4 computers would be very interesting - especially P4.

/ David Hedbor

Mirar ＠ Pike developers forum

9:45 p.m.

You may or may not need to patch your bits/byteswap.h. On my Linux/glibc you needed to change this "16" to "32":

vv # define __bswap_16(x) \ (__extension__ \ ({ register unsigned int __x = (x); __bswap_constant_32 (__x); }))

(Just a tip for everyone who wants to get icc working. Newer/CVS versions of glibc doesn't have this problem.)

/ Mirar

Previous text:

...

2003-02-07 22:24: Subject: Re: gcc/icc

I am downloading the compiler now. I'll be back. :-)

/ Per Hedbor ()

David Hedbor ＠ Pike developers forum

10:10 p.m.

What happened if you did not?

/ David Hedbor

Previous text:

...

2003-02-07 22:43: Subject: Re: gcc/icc

You may or may not need to patch your bits/byteswap.h. On my Linux/glibc you needed to change this "16" to "32":
            vv
# define __bswap_16(x) \ (__extension__ \ ({ register unsigned int __x = (x); __bswap_constant_32 (__x); }))

(Just a tip for everyone who wants to get icc working. Newer/CVS versions of glibc doesn't have this problem.)

/ Mirar

Mirar ＠ Pike developers forum

10:15 p.m.

Any use of ntohl/htonl gives errors when the program is linked. It probably doesn't show up in old enough or new enough glibc's, but it shows at least up in the current gentoo.

/ Mirar

Previous text:

...

2003-02-07 23:08: Subject: Re: gcc/icc

What happened if you did not?

/ David Hedbor

Per Hedbor () ＠ Pike (-) developers forum

10:15 p.m.

It worked for me.

On a not quite related note:

Matrix multiplication......Gmp.mpz conversion failed (Gmp.bignum not loaded). /usr/local/pike/7.5.3/lib/modules/Tools.pmod/Shoot.pmod/MatrixMult.pike:21: Tools.Shoot.MatrixMult()->test(-1) /usr/local/pike/7.5.3/lib/modules/Tools.pmod/Shoot.pmod/MatrixMult.pike:26: Tools.Shoot.MatrixMult()->perform() /usr/local/pike/7.5.3/lib/modules/Tools.pmod/Shoot.pmod/module.pmod:129: Tools.Shoot->_shoot("MatrixMult") -:3: -()->run(1,({"/usr/local/pike/7.5.3/bin/pike"}),mapping[34]) failed to spawn pike or run test

I can use Gmp.mpz (or Gmp.bignum) without trouble if I just start the pike, though.

/ Per Hedbor ()

Previous text:

...

2003-02-07 22:43: Subject: Re: gcc/icc

You may or may not need to patch your bits/byteswap.h. On my Linux/glibc you needed to change this "16" to "32":
            vv
# define __bswap_16(x) \ (__extension__ \ ({ register unsigned int __x = (x); __bswap_constant_32 (__x); }))

(Just a tip for everyone who wants to get icc working. Newer/CVS versions of glibc doesn't have this problem.)

/ Mirar

Per Hedbor () ＠ Pike (-) developers forum

10:20 p.m.

Adding main_resolv("Gmp.bignum" ); in _main in the master fixed the problem (and quite exactly doubled the pike start overhead. :-))

/ Per Hedbor ()

Previous text:

...

2003-02-07 23:12: Subject: Re: gcc/icc

It worked for me.

On a not quite related note:

Matrix multiplication......Gmp.mpz conversion failed (Gmp.bignum not loaded). /usr/local/pike/7.5.3/lib/modules/Tools.pmod/Shoot.pmod/MatrixMult.pike:21: Tools.Shoot.MatrixMult()->test(-1) /usr/local/pike/7.5.3/lib/modules/Tools.pmod/Shoot.pmod/MatrixMult.pike:26: Tools.Shoot.MatrixMult()->perform() /usr/local/pike/7.5.3/lib/modules/Tools.pmod/Shoot.pmod/module.pmod:129: Tools.Shoot->_shoot("MatrixMult") -:3: -()->run(1,({"/usr/local/pike/7.5.3/bin/pike"}),mapping[34]) failed to spawn pike or run test

I can use Gmp.mpz (or Gmp.bignum) without trouble if I just start the pike, though.

/ Per Hedbor ()

Mirar ＠ Pike developers forum

10:20 p.m.

Yes, something that probably should doesn't load Gmp. It's a recent (few weeks) error, I think.

I patched the benchmarks a week ago or so so they load Gmp. *looks* Hmm, it never got checked in... *checks in*

/ Mirar

Previous text:

...

2003-02-07 23:12: Subject: Re: gcc/icc

It worked for me.

On a not quite related note:

Matrix multiplication......Gmp.mpz conversion failed (Gmp.bignum not loaded). /usr/local/pike/7.5.3/lib/modules/Tools.pmod/Shoot.pmod/MatrixMult.pike:21: Tools.Shoot.MatrixMult()->test(-1) /usr/local/pike/7.5.3/lib/modules/Tools.pmod/Shoot.pmod/MatrixMult.pike:26: Tools.Shoot.MatrixMult()->perform() /usr/local/pike/7.5.3/lib/modules/Tools.pmod/Shoot.pmod/module.pmod:129: Tools.Shoot->_shoot("MatrixMult") -:3: -()->run(1,({"/usr/local/pike/7.5.3/bin/pike"}),mapping[34]) failed to spawn pike or run test

I can use Gmp.mpz (or Gmp.bignum) without trouble if I just start the pike, though.

/ Per Hedbor ()

Martin Nilsson (�skblod) ＠ Pike (-) developers forum

10:30 p.m.

Don't.

/ Martin Nilsson (Åskblod)

Previous text:

...

2003-02-07 23:16: Subject: Re: gcc/icc

Yes, something that probably should doesn't load Gmp. It's a recent (few weeks) error, I think.

I patched the benchmarks a week ago or so so they load Gmp. *looks* Hmm, it never got checked in... *checks in*

/ Mirar

Mirar ＠ Pike developers forum

10:30 p.m.

Any good reason they shouldn't? They need Gmp?

/ Mirar

Previous text:

...

2003-02-07 23:25: Subject: Re: gcc/icc

Don't.

/ Martin Nilsson (Åskblod)

Martin Nilsson (�skblod) ＠ Pike (-) developers forum

10:45 p.m.

It's sunch an ugly fix that it shouldn't end up in the Pike main CVS.

Though I can't figure out why the error occurs in the first place. It does appear very much as if Gmp.bignum is resolved in the master before the benchmark is run.

/ Martin Nilsson (Åskblod)

Previous text:

...

2003-02-07 23:28: Subject: Re: gcc/icc

Any good reason they shouldn't? They need Gmp?

/ Mirar

Mirar ＠ Pike developers forum

10:45 p.m.

Feel free to remove it when Gmp is loaded by something else. The problem seems only to appear in the benchmark test pike instances.

/ Mirar

Previous text:

...

2003-02-07 23:41: Subject: Re: gcc/icc

It's sunch an ugly fix that it shouldn't end up in the Pike main CVS.

Though I can't figure out why the error occurs in the first place. It does appear very much as if Gmp.bignum is resolved in the master before the benchmark is run.

/ Martin Nilsson (Åskblod)

Per Hedbor () ＠ Pike (-) developers forum

11:20 p.m.

I'm back!

I sadly enough don't have a modern P4, but my Celeron might work as an indication. However, it has 128Kb cache, and a modern P4 has 512Kb. My old 'normal' P4 has 256Kb cache, and is generally speaking somewhat slower per GHz than a modern one.

The 'gain' column is the percentage difference between gcc and icc. gcc-asm is the default compile, gcc is gcc without assembly ( optimizations.

Enough preamble, here are the tests. :-)

lain: dual 560Mhz P3; 1024Mb PC112 SDRAM ------------------------------------------------------------------ test gcc-asm gcc icc gain ------------------------------------------------------------------ Ackermann . . . . . . . . . . 1.96 2.31 1.99 14 Append array. . . . . . . . . 1.78 1.73 1.90 -10 (262697/s) Append mapping. . . . . . . . 12.44 11.65 9.24 21 (1082/s) Append multiset . . . . . . . 1.82 1.87 1.81 4 (5515/s) Array & String Juggling . . . 3.84 4.01 4.68 -16 Clone null-object . . . . . . 0.74 0.75 0.65 14 (458015/s) Clone object. . . . . . . . . 1.73 1.69 1.91 -13 (156794/s) Compile . . . . . . . . . . . 4.30 4.07 3.42 16 (7048 lines/s) Compile & Exec. . . . . . . . 4.48 4.11 3.57 14 (168696 lines/s) GC. . . . . . . . . . . . . . 1.67 1.72 1.36 22 Insert in mapping . . . . . . 0.99 1.06 0.96 10 (518672/s) Insert in multiset. . . . . . 2.71 3.09 2.41 22 (207469/s) Loops Nested (global) . . . . 1.45 2.06 1.98 5 (8473341 iters/s) Loops Nested (local). . . . . 0.90 1.48 1.31 12 (12807036 iters/s) Loops Recursed. . . . . . . . 1.37 1.62 1.48 9 (708497 iters/s) Matrix multiplication . . . . 1.62 1.78 1.38 23 Pike start overhead . . . . . 0.00 0.00 0.00 0 Read binary INT128. . . . . . 4.01 3.71 3.13 16 (3195/s) Read binary INT16 . . . . . . 0.68 0.66 0.64 4 (1570680/s) Read binary INT32 . . . . . . 10.28 10.04 9.15 9 (54645/s) ------------------------------------------------------------------

eiri: 450Mhz P2; 768Mb PC100 SDRAM ------------------------------------------------------------------ test gcc-asm gcc icc gain ------------------------------------------------------------------ Ackermann . . . . . . . . . . 2.19 2.55 2.17 15 Append array. . . . . . . . . 2.11 2.12 2.32 -9 (215517/s) Append mapping. . . . . . . . 14.26 12.67 10.47 18 (955/s) Append multiset . . . . . . . 2.02 2.11 1.98 7 (5059/s) Array & String Juggling . . . 4.56 4.78 5.20 -8 Clone null-object . . . . . . 0.77 0.84 0.72 15 (416667/s) Clone object. . . . . . . . . 1.82 2.20 2.30 -4 (130435/s) Compile . . . . . . . . . . . 4.82 4.39 3.75 15 (6437 lines/s) Compile & Exec. . . . . . . . 5.02 4.78 4.30 10 (139860 lines/s) GC. . . . . . . . . . . . . . 1.86 1.89 1.62 15 Insert in mapping . . . . . . 1.09 1.16 1.01 13 (493827/s) Insert in multiset. . . . . . 3.25 3.53 2.75 23 (181818/s) Loops Nested (global) . . . . 1.60 2.29 2.38 -4 (7049250 iters/s) Loops Nested (local). . . . . 1.00 1.63 1.73 -6 (9679163 iters/s) Loops Recursed. . . . . . . . 1.53 1.77 1.60 10 (655360 iters/s) Matrix multiplication . . . . 1.83 1.86 1.54 18 Pike start overhead . . . . . 0.00 0.00 0.00 25 Read binary INT128. . . . . . 4.11 4.09 3.97 4 (2519/s) Read binary INT16 . . . . . . 0.75 0.72 0.72 1 (1388889/s) Read binary INT32 . . . . . . 10.74 11.42 10.27 11 (48685/s) ------------------------------------------------------------------

ayumu: 2.1Ghz P4 Celeron, 512Mb DDR333 ------------------------------------------------------------------ test gcc-asm gcc icc gain ------------------------------------------------------------------ Ackermann . . . . . . . . . . 0.66 0.69 0.55 21 Append array. . . . . . . . . 0.55 0.60 0.51 16 (985222/s) Append mapping. . . . . . . . 2.94 3.06 2.52 18 (3968/s) Append multiset . . . . . . . 0.44 0.45 0.44 3 (22843/s) Array & String Juggling . . . 1.29 1.29 1.35 -4 Clone null-object . . . . . . 0.28 0.26 0.22 16 (1359516/s) Clone object. . . . . . . . . 0.75 0.55 0.45 19 (663391/s) Compile . . . . . . . . . . . 1.47 1.26 1.07 15 (22508 lines/s) Compile & Exec. . . . . . . . 1.37 1.33 1.09 19 (552757 lines/s) GC. . . . . . . . . . . . . . 0.57 0.57 0.47 17 Insert in mapping . . . . . . 0.27 0.30 0.25 19 (2034884/s) Insert in multiset. . . . . . 0.81 0.85 0.73 14 (681818/s) Loops Nested (global) . . . . 0.41 0.68 0.47 32 (36036980 iters/s) Loops Nested (local). . . . . 0.26 0.44 0.33 25 (50423328 iters/s) Loops Recursed. . . . . . . . 0.42 0.55 0.39 30 (2674939 iters/s) Matrix multiplication . . . . 0.51 0.48 0.47 4 Pike start overhead . . . . . 0.00 0.00 0.00 0 Read binary INT128. . . . . . 1.13 1.07 0.94 13 (10638/s) Read binary INT16 . . . . . . 0.20 0.19 0.17 9 (5802047/s) Read binary INT32 . . . . . . 2.86 2.70 2.35 14 (212766/s) ------------------------------------------------------------------

sakura: 1.65Ghz P4, 512Mb PC800 RDRAM ------------------------------------------------------------------ test gcc-asm gcc icc gain ------------------------------------------------------------------ Ackermann . . . . . . . . . . 0.76 0.81 0.62 25 Append array. . . . . . . . . 0.62 0.65 0.60 8 (829876/s) Append mapping. . . . . . . . 3.42 3.76 2.84 25 (3521/s) Append multiset . . . . . . . 0.52 0.57 0.49 14 (20270/s) Array & String Juggling . . . 1.17 1.18 0.88 26 Clone null-object . . . . . . 0.32 0.31 0.27 15 (1111111/s) Clone object. . . . . . . . . 0.62 0.63 0.55 14 (549199/s) Compile . . . . . . . . . . . 1.42 1.29 1.05 19 (22990 lines/s) Compile & Exec. . . . . . . . 1.41 1.33 1.16 13 (518448 lines/s) GC. . . . . . . . . . . . . . 0.58 0.58 0.47 19 Insert in mapping . . . . . . 0.31 0.34 0.27 21 (1871658/s) Insert in multiset. . . . . . 0.87 0.91 0.76 17 (657895/s) Loops Nested (global) . . . . 0.44 0.71 0.58 19 (28802088 iters/s) Loops Nested (local). . . . . 0.32 0.50 0.43 15 (39107724 iters/s) Loops Recursed. . . . . . . . 0.52 0.64 0.46 29 (2279513 iters/s) Matrix multiplication . . . . 0.47 0.47 0.41 13 Pike start overhead . . . . . 0.00 0.00 0.00 75 Read binary INT128. . . . . . 1.22 1.21 1.08 11 (9225/s) Read binary INT16 . . . . . . 0.22 0.22 0.19 13 (5278592/s) Read binary INT32 . . . . . . 3.54 3.46 3.03 13 (165289/s) ------------------------------------------------------------------

/ Per Hedbor ()

Previous text:

...

2003-02-07 22:24: Subject: Re: gcc/icc

I am downloading the compiler now. I'll be back. :-)

/ Per Hedbor ()

David Hedbor ＠ Pike developers forum

11:25 p.m.

Nice results on the P4. Also nice to see that the reason for the low numbers in non-recursive loops is (mostly) -asm. It would indeed be interesting to see the result for icc + asm. :)

/ David Hedbor

Previous text:

...

2003-02-08 00:15: Subject: Re: gcc/icc

I'm back!

I sadly enough don't have a modern P4, but my Celeron might work as an indication. However, it has 128Kb cache, and a modern P4 has 512Kb. My old 'normal' P4 has 256Kb cache, and is generally speaking somewhat slower per GHz than a modern one.

The 'gain' column is the percentage difference between gcc and icc. gcc-asm is the default compile, gcc is gcc without assembly ( optimizations.

Enough preamble, here are the tests. :-)

lain: dual 560Mhz P3; 1024Mb PC112 SDRAM

test gcc-asm gcc icc gain

Ackermann . . . . . . . . . . 1.96 2.31 1.99 14 Append array. . . . . . . . . 1.78 1.73 1.90 -10 (262697/s) Append mapping. . . . . . . . 12.44 11.65 9.24 21 (1082/s) Append multiset . . . . . . . 1.82 1.87 1.81 4 (5515/s) Array & String Juggling . . . 3.84 4.01 4.68 -16 Clone null-object . . . . . . 0.74 0.75 0.65 14 (458015/s) Clone object. . . . . . . . . 1.73 1.69 1.91 -13 (156794/s) Compile . . . . . . . . . . . 4.30 4.07 3.42 16 (7048 lines/s) Compile & Exec. . . . . . . . 4.48 4.11 3.57 14 (168696 lines/s) GC. . . . . . . . . . . . . . 1.67 1.72 1.36 22 Insert in mapping . . . . . . 0.99 1.06 0.96 10 (518672/s) Insert in multiset. . . . . . 2.71 3.09 2.41 22 (207469/s) Loops Nested (global) . . . . 1.45 2.06 1.98 5 (8473341 iters/s) Loops Nested (local). . . . . 0.90 1.48 1.31 12 (12807036 iters/s) Loops Recursed. . . . . . . . 1.37 1.62 1.48 9 (708497 iters/s) Matrix multiplication . . . . 1.62 1.78 1.38 23 Pike start overhead . . . . . 0.00 0.00 0.00 0 Read binary INT128. . . . . . 4.01 3.71 3.13 16 (3195/s) Read binary INT16 . . . . . . 0.68 0.66 0.64 4 (1570680/s) Read binary INT32 . . . . . . 10.28 10.04 9.15 9 (54645/s)

eiri: 450Mhz P2; 768Mb PC100 SDRAM

test gcc-asm gcc icc gain

Ackermann . . . . . . . . . . 2.19 2.55 2.17 15 Append array. . . . . . . . . 2.11 2.12 2.32 -9 (215517/s) Append mapping. . . . . . . . 14.26 12.67 10.47 18 (955/s) Append multiset . . . . . . . 2.02 2.11 1.98 7 (5059/s) Array & String Juggling . . . 4.56 4.78 5.20 -8 Clone null-object . . . . . . 0.77 0.84 0.72 15 (416667/s) Clone object. . . . . . . . . 1.82 2.20 2.30 -4 (130435/s) Compile . . . . . . . . . . . 4.82 4.39 3.75 15 (6437 lines/s) Compile & Exec. . . . . . . . 5.02 4.78 4.30 10 (139860 lines/s) GC. . . . . . . . . . . . . . 1.86 1.89 1.62 15 Insert in mapping . . . . . . 1.09 1.16 1.01 13 (493827/s) Insert in multiset. . . . . . 3.25 3.53 2.75 23 (181818/s) Loops Nested (global) . . . . 1.60 2.29 2.38 -4 (7049250 iters/s) Loops Nested (local). . . . . 1.00 1.63 1.73 -6 (9679163 iters/s) Loops Recursed. . . . . . . . 1.53 1.77 1.60 10 (655360 iters/s) Matrix multiplication . . . . 1.83 1.86 1.54 18 Pike start overhead . . . . . 0.00 0.00 0.00 25 Read binary INT128. . . . . . 4.11 4.09 3.97 4 (2519/s) Read binary INT16 . . . . . . 0.75 0.72 0.72 1 (1388889/s) Read binary INT32 . . . . . . 10.74 11.42 10.27 11 (48685/s)

ayumu: 2.1Ghz P4 Celeron, 512Mb DDR333

test gcc-asm gcc icc gain

Ackermann . . . . . . . . . . 0.66 0.69 0.55 21 Append array. . . . . . . . . 0.55 0.60 0.51 16 (985222/s) Append mapping. . . . . . . . 2.94 3.06 2.52 18 (3968/s) Append multiset . . . . . . . 0.44 0.45 0.44 3 (22843/s) Array & String Juggling . . . 1.29 1.29 1.35 -4 Clone null-object . . . . . . 0.28 0.26 0.22 16 (1359516/s) Clone object. . . . . . . . . 0.75 0.55 0.45 19 (663391/s) Compile . . . . . . . . . . . 1.47 1.26 1.07 15 (22508 lines/s) Compile & Exec. . . . . . . . 1.37 1.33 1.09 19 (552757 lines/s) GC. . . . . . . . . . . . . . 0.57 0.57 0.47 17 Insert in mapping . . . . . . 0.27 0.30 0.25 19 (2034884/s) Insert in multiset. . . . . . 0.81 0.85 0.73 14 (681818/s) Loops Nested (global) . . . . 0.41 0.68 0.47 32 (36036980 iters/s) Loops Nested (local). . . . . 0.26 0.44 0.33 25 (50423328 iters/s) Loops Recursed. . . . . . . . 0.42 0.55 0.39 30 (2674939 iters/s) Matrix multiplication . . . . 0.51 0.48 0.47 4 Pike start overhead . . . . . 0.00 0.00 0.00 0 Read binary INT128. . . . . . 1.13 1.07 0.94 13 (10638/s) Read binary INT16 . . . . . . 0.20 0.19 0.17 9 (5802047/s) Read binary INT32 . . . . . . 2.86 2.70 2.35 14 (212766/s)

sakura: 1.65Ghz P4, 512Mb PC800 RDRAM

test gcc-asm gcc icc gain

Ackermann . . . . . . . . . . 0.76 0.81 0.62 25 Append array. . . . . . . . . 0.62 0.65 0.60 8 (829876/s) Append mapping. . . . . . . . 3.42 3.76 2.84 25 (3521/s) Append multiset . . . . . . . 0.52 0.57 0.49 14 (20270/s) Array & String Juggling . . . 1.17 1.18 0.88 26 Clone null-object . . . . . . 0.32 0.31 0.27 15 (1111111/s) Clone object. . . . . . . . . 0.62 0.63 0.55 14 (549199/s) Compile . . . . . . . . . . . 1.42 1.29 1.05 19 (22990 lines/s) Compile & Exec. . . . . . . . 1.41 1.33 1.16 13 (518448 lines/s) GC. . . . . . . . . . . . . . 0.58 0.58 0.47 19 Insert in mapping . . . . . . 0.31 0.34 0.27 21 (1871658/s) Insert in multiset. . . . . . 0.87 0.91 0.76 17 (657895/s) Loops Nested (global) . . . . 0.44 0.71 0.58 19 (28802088 iters/s) Loops Nested (local). . . . . 0.32 0.50 0.43 15 (39107724 iters/s) Loops Recursed. . . . . . . . 0.52 0.64 0.46 29 (2279513 iters/s) Matrix multiplication . . . . 0.47 0.47 0.41 13 Pike start overhead . . . . . 0.00 0.00 0.00 75 Read binary INT128. . . . . . 1.22 1.21 1.08 11 (9225/s) Read binary INT16 . . . . . . 0.22 0.22 0.19 13 (5278592/s) Read binary INT32 . . . . . . 3.54 3.46 3.03 13 (165289/s)

/ Per Hedbor ()

Mirar ＠ Pike developers forum

8 Feb 8 Feb

10:20 p.m.

The P4 difference is nice enough...

My xenofarm can't seem to build with icc. I get "Fatal error: ilio_malloc: out of memory -- 153338 bytes requested".

Does xenofarm put a memory limit or something on the build processes? My Linux' max process size is 2Gb, and it hardly swapped at all before it gave up (1Gb RAM/2Gb swap).

/ Mirar

Previous text:

...

2003-02-08 00:15: Subject: Re: gcc/icc

I'm back!

I sadly enough don't have a modern P4, but my Celeron might work as an indication. However, it has 128Kb cache, and a modern P4 has 512Kb. My old 'normal' P4 has 256Kb cache, and is generally speaking somewhat slower per GHz than a modern one.

The 'gain' column is the percentage difference between gcc and icc. gcc-asm is the default compile, gcc is gcc without assembly ( optimizations.

Enough preamble, here are the tests. :-)

lain: dual 560Mhz P3; 1024Mb PC112 SDRAM

test gcc-asm gcc icc gain

Ackermann . . . . . . . . . . 1.96 2.31 1.99 14 Append array. . . . . . . . . 1.78 1.73 1.90 -10 (262697/s) Append mapping. . . . . . . . 12.44 11.65 9.24 21 (1082/s) Append multiset . . . . . . . 1.82 1.87 1.81 4 (5515/s) Array & String Juggling . . . 3.84 4.01 4.68 -16 Clone null-object . . . . . . 0.74 0.75 0.65 14 (458015/s) Clone object. . . . . . . . . 1.73 1.69 1.91 -13 (156794/s) Compile . . . . . . . . . . . 4.30 4.07 3.42 16 (7048 lines/s) Compile & Exec. . . . . . . . 4.48 4.11 3.57 14 (168696 lines/s) GC. . . . . . . . . . . . . . 1.67 1.72 1.36 22 Insert in mapping . . . . . . 0.99 1.06 0.96 10 (518672/s) Insert in multiset. . . . . . 2.71 3.09 2.41 22 (207469/s) Loops Nested (global) . . . . 1.45 2.06 1.98 5 (8473341 iters/s) Loops Nested (local). . . . . 0.90 1.48 1.31 12 (12807036 iters/s) Loops Recursed. . . . . . . . 1.37 1.62 1.48 9 (708497 iters/s) Matrix multiplication . . . . 1.62 1.78 1.38 23 Pike start overhead . . . . . 0.00 0.00 0.00 0 Read binary INT128. . . . . . 4.01 3.71 3.13 16 (3195/s) Read binary INT16 . . . . . . 0.68 0.66 0.64 4 (1570680/s) Read binary INT32 . . . . . . 10.28 10.04 9.15 9 (54645/s)

eiri: 450Mhz P2; 768Mb PC100 SDRAM

test gcc-asm gcc icc gain

Ackermann . . . . . . . . . . 2.19 2.55 2.17 15 Append array. . . . . . . . . 2.11 2.12 2.32 -9 (215517/s) Append mapping. . . . . . . . 14.26 12.67 10.47 18 (955/s) Append multiset . . . . . . . 2.02 2.11 1.98 7 (5059/s) Array & String Juggling . . . 4.56 4.78 5.20 -8 Clone null-object . . . . . . 0.77 0.84 0.72 15 (416667/s) Clone object. . . . . . . . . 1.82 2.20 2.30 -4 (130435/s) Compile . . . . . . . . . . . 4.82 4.39 3.75 15 (6437 lines/s) Compile & Exec. . . . . . . . 5.02 4.78 4.30 10 (139860 lines/s) GC. . . . . . . . . . . . . . 1.86 1.89 1.62 15 Insert in mapping . . . . . . 1.09 1.16 1.01 13 (493827/s) Insert in multiset. . . . . . 3.25 3.53 2.75 23 (181818/s) Loops Nested (global) . . . . 1.60 2.29 2.38 -4 (7049250 iters/s) Loops Nested (local). . . . . 1.00 1.63 1.73 -6 (9679163 iters/s) Loops Recursed. . . . . . . . 1.53 1.77 1.60 10 (655360 iters/s) Matrix multiplication . . . . 1.83 1.86 1.54 18 Pike start overhead . . . . . 0.00 0.00 0.00 25 Read binary INT128. . . . . . 4.11 4.09 3.97 4 (2519/s) Read binary INT16 . . . . . . 0.75 0.72 0.72 1 (1388889/s) Read binary INT32 . . . . . . 10.74 11.42 10.27 11 (48685/s)

ayumu: 2.1Ghz P4 Celeron, 512Mb DDR333

test gcc-asm gcc icc gain

Ackermann . . . . . . . . . . 0.66 0.69 0.55 21 Append array. . . . . . . . . 0.55 0.60 0.51 16 (985222/s) Append mapping. . . . . . . . 2.94 3.06 2.52 18 (3968/s) Append multiset . . . . . . . 0.44 0.45 0.44 3 (22843/s) Array & String Juggling . . . 1.29 1.29 1.35 -4 Clone null-object . . . . . . 0.28 0.26 0.22 16 (1359516/s) Clone object. . . . . . . . . 0.75 0.55 0.45 19 (663391/s) Compile . . . . . . . . . . . 1.47 1.26 1.07 15 (22508 lines/s) Compile & Exec. . . . . . . . 1.37 1.33 1.09 19 (552757 lines/s) GC. . . . . . . . . . . . . . 0.57 0.57 0.47 17 Insert in mapping . . . . . . 0.27 0.30 0.25 19 (2034884/s) Insert in multiset. . . . . . 0.81 0.85 0.73 14 (681818/s) Loops Nested (global) . . . . 0.41 0.68 0.47 32 (36036980 iters/s) Loops Nested (local). . . . . 0.26 0.44 0.33 25 (50423328 iters/s) Loops Recursed. . . . . . . . 0.42 0.55 0.39 30 (2674939 iters/s) Matrix multiplication . . . . 0.51 0.48 0.47 4 Pike start overhead . . . . . 0.00 0.00 0.00 0 Read binary INT128. . . . . . 1.13 1.07 0.94 13 (10638/s) Read binary INT16 . . . . . . 0.20 0.19 0.17 9 (5802047/s) Read binary INT32 . . . . . . 2.86 2.70 2.35 14 (212766/s)

sakura: 1.65Ghz P4, 512Mb PC800 RDRAM

test gcc-asm gcc icc gain

Ackermann . . . . . . . . . . 0.76 0.81 0.62 25 Append array. . . . . . . . . 0.62 0.65 0.60 8 (829876/s) Append mapping. . . . . . . . 3.42 3.76 2.84 25 (3521/s) Append multiset . . . . . . . 0.52 0.57 0.49 14 (20270/s) Array & String Juggling . . . 1.17 1.18 0.88 26 Clone null-object . . . . . . 0.32 0.31 0.27 15 (1111111/s) Clone object. . . . . . . . . 0.62 0.63 0.55 14 (549199/s) Compile . . . . . . . . . . . 1.42 1.29 1.05 19 (22990 lines/s) Compile & Exec. . . . . . . . 1.41 1.33 1.16 13 (518448 lines/s) GC. . . . . . . . . . . . . . 0.58 0.58 0.47 19 Insert in mapping . . . . . . 0.31 0.34 0.27 21 (1871658/s) Insert in multiset. . . . . . 0.87 0.91 0.76 17 (657895/s) Loops Nested (global) . . . . 0.44 0.71 0.58 19 (28802088 iters/s) Loops Nested (local). . . . . 0.32 0.50 0.43 15 (39107724 iters/s) Loops Recursed. . . . . . . . 0.52 0.64 0.46 29 (2279513 iters/s) Matrix multiplication . . . . 0.47 0.47 0.41 13 Pike start overhead . . . . . 0.00 0.00 0.00 75 Read binary INT128. . . . . . 1.22 1.21 1.08 11 (9225/s) Read binary INT16 . . . . . . 0.22 0.22 0.19 13 (5278592/s) Read binary INT32 . . . . . . 3.54 3.46 3.03 13 (165289/s)

/ Per Hedbor ()

Peter Bortas ＠ Pike developers forum

10:35 p.m.

Yes. Run client.sh with --no-limits to disable the ulimits.

/ Peter Bortas

Previous text:

...

2003-02-08 23:20: Subject: Re: gcc/icc

The P4 difference is nice enough...

My xenofarm can't seem to build with icc. I get "Fatal error: ilio_malloc: out of memory -- 153338 bytes requested".

Does xenofarm put a memory limit or something on the build processes? My Linux' max process size is 2Gb, and it hardly swapped at all before it gave up (1Gb RAM/2Gb swap).

/ Mirar

Mirar ＠ Pike developers forum

10 Feb 10 Feb

6:30 p.m.

Roger that. Now the testsuite fails on test 10007 (whatever that is - process stuff?), but only for xenoclient, not for me...

/ Mirar

Previous text:

...

2003-02-08 23:34: Subject: Re: gcc/icc

Yes. Run client.sh with --no-limits to disable the ulimits.

/ Peter Bortas

Dan Nelson

7 Feb 7 Feb

4:53 p.m.

In the last episode (Feb 07), Ludger Merkens said:

...

On Fri, 7 Feb 2003, Mirar @ Pike developers forum wrote:

...
Note that icc function calls seems to be much faster (Loops Recursed), and that the speed is comparable even though the icc version is compiled without machine code. (Both are with 64-bit float and int.)

Nice results for intels compiler. How machine dependent is this result? E.g. is this pentium 4 specific, or can we expect AMD Athlon or pentium III (II) to benefit likewise? What type of CPU did you run your tests btw.?

icc is consistently faster than gcc; of course, they have no AMD-specific optimizations :)

...

...
Has anyone looked into how difficult it would be to get icc to use the machine code stuff?

I can't see anything that would indicate that machine-code is dependant on gcc. Everything in code/ia32 manually builds assembly instructions one word at a time.

-- Dan Nelson dnelson@allantgroup.com

Henrik Grubbstr�m (Lysator) ＠ Pike (-) developers forum

5 p.m.

The problem is that the pike machine code uses unusual calling conventions, which currently are implemented in a gcc-specific way.

/ Henrik Grubbström (Lysator)

Previous text:

...

2003-02-07 17:54: Subject: Re: gcc/icc

In the last episode (Feb 07), Ludger Merkens said:

...
On Fri, 7 Feb 2003, Mirar @ Pike developers forum wrote:

...
Note that icc function calls seems to be much faster (Loops Recursed), and that the speed is comparable even though the icc version is compiled without machine code. (Both are with 64-bit float and int.)

Nice results for intels compiler. How machine dependent is this result? E.g. is this pentium 4 specific, or can we expect AMD Athlon or pentium III (II) to benefit likewise? What type of CPU did you run your tests btw.?

icc is consistently faster than gcc; of course, they have no AMD-specific optimizations :)

...
...
Has anyone looked into how difficult it would be to get icc to use the machine code stuff?

I can't see anything that would indicate that machine-code is dependant on gcc. Everything in code/ia32 manually builds assembly instructions one word at a time.

-- Dan Nelson dnelson@allantgroup.com

/ Brevbäraren

Mirar ＠ Pike developers forum

5 p.m.

...

I can't see anything that would indicate that machine-code is dependant on gcc. Everything in code/ia32 manually builds assembly instructions one word at a time.

The calling convention is probably important. It breaks if you compile interpret.o with -fomit-frame-pointer. Other then that, it should be able to work on any supported architecture. (Intel and sparc...?)

/ Mirar

Previous text:

...

2003-02-07 17:54: Subject: Re: gcc/icc

In the last episode (Feb 07), Ludger Merkens said:

...
On Fri, 7 Feb 2003, Mirar @ Pike developers forum wrote:

...
Note that icc function calls seems to be much faster (Loops Recursed), and that the speed is comparable even though the icc version is compiled without machine code. (Both are with 64-bit float and int.)

Nice results for intels compiler. How machine dependent is this result? E.g. is this pentium 4 specific, or can we expect AMD Athlon or pentium III (II) to benefit likewise? What type of CPU did you run your tests btw.?

icc is consistently faster than gcc; of course, they have no AMD-specific optimizations :)

...
...
Has anyone looked into how difficult it would be to get icc to use the machine code stuff?

I can't see anything that would indicate that machine-code is dependant on gcc. Everything in code/ia32 manually builds assembly instructions one word at a time.

-- Dan Nelson dnelson@allantgroup.com

/ Brevbäraren

Mirar ＠ Pike developers forum

5:20 p.m.

Now with corresponding optimizations (-O3 -ipp7). Run on my Athlon XP. (Not a Pentium 4!)

test gcc (machine code) icc (no machine code) Pike start overhead..... 0.001s 0.001s Ackermann............... 0.660s 0.653s Append array............ 0.470s (1063830/s) 0.490s (1020408) Append mapping.......... 2.890s (3460/s) 2.670s (3745) Append multiset......... 0.476s (20997/s) 0.416s (24017) Array & String Juggling. 0.687s 0.591s Read binary INT16....... 0.283s (3537736/s) 0.247s (4047619) Read binary INT32....... 0.177s (2820513/s) 0.140s (3571429) Read binary INT128...... 0.895s (11173/s) 0.767s (13043) Clone null-object....... 0.228s (1313869/s) 0.188s (1598985) Clone object............ 0.429s (699153/s) 0.367s (816327) Compile................. 1.080s (22352 lines/s) 0.906s (26645) Compile & Exec.......... 0.904s (665265 lines/s) 0.689s (873402) GC...................... 0.587s 0.579s Insert in mapping....... 0.443s (1128668/s) 0.457s (1094092) Insert in multiset...... 0.880s (568182/s) 0.770s (649351) Matrix multiplication... 0.410s 0.386s Loops Nested (local).... 0.332s (50561473 iters/s) 0.457s (36711632) Loops Nested (global)... 0.528s (31788409 iters/s) 0.694s (24164714) Loops Recursed.......... 1.383s (758464 iters/s) 0.475s (2207528)

It's clearly impressive. (Or, gcc lost any impressiveness it had.)

/ Mirar

Previous text:

...

2003-02-07 14:29: Subject: Re: gcc/icc

On Fri, 7 Feb 2003, Mirar @ Pike developers forum wrote:

...
I downloaded icc-7.0 and ran a comparison.

test gcc icc Pike start overhead..... 0.001s 0.001s Ackermann............... 0.660s 0.646s Append array............ 0.470s (1063830/s) 0.534s (935551) Append mapping.......... 2.890s (3460/s) 2.785s (3591) Append multiset......... 0.476s (20997/s) 0.432s (23148) Array & String Juggling. 0.687s 0.604s Read binary INT16....... 0.283s (3537736/s) 0.257s (3892944) Read binary INT32....... 0.177s (2820513/s) 0.150s (3342618) Read binary INT128...... 0.895s (11173/s) 0.857s (11673) Clone null-object....... 0.228s (1313869/s) 0.177s (1691176) Clone object............ 0.429s (699153/s) 0.369s (812641) Compile................. 1.080s (22352 lines/s) 0.893s (27022) Compile & Exec.......... 0.904s (665265 lines/s) 0.738s (814537) GC...................... 0.587s 0.578s Insert in mapping....... 0.443s (1128668/s) 0.446s (1121076) Insert in multiset...... 0.880s (568182/s) 0.777s (643777) Matrix multiplication... 0.410s 0.402s Loops Nested (local).... 0.332s (50561473 iters/s) 0.487s (34473732) Loops Nested (global)... 0.528s (31788409 iters/s) 0.723s (23209587) Loops Recursed.......... 1.383s (758464 iters/s) 0.504s (2078675)

Note that icc function calls seems to be much faster (Loops Recursed), and that the speed is comparable even though the icc version is compiled without machine code. (Both are with 64-bit float and int.)

Nice results for intels compiler. How machine dependent is this result? E.g. is this pentium 4 specific, or can we expect AMD Athlon or pentium III (II) to benefit likewise? What type of CPU did you run your tests btw.?

--- Ludger

...
Has anyone looked into how difficult it would be to get icc to use the machine code stuff?

/ Brevbäraren

David Hedbor ＠ Pike developers forum

5:50 p.m.

...

Now with corresponding optimizations (-O3 -ipp7).

I believe -ipp7 is the default actually, so the only difference would be -O3 (if you can figure out a way to enable that without enabling -O2 for icc and without enabling -O3 for other compilers, feel free to do so).

/ David Hedbor

Previous text:

...

2003-02-07 18:17: Subject: Re: gcc/icc

Now with corresponding optimizations (-O3 -ipp7). Run on my Athlon XP. (Not a Pentium 4!)

test gcc (machine code) icc (no machine code) Pike start overhead..... 0.001s 0.001s Ackermann............... 0.660s 0.653s Append array............ 0.470s (1063830/s) 0.490s (1020408) Append mapping.......... 2.890s (3460/s) 2.670s (3745) Append multiset......... 0.476s (20997/s) 0.416s (24017) Array & String Juggling. 0.687s 0.591s Read binary INT16....... 0.283s (3537736/s) 0.247s (4047619) Read binary INT32....... 0.177s (2820513/s) 0.140s (3571429) Read binary INT128...... 0.895s (11173/s) 0.767s (13043) Clone null-object....... 0.228s (1313869/s) 0.188s (1598985) Clone object............ 0.429s (699153/s) 0.367s (816327) Compile................. 1.080s (22352 lines/s) 0.906s (26645) Compile & Exec.......... 0.904s (665265 lines/s) 0.689s (873402) GC...................... 0.587s 0.579s Insert in mapping....... 0.443s (1128668/s) 0.457s (1094092) Insert in multiset...... 0.880s (568182/s) 0.770s (649351) Matrix multiplication... 0.410s 0.386s Loops Nested (local).... 0.332s (50561473 iters/s) 0.457s (36711632) Loops Nested (global)... 0.528s (31788409 iters/s) 0.694s (24164714) Loops Recursed.......... 1.383s (758464 iters/s) 0.475s (2207528)

It's clearly impressive. (Or, gcc lost any impressiveness it had.)

/ Mirar

Dan Nelson

7:42 p.m.

In the last episode (Feb 07), David Hedbor @ Pike developers forum said:

...

...
Now with corresponding optimizations (-O3 -ipp7).

I believe -ipp7 is the default actually, so the only difference would be -O3 (if you can figure out a way to enable that without enabling -O2 for icc and without enabling -O3 for other compilers, feel free to do so).

You probably mean -tpp7, right? That is the same as -mcpu=p4. You'll get more performance by also adding -xW, which enables icc to emit p4 instructions and vectorize loops.

-- Dan Nelson dnelson@allantgroup.com

David Hedbor ＠ Pike developers forum

7:50 p.m.

...

You probably mean -tpp7, right? That is the same as -mcpu=p4. You'll

Uh, yeah, right.

...

get more performance by also adding -xW, which enables icc to emit p4 instructions and vectorize loops.

If you look back, you see that I also made -axKW a default command line option. I don't know what the cost is of the multi-arch optimizations though (i.e if it has 3 versions of the same method, what is the method calling overhead?)

/ David Hedbor

Previous text:

...

2003-02-07 20:42: Subject: Re: gcc/icc

In the last episode (Feb 07), David Hedbor @ Pike developers forum said:

...
...
Now with corresponding optimizations (-O3 -ipp7).

I believe -ipp7 is the default actually, so the only difference would be -O3 (if you can figure out a way to enable that without enabling -O2 for icc and without enabling -O3 for other compilers, feel free to do so).

You probably mean -tpp7, right? That is the same as -mcpu=p4. You'll get more performance by also adding -xW, which enables icc to emit p4 instructions and vectorize loops.

-- Dan Nelson dnelson@allantgroup.com

/ Brevbäraren

Peter Lundqvist (disjunkt) ＠ Pike (-) developers forum

6:05 p.m.

<lazy> What is Loops Recursed? A function itterating over itself? </lazy>

/ Peter Lundqvist (disjunkt)

Previous text:

...

2003-02-07 18:17: Subject: Re: gcc/icc

Now with corresponding optimizations (-O3 -ipp7). Run on my Athlon XP. (Not a Pentium 4!)

test gcc (machine code) icc (no machine code) Pike start overhead..... 0.001s 0.001s Ackermann............... 0.660s 0.653s Append array............ 0.470s (1063830/s) 0.490s (1020408) Append mapping.......... 2.890s (3460/s) 2.670s (3745) Append multiset......... 0.476s (20997/s) 0.416s (24017) Array & String Juggling. 0.687s 0.591s Read binary INT16....... 0.283s (3537736/s) 0.247s (4047619) Read binary INT32....... 0.177s (2820513/s) 0.140s (3571429) Read binary INT128...... 0.895s (11173/s) 0.767s (13043) Clone null-object....... 0.228s (1313869/s) 0.188s (1598985) Clone object............ 0.429s (699153/s) 0.367s (816327) Compile................. 1.080s (22352 lines/s) 0.906s (26645) Compile & Exec.......... 0.904s (665265 lines/s) 0.689s (873402) GC...................... 0.587s 0.579s Insert in mapping....... 0.443s (1128668/s) 0.457s (1094092) Insert in multiset...... 0.880s (568182/s) 0.770s (649351) Matrix multiplication... 0.410s 0.386s Loops Nested (local).... 0.332s (50561473 iters/s) 0.457s (36711632) Loops Nested (global)... 0.528s (31788409 iters/s) 0.694s (24164714) Loops Recursed.......... 1.383s (758464 iters/s) 0.475s (2207528)

It's clearly impressive. (Or, gcc lost any impressiveness it had.)

/ Mirar

Mirar ＠ Pike developers forum

6:15 p.m.

In principle, yes. "pike -x benchmark" or "make benchmark" will run these, if you want to compare with your setup.

Recursive Loops:

| % cat pike/lib/modules/Tools.pmod/Shoot.pmod/RecursiveLoops.pike | ... | int n=0; | int iter=16; | int d=5; | | void recur(int d) | { | if (d--) | for (int i=0; i<iter; i++) recur(d); | else | n++; | } | | void perform() | { | recur(d); | }

Non-recursive loops (Local) looks like this:

| void perform() | { | int iter = 16; | int x=0; | | for (int a; a<iter; a++) | for (int b; b<iter; b++) | for (int c; c<iter; c++) | for (int d; d<iter; d++) | for (int e; e<iter; e++) | for (int f; f<iter; f++) | x++; | | n=x; | }

The difference between "Local" and "Global" is that the variables are function-local or object-global respectively.

/ Mirar

Previous text:

...

2003-02-07 19:00: Subject: Re: gcc/icc

<lazy> What is Loops Recursed? A function itterating over itself? </lazy>

/ Peter Lundqvist (disjunkt)

Peter Lundqvist (disjunkt) ＠ Pike (-) developers forum

6:35 p.m.

I knew that, the strange thing for me was the recursive loops. It seems like such a strange construct. Is it commonly appearing in the "wild"?

/ Peter Lundqvist (disjunkt)

Previous text:

...

2003-02-07 19:13: Subject: Re: gcc/icc

In principle, yes. "pike -x benchmark" or "make benchmark" will run these, if you want to compare with your setup.

Recursive Loops:

| % cat pike/lib/modules/Tools.pmod/Shoot.pmod/RecursiveLoops.pike | ... | int n=0; | int iter=16; | int d=5; | | void recur(int d) | { | if (d--) | for (int i=0; i<iter; i++) recur(d); | else | n++; | } | | void perform() | { | recur(d); | }

Non-recursive loops (Local) looks like this:

| void perform() | { | int iter = 16; | int x=0; | | for (int a; a<iter; a++) | for (int b; b<iter; b++) | for (int c; c<iter; c++) | for (int d; d<iter; d++) | for (int e; e<iter; e++) | for (int f; f<iter; f++) | x++; | | n=x; | }

The difference between "Local" and "Global" is that the variables are function-local or object-global respectively.

/ Mirar

Martin Nilsson (�skblod) ＠ Pike (-) developers forum

7:20 p.m.

Oh, yes. Think tree data structure (XML-tree, file-tree, code-tree) traversal.

/ Martin Nilsson (Åskblod)

Previous text:

...

2003-02-07 19:34: Subject: Re: gcc/icc

I knew that, the strange thing for me was the recursive loops. It seems like such a strange construct. Is it commonly appearing in the "wild"?

/ Peter Lundqvist (disjunkt)

Peter Lundqvist (disjunkt) ＠ Pike (-) developers forum

7:45 p.m.

Oh, I didn't think of it as perticulary itterative, but of course you are right... my bad.

/ Peter Lundqvist (disjunkt)

Previous text:

...

2003-02-07 20:16: Subject: Re: gcc/icc

Oh, yes. Think tree data structure (XML-tree, file-tree, code-tree) traversal.

/ Martin Nilsson (Åskblod)

Mirar ＠ Pike developers forum

8:20 p.m.

I think heavy function calling is what occur in most Pike programs.

If it is recursing on one function or just say five-ten levels of function calling doesn't matter that much for the measure, I think. (Recall many backtraces that has less then say four levels?)

If anything, the iteration *without* function calls is a less common construct in Pike. Almost all those are on the C levels.

You should note that many common operations in expressions also yield function calls, for instance `+ on mappings and array creation.

/ Mirar

Previous text:

...

2003-02-07 19:34: Subject: Re: gcc/icc

I knew that, the strange thing for me was the recursive loops. It seems like such a strange construct. Is it commonly appearing in the "wild"?

/ Peter Lundqvist (disjunkt)

Peter Lundqvist (disjunkt) ＠ Pike (-) developers forum

10:15 p.m.

That one stared me right in the face, didn't it? It just never occured to me as beeing an recursive loop.

/ Peter Lundqvist (disjunkt)

Previous text:

...

2003-02-07 21:18: Subject: Re: gcc/icc

I think heavy function calling is what occur in most Pike programs.

If it is recursing on one function or just say five-ten levels of function calling doesn't matter that much for the measure, I think. (Recall many backtraces that has less then say four levels?)

If anything, the iteration *without* function calls is a less common construct in Pike. Almost all those are on the C levels.

You should note that many common operations in expressions also yield function calls, for instance `+ on mappings and array creation.

/ Mirar

Mirar ＠ Pike developers forum

10:20 p.m.

The recursive loop is just a simple way of testing the function calls. :)

/ Mirar

Previous text:

...

2003-02-07 23:14: Subject: Re: gcc/icc

That one stared me right in the face, didn't it? It just never occured to me as beeing an recursive loop.

/ Peter Lundqvist (disjunkt)

Niels M�ller () ＠ Pike (-) developers forum

9 p.m.

Only two tests where gcc is significantly faster.

/ Niels Möller ()

Previous text:

...

2003-02-07 18:17: Subject: Re: gcc/icc

Now with corresponding optimizations (-O3 -ipp7). Run on my Athlon XP. (Not a Pentium 4!)

test gcc (machine code) icc (no machine code) Pike start overhead..... 0.001s 0.001s Ackermann............... 0.660s 0.653s Append array............ 0.470s (1063830/s) 0.490s (1020408) Append mapping.......... 2.890s (3460/s) 2.670s (3745) Append multiset......... 0.476s (20997/s) 0.416s (24017) Array & String Juggling. 0.687s 0.591s Read binary INT16....... 0.283s (3537736/s) 0.247s (4047619) Read binary INT32....... 0.177s (2820513/s) 0.140s (3571429) Read binary INT128...... 0.895s (11173/s) 0.767s (13043) Clone null-object....... 0.228s (1313869/s) 0.188s (1598985) Clone object............ 0.429s (699153/s) 0.367s (816327) Compile................. 1.080s (22352 lines/s) 0.906s (26645) Compile & Exec.......... 0.904s (665265 lines/s) 0.689s (873402) GC...................... 0.587s 0.579s Insert in mapping....... 0.443s (1128668/s) 0.457s (1094092) Insert in multiset...... 0.880s (568182/s) 0.770s (649351) Matrix multiplication... 0.410s 0.386s Loops Nested (local).... 0.332s (50561473 iters/s) 0.457s (36711632) Loops Nested (global)... 0.528s (31788409 iters/s) 0.694s (24164714) Loops Recursed.......... 1.383s (758464 iters/s) 0.475s (2207528)

It's clearly impressive. (Or, gcc lost any impressiveness it had.)

/ Mirar

Mirar ＠ Pike developers forum

9:05 p.m.

Yes, and both can easily be related to the use of machine code. ICC doesn't seem to want to link if I turn it on, so I can't even test to see if the calling convention is similar enough...

/ Mirar

Previous text:

...

2003-02-07 21:57: Subject: Re: gcc/icc

Only two tests where gcc is significantly faster.

/ Niels Möller ()

8195

Age (days ago)

8198

Last active (days ago)

pike-devel@lists.lysator.liu.se

36 comments

10 participants

tags (0)

participants (10)

Dan Nelson
David Hedbor ＠ Pike developers forum
Henrik Grubbstr�m (Lysator) ＠ Pike (-) developers forum
Ludger Merkens
Martin Nilsson (�skblod) ＠ Pike (-) developers forum
Mirar ＠ Pike developers forum
Niels M�ller () ＠ Pike (-) developers forum
Per Hedbor () ＠ Pike (-) developers forum
Peter Bortas ＠ Pike developers forum
Peter Lundqvist (disjunkt) ＠ Pike (-) developers forum