On Mon, Sep 12, 2011 at 1:59 PM, Niels Möller nisse@lysator.liu.se wrote:
(2nd reply on list)
It seems that in x86-64 the ASM version is slower than the C one.
Hmm, that's different from what I have seen. This is what I get, benchmarked on a 1.3 GHz Intel SU4100, by running examples/nettle-benchmark -f 1.3e9 memxor.
The CPU reports itself as Intel(R) Xeon(R) CPU X5670 @ 2.93GHz (the system has 24 such cpus). The output of nettle-benchmark on that machine follows.
x86-64 assembly: [nikos@koninck examples]$ ./nettle-benchmark -f 1.3e9 memxor sha1_compress: 463.00 cycles
benchmark call overhead: 0.001871 us 2.43 cycles
Algorithm mode Mbyte/s cycles/byte cycles/block memxor aligned 11887.63 0.10 0.83 memxor unaligned 11194.60 0.11 0.89 memxor3 aligned 11863.08 0.10 0.84 memxor3 unaligned01 11167.71 0.11 0.89 memxor3 unaligned11 6479.55 0.19 1.53 memxor3 unaligned12 10540.27 0.12 0.94
C-implementation (gcc-4.1.2): [nikos@koninck nettle-2.4]$ examples/nettle-benchmark -f 1.3e9 memxor sha1_compress: 463.30 cycles
benchmark call overhead: 0.001872 us 2.43 cycles
Algorithm mode Mbyte/s cycles/byte cycles/block memxor aligned 11854.78 0.10 0.84 memxor unaligned 11186.80 0.11 0.89 memxor3 aligned 11896.14 0.10 0.83 memxor3 unaligned01 11169.13 0.11 0.89 memxor3 unaligned11 6437.96 0.19 1.54 memxor3 unaligned12 10485.28 0.12 0.95
I see no big difference between them. However the results we see from my and your benchmark vary. How do you benchmark? What is ncalls in time_function()?
My benchmark is simplistic, it counts speed, number of memxors in a fixed amount of time.
Moreover I noticed that the loop unrolling techniques used in the C code have no visible performance benefit.
That's what I have seen as well. I keep the small amount of manual unrolling for the benefit of other machines and/or compilers (but I'm not sure where it really matters).
My personal preference would have been cleaner code.
I'm also not quite sure what's the right way to think about memory bandwidth. nettle-benchmark uses blocks of 10 KByte and processes the same block repeatedly, which means that it ought to fit in L1 (or at least L2) cache.
I saw no differences in my benchmark when I decreased the buffer to 10k.
regards, Nikos