Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
It seems that in x86-64 the ASM version is slower than the C one.
Hmm, that's different from what I have seen. This is what I get, benchmarked on a 1.3 GHz Intel SU4100, by running examples/nettle-benchmark -f 1.3e9 memxor.
C-implementation (compiled with gcc-4.4.5):
Algorithm mode Mbyte/s cycles/byte cycles/block memxor aligned 4885.87 0.25 2.03 memxor unaligned 2771.14 0.45 3.58 memxor3 aligned 4569.70 0.27 2.17 memxor3 unaligned01 2528.03 0.49 3.92 memxor3 unaligned11 2603.92 0.48 3.81 memxor3 unaligned12 1496.87 0.83 6.63
x86_64 assembly:
Algorithm mode Mbyte/s cycles/byte cycles/block memxor aligned 4895.18 0.25 2.03 memxor unaligned 3284.10 0.38 3.02 memxor3 aligned 4890.66 0.25 2.03 memxor3 unaligned01 3409.62 0.36 2.91 memxor3 unaligned11 2697.37 0.46 3.68 memxor3 unaligned12 2030.06 0.61 4.89
So no difference for the aligned case. I think two cycles per 64-bit word ("block" above means unsigned long) is the memory bandwidth. While for the different unaligned cases, the assembly version is a bit faster, shaving off 0.5-2 cycles per word.
Moreover I noticed that the loop unrolling techniques used in the C code have no visible performance benefit.
That's what I have seen as well. I keep the small amount of manual unrolling for the benefit of other machines and/or compilers (but I'm not sure where it really matters).
However, an SSE2 version of memxor (attached) increases performance by 30% or more in the same CPU.
I'll have a look at that.
- SSE2: Xoring in chunks of 32768 bytes: done. 69.94 Gb in 5.00 secs:
13.98 Gb/sec Xoring (unaligned) in chunks of 32768 bytes: done. 65.96 Gb in 5.00 secs: 13.19 Gb/sec
I'm a bit puzzled by your results, I didn't expect any speedup with sse2 instructions for the aligned case. What machine are you benchmarking on?
I'm also not quite sure what's the right way to think about memory bandwidth. nettle-benchmark uses blocks of 10 KByte and processes the same block repeatedly, which means that it ought to fit in L1 (or at least L2) cache.
Regards, /nisse