Re: memxor

12 Sep 2011


      On Mon, Sep 12, 2011 at 1:59 PM, Niels Möller nisse@lysator.liu.se wrote:
(2nd reply on list)
...
...
It seems that in x86-64 the ASM version is slower than the C one.
Hmm, that's different from what I have seen. This is what I get,
benchmarked on a 1.3 GHz Intel SU4100, by running
examples/nettle-benchmark -f 1.3e9 memxor.
The CPU reports itself as Intel(R) Xeon(R) CPU X5670  @ 2.93GHz (the
system has 24 such cpus). The output of nettle-benchmark on that
machine follows.
x86-64 assembly:
[nikos@koninck examples]$ ./nettle-benchmark -f 1.3e9 memxor
sha1_compress: 463.00 cycles
benchmark call overhead: 0.001871 us   2.43 cycles
Algorithm        mode Mbyte/s cycles/byte cycles/block
           memxor     aligned 11887.63        0.10         0.83
           memxor   unaligned 11194.60        0.11         0.89
          memxor3     aligned 11863.08        0.10         0.84
          memxor3 unaligned01 11167.71        0.11         0.89
          memxor3 unaligned11 6479.55        0.19         1.53
          memxor3 unaligned12 10540.27        0.12         0.94
C-implementation (gcc-4.1.2):
[nikos@koninck nettle-2.4]$ examples/nettle-benchmark -f 1.3e9 memxor
sha1_compress: 463.30 cycles
benchmark call overhead: 0.001872 us   2.43 cycles
Algorithm        mode Mbyte/s cycles/byte cycles/block
           memxor     aligned 11854.78        0.10         0.84
           memxor   unaligned 11186.80        0.11         0.89
          memxor3     aligned 11896.14        0.10         0.83
          memxor3 unaligned01 11169.13        0.11         0.89
          memxor3 unaligned11 6437.96        0.19         1.54
          memxor3 unaligned12 10485.28        0.12         0.95
I see no big difference between them. However the results we see from
my and your benchmark vary. How do you benchmark? What is ncalls in
time_function()?
My benchmark is simplistic, it counts speed, number of memxors in a
fixed amount of time.
...
...
Moreover I noticed that the loop unrolling techniques used in the C
code have no visible performance benefit.
That's what I have seen as well. I keep the small amount of manual
unrolling for the benefit of other machines and/or compilers (but I'm
not sure where it really matters).
My personal preference would have been cleaner code.
...
I'm also not quite sure what's the right way to think about memory
bandwidth. nettle-benchmark uses blocks of 10 KByte and processes the
same block repeatedly, which means that it ought to fit in L1 (or at
least L2) cache.
I saw no differences in my benchmark when I decreased the buffer to 10k.
regards,
Nikos

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: memxor