Re: memxor

12 Sep 2011


      Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
...
It seems that in x86-64 the ASM version is slower than the C one.
Hmm, that's different from what I have seen. This is what I get,
benchmarked on a 1.3 GHz Intel SU4100, by running
examples/nettle-benchmark -f 1.3e9 memxor.
C-implementation (compiled with gcc-4.4.5):
Algorithm        mode Mbyte/s cycles/byte cycles/block
            memxor     aligned 4885.87        0.25         2.03
            memxor   unaligned 2771.14        0.45         3.58
           memxor3     aligned 4569.70        0.27         2.17
           memxor3 unaligned01 2528.03        0.49         3.92
           memxor3 unaligned11 2603.92        0.48         3.81
           memxor3 unaligned12 1496.87        0.83         6.63
x86_64 assembly:
Algorithm        mode Mbyte/s cycles/byte cycles/block
            memxor     aligned 4895.18        0.25         2.03
            memxor   unaligned 3284.10        0.38         3.02
           memxor3     aligned 4890.66        0.25         2.03
           memxor3 unaligned01 3409.62        0.36         2.91
           memxor3 unaligned11 2697.37        0.46         3.68
           memxor3 unaligned12 2030.06        0.61         4.89
So no difference for the aligned case. I think two cycles per 64-bit
word ("block" above means unsigned long) is the memory bandwidth. While
for the different unaligned cases, the assembly version is a bit faster,
shaving off 0.5-2 cycles per word.
...
Moreover I noticed that the loop unrolling techniques used in the C
code have no visible performance benefit.
That's what I have seen as well. I keep the small amount of manual
unrolling for the benefit of other machines and/or compilers (but I'm
not sure where it really matters).
...
However, an SSE2 version of memxor (attached) increases performance by
30% or more in the same CPU.
I'll have a look at that.
...

SSE2:
      Xoring in chunks of 32768 bytes: done. 69.94 Gb in 5.00 secs:

13.98 Gb/sec
        Xoring (unaligned) in chunks of 32768 bytes: done. 65.96 Gb in
5.00 secs: 13.19 Gb/sec
I'm a bit puzzled by your results, I didn't expect any speedup with sse2
instructions for the aligned case. What machine are you benchmarking on?
I'm also not quite sure what's the right way to think about memory
bandwidth. nettle-benchmark uses blocks of 10 KByte and processes the
same block repeatedly, which means that it ought to fit in L1 (or at
least L2) cache.
Regards,
/nisse
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: memxor