Hello, I've run some tests with memxor on a x86-64 machine. My results are: * C implementation (compiled with gcc 4.4): Xoring in chunks of 32768 bytes: done. 50.09 Gb in 5.00 secs: 10.02 Gb/sec Xoring (unaligned) in chunks of 32768 bytes: done. 39.90 Gb in 5.00 secs: 7.98 Gb/sec
* ASM implementation: Xoring in chunks of 32768 bytes: done. 38.32 Gb in 5.00 secs: 7.66 Gb/sec Xoring (unaligned) in chunks of 32768 bytes: done. 30.16 Gb in 5.00 secs: 6.03 Gb/sec
It seems that in x86-64 the ASM version is slower than the C one. Moreover I noticed that the loop unrolling techniques used in the C code have no visible performance benefit.
However, an SSE2 version of memxor (attached) increases performance by 30% or more in the same CPU.
* SSE2: Xoring in chunks of 32768 bytes: done. 69.94 Gb in 5.00 secs: 13.98 Gb/sec Xoring (unaligned) in chunks of 32768 bytes: done. 65.96 Gb in 5.00 secs: 13.19 Gb/sec
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
It seems that in x86-64 the ASM version is slower than the C one.
Hmm, that's different from what I have seen. This is what I get, benchmarked on a 1.3 GHz Intel SU4100, by running examples/nettle-benchmark -f 1.3e9 memxor.
C-implementation (compiled with gcc-4.4.5):
Algorithm mode Mbyte/s cycles/byte cycles/block memxor aligned 4885.87 0.25 2.03 memxor unaligned 2771.14 0.45 3.58 memxor3 aligned 4569.70 0.27 2.17 memxor3 unaligned01 2528.03 0.49 3.92 memxor3 unaligned11 2603.92 0.48 3.81 memxor3 unaligned12 1496.87 0.83 6.63
x86_64 assembly:
Algorithm mode Mbyte/s cycles/byte cycles/block memxor aligned 4895.18 0.25 2.03 memxor unaligned 3284.10 0.38 3.02 memxor3 aligned 4890.66 0.25 2.03 memxor3 unaligned01 3409.62 0.36 2.91 memxor3 unaligned11 2697.37 0.46 3.68 memxor3 unaligned12 2030.06 0.61 4.89
So no difference for the aligned case. I think two cycles per 64-bit word ("block" above means unsigned long) is the memory bandwidth. While for the different unaligned cases, the assembly version is a bit faster, shaving off 0.5-2 cycles per word.
Moreover I noticed that the loop unrolling techniques used in the C code have no visible performance benefit.
That's what I have seen as well. I keep the small amount of manual unrolling for the benefit of other machines and/or compilers (but I'm not sure where it really matters).
However, an SSE2 version of memxor (attached) increases performance by 30% or more in the same CPU.
I'll have a look at that.
- SSE2: Xoring in chunks of 32768 bytes: done. 69.94 Gb in 5.00 secs:
13.98 Gb/sec Xoring (unaligned) in chunks of 32768 bytes: done. 65.96 Gb in 5.00 secs: 13.19 Gb/sec
I'm a bit puzzled by your results, I didn't expect any speedup with sse2 instructions for the aligned case. What machine are you benchmarking on?
I'm also not quite sure what's the right way to think about memory bandwidth. nettle-benchmark uses blocks of 10 KByte and processes the same block repeatedly, which means that it ought to fit in L1 (or at least L2) cache.
Regards, /nisse
On Mon, Sep 12, 2011 at 1:59 PM, Niels Möller nisse@lysator.liu.se wrote:
(2nd reply on list)
It seems that in x86-64 the ASM version is slower than the C one.
Hmm, that's different from what I have seen. This is what I get, benchmarked on a 1.3 GHz Intel SU4100, by running examples/nettle-benchmark -f 1.3e9 memxor.
The CPU reports itself as Intel(R) Xeon(R) CPU X5670 @ 2.93GHz (the system has 24 such cpus). The output of nettle-benchmark on that machine follows.
x86-64 assembly: [nikos@koninck examples]$ ./nettle-benchmark -f 1.3e9 memxor sha1_compress: 463.00 cycles
benchmark call overhead: 0.001871 us 2.43 cycles
Algorithm mode Mbyte/s cycles/byte cycles/block memxor aligned 11887.63 0.10 0.83 memxor unaligned 11194.60 0.11 0.89 memxor3 aligned 11863.08 0.10 0.84 memxor3 unaligned01 11167.71 0.11 0.89 memxor3 unaligned11 6479.55 0.19 1.53 memxor3 unaligned12 10540.27 0.12 0.94
C-implementation (gcc-4.1.2): [nikos@koninck nettle-2.4]$ examples/nettle-benchmark -f 1.3e9 memxor sha1_compress: 463.30 cycles
benchmark call overhead: 0.001872 us 2.43 cycles
Algorithm mode Mbyte/s cycles/byte cycles/block memxor aligned 11854.78 0.10 0.84 memxor unaligned 11186.80 0.11 0.89 memxor3 aligned 11896.14 0.10 0.83 memxor3 unaligned01 11169.13 0.11 0.89 memxor3 unaligned11 6437.96 0.19 1.54 memxor3 unaligned12 10485.28 0.12 0.95
I see no big difference between them. However the results we see from my and your benchmark vary. How do you benchmark? What is ncalls in time_function()?
My benchmark is simplistic, it counts speed, number of memxors in a fixed amount of time.
Moreover I noticed that the loop unrolling techniques used in the C code have no visible performance benefit.
That's what I have seen as well. I keep the small amount of manual unrolling for the benefit of other machines and/or compilers (but I'm not sure where it really matters).
My personal preference would have been cleaner code.
I'm also not quite sure what's the right way to think about memory bandwidth. nettle-benchmark uses blocks of 10 KByte and processes the same block repeatedly, which means that it ought to fit in L1 (or at least L2) cache.
I saw no differences in my benchmark when I decreased the buffer to 10k.
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
The CPU reports itself as Intel(R) Xeon(R) CPU X5670 @ 2.93GHz (the system has 24 such cpus). The output of nettle-benchmark on that machine follows.
x86-64 assembly: [nikos@koninck examples]$ ./nettle-benchmark -f 1.3e9 memxor
To get the printed cycle numbers to make sense, you have to pass the correct clock frequency to the -f option. -f 2.93e9 in your case.
However the results we see from my and your benchmark vary.
Right, we'll have to figure out why. I'm puzzled.
How do you benchmark? What is ncalls in time_function()?
time_function loops around the benchmarked function ncalls times, and reads the clock before and after the loop. Qnd then, if the elapsed time was too short, it increases ncalls and starts over.
My benchmark is simplistic, it counts speed, number of memxors in a fixed amount of time.
I guess that should be good enough. I'm not so familiar with SIGALARM, but I don't seen anything obviously wrong with it.
That's what I have seen as well. I keep the small amount of manual unrolling for the benefit of other machines and/or compilers (but I'm not sure where it really matters).
My personal preference would have been cleaner code.
Well, for the unaligned case, the unrolling is also a natural way to avoid moving values between s1 and s0, which I think is nice.
Regards, /Niels
On Mon, Sep 12, 2011 at 7:03 PM, Niels Möller nisse@lysator.liu.se wrote:
To get the printed cycle numbers to make sense, you have to pass the correct clock frequency to the -f option. -f 2.93e9 in your case.
However the results we see from my and your benchmark vary.
Right, we'll have to figure out why. I'm puzzled.
It seems --disable-assembler doesn't work for memxor. That is the figures were so similar with nettle-benchmark because only assembler was used. To be honest I've not understood how the assembler thing works so I am not able to disable it even manually.
regards, Nikos
On Tue, Sep 13, 2011 at 10:14 AM, Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com wrote:
Corrected figures for nettle-benchmark. My previous issue seems to have been because a new ./configure doesn't really undo the previous settings. SSE is faster than the previous implementations (asm or C), but ASM performs better than C in the unaligned case. I cannot figure out why my benchmark shows otherwise (our unaligned test seem to be pretty much identical). I include the overhead that you subtract, but seems to be identical in both cases.
* ASM benchmark call overhead: 0.001862 us 5.46 cycles
Algorithm mode Mbyte/s cycles/byte cycles/block memxor aligned 11980.56 0.23 1.87 memxor unaligned 11269.30 0.25 1.98
* C implementation: benchmark call overhead: 0.001875 us 5.49 cycles
Algorithm mode Mbyte/s cycles/byte cycles/block memxor aligned 11777.25 0.24 1.90 memxor unaligned 7794.15 0.36 2.87
* SSE2 benchmark call overhead: 0.001868 us 5.47 cycles
Algorithm mode Mbyte/s cycles/byte cycles/block memxor aligned 15961.09 0.18 1.40 memxor unaligned 15882.32 0.18 1.41
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
It seems --disable-assembler doesn't work for memxor.
Maybe you forgot to run make distclean? You should do that whenever you change any interesting configure options.
I usually don't build in the source directory, and than I can have multiple build directories configured with different options. Currently I have build trees with plain ./configure, ./configure --disable-assembler, ./configure --enable-shared, and ./configure CC="gcc -m32" CXX="g++ -m32".
To be honest I've not understood how the assembler thing works so I am not able to disable it even manually.
I'll try to explain. It's not very complicated.
./configure looks for certain files in the appropriate machine-specific directory, and creates symlinks in the top-level nettle build directory. Then if make needs to build foo.o, and it can find both foo.c and foo.asm, by the order of the suffix list, it will use the foo.asm file.
The rule to build foo.o from foo.asm first runs m4 to produce foo.s, and then invokes the compiler $(CC) -c ... on foo.s.
The symlinks are created by config.status (which is also run at the end of ./configure), and deleted by make distclean. If you create or delete any links by hand, make will follow.
The setup is inspired by gmp. If you look at what gmp does, there are two main differences:
1. There's hierarchy of machine specific directories for different flavors of the same architecture, and configure searches in multiple directories.
2. It doesn't rely on make ordering the suffix list. Instead, it puts all C files in the "mpn/generic" subdirectory, and creates symlinks also for the portable C files.
And then gmp also has the possibility of creating a fat binary, including optimized code for all different flavors of the given architecture.
Regards, /Niels
On Tue, Sep 13, 2011 at 11:33 AM, Niels Möller nisse@lysator.liu.se wrote:
It seems --disable-assembler doesn't work for memxor.
Maybe you forgot to run make distclean? You should do that whenever you change any interesting configure options.
Indeed. I've now figured out. It would be also better if configure.ac was self contained. If I modify it and run autoconf I get: configure.ac:366: warning: AC_LANG_CONFTEST: no AC_LANG_SOURCE call detected in body autoconf/lang.m4:198: AC_LANG_CONFTEST is expanded from... autoconf/general.m4:2599: _AC_COMPILE_IFELSE is expanded from... autoconf/general.m4:2609: AC_COMPILE_IFELSE is expanded from... ../../lib/m4sugar/m4sh.m4:610: AS_IF is expanded from... autoconf/general.m4:2047: AC_CACHE_VAL is expanded from... autoconf/general.m4:2060: AC_CACHE_CHECK is expanded from... configure.ac:366: the top level
if I run .bootstrap from the lsh dir it succeeds but running configure prints the warnings below and make fails. checking malloc.h presence... no configure: WARNING: malloc.h: accepted by the compiler, rejected by the preprocessor! configure: WARNING: malloc.h: proceeding with the compiler's result
Anyway, I attach an untested patch that shows how SSE2 or other cpu version specific optimizations can be enabled at run-time.
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
if I run .bootstrap from the lsh dir it succeeds but running configure prints the warnings below and make fails.
Strange. I'm not sure what the warnings mean, but make definitely shouldn't fail. I take it you have checked it out from cvs? Exactly what did you do? I'd expect the following to work:
cvs ... co lsh cd lsh ./.bootstrap cd nettle ./configure make
Then you shouldn't need to bother about the lsh directory again. You have a symlink to the shared aclocal.m4 (and to some other shared files).
Anyway, I attach an untested patch that shows how SSE2 or other cpu version specific optimizations can be enabled at run-time.
Thanks. I'll have to think some more on how to organize this. Some properties I'd like to have:
1. Don't require users to call any init function.
One could define memxor to jump via a function pointer, and have an initial value for that pointer which jumps to the routine to set the pointer to the right function, and then use it. Overwriting the pointer should be atomic, so no locking needed even for multithreaded programs.
Or for library formats that support that, hook in the initialization in the same way as C++ constructors for global data.
2. Have configure options like
--enable-x86-sse2/--disable-x86-sse2
which omits that wrapper function and its function pointer.
3. Avoid using gcc-specific things, including inline asm, in the C source files.
Other obvious uses for cpu detection in nettle:
* The AES code could check for the special aes instructions.
* The serpent code can use %xmm and %ymm registers, when present. On x86_64, as far as I'm aware all current implementations have sse2, but one could check for, and make use of, the 256-bit %ymm registers.
Regards, /Niels
nettle-bugs@lists.lysator.liu.se