Michael Weiser michael.weiser@gmx.de writes:
It comes out at around seven cycles per block slowdown for chacha-3core and five for salsa20-2core. I trace this to vst1.8. It's just slower than vstm (in contrast to vldm vs. vld1.32). I managed to save a cumulative two cycles by rescheduling instructions so that there's no two consecutive vst1.8s which seems to avoid stalls in the pipeline or bus access waits (at least on my machine). Element width (8 vs. 32 vs. 64) doesn't seem to play into it.
Thanks for investigating. Maybe keep some IF_BE / IF_LE just for the store instructions, to stay with vstm on little-endian?
(BTW: Am I using the benchmark correctly, particularly the frequency parameter?)
I think it's right. But it's a floating point number, so -f 1e9 for 1 GHz should work too.
Alignment can't be used to improve performance: The tests immediately bus error when giving a :64 alignment hint to vst1.8.
Unfortunately, I'm not aware of any nice and portable way to enforce alignment from the calling C code.
Do you (or anybody else) have a hardware arm board for testing, possibly with a Cortex A8 or A9 implementation to see how it behaves there?
I have access to the GMP test systems on https://gmplib.org/devel/testsystems, but little time to benchmark things in the near future.
I've got one side-track which might point to some peculiarity of my machine: The unmodified assembler code *without* chacha-3core and salsa20-2core (files moved out of the way before configure) is no faster or even slower than what the C compiler produces:
[...]
Does this seem reasonable or does it point to some flaw in my benchmarking or system software/hardware?
That's unexpected. In principle I guess it's possible for the C compiler to generate great vectorized code, but that seems a bit unlikely. Do you get the same results if you build Nettle-3.6?
From ChangeLog comments, it seems I got 45% speedup for Salsa20,
compared to the C implementation, when I wrote the original neon assembly code. At the time, benchmarked on a pandaboard (cortex a9), if I remember correctly.
Is it for a fat build? If so, it's possibly that the fat setup logic selects the C implementation is this hacked setup (but on the other hand, I'd guess a fat build may just failed at link time if these files are removed).
Regards, /Niels