Hello Niels,
On Fri, Dec 25, 2020 at 10:48:19PM +0100, Niels Möller wrote:
Since we have plenty of registers available, (including r3 which seems unused and free to clobber), I'd suggest using
define(`SRCp32', `r3')
and an
add SRCp32, SRC, #32
in function entry, and then leave both SRC and SRCp32 unmodified for the rest of the function.
I've done that and according to nettle-benchmark it saves one to two cycles per block compared to the mov+postincrement approach.
As expected, all the special treatment of transposed operands can just go away because it doesn't happen any more. Also, vld1.32 (for sequential loads of 32-bit operands in host-endianness) and vld1.8 (for sequential store of register contents to get an implicit little-endian store without any vrev32.u8s) works the same on LE as well as BE.
Neat. Use of vld1.8 is worth a commment in the code (and/or arm/README).
I added those where it seemed to make sense. It was already in the README but I've extended it a bit with the new findings.
Option 2: By coincidence I found that vldm/vstm can work with s registers originally intended for use with VFP. They're just a different
That sounds a bit complicated, and since there's no great benefit over vld1, maybe best to stay away from that?
Also, interestingly, when I use vldm to s regs wherever possible (see second attached patch), it doesn't give any speedup. It saves the scratch register in all routines I've touched, though. In general, it seems that add+2*vld1.32 is exactly the same number of cycles as the equivalent vldm.
Switch arm neon assemlber routines to endianness-agnostic loads and stores where possible to avoid modifications to the rest of the code. This involves switching to vld1.32 for loading consecutive 32-bit words in host endianness as well as vst1.8 for storing back to memory in little-endian order as required by the caller.
I like this approach. It would be nice if you coudl benchmark it on little-endian, to verify that there's no unexpectedly large speed regression (a regression of just cycle or two per block, if that's at all measurable, is ok, I think).
It comes out at around seven cycles per block slowdown for chacha-3core and five for salsa20-2core. I trace this to vst1.8. It's just slower than vstm (in contrast to vldm vs. vld1.32). I managed to save a cumulative two cycles by rescheduling instructions so that there's no two consecutive vst1.8s which seems to avoid stalls in the pipeline or bus access waits (at least on my machine). Element width (8 vs. 32 vs. 64) doesn't seem to play into it. Alignment can't be used to improve performance: The tests immediately bus error when giving a :64 alignment hint to vst1.8.
Baseline with --disable-assembler comes in with these numbers on my Cubieboard2 with 1GHz Allwinner A20 which is a Cortex-A7 implementation:
[michael@c2-le:~/nettle/build-noasm/examples] LD_LIBRARY_PATH=../.lib ./nettle-benchmark -f 1000000000 chacha salsa20
Algorithm mode Mbyte/s cycles/byte cycles/block
chacha encrypt 30.43 31.34 2005.82 chacha decrypt 30.41 31.36 2006.89
chacha_poly1305 encrypt 23.57 40.47 2589.77 chacha_poly1305 decrypt 23.55 40.50 2592.15 chacha_poly1305 update 104.42 9.13 584.51
salsa20 encrypt 35.10 27.17 1738.73 salsa20 decrypt 35.10 27.17 1738.75
salsa20r12 encrypt 50.12 19.03 1217.75 salsa20r12 decrypt 50.15 19.01 1216.93
(BTW: Am I using the benchmark correctly, particularly the frequency parameter?)
Baseline unmodified assembler routines (without --enable-fat) come in at:
[michael@c2-le:~/nettle/build-orig/examples] LD_LIBRARY_PATH=../.lib ./nettle-benchmark -f 1000000000 chacha salsa20
Algorithm mode Mbyte/s cycles/byte cycles/block
chacha encrypt 63.06 15.12 967.83 chacha decrypt 63.06 15.12 967.82
chacha_poly1305 encrypt 39.18 24.34 1557.72 chacha_poly1305 decrypt 39.18 24.34 1557.96 chacha_poly1305 update 104.38 9.14 584.75
salsa20 encrypt 62.15 15.34 982.04 salsa20 decrypt 62.07 15.36 983.33
salsa20r12 encrypt 92.69 10.29 658.48 salsa20r12 decrypt 92.70 10.29 658.43
Attached unified code (patch 0001) comes in like this:
[michael@c2-le:~/nettle/build-unified-add/examples] LD_LIBRARY_PATH=../.lib ./nettle-benchmark -f 1000000000 chacha salsa20
Algorithm mode Mbyte/s cycles/byte cycles/block
chacha encrypt 62.61 15.23 974.79 chacha decrypt 62.62 15.23 974.72
chacha_poly1305 encrypt 39.14 24.36 1559.28 chacha_poly1305 decrypt 39.18 24.34 1558.00 chacha_poly1305 update 103.65 9.20 588.88
salsa20 encrypt 61.80 15.43 987.65 salsa20 decrypt 61.81 15.43 987.51
salsa20r12 encrypt 91.88 10.38 664.30 salsa20r12 decrypt 91.91 10.38 664.07
What's nice is that the same code gives very consistent numbers on BE (no idea what's going on with poly1305 though):
[michael@c2-be:~/nettle/build-unified-add/examples] LD_LIBRARY_PATH=../.lib ./nettle-benchmark -f 1000000000 chacha salsa20
Algorithm mode Mbyte/s cycles/byte cycles/block
chacha encrypt 62.56 15.25 975.69 chacha decrypt 62.62 15.23 974.68
chacha_poly1305 encrypt 38.40 24.83 1589.32 chacha_poly1305 decrypt 38.40 24.83 1589.38 chacha_poly1305 update 99.92 9.54 610.86
salsa20 encrypt 61.80 15.43 987.58 salsa20 decrypt 61.81 15.43 987.41
salsa20r12 encrypt 91.90 10.38 664.14 salsa20r12 decrypt 91.93 10.37 663.92
As said, the second patch (switching back to vldm via s regs where possible) doesn't change these numbers at all (but saves a register).
(What's nice about my boards it that due to missing power-saving and frequency-scaling functionality they give very, very consistent numbers across multiple runs.)
My first reflex is that 400Kbyte/s for chacha and 350Kbyte/s for salsa20 is relevant enough to keep separate implementations for LE and BE in the code *or* dig deeper into why vst1.8 is so much slower.
Do you (or anybody else) have a hardware arm board for testing, possibly with a Cortex A8 or A9 implementation to see how it behaves there?
I have a couple of RasPis and little- and big-endian pine64s (aarch64) gathering dust in a box which I could fire up for some testing (not sure about 32-bit support on the pine64s, though).
and reuse SRCp32 for the second load of the same data, further down (assuming r3 really is free to use for this purpose; if we have to save
I read AAPCS as saying that r3 can be used as scratch register inbetween subroutine calls. Since we don't to subroutine calls, its use should be fine.
I've got one side-track which might point to some peculiarity of my machine: The unmodified assembler code *without* chacha-3core and salsa20-2core (files moved out of the way before configure) is no faster or even slower than what the C compiler produces:
[michael@c2-le:~/nettle/build-no23core/examples] LD_LIBRARY_PATH=../.lib ./nettle-benchmark -f 1000000000 chacha salsa20
Algorithm mode Mbyte/s cycles/byte cycles/block
chacha encrypt 31.35 30.42 1946.66 chacha decrypt 31.34 30.43 1947.30
chacha_poly1305 encrypt 24.10 39.57 2532.24 chacha_poly1305 decrypt 24.10 39.57 2532.21 chacha_poly1305 update 104.42 9.13 584.53
salsa20 encrypt 30.38 31.39 2008.96 salsa20 decrypt 30.39 31.38 2008.34
salsa20r12 encrypt 47.00 20.29 1298.56 salsa20r12 decrypt 47.01 20.29 1298.25
Does this seem reasonable or does it point to some flaw in my benchmarking or system software/hardware? (I've done my best using gdb to verify that the asm routines are in use. Unfortunately, nettle-benchmark is resisting attempts to ltrace or gdb-debug it, so I diagnosed the testsuite tests instead.)