Re: [AArch64] Optimize GHASH

24 Jan 2021


      Hello Mamone,
On Sat, Jan 23, 2021 at 08:52:30PM +0200, Maamoun TK wrote:
...
...
@@ -280,9 +266,9 @@ L1x:
     tst            LENGTH,#-16
     b.eq           Lmod

ld1            {H1M.16b,H1L.16b},[TABLE]


ld1            {H1M.2d,H1L.2d},[TABLE]


ld1            {C0.16b},[DATA],#16


ld1            {C0.2d},[DATA],#16

IF_LE(`
     rev64          C0.16b,C0.16b
 ')
behavior hence we can't get this patch working on BE mode. The core of
First off: All three patches from my previous mail had the test gcm-hash
passing on LE and BE. I just reconfirmed the last patch with the whole
testsuite on LE and BE. So they should be working and cause no
regression.
...
I have one question here, do operations on doublewords transpose both
doubleword parts in BE mode? for example pmull instruction transpose
doublewords on LE mode when operated, in BE I don't expect the same
behavior hence we can't get this patch working on BE mode. The core of
pmull instruction is shift and xor operations so we can't perform pmull
instruction on byte-reversed doublewords as it's gonna produce wrong
results.
I think this directly corresponds to your next question:
...
Dealing with vector
registers in aarch64 is really challenging, both x86_64 and PowerPC don't
drag the endianness issues to vector registers, it's only applied to memory
and once the data loaded from memory into vector register all
endianness concerns are ended. Although PowerPC supports both endianness
modes, AltiVec instructions operate the same on vector registers on both
modes. It's a weird decision made by the Arm side.
I think there might be a misunderstanding here (possibly caused by
my attemps at explaining what ldr does, sorry):
On arm(32) and aarch64, endianness is also exclusively handled on
load and store operations. Register layout and operation behaviour is
identical in both modes. I think ARM also speaks of "memory endianness"
for just that reason. There is no adjustable "CPU endianness". It's
always "CPU-native".
So pmull will behave exactly the same in BE and LE mode. We just have
to make sure our load operations put the operands in the correct (i.e.
CPU-native) representation into the correct vector register indices upon
load.
So as an example:
pmull2 v0.1q,v1.2d,v2.2d
will always work on d[2] of v1 and v2 and put the result into all of v0.
And it expects its operands there in exactly one format, i.e. the least
significant bit at one end and the most-significant bit at the other
(and it's the same ends/bits in both memory-endianness modes :). And it will
also store to v0 in exactly the same representation in LE and BE mode.
Nothing changes with an endianness mode switch.
That's where load and store come in:
ld1 {v1.2d,v2.2d},[x0]
will load v1 and v2 with one-dimensional vectors from memory. So v1.d[0]
will be read from x0+0, v1.d[1] from x0+8 (bytes) and v2.d[0] from x0+16
and v2.d[1] from x0+24. That'll also be the same in LE and BE mode
because that's the structure of the vector prescribed by the load
operation we choose. Endianness will be applied to the individual
doublewords but the order in which they're loaded from memory and in
which they're put into d[0] and d[1] won't change, because they're
vectors.
So if you've actually stored a vector from CPU registers using
st1 {v1.2d, v2.2d},[x0]
and then load them back using
ld1 {v1.2d, v2.2d},[x0]
there's nothing else that needs to be done. The individual bytes of the
doublewords will be stored LE in memory in LE mode and BE in BE mode but
you won't notice. And the order of the doublewords in memory will be the
same in both modes.
If you're loading something that isn't stored LE or has no endianness at
all, e.g. just a sequence of data bytes (as in DATA in our code) or
something that was explicitly stored BE even on an LE CPU (as in
TABLE[128] in our code, I gather) but want to treat it as a larger
datatype, then you have to define endianness and need to apply
correction. That's why we need to rev64 in one mode (e.g. LE) to get the
same register-content in both endianness modes if what's loaded isn't
actually stored in that endianness in memory.
Another way is to explicitly load a vector of bytes using ld1 {v1.16b,
v2.16b},[x0]. Then you can be sure what you get as register content, no
matter what memory endianness the CPU is using. If it's really just a
sequence of data bytes stored in their correct and necessary order in
memory and we only want to apply shifts and logical operations to each
of them, we'd be all set.
Here we can also exploit but need to be careful to understand the
different views on the register, so the fact that b[0] through b[7] is
mapped to d[0] and that b[0] will be the least significant byte in d[0]
and b[7] will be MSB. This layout is cpu-native, i.e. also the same in
both endianness modes. It's just that an ld1 {v1.16b} will always load a
vector of bytes with eight elements as consecutive bytes from memory
into b[0] through b[7], so it'll always be an LSB-first load when
interpreted as a larger data type. If we then look at that data trough
d[0] it will appear reversed if it isn't really a doubleword that was
stored little-endian.
That's why an ld1 {v1.b16,v2.b16},[x0] will produce incorrect results
with a pmull2 v0.1q,v1.2d,v2.2d in at least one endianness because we're
telling one operation that it's dealing with a byte-vector and the other
expects us to provide a vector of doublewords. If what we're loading is
actually something that was stored as doublewords in current memory
endianness, then ld1 {v1.2d,v2.2d} is the correct load operation. If
it's data bytes we want to *treat* as a big-endian doubleword, we can
use either ld1 {v1.16b,v2.16b} or {v1.2d,v2.2d} but in both cases need
to rev64 the register content if memory endianness is LE.
Now what *ldr* does is load a single 128bit quadword. And this will
indeed transpose the doublewords in BE mode when looked at through d[0]
and d[1]. Because as a big-endian load it will of course load the byte
at x0 into the most significant byte of e.g. v2, i.e. v2.d[1], i.e.
v2.b[15] and not v2.d[0], i.e. v2.b[7] (as with ld1.2d) or v2.b[0] (as
with ld1.16b). Similarly, x0+15 will go into v2.b[0] in BE and v2.b[15]
in LE mode. So this will only make sense if what we're loading was
actually stored using str as a 128bit quadword in current memory
endianness. If it's a sequence of bytes (st1.16b) we want to treat as a
vector of doublewords, not only will the bytes appear inverted when
looked at through d[0] and d[1] but also what's in d[0] will be in d[1]
in the other endianness mode and vice-versa. If it's a vector of
doublewords in memory endianness (st1.2d), byte order in the register
will be correct in both modes (because it's different in memory) but
d[0] and d[1] will still be transposed.
That's where all my rambling about doubleword transposition came from.
Does that make sense?
I just found this document from the LLVM guys with pictures! :)
https://llvm.org/docs/BigEndianNEON.html
BTW: ARM even goes as far as always storing *instructions* themselves,
so the actual opcodes the CPU decodes and executes, little-endian, even
in BE binaries. So the instruction fetch and decode stage always
operates little-endian. When the instruction is executed it's then just
an additional flag that tells load and store instructions how to behave
when executed and accessing memory. (I'm actually extrapolation from
what I know to be true for classic arm32 but it makes sense for that to
be true for aarch64 as well.)
...
...
Please excuse my laboured and longwinded thinking. ;) I really have to
start thinking in vectors also.
Actually, I'm impressed how you get and handle all these ideas in your mind
and turn around quickly once you get a new one.
Uh, thanks, FWIW. :)
I think to gather you (same as me) prefer to think in big-endian
representation. As for arm and aarch64, little-endian is the default, do
you think, the routine could be changed to move the special endianness
treatment using rev64 to BE mode, i.e. avoid them in the standard LE
case? It's certainly beyond me but it might give some additional
speedup.
Or would it be irrelevant compared to the speedup already given by using
pmull in the first place?
...
...
@@ -335,9 +321,7 @@ Lmod_8_done:
     REDUCTION D
Ldone:
-IF_LE(`
     rev64          D.16b,D.16b
-')
     st1            {D.16b},[X]
     ret
 EPILOGUE(_nettle_gcm_hash)
I like your ideas so far as you're shrinking the gap between both
endianness code but if my previous concern is right we still can't get this
patch works too.
As said, the testsuite is passing with all three diffs from my previous
mail.
[...]
PASS: symbols
PASS: dlopen
====================
All 110 tests passed
====================
make[1]: Leaving directory '/home/michael/build-aarch64_be/testsuite'
Making check in examples
make[1]: Entering directory '/home/michael/build-aarch64_be/examples'
TEST_SHLIB_DIR="/home/michael/build-aarch64_be/.lib" \
  srcdir="../../nettle/examples" EMULATOR="" EXEEXT="" \
          "../../nettle"/run-tests rsa-sign-test rsa-verify-test
rsa-encrypt-test
xxxxxx
xxxxxx
PASS: rsa-sign
PASS: rsa-verify
PASS: rsa-encrypt
==================
All 3 tests passed
==================
make[1]: Leaving directory
'/home/michael/build-aarch64_be/examples'
[michael@aarch64-be:~/build-aarch64_be]
[...]
PASS: symbols
PASS: dlopen
====================
All 110 tests passed
====================
make[1]: Leaving directory '/home/michael/build-aarch64/testsuite'
Making check in examples
make[1]: Entering directory '/home/michael/build-aarch64/examples'
TEST_SHLIB_DIR="/home/michael/build-aarch64/.lib" \
  srcdir="../../nettle/examples" EMULATOR="" EXEEXT="" \
          "../../nettle"/run-tests rsa-sign-test rsa-verify-test
rsa-encrypt-test
xxxxxx
xxxxxx
ee
PASS: rsa-sign
PASS: rsa-verify
PASS: rsa-encrypt
==================
All 3 tests passed
==================
make[1]: Leaving directory '/home/michael/build-aarch64/examples'
[michael@aarch64:~/build-aarch64]
...
...
And as always after all this guesswork I have found a likely very
relevant comment in gcm.c:
/* Shift uses big-endian representation. */
#if WORDS_BIGENDIAN
  reduce = shift_table[x->u64[1] & 0xff];
Is that it? Or is TABLE just internal to the routine and we can store
there however we please? (Apart from H at TABLE[128] initialised for us
by gcm_set_key and stored BE?)
The assembly implementation of GHASH has a whole different scheme from C
table-lookup implementation, you don't have to worry about any of that.
Perfect. So whether we use ld1/st1.16b or ld1/st1.2d for TABLE doesn't
matter. I wouldn't expect it but we could benchmark whether one is faster
than the other though!?
For clarification: How is H, i.e. TABLE[128] defined an interface to
gcm_set_key? I see that gcm_set_key calls a cipher function to fill it.
So I guess it provides the routine with a sequence of bytes  (similar to
DATA), i.e. the key, which will be the same on LE and BE and we *treat*
it as a big-endian doubleword for the sake of using pmull on it.
Correct?
-- 
Thanks,
Michael

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [AArch64] Optimize GHASH