Re: [AArch64] Optimize GHASH

24 Jan 2021


      Hello Michael,
On Sun, Jan 24, 2021 at 3:15 PM Michael Weiser michael.weiser@gmx.de
wrote:
...
I think there might be a misunderstanding here (possibly caused by
my attemps at explaining what ldr does, sorry):
On arm(32) and aarch64, endianness is also exclusively handled on
load and store operations. Register layout and operation behaviour is
identical in both modes. I think ARM also speaks of "memory endianness"
for just that reason. There is no adjustable "CPU endianness". It's
always "CPU-native".
So pmull will behave exactly the same in BE and LE mode. We just have
to make sure our load operations put the operands in the correct (i.e.
CPU-native) representation into the correct vector register indices upon
load.
So as an example:
pmull2 v0.1q,v1.2d,v2.2d
will always work on d[2] of v1 and v2 and put the result into all of v0.
And it expects its operands there in exactly one format, i.e. the least
significant bit at one end and the most-significant bit at the other
(and it's the same ends/bits in both memory-endianness modes :). And it
will
also store to v0 in exactly the same representation in LE and BE mode.
Nothing changes with an endianness mode switch.
That's where load and store come in:
ld1 {v1.2d,v2.2d},[x0]
will load v1 and v2 with one-dimensional vectors from memory. So v1.d[0]
will be read from x0+0, v1.d[1] from x0+8 (bytes) and v2.d[0] from x0+16
and v2.d[1] from x0+24. That'll also be the same in LE and BE mode
because that's the structure of the vector prescribed by the load
operation we choose. Endianness will be applied to the individual
doublewords but the order in which they're loaded from memory and in
which they're put into d[0] and d[1] won't change, because they're
vectors.
So if you've actually stored a vector from CPU registers using
st1 {v1.2d, v2.2d},[x0]
and then load them back using
ld1 {v1.2d, v2.2d},[x0]
there's nothing else that needs to be done. The individual bytes of the
doublewords will be stored LE in memory in LE mode and BE in BE mode but
you won't notice. And the order of the doublewords in memory will be the
same in both modes.
If you're loading something that isn't stored LE or has no endianness at
all, e.g. just a sequence of data bytes (as in DATA in our code) or
something that was explicitly stored BE even on an LE CPU (as in
TABLE[128] in our code, I gather) but want to treat it as a larger
datatype, then you have to define endianness and need to apply
correction. That's why we need to rev64 in one mode (e.g. LE) to get the
same register-content in both endianness modes if what's loaded isn't
actually stored in that endianness in memory.
Another way is to explicitly load a vector of bytes using ld1 {v1.16b,
v2.16b},[x0]. Then you can be sure what you get as register content, no
matter what memory endianness the CPU is using. If it's really just a
sequence of data bytes stored in their correct and necessary order in
memory and we only want to apply shifts and logical operations to each
of them, we'd be all set.
Here we can also exploit but need to be careful to understand the
different views on the register, so the fact that b[0] through b[7] is
mapped to d[0] and that b[0] will be the least significant byte in d[0]
and b[7] will be MSB. This layout is cpu-native, i.e. also the same in
both endianness modes. It's just that an ld1 {v1.16b} will always load a
vector of bytes with eight elements as consecutive bytes from memory
into b[0] through b[7], so it'll always be an LSB-first load when
interpreted as a larger data type. If we then look at that data trough
d[0] it will appear reversed if it isn't really a doubleword that was
stored little-endian.
That's why an ld1 {v1.b16,v2.b16},[x0] will produce incorrect results
with a pmull2 v0.1q,v1.2d,v2.2d in at least one endianness because we're
telling one operation that it's dealing with a byte-vector and the other
expects us to provide a vector of doublewords. If what we're loading is
actually something that was stored as doublewords in current memory
endianness, then ld1 {v1.2d,v2.2d} is the correct load operation. If
it's data bytes we want to *treat* as a big-endian doubleword, we can
use either ld1 {v1.16b,v2.16b} or {v1.2d,v2.2d} but in both cases need
to rev64 the register content if memory endianness is LE.
Now what *ldr* does is load a single 128bit quadword. And this will
indeed transpose the doublewords in BE mode when looked at through d[0]
and d[1]. Because as a big-endian load it will of course load the byte
at x0 into the most significant byte of e.g. v2, i.e. v2.d[1], i.e.
v2.b[15] and not v2.d[0], i.e. v2.b[7] (as with ld1.2d) or v2.b[0] (as
with ld1.16b). Similarly, x0+15 will go into v2.b[0] in BE and v2.b[15]
in LE mode. So this will only make sense if what we're loading was
actually stored using str as a 128bit quadword in current memory
endianness. If it's a sequence of bytes (st1.16b) we want to treat as a
vector of doublewords, not only will the bytes appear inverted when
looked at through d[0] and d[1] but also what's in d[0] will be in d[1]
in the other endianness mode and vice-versa. If it's a vector of
doublewords in memory endianness (st1.2d), byte order in the register
will be correct in both modes (because it's different in memory) but
d[0] and d[1] will still be transposed.
That's where all my rambling about doubleword transposition came from.
Does that make sense?
I just found this document from the LLVM guys with pictures! :)
https://llvm.org/docs/BigEndianNEON.html
BTW: ARM even goes as far as always storing *instructions* themselves,
so the actual opcodes the CPU decodes and executes, little-endian, even
in BE binaries. So the instruction fetch and decode stage always
operates little-endian. When the instruction is executed it's then just
an additional flag that tells load and store instructions how to behave
when executed and accessing memory. (I'm actually extrapolation from
what I know to be true for classic arm32 but it makes sense for that to
be true for aarch64 as well.)
That explains everything, it also explains why ld1 instruction reverse the
byte order according to loading type on BE and always maintain the same
order on LE. The non memory related instructions maintain the same behavior
as it should no matter what the endianness mode they run on. Thanks for the
detailed explanation.
This scheme has a couple of advantages:
- Taking advantage of performance benefit of LE data layout on both memory
and registers side.
- Eliminating the overhead caused by transposing data order for every
potential load/store operation on LE since it's a more popular mode.
I think to gather you (same as me) prefer to think in big-endian
...
representation. As for arm and aarch64, little-endian is the default, do
you think, the routine could be changed to move the special endianness
treatment using rev64 to BE mode, i.e. avoid them in the standard LE
case? It's certainly beyond me but it might give some additional
speedup.
Or would it be irrelevant compared to the speedup already given by using
pmull in the first place?
I don't know how it gonna affect the performance but it's irrelevant margin
indeed, TBH I liked the patch with the special endianness treatment but
it's up to you to decide!
...
...
...
And as always after all this guesswork I have found a likely very
relevant comment in gcm.c:
/* Shift uses big-endian representation. */
#if WORDS_BIGENDIAN
  reduce = shift_table[x->u64[1] & 0xff];
Is that it? Or is TABLE just internal to the routine and we can store
there however we please? (Apart from H at TABLE[128] initialised for us
by gcm_set_key and stored BE?)
The assembly implementation of GHASH has a whole different scheme from C
table-lookup implementation, you don't have to worry about any of that.
Perfect. So whether we use ld1/st1.16b or ld1/st1.2d for TABLE doesn't
matter. I wouldn't expect it but we could benchmark whether one is faster
than the other though!?
Yeah, it doesn't matter since gcm_init_key() and gcm_hash() are the only
functions that use the table. keeping it ld1/st1.16b is fine, either way
there is a table layout at header of the file that gives a sense about the
table structure for the assembly implementation scheme.
...
For clarification: How is H, i.e. TABLE[128] defined an interface to
gcm_set_key? I see that gcm_set_key calls a cipher function to fill it.
So I guess it provides the routine with a sequence of bytes  (similar to
DATA), i.e. the key, which will be the same on LE and BE and we *treat*
it as a big-endian doubleword for the sake of using pmull on it.
Correct?
subkey 'H' value is calculated by enciphering (usually using AES) a
sequence of ZERO data, then gcm_set_key() assign the calculated value
(subkey 'H') at the middle of TABLE array, that is TABLE[80], the remaining
fields of array are meant to be filled by C gcm_init_key() routine to
server as assistance subkeys for C table-look implementation. Since the
assembly implementation uses a different scheme, we don't need those
assistance subkeys so we grab the main subkey (H) value from the middle of
the table and hook our needed assistance values on this table in order to
be used by gcm_hash(). Hope it makes sense for you, let me know if you want
to hear further explanation.
regards,
Mamone

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [AArch64] Optimize GHASH