Hello Michael,
On Sun, Jan 24, 2021 at 3:15 PM Michael Weiser michael.weiser@gmx.de wrote:
I think there might be a misunderstanding here (possibly caused by my attemps at explaining what ldr does, sorry):
On arm(32) and aarch64, endianness is also exclusively handled on load and store operations. Register layout and operation behaviour is identical in both modes. I think ARM also speaks of "memory endianness" for just that reason. There is no adjustable "CPU endianness". It's always "CPU-native".
So pmull will behave exactly the same in BE and LE mode. We just have to make sure our load operations put the operands in the correct (i.e. CPU-native) representation into the correct vector register indices upon load.
So as an example:
pmull2 v0.1q,v1.2d,v2.2d
will always work on d[2] of v1 and v2 and put the result into all of v0. And it expects its operands there in exactly one format, i.e. the least significant bit at one end and the most-significant bit at the other (and it's the same ends/bits in both memory-endianness modes :). And it will also store to v0 in exactly the same representation in LE and BE mode. Nothing changes with an endianness mode switch.
That's where load and store come in:
ld1 {v1.2d,v2.2d},[x0]
will load v1 and v2 with one-dimensional vectors from memory. So v1.d[0] will be read from x0+0, v1.d[1] from x0+8 (bytes) and v2.d[0] from x0+16 and v2.d[1] from x0+24. That'll also be the same in LE and BE mode because that's the structure of the vector prescribed by the load operation we choose. Endianness will be applied to the individual doublewords but the order in which they're loaded from memory and in which they're put into d[0] and d[1] won't change, because they're vectors.
So if you've actually stored a vector from CPU registers using st1 {v1.2d, v2.2d},[x0] and then load them back using ld1 {v1.2d, v2.2d},[x0] there's nothing else that needs to be done. The individual bytes of the doublewords will be stored LE in memory in LE mode and BE in BE mode but you won't notice. And the order of the doublewords in memory will be the same in both modes.
If you're loading something that isn't stored LE or has no endianness at all, e.g. just a sequence of data bytes (as in DATA in our code) or something that was explicitly stored BE even on an LE CPU (as in TABLE[128] in our code, I gather) but want to treat it as a larger datatype, then you have to define endianness and need to apply correction. That's why we need to rev64 in one mode (e.g. LE) to get the same register-content in both endianness modes if what's loaded isn't actually stored in that endianness in memory.
Another way is to explicitly load a vector of bytes using ld1 {v1.16b, v2.16b},[x0]. Then you can be sure what you get as register content, no matter what memory endianness the CPU is using. If it's really just a sequence of data bytes stored in their correct and necessary order in memory and we only want to apply shifts and logical operations to each of them, we'd be all set.
Here we can also exploit but need to be careful to understand the different views on the register, so the fact that b[0] through b[7] is mapped to d[0] and that b[0] will be the least significant byte in d[0] and b[7] will be MSB. This layout is cpu-native, i.e. also the same in both endianness modes. It's just that an ld1 {v1.16b} will always load a vector of bytes with eight elements as consecutive bytes from memory into b[0] through b[7], so it'll always be an LSB-first load when interpreted as a larger data type. If we then look at that data trough d[0] it will appear reversed if it isn't really a doubleword that was stored little-endian.
That's why an ld1 {v1.b16,v2.b16},[x0] will produce incorrect results with a pmull2 v0.1q,v1.2d,v2.2d in at least one endianness because we're telling one operation that it's dealing with a byte-vector and the other expects us to provide a vector of doublewords. If what we're loading is actually something that was stored as doublewords in current memory endianness, then ld1 {v1.2d,v2.2d} is the correct load operation. If it's data bytes we want to *treat* as a big-endian doubleword, we can use either ld1 {v1.16b,v2.16b} or {v1.2d,v2.2d} but in both cases need to rev64 the register content if memory endianness is LE.
Now what *ldr* does is load a single 128bit quadword. And this will indeed transpose the doublewords in BE mode when looked at through d[0] and d[1]. Because as a big-endian load it will of course load the byte at x0 into the most significant byte of e.g. v2, i.e. v2.d[1], i.e. v2.b[15] and not v2.d[0], i.e. v2.b[7] (as with ld1.2d) or v2.b[0] (as with ld1.16b). Similarly, x0+15 will go into v2.b[0] in BE and v2.b[15] in LE mode. So this will only make sense if what we're loading was actually stored using str as a 128bit quadword in current memory endianness. If it's a sequence of bytes (st1.16b) we want to treat as a vector of doublewords, not only will the bytes appear inverted when looked at through d[0] and d[1] but also what's in d[0] will be in d[1] in the other endianness mode and vice-versa. If it's a vector of doublewords in memory endianness (st1.2d), byte order in the register will be correct in both modes (because it's different in memory) but d[0] and d[1] will still be transposed.
That's where all my rambling about doubleword transposition came from. Does that make sense?
I just found this document from the LLVM guys with pictures! :) https://llvm.org/docs/BigEndianNEON.html
BTW: ARM even goes as far as always storing *instructions* themselves, so the actual opcodes the CPU decodes and executes, little-endian, even in BE binaries. So the instruction fetch and decode stage always operates little-endian. When the instruction is executed it's then just an additional flag that tells load and store instructions how to behave when executed and accessing memory. (I'm actually extrapolation from what I know to be true for classic arm32 but it makes sense for that to be true for aarch64 as well.)
That explains everything, it also explains why ld1 instruction reverse the byte order according to loading type on BE and always maintain the same order on LE. The non memory related instructions maintain the same behavior as it should no matter what the endianness mode they run on. Thanks for the detailed explanation. This scheme has a couple of advantages: - Taking advantage of performance benefit of LE data layout on both memory and registers side. - Eliminating the overhead caused by transposing data order for every potential load/store operation on LE since it's a more popular mode.
I think to gather you (same as me) prefer to think in big-endian
representation. As for arm and aarch64, little-endian is the default, do you think, the routine could be changed to move the special endianness treatment using rev64 to BE mode, i.e. avoid them in the standard LE case? It's certainly beyond me but it might give some additional speedup.
Or would it be irrelevant compared to the speedup already given by using pmull in the first place?
I don't know how it gonna affect the performance but it's irrelevant margin indeed, TBH I liked the patch with the special endianness treatment but it's up to you to decide!
And as always after all this guesswork I have found a likely very relevant comment in gcm.c:
/* Shift uses big-endian representation. */ #if WORDS_BIGENDIAN reduce = shift_table[x->u64[1] & 0xff];
Is that it? Or is TABLE just internal to the routine and we can store there however we please? (Apart from H at TABLE[128] initialised for us by gcm_set_key and stored BE?)
The assembly implementation of GHASH has a whole different scheme from C table-lookup implementation, you don't have to worry about any of that.
Perfect. So whether we use ld1/st1.16b or ld1/st1.2d for TABLE doesn't matter. I wouldn't expect it but we could benchmark whether one is faster than the other though!?
Yeah, it doesn't matter since gcm_init_key() and gcm_hash() are the only functions that use the table. keeping it ld1/st1.16b is fine, either way there is a table layout at header of the file that gives a sense about the table structure for the assembly implementation scheme.
For clarification: How is H, i.e. TABLE[128] defined an interface to gcm_set_key? I see that gcm_set_key calls a cipher function to fill it. So I guess it provides the routine with a sequence of bytes (similar to DATA), i.e. the key, which will be the same on LE and BE and we *treat* it as a big-endian doubleword for the sake of using pmull on it. Correct?
subkey 'H' value is calculated by enciphering (usually using AES) a sequence of ZERO data, then gcm_set_key() assign the calculated value (subkey 'H') at the middle of TABLE array, that is TABLE[80], the remaining fields of array are meant to be filled by C gcm_init_key() routine to server as assistance subkeys for C table-look implementation. Since the assembly implementation uses a different scheme, we don't need those assistance subkeys so we grab the main subkey (H) value from the middle of the table and hook our needed assistance values on this table in order to be used by gcm_hash(). Hope it makes sense for you, let me know if you want to hear further explanation.
regards, Mamone