Hello Mamone,
On Sat, Jan 23, 2021 at 08:52:30PM +0200, Maamoun TK wrote:
@@ -280,9 +266,9 @@ L1x: tst LENGTH,#-16 b.eq Lmod
- ld1 {H1M.16b,H1L.16b},[TABLE]
- ld1 {H1M.2d,H1L.2d},[TABLE]
- ld1 {C0.16b},[DATA],#16
- ld1 {C0.2d},[DATA],#16
IF_LE(` rev64 C0.16b,C0.16b ')
behavior hence we can't get this patch working on BE mode. The core of
First off: All three patches from my previous mail had the test gcm-hash passing on LE and BE. I just reconfirmed the last patch with the whole testsuite on LE and BE. So they should be working and cause no regression.
I have one question here, do operations on doublewords transpose both doubleword parts in BE mode? for example pmull instruction transpose doublewords on LE mode when operated, in BE I don't expect the same behavior hence we can't get this patch working on BE mode. The core of pmull instruction is shift and xor operations so we can't perform pmull instruction on byte-reversed doublewords as it's gonna produce wrong results.
I think this directly corresponds to your next question:
Dealing with vector registers in aarch64 is really challenging, both x86_64 and PowerPC don't drag the endianness issues to vector registers, it's only applied to memory and once the data loaded from memory into vector register all endianness concerns are ended. Although PowerPC supports both endianness modes, AltiVec instructions operate the same on vector registers on both modes. It's a weird decision made by the Arm side.
I think there might be a misunderstanding here (possibly caused by my attemps at explaining what ldr does, sorry):
On arm(32) and aarch64, endianness is also exclusively handled on load and store operations. Register layout and operation behaviour is identical in both modes. I think ARM also speaks of "memory endianness" for just that reason. There is no adjustable "CPU endianness". It's always "CPU-native".
So pmull will behave exactly the same in BE and LE mode. We just have to make sure our load operations put the operands in the correct (i.e. CPU-native) representation into the correct vector register indices upon load.
So as an example:
pmull2 v0.1q,v1.2d,v2.2d
will always work on d[2] of v1 and v2 and put the result into all of v0. And it expects its operands there in exactly one format, i.e. the least significant bit at one end and the most-significant bit at the other (and it's the same ends/bits in both memory-endianness modes :). And it will also store to v0 in exactly the same representation in LE and BE mode. Nothing changes with an endianness mode switch.
That's where load and store come in:
ld1 {v1.2d,v2.2d},[x0]
will load v1 and v2 with one-dimensional vectors from memory. So v1.d[0] will be read from x0+0, v1.d[1] from x0+8 (bytes) and v2.d[0] from x0+16 and v2.d[1] from x0+24. That'll also be the same in LE and BE mode because that's the structure of the vector prescribed by the load operation we choose. Endianness will be applied to the individual doublewords but the order in which they're loaded from memory and in which they're put into d[0] and d[1] won't change, because they're vectors.
So if you've actually stored a vector from CPU registers using st1 {v1.2d, v2.2d},[x0] and then load them back using ld1 {v1.2d, v2.2d},[x0] there's nothing else that needs to be done. The individual bytes of the doublewords will be stored LE in memory in LE mode and BE in BE mode but you won't notice. And the order of the doublewords in memory will be the same in both modes.
If you're loading something that isn't stored LE or has no endianness at all, e.g. just a sequence of data bytes (as in DATA in our code) or something that was explicitly stored BE even on an LE CPU (as in TABLE[128] in our code, I gather) but want to treat it as a larger datatype, then you have to define endianness and need to apply correction. That's why we need to rev64 in one mode (e.g. LE) to get the same register-content in both endianness modes if what's loaded isn't actually stored in that endianness in memory.
Another way is to explicitly load a vector of bytes using ld1 {v1.16b, v2.16b},[x0]. Then you can be sure what you get as register content, no matter what memory endianness the CPU is using. If it's really just a sequence of data bytes stored in their correct and necessary order in memory and we only want to apply shifts and logical operations to each of them, we'd be all set.
Here we can also exploit but need to be careful to understand the different views on the register, so the fact that b[0] through b[7] is mapped to d[0] and that b[0] will be the least significant byte in d[0] and b[7] will be MSB. This layout is cpu-native, i.e. also the same in both endianness modes. It's just that an ld1 {v1.16b} will always load a vector of bytes with eight elements as consecutive bytes from memory into b[0] through b[7], so it'll always be an LSB-first load when interpreted as a larger data type. If we then look at that data trough d[0] it will appear reversed if it isn't really a doubleword that was stored little-endian.
That's why an ld1 {v1.b16,v2.b16},[x0] will produce incorrect results with a pmull2 v0.1q,v1.2d,v2.2d in at least one endianness because we're telling one operation that it's dealing with a byte-vector and the other expects us to provide a vector of doublewords. If what we're loading is actually something that was stored as doublewords in current memory endianness, then ld1 {v1.2d,v2.2d} is the correct load operation. If it's data bytes we want to *treat* as a big-endian doubleword, we can use either ld1 {v1.16b,v2.16b} or {v1.2d,v2.2d} but in both cases need to rev64 the register content if memory endianness is LE.
Now what *ldr* does is load a single 128bit quadword. And this will indeed transpose the doublewords in BE mode when looked at through d[0] and d[1]. Because as a big-endian load it will of course load the byte at x0 into the most significant byte of e.g. v2, i.e. v2.d[1], i.e. v2.b[15] and not v2.d[0], i.e. v2.b[7] (as with ld1.2d) or v2.b[0] (as with ld1.16b). Similarly, x0+15 will go into v2.b[0] in BE and v2.b[15] in LE mode. So this will only make sense if what we're loading was actually stored using str as a 128bit quadword in current memory endianness. If it's a sequence of bytes (st1.16b) we want to treat as a vector of doublewords, not only will the bytes appear inverted when looked at through d[0] and d[1] but also what's in d[0] will be in d[1] in the other endianness mode and vice-versa. If it's a vector of doublewords in memory endianness (st1.2d), byte order in the register will be correct in both modes (because it's different in memory) but d[0] and d[1] will still be transposed.
That's where all my rambling about doubleword transposition came from. Does that make sense?
I just found this document from the LLVM guys with pictures! :) https://llvm.org/docs/BigEndianNEON.html
BTW: ARM even goes as far as always storing *instructions* themselves, so the actual opcodes the CPU decodes and executes, little-endian, even in BE binaries. So the instruction fetch and decode stage always operates little-endian. When the instruction is executed it's then just an additional flag that tells load and store instructions how to behave when executed and accessing memory. (I'm actually extrapolation from what I know to be true for classic arm32 but it makes sense for that to be true for aarch64 as well.)
Please excuse my laboured and longwinded thinking. ;) I really have to start thinking in vectors also.
Actually, I'm impressed how you get and handle all these ideas in your mind and turn around quickly once you get a new one.
Uh, thanks, FWIW. :)
I think to gather you (same as me) prefer to think in big-endian representation. As for arm and aarch64, little-endian is the default, do you think, the routine could be changed to move the special endianness treatment using rev64 to BE mode, i.e. avoid them in the standard LE case? It's certainly beyond me but it might give some additional speedup.
Or would it be irrelevant compared to the speedup already given by using pmull in the first place?
@@ -335,9 +321,7 @@ Lmod_8_done: REDUCTION D
Ldone: -IF_LE(` rev64 D.16b,D.16b -') st1 {D.16b},[X] ret EPILOGUE(_nettle_gcm_hash)
I like your ideas so far as you're shrinking the gap between both endianness code but if my previous concern is right we still can't get this patch works too.
As said, the testsuite is passing with all three diffs from my previous mail.
[...] PASS: symbols PASS: dlopen ==================== All 110 tests passed ==================== make[1]: Leaving directory '/home/michael/build-aarch64_be/testsuite' Making check in examples make[1]: Entering directory '/home/michael/build-aarch64_be/examples' TEST_SHLIB_DIR="/home/michael/build-aarch64_be/.lib" \ srcdir="../../nettle/examples" EMULATOR="" EXEEXT="" \ "../../nettle"/run-tests rsa-sign-test rsa-verify-test rsa-encrypt-test xxxxxx xxxxxx PASS: rsa-sign PASS: rsa-verify PASS: rsa-encrypt ================== All 3 tests passed ================== make[1]: Leaving directory '/home/michael/build-aarch64_be/examples' [michael@aarch64-be:~/build-aarch64_be]
[...] PASS: symbols PASS: dlopen ==================== All 110 tests passed ==================== make[1]: Leaving directory '/home/michael/build-aarch64/testsuite' Making check in examples make[1]: Entering directory '/home/michael/build-aarch64/examples' TEST_SHLIB_DIR="/home/michael/build-aarch64/.lib" \ srcdir="../../nettle/examples" EMULATOR="" EXEEXT="" \ "../../nettle"/run-tests rsa-sign-test rsa-verify-test rsa-encrypt-test xxxxxx xxxxxx ee PASS: rsa-sign PASS: rsa-verify PASS: rsa-encrypt ================== All 3 tests passed ================== make[1]: Leaving directory '/home/michael/build-aarch64/examples' [michael@aarch64:~/build-aarch64]
And as always after all this guesswork I have found a likely very relevant comment in gcm.c:
/* Shift uses big-endian representation. */ #if WORDS_BIGENDIAN reduce = shift_table[x->u64[1] & 0xff];
Is that it? Or is TABLE just internal to the routine and we can store there however we please? (Apart from H at TABLE[128] initialised for us by gcm_set_key and stored BE?)
The assembly implementation of GHASH has a whole different scheme from C table-lookup implementation, you don't have to worry about any of that.
Perfect. So whether we use ld1/st1.16b or ld1/st1.2d for TABLE doesn't matter. I wouldn't expect it but we could benchmark whether one is faster than the other though!?
For clarification: How is H, i.e. TABLE[128] defined an interface to gcm_set_key? I see that gcm_set_key calls a cipher function to fill it. So I guess it provides the routine with a sequence of bytes (similar to DATA), i.e. the key, which will be the same on LE and BE and we *treat* it as a big-endian doubleword for the sake of using pmull on it. Correct?