On Fri, Jan 22, 2021 at 1:45 AM Michael Weiser michael.weiser@gmx.de wrote:
Longer story: ldr does a 128bit load. This loads bytes in exactly reverse order into the register on LE and BE. As you describe above, the macros for LE expect a state which is neither of those: The bytes transposed as if BE but the doublewords as loaded on LE. For BE this poses the oppositve problem: It natively loads bytes in the order LE has to reproduce using rev64 but then needs to reproduce the doubleword order of LE for the LE routines to work or basically have native BE routines.
That's what my last pedestrian change did. After today I'd perhaps write it like this (untested):
@@ -125,10 +135,12 @@ IF_BE(`
PROLOGUE(_nettle_gcm_init_key) ldr HQ,[TABLE,#16*H_Idx]
- dup EMSB.16b,H.b[0]
IF_LE(` rev64 H.16b,H.16b +',`
- ext H.16b,H.16b,H.16b,#8
')
- dup EMSB.16b,H.b[7] mov x1,#0xC200000000000000 mov x2,#1 mov POLY.d[0],x1
When trying to cater to the current layout on LE, all the other vectors which are later used in conjunction with H to be reversed as well. That leads to this diff to your initial patch:
@@ -125,14 +135,21 @@ IF_BE(`
PROLOGUE(_nettle_gcm_init_key) ldr HQ,[TABLE,#16*H_Idx]
- dup EMSB.16b,H.b[0]
IF_LE(`
- dup EMSB.16b,H.b[0] rev64 H.16b,H.16b
+',`
- dup EMSB.16b,H.b[15]
') mov x1,#0xC200000000000000 mov x2,#1 +IF_LE(` mov POLY.d[0],x1 mov POLY.d[1],x2 +',`
- mov POLY.d[1],x1
- mov POLY.d[0],x2
+') sshr EMSB.16b,EMSB.16b,#7 and EMSB.16b,EMSB.16b,POLY.16b ushr B.2d,H.2d,#63 @@ -142,7 +159,11 @@ IF_LE(` orr H.16b,H.16b,B.16b eor H.16b,H.16b,EMSB.16b
+IF_LE(` dup POLY.2d,POLY.d[0] +',`
- dup POLY.2d,POLY.d[1]
+')
C --- calculate H^2 = H*H ---
The difference in index in dup EMSB nicely shows the doubleword transposition compared to LE. If on LE the dup was done after the rev64, it'd be H.b[7] vs. H.b[15].
I see what you did here, but I'm confused about ld1 and st1 instructions so let me clarify one thing before going on, how do ld1 and st1 load and store from/into memory in BE mode? If they perform in a normal way then there is no point of using ldr at all, I just used it because it handles imm offset. so to replace this line "ldr HQ,[TABLE,#16*H_Idx]" we can just add the offset to the register that hold the address "add x1,TABLE,#16*H_Idx" then load the H value by using ld1 "ld1 {H.16b},[x1]" in this way we can still have to deal with LE as transposed doublewords and with BE in normal way (not transposed doublewords or transposed quadword).
regards, Mamone