Re: [AArch64] Optimize GHASH

22 Jan 2021


      On Fri, Jan 22, 2021 at 1:45 AM Michael Weiser michael.weiser@gmx.de
wrote:
...
Longer story: ldr does a 128bit load. This loads bytes in exactly
reverse order into the register on LE and BE. As you describe above, the
macros for LE expect a state which is neither of those: The bytes
transposed as if BE but the doublewords as loaded on LE. For BE this
poses the oppositve problem: It natively loads bytes in the order LE has
to reproduce using rev64 but then needs to reproduce the doubleword
order of LE for the LE routines to work or basically have native BE
routines.
That's what my last pedestrian change did. After today I'd perhaps write
it like this (untested):
@@ -125,10 +135,12 @@ IF_BE(`
PROLOGUE(_nettle_gcm_init_key)
     ldr            HQ,[TABLE,#16*H_Idx]

dup            EMSB.16b,H.b[0]

IF_LE(`
     rev64          H.16b,H.16b
+',`

ext            H.16b,H.16b,H.16b,#8

')

dup            EMSB.16b,H.b[7]
mov            x1,#0xC200000000000000
mov            x2,#1
mov            POLY.d[0],x1

When trying to cater to the current layout on LE, all the other vectors
which are later used in conjunction with H to be reversed as well. That
leads to this diff to your initial patch:
@@ -125,14 +135,21 @@ IF_BE(`
PROLOGUE(_nettle_gcm_init_key)
     ldr            HQ,[TABLE,#16*H_Idx]

dup            EMSB.16b,H.b[0]

IF_LE(`

dup            EMSB.16b,H.b[0]
rev64          H.16b,H.16b

+',`

dup            EMSB.16b,H.b[15]

')
     mov            x1,#0xC200000000000000
     mov            x2,#1
+IF_LE(`
     mov            POLY.d[0],x1
     mov            POLY.d[1],x2
+',`

mov            POLY.d[1],x1
mov            POLY.d[0],x2

+')
     sshr           EMSB.16b,EMSB.16b,#7
     and            EMSB.16b,EMSB.16b,POLY.16b
     ushr           B.2d,H.2d,#63
@@ -142,7 +159,11 @@ IF_LE(`
     orr            H.16b,H.16b,B.16b
     eor            H.16b,H.16b,EMSB.16b
+IF_LE(`
     dup            POLY.2d,POLY.d[0]
+',`

dup            POLY.2d,POLY.d[1]

+')
 C --- calculate H^2 = H*H ---


The difference in index in dup EMSB nicely shows the doubleword
transposition compared to LE. If on LE the dup was done after the rev64,
it'd be H.b[7] vs. H.b[15].
I see what you did here, but I'm confused about ld1 and st1 instructions so
let me clarify one thing before going on, how do ld1 and st1 load and store
from/into memory in BE mode? If they perform in a normal way then there is
no point of using ldr at all, I just used it because it handles imm offset.
so to replace this line "ldr HQ,[TABLE,#16*H_Idx]" we can just add the
offset to the register that hold the address "add x1,TABLE,#16*H_Idx" then
load the H value by using ld1 "ld1 {H.16b},[x1]" in this way we can still
have to deal with LE as transposed doublewords and with BE in normal way
(not transposed doublewords or transposed quadword).
regards,
Mamone

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [AArch64] Optimize GHASH