Re: [AArch64] Optimize GHASH

22 Jan 2021


      Hello Mamone,
On Wed, Jan 20, 2021 at 10:25:19PM +0200, Maamoun TK wrote:
...
I'm trying to install Gentoo on VMware by walking through this receip
https://medium.com/@steensply/vmware-installation-of-gentoo-linux-from-scrat...
I'm in the middle of receip now but there a lot of instruction there so I'm
gonna get the os working in the end.
As far as I can tell that recipe only encompasses basic installation.
You'd additionally need to run crossdev to create a cross-toolchain and
then install qemu as well. Gentoo has a very steep learning curve. There's
no benefit compared to buildroot for our use-case here, IMO.
...
Here how I get the vector instruction operate on registers in LE mode, i'll
take this instruction as example: pmull  v0.1q,v1.1d,v2.1d
Input represented as indexes
v1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
v2: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
the instruction byte-reverse each of 64-bit parts of register so the
instruction consider the register as follow
v1: 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8
v2: 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8
so what I did in LE mode is reverse the 64-bit parts before execute the
doublework operation using rev64 instruction, according to that the pmull
output will be 128-bit byte-reversed
Output represented as indexes
v0: 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
...
What I'm assuming in BE mode is operations are performed in normal way in
registers side so no need to reverse the inputs in addition to get normal
output hence the macros "REDUCTION" and "PMUL_PARAM" have differences in
their structure, it's not matter of zip instruction perform better but how
to handle the weird situation in LE mode.
I've tried for a number of hours to make this work today. Always when I
added correct handling of the transposed doublewords to one macro,
another broke down. To me the problem comes down to this: ldr
HQ,[TABLE...] and st1.16b are fighting each other and can't be brought
together without a lot of additional instructions (at least not by me).
Longer story: ldr does a 128bit load. This loads bytes in exactly
reverse order into the register on LE and BE. As you describe above, the
macros for LE expect a state which is neither of those: The bytes
transposed as if BE but the doublewords as loaded on LE. For BE this
poses the oppositve problem: It natively loads bytes in the order LE has
to reproduce using rev64 but then needs to reproduce the doubleword
order of LE for the LE routines to work or basically have native BE
routines.
That's what my last pedestrian change did. After today I'd perhaps write
it like this (untested):
@@ -125,10 +135,12 @@ IF_BE(`
PROLOGUE(_nettle_gcm_init_key)
     ldr            HQ,[TABLE,#16*H_Idx]
-    dup            EMSB.16b,H.b[0]
 IF_LE(`
     rev64          H.16b,H.16b
+',`
+    ext            H.16b,H.16b,H.16b,#8
 ')
+    dup            EMSB.16b,H.b[7]
     mov            x1,#0xC200000000000000
     mov            x2,#1
     mov            POLY.d[0],x1
When trying to cater to the current layout on LE, all the other vectors
which are later used in conjunction with H to be reversed as well. That
leads to this diff to your initial patch:
@@ -125,14 +135,21 @@ IF_BE(`
PROLOGUE(_nettle_gcm_init_key)
     ldr            HQ,[TABLE,#16*H_Idx]
-    dup            EMSB.16b,H.b[0]
 IF_LE(`
+    dup            EMSB.16b,H.b[0]
     rev64          H.16b,H.16b
+',`
+    dup            EMSB.16b,H.b[15]
 ')
     mov            x1,#0xC200000000000000
     mov            x2,#1
+IF_LE(`
     mov            POLY.d[0],x1
     mov            POLY.d[1],x2
+',`
+    mov            POLY.d[1],x1
+    mov            POLY.d[0],x2
+')
     sshr           EMSB.16b,EMSB.16b,#7
     and            EMSB.16b,EMSB.16b,POLY.16b
     ushr           B.2d,H.2d,#63
@@ -142,7 +159,11 @@ IF_LE(`
     orr            H.16b,H.16b,B.16b
     eor            H.16b,H.16b,EMSB.16b
+IF_LE(`
     dup            POLY.2d,POLY.d[0]
+',`
+    dup            POLY.2d,POLY.d[1]
+')
C --- calculate H^2 = H*H ---
The difference in index in dup EMSB nicely shows the doubleword
transposition compared to LE. If on LE the dup was done after the rev64,
it'd be H.b[7] vs. H.b[15].
With this layout PMUL_PARAM can work on H and POLY but then needs to use
pmull instead of pmull2 because the relevant data is in the other
doublewords compared to LE. On the other hand, since the output of
PMUL_PARAM is to be stored using st1.16b it must not have the
doublewords transposed ("load-inverted" I termed it in the comments ;).
That leads to the following adjustment and makes the first 16bytes of
TABLE identical to LE:
@@ -109,11 +118,12 @@ define(`H4L', `v30')
.macro PMUL_PARAM in, param1, param2
 IF_BE(`
-    pmull2         Hp.1q,\in().2d,POLY.2d
+    pmull          Hp.1q,\in().1d,POLY.1d
     ext            Hm.16b,\in().16b,\in().16b,#8
     eor            Hm.16b,Hm.16b,Hp.16b
-    zip            \param1().2d,\in().2d,Hm.2d
-    zip2           \param2().2d,\in().2d,Hm.2d
+    C output must be in native register order (not load-inverted) for st1.16b to work
+    zip2           \param1().2d,\in().2d,Hm.2d
+    zip1           \param2().2d,\in().2d,Hm.2d
 ',`
     pmull2         Hp.1q,\in().2d,POLY.2d
     eor            Hm.16b,\in().16b,Hp.16b
In PMUL is where it breaks down, at least for my brain: Its first call
is handed H (which has doublewords "transposed" from the initial ldr) and
H1M and H1L (which must not have doublewords transposed so st1.16b
writes them to memory in correct order). It wants to pmull/pmull2 them
which requires corresponding doublewords at the same index. So we'd
need to temporarily transpose \in for that:
@@ -46,25 +46,34 @@ define(`R1', `v19')
C common macros:
 .macro PMUL in, param1, param2
-    pmull          F.1q,\param2().1d,\in().1d
-    pmull2         F1.1q,\param2().2d,\in().2d
-    pmull          R.1q,\param1().1d,\in().1d
-    pmull2         R1.1q,\param1().2d,\in().2d
+    C PMUL_PARAM left us with \param1 and \param2 in native register order but
+    C \in is load-inverted from initial load of H using ldr, something must give
+IF_BE(`
+    ext            T.16b,\in().16b,\in().16b,#8
+',`
+    mov            T.16b,\in().16b
+')
+    pmull          F.1q,\param2().1d,T.1d
+    pmull2         F1.1q,\param2().2d,T.2d
+    pmull          R.1q,\param1().1d,T.1d
+    pmull2         R1.1q,\param1().2d,T.2d
     eor            F.16b,F.16b,F1.16b
     eor            R.16b,R.16b,R1.16b
 .endm
If we finally artificially restore the doubleword transposition in
REDUCE for H2 and H3 we're all set for the next calls:
.macro REDUCTION out
 IF_BE(`
-    pmull          T.1q,F.1d,POLY.1d
     ext            \out().16b,F.16b,F.16b,#8
-    eor            R.16b,R.16b,T.16b
-    eor            \out().16b,\out().16b,R.16b
+    pmull2         T.1q,\out().2d,POLY.2d
 ',`
     pmull          T.1q,F.1d,POLY.1d
+')
     eor            R.16b,R.16b,T.16b
     ext            R.16b,R.16b,R.16b,#8
     eor            \out().16b,F.16b,R.16b
+C artificially restore load inversion for PMUL_PARAM :-(
+IF_BE(`
+    ext            \out().16b,\out().16b,\out().16b,#8
 ')
 .endm
So all we're doing is catering to the quirk of the very first ldr
operation. The easiest solution seems to me to align all types of load
and store operations with each other or counteract their quirks right
after or before executing them. That way we end up with identical
register contents on LE and BE and don't have to maintain separate
implementations.
That'd be in line with what we ended up with on arm32 NEON as well.
memxor3.asm does do the dance of working with different register content
but there it's only bitwise operations and the load and store operations
have identical behaviour.
The advantage of the current implementation with transposed doublewords
and only the LE routines seems to me that overhead on LE and BE would
be about the same.
Do you think it makes sense to try and adjust the code to work with the
BE layout natively and have a full 128bit reverse after ldr-like loads
on LE instead (considering that 99,999% of aarch64 users will run LE)?
...
...
Otherwise, what's your error message from podman? It's got no deamon, so
it shouldn't need a socket to connect to it like docker does. Out to the
Internet for image download it's also a standard client and respects
environment variables for proxies as usual.
I got Error: error creating network namespace for container. I think I can
fix it by tracing the problem but let's try the other methods first as I
think it's gonna be simpler for me..
I found this error on the Net in conjunction with a Debian/Ubuntu
security-related custom kernel knob for disabling unprivileged user
namespaces that was enabled by default once. I tested that with Ubuntu
18.04, 20.04 and 20.10 yesterday and it's disabled (i.e. namespaces for
unprivileged users enabled) on all of them. You can still have a look at
your setting in /proc/sys/kernel/unprivileged_userns_clone or with
sysctl kernel.unprivileged_userns_clone. It needs to be set to 1 for
rootless podman to work.
You're not by any chance running the Windows Subsystem for Linux (WSL)?
https://github.com/containers/podman/issues/3288#issuecomment-501356136 :)
Or inside another container at a hosting service?
https://github.com/containers/podman/issues/4056
Otherwise I have no idea what could be causing that and have never seen
that error.
-- 
Thanks,
Michael

    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [AArch64] Optimize GHASH