Re: [AArch64] Optimize GHASH

20 Jan 2021

Hello Michael,
On Tue, Jan 19, 2021 at 11:45 PM Michael Weiser michael.weiser@gmx.de
wrote:
...
Yes, there are no packages for aarch64_be in any mainstream distribution
I'm aware of. Buildroot and Gentoo are the ones I know that can target
it, Yocto likely as well. All are compile-yourself-distributions and not
for the faint of heart. Also, I've just learned that Buildroot has made
a concious decision not to produce native toolchains for the target. So
you can only ever cross-compile nettle to it, run it on an actual board
or under qemu and then go back to the cross-compiler on the host.
I'm trying to install Gentoo on VMware by walking through this receip
https://medium.com/@steensply/vmware-installation-of-gentoo-linux-from-scrat...
I'm in the middle of receip now but there a lot of instruction there so I'm
gonna get the os working in the end.
...
...
...
I did a search of the aarch64 instruction set and saw that there's zip1
and zip2 instructions. So as a first test I just changed zip to zip1
which made it compile. As was to be expected, the testsuite failed
though.
You are on the right track so far.
I've poked at the code a bit more and seemingly made the key init
function work by eliminiating all the BE specific macros and instead
adjusting the load from memory to produce the same register content. At
least register values and the final output to memory look the same in
an x/64xb $x0-64 and x64/xb $x0 for the first test cases in gcm-test
(which they did not before).
137         PMUL_PARAM v5,v29,v30
(gdb)
139         st1            {v27.16b,v28.16b,v29.16b,v30.16b},[x0]
(gdb)
141         ret
(gdb) x/64xb $x0-64
0xaaaaaaac5390: 0x77    0x58    0x14    0xdf    0xa9    0x97    0xd2
0xcd
[.. all the same on BE and LE ...]
0xaaaaaaac53c8: 0x0d    0x12    0x63    0x69    0x37    0x20    0xd3
0xfe
(gdb) x/64xb $x0
0xaaaaaaac53d0: 0xf9    0xfa    0x22    0xc3    0x02    0xe7    0x95
0x86
[.. all the same on BE and LE ...]
0xaaaaaaac5408: 0x45    0x91    0xbd    0x48    0x73    0xd9    0x8b
0x5c
(gdb)
Here how I get the vector instruction operate on registers in LE mode, i'll
take this instruction as example: pmull  v0.1q,v1.1d,v2.1d
Input represented as indexes
v1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
v2: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
the instruction byte-reverse each of 64-bit parts of register so the
instruction consider the register as follow
v1: 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8
v2: 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8
so what I did in LE mode is reverse the 64-bit parts before execute the
doublework operation using rev64 instruction, according to that the pmull
output will be 128-bit byte-reversed
Output represented as indexes
v0: 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
What I'm assuming in BE mode is operations are performed in normal way in
registers side so no need to reverse the inputs in addition to get normal
output hence the macros "REDUCTION" and "PMUL_PARAM" have differences in
their structure, it's not matter of zip instruction perform better but how
to handle the weird situation in LE mode.
...
The problem here once more seems to be that after a 128bit LE load which
is later used as two 64bit operands, not only the bytes of the operands
are reversed (which you already counter by rev64'ing them, I gather) but
the operands (doublewords) also end up transposed in the register. This
is something the rest of the routine expects but is only true on LE. So
I adjusted for it on BE in a very pedestrian way:

diff --git a/arm64/v8/gcm-hash.asm b/arm64/v8/gcm-hash.asm
index 1c14db54..74cd656a 100644
--- a/arm64/v8/gcm-hash.asm
+++ b/arm64/v8/gcm-hash.asm
@@ -55,17 +55,10 @@ C common macros:
 .endm
.macro REDUCTION out
-IF_BE(`

pmull          T.1q,F.1d,POLY.1d
ext            \out().16b,F.16b,F.16b,#8
eor            R.16b,R.16b,T.16b
eor            \out().16b,\out().16b,R.16b

-',`
     pmull          T.1q,F.1d,POLY.1d
     eor            R.16b,R.16b,T.16b
     ext            R.16b,R.16b,R.16b,#8
     eor            \out().16b,F.16b,R.16b
-')
 .endm
 C void gcm_init_key (union gcm_block *table)

@@ -108,19 +101,11 @@ define(`H4M', `v29')
 define(`H4L', `v30')
.macro PMUL_PARAM in, param1, param2
-IF_BE(`

pmull2         Hp.1q,\in().2d,POLY.2d
ext            Hm.16b,\in().16b,\in().16b,#8
eor            Hm.16b,Hm.16b,Hp.16b
zip            \param1().2d,\in().2d,Hm.2d
zip2           \param2().2d,\in().2d,Hm.2d

-',`
     pmull2         Hp.1q,\in().2d,POLY.2d
     eor            Hm.16b,\in().16b,Hp.16b
     ext            \param1().16b,Hm.16b,\in().16b,#8
     ext            \param2().16b,\in().16b,Hm.16b,#8
     ext            \param1().16b,\param1().16b,\param1().16b,#8
-')
 .endm
PROLOGUE(_nettle_gcm_init_key)
@@ -128,6 +113,10 @@ PROLOGUE(_nettle_gcm_init_key)
     dup            EMSB.16b,H.b[0]
 IF_LE(`
     rev64          H.16b,H.16b
+',`

mov            x1,H.d[0]
mov            H.d[0],H.d[1]
mov            H.d[1],x1

')
     mov            x1,#0xC200000000000000
     mov            x2,#1
If my understanding is correct, we could avoid the doubleword swap for
both LE and BE if we were to load using ld1 to {H.b16} instead (with a
precalculation of the offset because ld1 won't take an immediate offset
that high, correct?). But then the rest of the routine would need to
change its expectation what H.d[0] and H.d[1] contain, respectively,
because they will no longer be transposed by neither the load on LE nor
an explicit swap on BE.
Somehow I have a feeling, I'm terribly missing the actual point here,
though. Are the zip instructions likely to give even further speedup
beyond the LE version? Could this be exploited for LE as well by
adjusting the loading scheme even more?
If my assumption about how instruction operates in BE mode is right so yes
this is not the actual point.
...
But I have made the cross-compiling and -debugging setup of the
container available on a vserver on the Net. Send me a mail directly
with an SSH ID public key if you'd like to try this out and I'll send
you instructions for login and use. We could meet up there in a
tmux/screen session and work on it together as well.
Let's try the second solution before we get to this.
...
I have also tried to extract the buildroot toolchain from the image and
run it on my Gentoo box as well as Debian. It even seems relocatable, so
you can just put it anywhere and add it to PATH and it'll work. If you
want, I can put a tarball with the toolchain and qemu wrappers up on a
web server somewhere for you to grab. (I just thought, a container image
would be the easier delivery method nowadays. :)
I would like to try this method in case my gentoo installation failed or
just been easier to extract your uploaded packages and add it to PATH.
Update: while I'm writing this message I got: no space left of device. It
seems I set low numbers while partitioning the device. Let's try the above
method before I start over to install gentoo.
...
Otherwise, what's your error message from podman? It's got no deamon, so
it shouldn't need a socket to connect to it like docker does. Out to the
Internet for image download it's also a standard client and respects
environment variables for proxies as usual.
I got Error: error creating network namespace for container. I think I can
fix it by tracing the problem but let's try the other methods first as I
think it's gonna be simpler for me..
regards,
Mamone

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [AArch64] Optimize GHASH