Hello Michael,
On Tue, Jan 19, 2021 at 11:45 PM Michael Weiser michael.weiser@gmx.de wrote:
Yes, there are no packages for aarch64_be in any mainstream distribution I'm aware of. Buildroot and Gentoo are the ones I know that can target it, Yocto likely as well. All are compile-yourself-distributions and not for the faint of heart. Also, I've just learned that Buildroot has made a concious decision not to produce native toolchains for the target. So you can only ever cross-compile nettle to it, run it on an actual board or under qemu and then go back to the cross-compiler on the host.
I'm trying to install Gentoo on VMware by walking through this receip https://medium.com/@steensply/vmware-installation-of-gentoo-linux-from-scrat... I'm in the middle of receip now but there a lot of instruction there so I'm gonna get the os working in the end.
I did a search of the aarch64 instruction set and saw that there's zip1 and zip2 instructions. So as a first test I just changed zip to zip1 which made it compile. As was to be expected, the testsuite failed though.
You are on the right track so far.
I've poked at the code a bit more and seemingly made the key init function work by eliminiating all the BE specific macros and instead adjusting the load from memory to produce the same register content. At least register values and the final output to memory look the same in an x/64xb $x0-64 and x64/xb $x0 for the first test cases in gcm-test (which they did not before).
137 PMUL_PARAM v5,v29,v30 (gdb) 139 st1 {v27.16b,v28.16b,v29.16b,v30.16b},[x0] (gdb) 141 ret (gdb) x/64xb $x0-64 0xaaaaaaac5390: 0x77 0x58 0x14 0xdf 0xa9 0x97 0xd2 0xcd [.. all the same on BE and LE ...] 0xaaaaaaac53c8: 0x0d 0x12 0x63 0x69 0x37 0x20 0xd3 0xfe (gdb) x/64xb $x0 0xaaaaaaac53d0: 0xf9 0xfa 0x22 0xc3 0x02 0xe7 0x95 0x86 [.. all the same on BE and LE ...] 0xaaaaaaac5408: 0x45 0x91 0xbd 0x48 0x73 0xd9 0x8b 0x5c (gdb)
Here how I get the vector instruction operate on registers in LE mode, i'll take this instruction as example: pmull v0.1q,v1.1d,v2.1d Input represented as indexes v1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 v2: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 the instruction byte-reverse each of 64-bit parts of register so the instruction consider the register as follow v1: 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 v2: 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 so what I did in LE mode is reverse the 64-bit parts before execute the doublework operation using rev64 instruction, according to that the pmull output will be 128-bit byte-reversed Output represented as indexes v0: 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
What I'm assuming in BE mode is operations are performed in normal way in registers side so no need to reverse the inputs in addition to get normal output hence the macros "REDUCTION" and "PMUL_PARAM" have differences in their structure, it's not matter of zip instruction perform better but how to handle the weird situation in LE mode.
The problem here once more seems to be that after a 128bit LE load which is later used as two 64bit operands, not only the bytes of the operands are reversed (which you already counter by rev64'ing them, I gather) but the operands (doublewords) also end up transposed in the register. This is something the rest of the routine expects but is only true on LE. So I adjusted for it on BE in a very pedestrian way:
diff --git a/arm64/v8/gcm-hash.asm b/arm64/v8/gcm-hash.asm index 1c14db54..74cd656a 100644 --- a/arm64/v8/gcm-hash.asm +++ b/arm64/v8/gcm-hash.asm @@ -55,17 +55,10 @@ C common macros: .endm
.macro REDUCTION out -IF_BE(`
- pmull T.1q,F.1d,POLY.1d
- ext \out().16b,F.16b,F.16b,#8
- eor R.16b,R.16b,T.16b
- eor \out().16b,\out().16b,R.16b
-',` pmull T.1q,F.1d,POLY.1d eor R.16b,R.16b,T.16b ext R.16b,R.16b,R.16b,#8 eor \out().16b,F.16b,R.16b -') .endm
C void gcm_init_key (union gcm_block *table)
@@ -108,19 +101,11 @@ define(`H4M', `v29') define(`H4L', `v30')
.macro PMUL_PARAM in, param1, param2 -IF_BE(`
- pmull2 Hp.1q,\in().2d,POLY.2d
- ext Hm.16b,\in().16b,\in().16b,#8
- eor Hm.16b,Hm.16b,Hp.16b
- zip \param1().2d,\in().2d,Hm.2d
- zip2 \param2().2d,\in().2d,Hm.2d
-',` pmull2 Hp.1q,\in().2d,POLY.2d eor Hm.16b,\in().16b,Hp.16b ext \param1().16b,Hm.16b,\in().16b,#8 ext \param2().16b,\in().16b,Hm.16b,#8 ext \param1().16b,\param1().16b,\param1().16b,#8 -') .endm
PROLOGUE(_nettle_gcm_init_key) @@ -128,6 +113,10 @@ PROLOGUE(_nettle_gcm_init_key) dup EMSB.16b,H.b[0] IF_LE(` rev64 H.16b,H.16b +',`
- mov x1,H.d[0]
- mov H.d[0],H.d[1]
- mov H.d[1],x1
') mov x1,#0xC200000000000000 mov x2,#1
If my understanding is correct, we could avoid the doubleword swap for both LE and BE if we were to load using ld1 to {H.b16} instead (with a precalculation of the offset because ld1 won't take an immediate offset that high, correct?). But then the rest of the routine would need to change its expectation what H.d[0] and H.d[1] contain, respectively, because they will no longer be transposed by neither the load on LE nor an explicit swap on BE.
Somehow I have a feeling, I'm terribly missing the actual point here, though. Are the zip instructions likely to give even further speedup beyond the LE version? Could this be exploited for LE as well by adjusting the loading scheme even more?
If my assumption about how instruction operates in BE mode is right so yes this is not the actual point.
But I have made the cross-compiling and -debugging setup of the container available on a vserver on the Net. Send me a mail directly with an SSH ID public key if you'd like to try this out and I'll send you instructions for login and use. We could meet up there in a tmux/screen session and work on it together as well.
Let's try the second solution before we get to this.
I have also tried to extract the buildroot toolchain from the image and run it on my Gentoo box as well as Debian. It even seems relocatable, so you can just put it anywhere and add it to PATH and it'll work. If you want, I can put a tarball with the toolchain and qemu wrappers up on a web server somewhere for you to grab. (I just thought, a container image would be the easier delivery method nowadays. :)
I would like to try this method in case my gentoo installation failed or just been easier to extract your uploaded packages and add it to PATH. Update: while I'm writing this message I got: no space left of device. It seems I set low numbers while partitioning the device. Let's try the above method before I start over to install gentoo.
Otherwise, what's your error message from podman? It's got no deamon, so it shouldn't need a socket to connect to it like docker does. Out to the Internet for image download it's also a standard client and respects environment variables for proxies as usual.
I got Error: error creating network namespace for container. I think I can fix it by tracing the problem but let's try the other methods first as I think it's gonna be simpler for me..
regards, Mamone