Hello Mamone,
On Mon, Jan 18, 2021 at 06:27:40PM +0200, Maamoun TK wrote:
It would be nice to get the implementation of the enhanced algorithm working for both endian modes as it yields a good performance boost. Also, there is no much effort here, the only thing I'm struggling with is to get the binary built for Aarch64_be, I'm using Ubuntu on x86_64 as host and it seems there is no official package to cross compile for Aarch64_be.
Yes, there are no packages for aarch64_be in any mainstream distribution I'm aware of. Buildroot and Gentoo are the ones I know that can target it, Yocto likely as well. All are compile-yourself-distributions and not for the faint of heart. Also, I've just learned that Buildroot has made a concious decision not to produce native toolchains for the target. So you can only ever cross-compile nettle to it, run it on an actual board or under qemu and then go back to the cross-compiler on the host.
I did a search of the aarch64 instruction set and saw that there's zip1 and zip2 instructions. So as a first test I just changed zip to zip1 which made it compile. As was to be expected, the testsuite failed though.
You are on the right track so far.
I've poked at the code a bit more and seemingly made the key init function work by eliminiating all the BE specific macros and instead adjusting the load from memory to produce the same register content. At least register values and the final output to memory look the same in an x/64xb $x0-64 and x64/xb $x0 for the first test cases in gcm-test (which they did not before).
137 PMUL_PARAM v5,v29,v30 (gdb) 139 st1 {v27.16b,v28.16b,v29.16b,v30.16b},[x0] (gdb) 141 ret (gdb) x/64xb $x0-64 0xaaaaaaac5390: 0x77 0x58 0x14 0xdf 0xa9 0x97 0xd2 0xcd [.. all the same on BE and LE ...] 0xaaaaaaac53c8: 0x0d 0x12 0x63 0x69 0x37 0x20 0xd3 0xfe (gdb) x/64xb $x0 0xaaaaaaac53d0: 0xf9 0xfa 0x22 0xc3 0x02 0xe7 0x95 0x86 [.. all the same on BE and LE ...] 0xaaaaaaac5408: 0x45 0x91 0xbd 0x48 0x73 0xd9 0x8b 0x5c (gdb)
The problem here once more seems to be that after a 128bit LE load which is later used as two 64bit operands, not only the bytes of the operands are reversed (which you already counter by rev64'ing them, I gather) but the operands (doublewords) also end up transposed in the register. This is something the rest of the routine expects but is only true on LE. So I adjusted for it on BE in a very pedestrian way:
diff --git a/arm64/v8/gcm-hash.asm b/arm64/v8/gcm-hash.asm index 1c14db54..74cd656a 100644 --- a/arm64/v8/gcm-hash.asm +++ b/arm64/v8/gcm-hash.asm @@ -55,17 +55,10 @@ C common macros: .endm
.macro REDUCTION out -IF_BE(` - pmull T.1q,F.1d,POLY.1d - ext \out().16b,F.16b,F.16b,#8 - eor R.16b,R.16b,T.16b - eor \out().16b,\out().16b,R.16b -',` pmull T.1q,F.1d,POLY.1d eor R.16b,R.16b,T.16b ext R.16b,R.16b,R.16b,#8 eor \out().16b,F.16b,R.16b -') .endm
C void gcm_init_key (union gcm_block *table) @@ -108,19 +101,11 @@ define(`H4M', `v29') define(`H4L', `v30')
.macro PMUL_PARAM in, param1, param2 -IF_BE(` - pmull2 Hp.1q,\in().2d,POLY.2d - ext Hm.16b,\in().16b,\in().16b,#8 - eor Hm.16b,Hm.16b,Hp.16b - zip \param1().2d,\in().2d,Hm.2d - zip2 \param2().2d,\in().2d,Hm.2d -',` pmull2 Hp.1q,\in().2d,POLY.2d eor Hm.16b,\in().16b,Hp.16b ext \param1().16b,Hm.16b,\in().16b,#8 ext \param2().16b,\in().16b,Hm.16b,#8 ext \param1().16b,\param1().16b,\param1().16b,#8 -') .endm
PROLOGUE(_nettle_gcm_init_key) @@ -128,6 +113,10 @@ PROLOGUE(_nettle_gcm_init_key) dup EMSB.16b,H.b[0] IF_LE(` rev64 H.16b,H.16b +',` + mov x1,H.d[0] + mov H.d[0],H.d[1] + mov H.d[1],x1 ') mov x1,#0xC200000000000000 mov x2,#1
If my understanding is correct, we could avoid the doubleword swap for both LE and BE if we were to load using ld1 to {H.b16} instead (with a precalculation of the offset because ld1 won't take an immediate offset that high, correct?). But then the rest of the routine would need to change its expectation what H.d[0] and H.d[1] contain, respectively, because they will no longer be transposed by neither the load on LE nor an explicit swap on BE.
Somehow I have a feeling, I'm terribly missing the actual point here, though. Are the zip instructions likely to give even further speedup beyond the LE version? Could this be exploited for LE as well by adjusting the loading scheme even more?
Also, it's not fully working yet. Before digging deeper I wanted to give a bit of an update and get guidance as to how to proceed.
podman run -it -v ~/Downloads/nettle:/nettle
I tried that but I'm having difficulty getting it work, it seems there is a problem in my system configuration that prevents podman establishing a socket for connection, I spend some time looking for alternative solutions with no chance. Do you have any other solutions? all what I can think of is either setup ssh connection or work together to get it work if you are into it!
I mulled this over from all directions. Access to the actual board is somewhat complicated by the limits of my available Internet connections (CGNAT being one, missing DMZ functionality on the routers another). It can certainly be done, I just would need some time to set it up.
But I have made the cross-compiling and -debugging setup of the container available on a vserver on the Net. Send me a mail directly with an SSH ID public key if you'd like to try this out and I'll send you instructions for login and use. We could meet up there in a tmux/screen session and work on it together as well.
I have also tried to extract the buildroot toolchain from the image and run it on my Gentoo box as well as Debian. It even seems relocatable, so you can just put it anywhere and add it to PATH and it'll work. If you want, I can put a tarball with the toolchain and qemu wrappers up on a web server somewhere for you to grab. (I just thought, a container image would be the easier delivery method nowadays. :)
Otherwise, what's your error message from podman? It's got no deamon, so it shouldn't need a socket to connect to it like docker does. Out to the Internet for image download it's also a standard client and respects environment variables for proxies as usual.
rootless podman (running as your standard user instead of root) can take a bit of tweaking before it stops throwing error messages but once that's done it works nicely. I've never actually run podman as root by luck of late birth with regards to containers.
Here's my command sequence on a Ubuntu 20.04 VM that's never seen rootless podman before as per https://www.vultr.com/docs/how-to-install-and-use-podman-on-ubuntu-20-04 (literally the first hit on search, can't vouch for the packages from opensuse though):
michael@demo:~$ podman
Command 'podman' not found, did you mean:
command 'pod2man' from deb perl (5.30.0-9ubuntu0.2)
Try: sudo apt install <deb name>
michael@demo:~$ source /etc/os-release michael@demo:~$ sudo sh -c "echo 'deb http://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stabl... /' > /etc/apt/sources.list.d/devel:kubic:libcontainers:stable.list" michael@demo:~$ wget -nv https://download.opensuse.org/repositories/devel:kubic:libcontainers:stable/... -O- | sudo apt-key add - 2021-01-19 21:13:19 URL:https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stab... [1093/1093] -> "-" [1] OK michael@demo:~$ sudo apt-get update -qq michael@demo:~$ sudo apt-get -qq --yes install podman fuse-overlayfs slirp4netns [...] michael@demo:~$ podman run -it michaelweisernettleci/buildroot:2020.11.1-aarch64_be-glibc-gdb Completed short name "michaelweisernettleci/buildroot" with unqualified-search registries (origin: /etc/containers/registries.conf) Trying to pull docker.io/michaelweisernettleci/buildroot:2020.11.1-aarch64_be-glibc-gdb... Getting image source signatures Copying blob 6c33745f49b4 done Copying blob ff35d554f2d5 done Copying blob 3927b287d6b9 done Copying blob 6bbc022f227c done Copying config 21663e44fe done Writing manifest to image destination Storing signatures root@06e70f1e12e4:/# aarch64_be-buildroot-linux-gnu-gcc -v Using built-in specs. COLLECT_GCC=/buildroot/output/host/bin/aarch64_be-buildroot-linux-gnu-gcc.br_real COLLECT_LTO_WRAPPER=/buildroot/output/host/bin/../libexec/gcc/aarch64_be-buildroot-linux-gnu/9.3.0/lto-wrapper Target: aarch64_be-buildroot-linux-gnu Configured with: ./configure --prefix=/buildroot/output/per-package/host-gcc-final/host [...] --enable-shared --disable-libgomp --silent Thread model: posix gcc version 9.3.0 (Buildroot 2020.11.1) root@06e70f1e12e4:/# git clone https://git.lysator.liu.se/nettle/nettle bash: git: command not found root@06e70f1e12e4:/# apt-get update Get:1 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB] Get:2 http://deb.debian.org/debian buster InRelease [121 kB] [...] root@06e70f1e12e4:/# apt-get install git Reading package lists... Done Building dependency tree Reading state information... Done The following additional packages will be installed: ca-certificates git-man krb5-locales less libbsd0 libcurl3-gnutls [...] root@06e70f1e12e4:/# git clone https://git.lysator.liu.se/nettle/nettle Cloning into 'nettle'... warning: redirecting to https://git.lysator.liu.se/nettle/nettle.git/ remote: Enumerating objects: 721, done. remote: Counting objects: 100% (721/721), done. remote: Compressing objects: 100% (349/349), done. remote: Total 21095 (delta 479), reused 593 (delta 372), pack-reused 20374 Receiving objects: 100% (21095/21095), 5.90 MiB | 3.47 MiB/s, done. Resolving deltas: 100% (15748/15748), done. root@06e70f1e12e4:/#
That was a lot easier than even I expected. Necessary stuff like entries in /etc/subuid are automatically added by useradd as standard nowadays without podman even being installed:
michael@demo:~$ cat /etc/subuid michael:100000:65536
Hope that helps.
If all else fails and it's not too trying for your patience I'm up for making it work iteratively by trial, error and discussion as above. ;)