Hello Mamone,
On Wed, Jan 20, 2021 at 10:25:19PM +0200, Maamoun TK wrote:
I'm trying to install Gentoo on VMware by walking through this receip https://medium.com/@steensply/vmware-installation-of-gentoo-linux-from-scrat... I'm in the middle of receip now but there a lot of instruction there so I'm gonna get the os working in the end.
As far as I can tell that recipe only encompasses basic installation. You'd additionally need to run crossdev to create a cross-toolchain and then install qemu as well. Gentoo has a very steep learning curve. There's no benefit compared to buildroot for our use-case here, IMO.
Here how I get the vector instruction operate on registers in LE mode, i'll take this instruction as example: pmull v0.1q,v1.1d,v2.1d Input represented as indexes v1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 v2: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 the instruction byte-reverse each of 64-bit parts of register so the instruction consider the register as follow v1: 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 v2: 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 so what I did in LE mode is reverse the 64-bit parts before execute the doublework operation using rev64 instruction, according to that the pmull output will be 128-bit byte-reversed Output represented as indexes v0: 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
What I'm assuming in BE mode is operations are performed in normal way in registers side so no need to reverse the inputs in addition to get normal output hence the macros "REDUCTION" and "PMUL_PARAM" have differences in their structure, it's not matter of zip instruction perform better but how to handle the weird situation in LE mode.
I've tried for a number of hours to make this work today. Always when I added correct handling of the transposed doublewords to one macro, another broke down. To me the problem comes down to this: ldr HQ,[TABLE...] and st1.16b are fighting each other and can't be brought together without a lot of additional instructions (at least not by me).
Longer story: ldr does a 128bit load. This loads bytes in exactly reverse order into the register on LE and BE. As you describe above, the macros for LE expect a state which is neither of those: The bytes transposed as if BE but the doublewords as loaded on LE. For BE this poses the oppositve problem: It natively loads bytes in the order LE has to reproduce using rev64 but then needs to reproduce the doubleword order of LE for the LE routines to work or basically have native BE routines.
That's what my last pedestrian change did. After today I'd perhaps write it like this (untested):
@@ -125,10 +135,12 @@ IF_BE(`
PROLOGUE(_nettle_gcm_init_key) ldr HQ,[TABLE,#16*H_Idx] - dup EMSB.16b,H.b[0] IF_LE(` rev64 H.16b,H.16b +',` + ext H.16b,H.16b,H.16b,#8 ') + dup EMSB.16b,H.b[7] mov x1,#0xC200000000000000 mov x2,#1 mov POLY.d[0],x1
When trying to cater to the current layout on LE, all the other vectors which are later used in conjunction with H to be reversed as well. That leads to this diff to your initial patch:
@@ -125,14 +135,21 @@ IF_BE(`
PROLOGUE(_nettle_gcm_init_key) ldr HQ,[TABLE,#16*H_Idx] - dup EMSB.16b,H.b[0] IF_LE(` + dup EMSB.16b,H.b[0] rev64 H.16b,H.16b +',` + dup EMSB.16b,H.b[15] ') mov x1,#0xC200000000000000 mov x2,#1 +IF_LE(` mov POLY.d[0],x1 mov POLY.d[1],x2 +',` + mov POLY.d[1],x1 + mov POLY.d[0],x2 +') sshr EMSB.16b,EMSB.16b,#7 and EMSB.16b,EMSB.16b,POLY.16b ushr B.2d,H.2d,#63 @@ -142,7 +159,11 @@ IF_LE(` orr H.16b,H.16b,B.16b eor H.16b,H.16b,EMSB.16b
+IF_LE(` dup POLY.2d,POLY.d[0] +',` + dup POLY.2d,POLY.d[1] +')
C --- calculate H^2 = H*H ---
The difference in index in dup EMSB nicely shows the doubleword transposition compared to LE. If on LE the dup was done after the rev64, it'd be H.b[7] vs. H.b[15].
With this layout PMUL_PARAM can work on H and POLY but then needs to use pmull instead of pmull2 because the relevant data is in the other doublewords compared to LE. On the other hand, since the output of PMUL_PARAM is to be stored using st1.16b it must not have the doublewords transposed ("load-inverted" I termed it in the comments ;). That leads to the following adjustment and makes the first 16bytes of TABLE identical to LE:
@@ -109,11 +118,12 @@ define(`H4L', `v30')
.macro PMUL_PARAM in, param1, param2 IF_BE(` - pmull2 Hp.1q,\in().2d,POLY.2d + pmull Hp.1q,\in().1d,POLY.1d ext Hm.16b,\in().16b,\in().16b,#8 eor Hm.16b,Hm.16b,Hp.16b - zip \param1().2d,\in().2d,Hm.2d - zip2 \param2().2d,\in().2d,Hm.2d + C output must be in native register order (not load-inverted) for st1.16b to work + zip2 \param1().2d,\in().2d,Hm.2d + zip1 \param2().2d,\in().2d,Hm.2d ',` pmull2 Hp.1q,\in().2d,POLY.2d eor Hm.16b,\in().16b,Hp.16b
In PMUL is where it breaks down, at least for my brain: Its first call is handed H (which has doublewords "transposed" from the initial ldr) and H1M and H1L (which must not have doublewords transposed so st1.16b writes them to memory in correct order). It wants to pmull/pmull2 them which requires corresponding doublewords at the same index. So we'd need to temporarily transpose \in for that:
@@ -46,25 +46,34 @@ define(`R1', `v19')
C common macros: .macro PMUL in, param1, param2 - pmull F.1q,\param2().1d,\in().1d - pmull2 F1.1q,\param2().2d,\in().2d - pmull R.1q,\param1().1d,\in().1d - pmull2 R1.1q,\param1().2d,\in().2d + C PMUL_PARAM left us with \param1 and \param2 in native register order but + C \in is load-inverted from initial load of H using ldr, something must give +IF_BE(` + ext T.16b,\in().16b,\in().16b,#8 +',` + mov T.16b,\in().16b +') + pmull F.1q,\param2().1d,T.1d + pmull2 F1.1q,\param2().2d,T.2d + pmull R.1q,\param1().1d,T.1d + pmull2 R1.1q,\param1().2d,T.2d eor F.16b,F.16b,F1.16b eor R.16b,R.16b,R1.16b .endm
If we finally artificially restore the doubleword transposition in REDUCE for H2 and H3 we're all set for the next calls:
.macro REDUCTION out IF_BE(` - pmull T.1q,F.1d,POLY.1d ext \out().16b,F.16b,F.16b,#8 - eor R.16b,R.16b,T.16b - eor \out().16b,\out().16b,R.16b + pmull2 T.1q,\out().2d,POLY.2d ',` pmull T.1q,F.1d,POLY.1d +') eor R.16b,R.16b,T.16b ext R.16b,R.16b,R.16b,#8 eor \out().16b,F.16b,R.16b +C artificially restore load inversion for PMUL_PARAM :-( +IF_BE(` + ext \out().16b,\out().16b,\out().16b,#8 ') .endm
So all we're doing is catering to the quirk of the very first ldr operation. The easiest solution seems to me to align all types of load and store operations with each other or counteract their quirks right after or before executing them. That way we end up with identical register contents on LE and BE and don't have to maintain separate implementations.
That'd be in line with what we ended up with on arm32 NEON as well. memxor3.asm does do the dance of working with different register content but there it's only bitwise operations and the load and store operations have identical behaviour.
The advantage of the current implementation with transposed doublewords and only the LE routines seems to me that overhead on LE and BE would be about the same.
Do you think it makes sense to try and adjust the code to work with the BE layout natively and have a full 128bit reverse after ldr-like loads on LE instead (considering that 99,999% of aarch64 users will run LE)?
Otherwise, what's your error message from podman? It's got no deamon, so it shouldn't need a socket to connect to it like docker does. Out to the Internet for image download it's also a standard client and respects environment variables for proxies as usual.
I got Error: error creating network namespace for container. I think I can fix it by tracing the problem but let's try the other methods first as I think it's gonna be simpler for me..
I found this error on the Net in conjunction with a Debian/Ubuntu security-related custom kernel knob for disabling unprivileged user namespaces that was enabled by default once. I tested that with Ubuntu 18.04, 20.04 and 20.10 yesterday and it's disabled (i.e. namespaces for unprivileged users enabled) on all of them. You can still have a look at your setting in /proc/sys/kernel/unprivileged_userns_clone or with sysctl kernel.unprivileged_userns_clone. It needs to be set to 1 for rootless podman to work.
You're not by any chance running the Windows Subsystem for Linux (WSL)? https://github.com/containers/podman/issues/3288#issuecomment-501356136 :)
Or inside another container at a hosting service? https://github.com/containers/podman/issues/4056
Otherwise I have no idea what could be causing that and have never seen that error.