Hello Niels,
On Sun, Feb 11, 2018 at 11:03:41AM +0100, Niels Möller wrote:
dst[i] = LE_SWAP32 (t);
Would this apply to chacha-core-internal.asm, too?
That's right, it is expected to produce the output as an array of 16 32-bit words, stored in *little* endian order. So byteswap is needed
Right. When this still didn't fix it, I compared little- and big-endian behaviour and found that a.) vldm and vstm switch doublewords for no reason I can see or find documentation about and b.) vext extracts from the top of the vector, not bottom. Taking both into account, I now have chacha and salsa20 passing tests.
FAIL: umac
I'm drawing a bit of a blank on this one. It fails on the very first
It could be that if the all-zero input is misaligned, the aligned read + rotation tricks gets non-zero data from outside of the input area into the computation.
Apparently, NEON adjusts for endianness, meaning shifts switch direction as well. Which would explain why chacha and salsa20 didn't need more adjustment. All I needed to do in the end was change the order of registers for the 64bit return value in umac-nh and the checks passed.
I now have the whole testsuite passing apart from these two:
PASS: cxx ./sexp-conv-test: line 17: ../tools/sexp-conv: No such file or directory cmp: EOF on test1.out which is empty FAIL: sexp-conv SKIP: pkcs1-conv ./nettle-pbkdf2-test: line 18: ../tools/nettle-pbkdf2: No such file or directory cmp: EOF on test1.out which is empty FAIL: nettle-pbkdf2 PASS: symbols PASS: dlopen ==================== 2 of 93 tests failed ====================
They've been failing all along. Can they be ignored?
I was wrong: While the compiler is able to output big-endian objects with -mbig-endian, it needs matching libs as well (e.g. libgcc_s). Debian doesn't have anything precompiled for armeb.
Maybe full-system qemu is easier then (assuming there's some dist supporting big-endian arm)?
Weeell, depends on what you consider easier: I haven't found any binary distribution that supports armeb. Yocto and buildroot seem to support it but still require compiling the whole thing.
-AC_C_BIGENDIAN +AC_C_BIGENDIAN([AC_DEFINE([WORDS_BIGENDIAN], 1)
- [ASM_WORDS_BIGENDIAN=yes]],
- [ASM_WORDS_BIGENDIAN=no])
I think I'd indent differently, to group the two parts ACTION-IF-TRUE together, and drop the redundant square brackets in "[ASM_WORDS_BIGENDIAN=yes]".
Done.
You leave the ACTION-IS-UNIVERSAL as default, which I think is good. I hope that's not relevant for arm.
Ahem, seems I was looking at old documentation of autoconf which didn't have action-if-universal.
Apple does do arm and someone could potentially want to build a fat nettle that supports x86_64 and arm or rather arm and arm64.
Does nettle currently support being compiled fat with assembly at all? It would require building the individual platform's asm source and then lipo-ing them together, I guess. I'm not clear on the specifics.
Might still be good to set ASM_WORDS_BIGENDIAN to some default value before this check, and have IF_BE fail if used with an unknown endianness.
But then I want to have a nice error message so as to not leave the user with an aborted build and no apparent reason. :) Is this portable?
The patch got quite large now. Should I better make a series out of it?
From 67de31a70f8b8076681d6ddd221605365080103f Mon Sep 17 00:00:00 2001
From: Michael Weiser michael.weiser@gmx.de Date: Wed, 7 Feb 2018 00:11:24 +0100 Subject: [PATCH] Support big-endian arm in assembly code
Introduce m4 macros to conditionally handle differences of little- and big-endian arm in assembler code. Adjust sha1-compress, sha256-compress, umac-nh, chacha-core-internal, salsa20-core-internal and memxor for arm to work in big-endian mode. --- arm/memxor.asm | 21 ++++++++++++---- arm/memxor3.asm | 49 +++++++++++++++++++++++++------------- arm/neon/chacha-core-internal.asm | 37 +++++++++++++++++++++++----- arm/neon/salsa20-core-internal.asm | 43 ++++++++++++++++++++++++++++----- arm/neon/umac-nh.asm | 4 +++- arm/v6/sha1-compress.asm | 8 +++++-- arm/v6/sha256-compress.asm | 14 +++++++---- asm.m4 | 8 +++++++ config.m4.in | 1 + configure.ac | 7 +++++- 10 files changed, 149 insertions(+), 43 deletions(-)
diff --git a/arm/memxor.asm b/arm/memxor.asm index a50e91bc..239a4034 100644 --- a/arm/memxor.asm +++ b/arm/memxor.asm @@ -44,6 +44,11 @@ define(<N>, <r2>) define(<CNT>, <r6>) define(<TNC>, <r12>)
+C little-endian and big-endian need to shift in different directions for +C alignment correction +define(<S0ADJ>, IF_LE(<lsr>, <lsl>)) +define(<S1ADJ>, IF_LE(<lsl>, <lsr>)) + .syntax unified
.file "memxor.asm" @@ -99,6 +104,8 @@ PROLOGUE(nettle_memxor) C C With little-endian, we need to do C DST[i] ^= (SRC[i] >> CNT) ^ (SRC[i+1] << TNC) + C With big-endian, we need to do + C DST[i] ^= (SRC[i] << CNT) ^ (SRC[i+1] >> TNC)
push {r4,r5,r6} @@ -117,14 +124,14 @@ PROLOGUE(nettle_memxor) .Lmemxor_word_loop: ldr r5, [SRC], #+4 ldr r3, [DST] - eor r3, r3, r4, lsr CNT - eor r3, r3, r5, lsl TNC + eor r3, r3, r4, S0ADJ CNT + eor r3, r3, r5, S1ADJ TNC str r3, [DST], #+4 .Lmemxor_odd: ldr r4, [SRC], #+4 ldr r3, [DST] - eor r3, r3, r5, lsr CNT - eor r3, r3, r4, lsl TNC + eor r3, r3, r5, S0ADJ CNT + eor r3, r3, r4, S1ADJ TNC str r3, [DST], #+4 subs N, #8 bcs .Lmemxor_word_loop @@ -132,10 +139,14 @@ PROLOGUE(nettle_memxor) beq .Lmemxor_odd_done
C We have TNC/8 left-over bytes in r4, high end - lsr r4, CNT + S0ADJ r4, CNT ldr r3, [DST] eor r3, r4
+ C memxor_leftover does an LSB store + C so we need to reverse if actually BE +IF_BE(< rev r3, r3>) + pop {r4,r5,r6}
C Store bytes, one by one. diff --git a/arm/memxor3.asm b/arm/memxor3.asm index 139fd208..69598e1c 100644 --- a/arm/memxor3.asm +++ b/arm/memxor3.asm @@ -49,6 +49,11 @@ define(<ATNC>, <r10>) define(<BCNT>, <r11>) define(<BTNC>, <r12>)
+C little-endian and big-endian need to shift in different directions for +C alignment correction +define(<S0ADJ>, IF_LE(<lsr>, <lsl>)) +define(<S1ADJ>, IF_LE(<lsl>, <lsr>)) + .syntax unified
.file "memxor3.asm" @@ -124,6 +129,8 @@ PROLOGUE(nettle_memxor3) C C With little-endian, we need to do C DST[i-i] ^= (SRC[i-i] >> CNT) ^ (SRC[i] << TNC) + C With big-endian, we need to do + C DST[i-i] ^= (SRC[i-i] << CNT) ^ (SRC[i] >> TNC) rsb ATNC, ACNT, #32 bic BP, #3
@@ -138,14 +145,14 @@ PROLOGUE(nettle_memxor3) .Lmemxor3_au_loop: ldr r5, [BP, #-4]! ldr r6, [AP, #-4]! - eor r6, r6, r4, lsl ATNC - eor r6, r6, r5, lsr ACNT + eor r6, r6, r4, S1ADJ ATNC + eor r6, r6, r5, S0ADJ ACNT str r6, [DST, #-4]! .Lmemxor3_au_odd: ldr r4, [BP, #-4]! ldr r6, [AP, #-4]! - eor r6, r6, r5, lsl ATNC - eor r6, r6, r4, lsr ACNT + eor r6, r6, r5, S1ADJ ATNC + eor r6, r6, r4, S0ADJ ACNT str r6, [DST, #-4]! subs N, #8 bcs .Lmemxor3_au_loop @@ -154,7 +161,11 @@ PROLOGUE(nettle_memxor3)
C Leftover bytes in r4, low end ldr r5, [AP, #-4] - eor r4, r5, r4, lsl ATNC + eor r4, r5, r4, S1ADJ ATNC + + C leftover does an LSB store + C so we need to reverse if actually BE +IF_BE(< rev r4, r4>)
.Lmemxor3_au_leftover: C Store a byte at a time @@ -247,21 +258,25 @@ PROLOGUE(nettle_memxor3) ldr r5, [AP, #-4]! ldr r6, [BP, #-4]! eor r5, r6 - lsl r4, ATNC - eor r4, r4, r5, lsr ACNT + S1ADJ r4, ATNC + eor r4, r4, r5, S0ADJ ACNT str r4, [DST, #-4]! .Lmemxor3_uu_odd: ldr r4, [AP, #-4]! ldr r6, [BP, #-4]! eor r4, r6 - lsl r5, ATNC - eor r5, r5, r4, lsr ACNT + S1ADJ r5, ATNC + eor r5, r5, r4, S0ADJ ACNT str r5, [DST, #-4]! subs N, #8 bcs .Lmemxor3_uu_loop adds N, #8 beq .Lmemxor3_done
+ C leftover does an LSB store + C so we need to reverse if actually BE +IF_BE(< rev r4, r4>) + C Leftover bytes in a4, low end ror r4, ACNT .Lmemxor3_uu_leftover: @@ -290,18 +305,18 @@ PROLOGUE(nettle_memxor3) .Lmemxor3_uud_loop: ldr r5, [AP, #-4]! ldr r7, [BP, #-4]! - lsl r4, ATNC - eor r4, r4, r6, lsl BTNC - eor r4, r4, r5, lsr ACNT - eor r4, r4, r7, lsr BCNT + S1ADJ r4, ATNC + eor r4, r4, r6, S1ADJ BTNC + eor r4, r4, r5, S0ADJ ACNT + eor r4, r4, r7, S0ADJ BCNT str r4, [DST, #-4]! .Lmemxor3_uud_odd: ldr r4, [AP, #-4]! ldr r6, [BP, #-4]! - lsl r5, ATNC - eor r5, r5, r7, lsl BTNC - eor r5, r5, r4, lsr ACNT - eor r5, r5, r6, lsr BCNT + S1ADJ r5, ATNC + eor r5, r5, r7, S1ADJ BTNC + eor r5, r5, r4, S0ADJ ACNT + eor r5, r5, r6, S0ADJ BCNT str r5, [DST, #-4]! subs N, #8 bcs .Lmemxor3_uud_loop diff --git a/arm/neon/chacha-core-internal.asm b/arm/neon/chacha-core-internal.asm index 6f623106..43bacda6 100644 --- a/arm/neon/chacha-core-internal.asm +++ b/arm/neon/chacha-core-internal.asm @@ -90,31 +90,50 @@ PROLOGUE(_nettle_chacha_core) vmov S2, X2 vmov S3, X3
- C Input rows: + C Input rows little-endian: C 0 1 2 3 X0 C 4 5 6 7 X1 C 8 9 10 11 X2 C 12 13 14 15 X3
+ C Input rows big-endian: + C 2 3 0 1 X0 + C 6 7 4 5 X1 + C 10 11 8 9 X2 + C 14 15 12 13 X3 + C because vldm switches doublewords + .Loop: QROUND(X0, X1, X2, X3)
- C Rotate rows, to get + C In little-endian rotate rows, to get C 0 1 2 3 C 5 6 7 4 >>> 3 C 10 11 8 9 >>> 2 C 15 12 13 14 >>> 1 - vext.32 X1, X1, X1, #1 + + C In big-endian rotate rows, to get + C 2 3 0 1 + C 7 4 5 6 >>> 3 + C 8 9 10 11 >>> 2 + C 13 14 15 12 >>> 1 + + C vext extracts from the top of the vector instead of bottom +IF_LE(< vext.32 X1, X1, X1, #1>) +IF_BE(< vext.32 X1, X1, X1, #3>) vext.32 X2, X2, X2, #2 - vext.32 X3, X3, X3, #3 +IF_LE(< vext.32 X3, X3, X3, #3>) +IF_BE(< vext.32 X3, X3, X3, #1>)
QROUND(X0, X1, X2, X3)
subs ROUNDS, ROUNDS, #2 C Inverse rotation - vext.32 X1, X1, X1, #3 +IF_LE(< vext.32 X1, X1, X1, #3>) +IF_BE(< vext.32 X1, X1, X1, #1>) vext.32 X2, X2, X2, #2 - vext.32 X3, X3, X3, #1 +IF_LE(< vext.32 X3, X3, X3, #1>) +IF_BE(< vext.32 X3, X3, X3, #3>)
bhi .Loop
@@ -123,6 +142,12 @@ PROLOGUE(_nettle_chacha_core) vadd.u32 X2, X2, S2 vadd.u32 X3, X3, S3
+ C caller expects result little-endian +IF_BE(< vrev32.u8 X0, X0 + vrev32.u8 X1, X1 + vrev32.u8 X2, X2 + vrev32.u8 X3, X3>) + vstm DST, {X0,X1,X2,X3} bx lr EPILOGUE(_nettle_chacha_core) diff --git a/arm/neon/salsa20-core-internal.asm b/arm/neon/salsa20-core-internal.asm index 34eb1fba..12a812d8 100644 --- a/arm/neon/salsa20-core-internal.asm +++ b/arm/neon/salsa20-core-internal.asm @@ -88,7 +88,7 @@ define(<QROUND>, < PROLOGUE(_nettle_salsa20_core) vldm SRC, {X0,X1,X2,X3}
- C Input rows: + C Input rows little-endian: C 0 1 2 3 X0 C 4 5 6 7 X1 C 8 9 10 11 X2 @@ -99,6 +99,18 @@ PROLOGUE(_nettle_salsa20_core) C 8 13 2 7 C 12 1 6 11
+ C Input rows big-endian: + C 2 3 0 1 X0 + C 6 7 4 5 X1 + C 10 11 8 9 X2 + C 14 15 12 13 X3 + C because vldm switches doublewords + C Permuted to: + C 10 15 0 5 + C 14 3 4 9 + C 2 7 8 13 + C 6 11 12 1 + C FIXME: Construct in some other way? adr r12, .Lmasks vldm r12, {M0101, M0110, M0011} @@ -112,6 +124,7 @@ PROLOGUE(_nettle_salsa20_core) C 4 1 6 3 T0 v C 8 13 10 15 T1 ^ C 12 9 14 11 X3 v + C same in big endian just with transposed double-rows vmov T0, X1 vmov T1, X2 vbit T0, X0, M0101 @@ -140,22 +153,34 @@ PROLOGUE(_nettle_salsa20_core) .Loop: QROUND(X0, X1, X2, X3)
- C Rotate rows, to get + C In little-endian rotate rows, to get C 0 5 10 15 C 3 4 9 14 >>> 1 C 2 7 8 13 >>> 2 C 1 6 11 12 >>> 3 - vext.32 X1, X1, X1, #3 + + C In big-endian rotate rows, to get + C 10 15 0 5 + C 9 14 3 4 >>> 1 + C 8 13 2 7 >>> 2 + C 11 12 1 6 >>> 3 + + C vext extracts from the top of the vector instead of bottom +IF_LE(< vext.32 X1, X1, X1, #3>) +IF_BE(< vext.32 X1, X1, X1, #1>) vext.32 X2, X2, X2, #2 - vext.32 X3, X3, X3, #1 +IF_LE(< vext.32 X3, X3, X3, #1>) +IF_BE(< vext.32 X3, X3, X3, #3>)
QROUND(X0, X3, X2, X1)
subs ROUNDS, ROUNDS, #2 C Inverse rotation - vext.32 X1, X1, X1, #1 +IF_LE(< vext.32 X1, X1, X1, #1>) +IF_BE(< vext.32 X1, X1, X1, #3>) vext.32 X2, X2, X2, #2 - vext.32 X3, X3, X3, #3 +IF_LE(< vext.32 X3, X3, X3, #3>) +IF_BE(< vext.32 X3, X3, X3, #1>)
bhi .Loop
@@ -181,6 +206,12 @@ PROLOGUE(_nettle_salsa20_core) vadd.u32 X2, X2, S2 vadd.u32 X3, X3, S3
+ C caller expects result little-endian +IF_BE(< vrev32.u8 X0, X0 + vrev32.u8 X1, X1 + vrev32.u8 X2, X2 + vrev32.u8 X3, X3>) + vstm DST, {X0,X1,X2,X3} bx lr EPILOGUE(_nettle_salsa20_core) diff --git a/arm/neon/umac-nh.asm b/arm/neon/umac-nh.asm index 158a5686..2b617202 100644 --- a/arm/neon/umac-nh.asm +++ b/arm/neon/umac-nh.asm @@ -97,6 +97,8 @@ PROLOGUE(_nettle_umac_nh) bhi .Loop
vadd.i64 D0REG(QY), D0REG(QY), D1REG(QY) - vmov r0, r1, D0REG(QY) + C return values use memory endianness +IF_LE(< vmov r0, r1, D0REG(QY)>) +IF_BE(< vmov r1, r0, D0REG(QY)>) bx lr EPILOGUE(_nettle_umac_nh) diff --git a/arm/v6/sha1-compress.asm b/arm/v6/sha1-compress.asm index 59d6297e..8cc22be7 100644 --- a/arm/v6/sha1-compress.asm +++ b/arm/v6/sha1-compress.asm @@ -52,7 +52,7 @@ define(<LOAD>, < sel W, WPREV, T0 ror W, W, SHIFT mov WPREV, T0 - rev W, W +IF_LE(< rev W, W>) str W, [SP,#eval(4*$1)]
)
define(<EXPN>, < @@ -127,8 +127,12 @@ PROLOGUE(_nettle_sha1_compress) lsl SHIFT, SHIFT, #3 mov T0, #0 movne T0, #-1 - lsl W, T0, SHIFT +IF_LE(< lsl W, T0, SHIFT>) +IF_BE(< lsr W, T0, SHIFT>) uadd8 T0, T0, W C Sets APSR.GE bits + C on BE rotate right by 32-SHIFT bits + C because there is no rotate left +IF_BE(< rsb SHIFT, SHIFT, #32>) ldr K, .LK1 ldm STATE, {SA,SB,SC,SD,SE} diff --git a/arm/v6/sha256-compress.asm b/arm/v6/sha256-compress.asm index e6f4e1e9..324730c7 100644 --- a/arm/v6/sha256-compress.asm +++ b/arm/v6/sha256-compress.asm @@ -137,8 +137,12 @@ PROLOGUE(_nettle_sha256_compress) lsl SHIFT, SHIFT, #3 mov T0, #0 movne T0, #-1 - lsl I1, T0, SHIFT +IF_LE(< lsl I1, T0, SHIFT>) +IF_BE(< lsr I1, T0, SHIFT>) uadd8 T0, T0, I1 C Sets APSR.GE bits + C on BE rotate right by 32-SHIFT bits + C because there is no rotate left +IF_BE(< rsb SHIFT, SHIFT, #32>)
mov DST, sp mov ILEFT, #4 @@ -146,16 +150,16 @@ PROLOGUE(_nettle_sha256_compress) ldm INPUT!, {I1,I2,I3,I4} sel I0, I0, I1 ror I0, I0, SHIFT - rev I0, I0 +IF_LE(< rev I0, I0>) sel I1, I1, I2 ror I1, I1, SHIFT - rev I1, I1 +IF_LE(< rev I1, I1>) sel I2, I2, I3 ror I2, I2, SHIFT - rev I2, I2 +IF_LE(< rev I2, I2>) sel I3, I3, I4 ror I3, I3, SHIFT - rev I3, I3 +IF_LE(< rev I3, I3>) subs ILEFT, ILEFT, #1 stm DST!, {I0,I1,I2,I3} mov I0, I4 diff --git a/asm.m4 b/asm.m4 index 4018c235..8c290551 100644 --- a/asm.m4 +++ b/asm.m4 @@ -51,6 +51,14 @@ define(<ALIGN>, <.align ifelse(ALIGN_LOG,yes,<m4_log2($1)>,$1)
)
+define(<IF_BE>, <ifelse( +WORDS_BIGENDIAN,yes,<$1>, +WORDS_BIGENDIAN,no,<$2>, +<errprint(__file__:__line__:,<Unsupported endianness value>,WORDS_BIGENDIAN,< +>) + m4exit(1)>)>) +define(<IF_LE>, <IF_BE(<$2>, <$1>)>) + dnl Struct defining macros
dnl STRUCTURE(prefix) diff --git a/config.m4.in b/config.m4.in index e39c880c..11f90a40 100644 --- a/config.m4.in +++ b/config.m4.in @@ -7,6 +7,7 @@ define(<TYPE_PROGBITS>, <@ASM_TYPE_PROGBITS@>)dnl define(<ALIGN_LOG>, <@ASM_ALIGN_LOG@>)dnl define(<W64_ABI>, <@W64_ABI@>)dnl define(<RODATA>, <@ASM_RODATA@>)dnl +define(<WORDS_BIGENDIAN>, <@ASM_WORDS_BIGENDIAN@>)dnl divert(1) @ASM_MARK_NOEXEC_STACK@ divert diff --git a/configure.ac b/configure.ac index 41bf0998..b57d2fb8 100644 --- a/configure.ac +++ b/configure.ac @@ -201,7 +201,11 @@ LSH_FUNC_STRERROR # getenv_secure is used for fat overrides, # getline is used in the testsuite AC_CHECK_FUNCS(secure_getenv getline) -AC_C_BIGENDIAN + +ASM_WORDS_BIGENDIAN=unknown +AC_C_BIGENDIAN([AC_DEFINE([WORDS_BIGENDIAN], 1) + ASM_WORDS_BIGENDIAN=yes], + [ASM_WORDS_BIGENDIAN=no])
AC_CACHE_CHECK([for __builtin_bswap64], nettle_cv_c_builtin_bswap64, @@ -811,6 +815,7 @@ AC_SUBST(ASM_TYPE_PROGBITS) AC_SUBST(ASM_MARK_NOEXEC_STACK) AC_SUBST(ASM_ALIGN_LOG) AC_SUBST(W64_ABI) +AC_SUBST(ASM_WORDS_BIGENDIAN) AC_SUBST(EMULATOR)
AC_SUBST(LIBNETTLE_MAJOR)