Hi Niels,
On Wed, Feb 07, 2018 at 01:13:32PM +0100, Niels Möller wrote:
What's the host triplet?
armv7veb-hardfloat-linux-gnueabi
^^
And the "eb" is for big-endian?
Only the b actually. ve stands for virtualization extensions: http://gcc.gnu.org/ml/gcc-patches/2013-12/msg01783.html. But that's just my fancy. More common triples would most likely use armv7b or armv7eb and the above should perhaps have been armv7veeb. :)
# define WORDS_BIGENDIAN 1
Can you check if it's detected correctly also when cross-compiling?
# ./configure --host=armv7veb-hardfloat-linux-gnueabi checking build system type... x86_64-unknown-linux-gnu checking host system type... armv7veb-hardfloat-linux-gnueabi [...] configure: summary of build options:
Version: nettle 3.4 Host type: armv7veb-hardfloat-linux-gnueabi ABI: standard Assembly files: arm/v6 arm Install prefix: /usr/local Library directory: ${exec_prefix}/lib Compiler: armv7veb-hardfloat-linux-gnueabi-gcc Static libraries: yes Shared libraries: yes Public key crypto: no Using mini-gmp: no Documentation: yes
# grep WORDS_BIG config.h /* Define WORDS_BIGENDIAN to 1 if your processor stores words with the most # define WORDS_BIGENDIAN 1 # ifndef WORDS_BIGENDIAN # define WORDS_BIGENDIAN 1
Seems fine.
FAIL: memxor
This also does some tricks with word reads and rotate. (The C code does that too, but with conditions on WORDS_BIGENDIAN).
I think I got memxor, sha1 and sha256 sorted. Patch below.
FAIL: chacha
The chacha code doesn't look endian-dependent to me. I'd guess it's a consequence of incorrect memxor (below).
This one is still failing, even though memxor and sha are fixed. I've been looking at the code and can't find any apparent reason. In chacha-core-internal.c I see the following bit of code that does seem to do endianness handling:
dst[i] = LE_SWAP32 (t);
Would this apply to chacha-core-internal.asm, too?
FAIL: umac
Similar problem, I would guess. But this time, loading 64 bits at a time into neon registers.
I'm drawing a bit of a blank on this one. It fails on the very first test case of umac32 where only umac-nh is used and all the input is zeroes. So there does seem to be another endianness dependency in the actual computation code. Have I understood correctly, that vld1.8 reads a byte stream and should be endianness-neutral anyway and the keys are in host endianness?
If you feel like, v6/aes-*.asm could also use better code for aligned reading of input data.
Huh, getting existing code to work again is one thing. But actual better code is certainly beyond me. :-/
Aarch64 assembly (for both endian flavors) would be nice, but it's a separate project. I haven't yet looked into aarch64-assembly. I made an attempt to build nettle under termux on my android phone a while ago, but it failed because it didn't provide /bin/sh at the expected place.
Sorry, I think I had confused nettle with an other library I came across during debugging which had armv8 code. Again, I think I should leave producing actually working and efficient assembler code to someone who knows what they're doing. :)
Before attempting to support big-endian arm, I'd need some idea on how to test it.
Any halfway current ARM cross toolchain should be able to also output big-endian arm binaries (-mbig-endian). Then you could test those with qemu-user-armeb, which is very light-weight in that it doesn't need a kernel or emulated system and allows to run binaries directly.
Sounds good. I hope the needed tools are packaged in debian, I'll have to check that.
I was wrong: While the compiler is able to output big-endian objects with -mbig-endian, it needs matching libs as well (e.g. libgcc_s). Debian doesn't have anything precompiled for armeb. They refer you to Linaro's toolchains or rebootstrap for building from scratch instead (I do something similar with crossdev on Gentoo).
This Linaro toolchain works for me: https://releases.linaro.org/components/toolchain/binaries/latest/armeb-linux...
michael@debian:~/nettle$ PATH=$HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/bin:$PATH ./configure --host=armeb-linux-gnueabihf michael@debian:~/nettle$ make [...] michael@debian:~/nettle$ file libnettle.so libnettle.so: ELF 32-bit MSB shared object, ARM, EABI5 BE8 version 1 (SYSV), dynamically linked, BuildID[sha1]=1a8daa9c1d3e61b9d99d34f462337d02c47c9d74, with debug_info, not stripped michael@debian:~/nettle$ make testsuite/sha1-test
Now qemu can be installed, which automatically registers with binfmt so that arm binaries can just be executed:
michael@debian:~/nettle$ sudo apt-get install qemu-user-static michael@debian:~/nettle$ file testsuite/sha1-test testsuite/sha1-test: ELF 32-bit MSB executable, ARM, EABI5 BE8 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux 3.2.0, BuildID[sha1]=ec39b7153f4c09d11cac92d34c8e509bb1f4d0a0, with debug_info, not stripped michael@debian:~/nettle$ testsuite/sha1-test /lib/ld-linux-armhf.so.3: No such file or directory michael@debian:~/nettle$ QEMU_LD_PREFIX=$HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/armeb-linux-gnueabihf/libc testsuite/sha1-test qemu: uncaught target signal 11 (Segmentation fault) - core dumped Segmentation fault
This segfaults because of a bug in qemu where it tries to use the host's /etc/ld.so.cache. Deleting it "solves" that. Alternatively, it could be run in a chroot to avoid the segfault but would require some fiddling with the compiler's sysroot.
michael@debian:~/nettle$ sudo rm /etc/ld.so.cache michael@debian:~/nettle$ QEMU_LD_PREFIX=$HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/armeb-linux-gnueabihf/libc testsuite/sha1-test testsuite/sha1-test: error while loading shared libraries: libnettle.so.6: cannot open shared object file: No such file or directory michael@debian:~/nettle$ ln -sfn libnettle.so libnettle.so.6 michael@debian:~/nettle$ QEMU_LD_PREFIX=$HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/armeb-linux-gnueabihf/libc LD_LIBRARY_PATH=. testsuite/sha1-test
This worked because configure detected only generic arm support:
Assembly files: arm
So plain arm assembly seems to be BE-safe. :) After hacking configure to also enable arm/v6 with this triple I get:
michael@debian:~/nettle$ QEMU_LD_PREFIX=$HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/armeb-linux-gnueabihf/libc LD_LIBRARY_PATH=. testsuite/sha1-test
Got:
9844f81e1408f6ec b932137d33bed7cf dcf518a3
Expected:
da39a3ee5e6b4b0d 3255bfef95601890 afd80709 qemu: uncaught target signal 6 (Aborted) - core dumped Aborted
Which seems about right. With the patch that goes away:
michael@debian:~/nettle$ git am 0001-Support-big-endian-arm-in-sha1-armv6-assembly-code.patch Applying: Support big-endian arm in sha1 armv6 assembly code [make && make check] michael@debian:~/nettle$ QEMU_LD_PREFIX=$HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/armeb-linux-gnueabihf/libc LD_LIBRARY_PATH=. testsuite/sha1-test michael@debian:~/nettle$
I also tried rebootstrap but this quickly got really involved.
--- a/configure.ac +++ b/configure.ac @@ -691,6 +691,7 @@ ASM_TYPE_FUNCTION='@function' ASM_TYPE_PROGBITS='@progbits' ASM_MARK_NOEXEC_STACK='' ASM_ALIGN_LOG='' +ASM_WORDS_BIGENDIAN="$ac_cv_c_bigendian"
If you have the time, it would be good to file an autoconf bug report, asking them to document (and support) that AC_C_BIGENDIAN sets the shell variable ac_cv_c_bigendian.
Instead I augmented the default action (which is documented and shouldn't change) by setting ASM_WORDS_BIGENDIAN directly. Also this should make the explicit value checking in IF_BE redundant because we now know for sure configure will never emit anything other than yes and no. Documentation says that AC_C_BIGENDIAN will abort if endianness can't be determined.
From db70ecccdc65a97c103f3900b4f45d8370c1dd62 Mon Sep 17 00:00:00 2001
From: Michael Weiser michael.weiser@gmx.de Date: Wed, 7 Feb 2018 00:11:24 +0100 Subject: [PATCH] Support big-endian arm in assembly code
Introduce m4 macros to conditionally handle differences of little- and big-endian arm in assembler code. Adjust sha1-compress, sha256-compress and memxor for arm to work in big-endian mode. --- arm/memxor.asm | 21 +++++++++++++++----- arm/memxor3.asm | 49 ++++++++++++++++++++++++++++++---------------- arm/v6/sha1-compress.asm | 8 ++++++-- arm/v6/sha256-compress.asm | 14 ++++++++----- asm.m4 | 3 +++ config.m4.in | 1 + configure.ac | 5 ++++- 7 files changed, 71 insertions(+), 30 deletions(-)
diff --git a/arm/memxor.asm b/arm/memxor.asm index a50e91bc..239a4034 100644 --- a/arm/memxor.asm +++ b/arm/memxor.asm @@ -44,6 +44,11 @@ define(<N>, <r2>) define(<CNT>, <r6>) define(<TNC>, <r12>)
+C little-endian and big-endian need to shift in different directions for +C alignment correction +define(<S0ADJ>, IF_LE(<lsr>, <lsl>)) +define(<S1ADJ>, IF_LE(<lsl>, <lsr>)) + .syntax unified
.file "memxor.asm" @@ -99,6 +104,8 @@ PROLOGUE(nettle_memxor) C C With little-endian, we need to do C DST[i] ^= (SRC[i] >> CNT) ^ (SRC[i+1] << TNC) + C With big-endian, we need to do + C DST[i] ^= (SRC[i] << CNT) ^ (SRC[i+1] >> TNC)
push {r4,r5,r6} @@ -117,14 +124,14 @@ PROLOGUE(nettle_memxor) .Lmemxor_word_loop: ldr r5, [SRC], #+4 ldr r3, [DST] - eor r3, r3, r4, lsr CNT - eor r3, r3, r5, lsl TNC + eor r3, r3, r4, S0ADJ CNT + eor r3, r3, r5, S1ADJ TNC str r3, [DST], #+4 .Lmemxor_odd: ldr r4, [SRC], #+4 ldr r3, [DST] - eor r3, r3, r5, lsr CNT - eor r3, r3, r4, lsl TNC + eor r3, r3, r5, S0ADJ CNT + eor r3, r3, r4, S1ADJ TNC str r3, [DST], #+4 subs N, #8 bcs .Lmemxor_word_loop @@ -132,10 +139,14 @@ PROLOGUE(nettle_memxor) beq .Lmemxor_odd_done
C We have TNC/8 left-over bytes in r4, high end - lsr r4, CNT + S0ADJ r4, CNT ldr r3, [DST] eor r3, r4
+ C memxor_leftover does an LSB store + C so we need to reverse if actually BE +IF_BE(< rev r3, r3>) + pop {r4,r5,r6}
C Store bytes, one by one. diff --git a/arm/memxor3.asm b/arm/memxor3.asm index 139fd208..69598e1c 100644 --- a/arm/memxor3.asm +++ b/arm/memxor3.asm @@ -49,6 +49,11 @@ define(<ATNC>, <r10>) define(<BCNT>, <r11>) define(<BTNC>, <r12>)
+C little-endian and big-endian need to shift in different directions for +C alignment correction +define(<S0ADJ>, IF_LE(<lsr>, <lsl>)) +define(<S1ADJ>, IF_LE(<lsl>, <lsr>)) + .syntax unified
.file "memxor3.asm" @@ -124,6 +129,8 @@ PROLOGUE(nettle_memxor3) C C With little-endian, we need to do C DST[i-i] ^= (SRC[i-i] >> CNT) ^ (SRC[i] << TNC) + C With big-endian, we need to do + C DST[i-i] ^= (SRC[i-i] << CNT) ^ (SRC[i] >> TNC) rsb ATNC, ACNT, #32 bic BP, #3
@@ -138,14 +145,14 @@ PROLOGUE(nettle_memxor3) .Lmemxor3_au_loop: ldr r5, [BP, #-4]! ldr r6, [AP, #-4]! - eor r6, r6, r4, lsl ATNC - eor r6, r6, r5, lsr ACNT + eor r6, r6, r4, S1ADJ ATNC + eor r6, r6, r5, S0ADJ ACNT str r6, [DST, #-4]! .Lmemxor3_au_odd: ldr r4, [BP, #-4]! ldr r6, [AP, #-4]! - eor r6, r6, r5, lsl ATNC - eor r6, r6, r4, lsr ACNT + eor r6, r6, r5, S1ADJ ATNC + eor r6, r6, r4, S0ADJ ACNT str r6, [DST, #-4]! subs N, #8 bcs .Lmemxor3_au_loop @@ -154,7 +161,11 @@ PROLOGUE(nettle_memxor3)
C Leftover bytes in r4, low end ldr r5, [AP, #-4] - eor r4, r5, r4, lsl ATNC + eor r4, r5, r4, S1ADJ ATNC + + C leftover does an LSB store + C so we need to reverse if actually BE +IF_BE(< rev r4, r4>)
.Lmemxor3_au_leftover: C Store a byte at a time @@ -247,21 +258,25 @@ PROLOGUE(nettle_memxor3) ldr r5, [AP, #-4]! ldr r6, [BP, #-4]! eor r5, r6 - lsl r4, ATNC - eor r4, r4, r5, lsr ACNT + S1ADJ r4, ATNC + eor r4, r4, r5, S0ADJ ACNT str r4, [DST, #-4]! .Lmemxor3_uu_odd: ldr r4, [AP, #-4]! ldr r6, [BP, #-4]! eor r4, r6 - lsl r5, ATNC - eor r5, r5, r4, lsr ACNT + S1ADJ r5, ATNC + eor r5, r5, r4, S0ADJ ACNT str r5, [DST, #-4]! subs N, #8 bcs .Lmemxor3_uu_loop adds N, #8 beq .Lmemxor3_done
+ C leftover does an LSB store + C so we need to reverse if actually BE +IF_BE(< rev r4, r4>) + C Leftover bytes in a4, low end ror r4, ACNT .Lmemxor3_uu_leftover: @@ -290,18 +305,18 @@ PROLOGUE(nettle_memxor3) .Lmemxor3_uud_loop: ldr r5, [AP, #-4]! ldr r7, [BP, #-4]! - lsl r4, ATNC - eor r4, r4, r6, lsl BTNC - eor r4, r4, r5, lsr ACNT - eor r4, r4, r7, lsr BCNT + S1ADJ r4, ATNC + eor r4, r4, r6, S1ADJ BTNC + eor r4, r4, r5, S0ADJ ACNT + eor r4, r4, r7, S0ADJ BCNT str r4, [DST, #-4]! .Lmemxor3_uud_odd: ldr r4, [AP, #-4]! ldr r6, [BP, #-4]! - lsl r5, ATNC - eor r5, r5, r7, lsl BTNC - eor r5, r5, r4, lsr ACNT - eor r5, r5, r6, lsr BCNT + S1ADJ r5, ATNC + eor r5, r5, r7, S1ADJ BTNC + eor r5, r5, r4, S0ADJ ACNT + eor r5, r5, r6, S0ADJ BCNT str r5, [DST, #-4]! subs N, #8 bcs .Lmemxor3_uud_loop diff --git a/arm/v6/sha1-compress.asm b/arm/v6/sha1-compress.asm index 59d6297e..52739b69 100644 --- a/arm/v6/sha1-compress.asm +++ b/arm/v6/sha1-compress.asm @@ -52,7 +52,7 @@ define(<LOAD>, < sel W, WPREV, T0 ror W, W, SHIFT mov WPREV, T0 - rev W, W +IF_LE(< rev W, W>) str W, [SP,#eval(4*$1)]
)
define(<EXPN>, < @@ -127,8 +127,12 @@ PROLOGUE(_nettle_sha1_compress) lsl SHIFT, SHIFT, #3 mov T0, #0 movne T0, #-1 - lsl W, T0, SHIFT +IF_LE(< lsl W, T0, SHIFT>) +IF_BE(< lsr W, T0, SHIFT>) uadd8 T0, T0, W C Sets APSR.GE bits + C on BE rotate right by 32-SHIFT bits + C because there is no rotate left +IF_BE(< rsb SHIFT, SHIFT, #32>) ldr K, .LK1 ldm STATE, {SA,SB,SC,SD,SE} diff --git a/arm/v6/sha256-compress.asm b/arm/v6/sha256-compress.asm index e6f4e1e9..324730c7 100644 --- a/arm/v6/sha256-compress.asm +++ b/arm/v6/sha256-compress.asm @@ -137,8 +137,12 @@ PROLOGUE(_nettle_sha256_compress) lsl SHIFT, SHIFT, #3 mov T0, #0 movne T0, #-1 - lsl I1, T0, SHIFT +IF_LE(< lsl I1, T0, SHIFT>) +IF_BE(< lsr I1, T0, SHIFT>) uadd8 T0, T0, I1 C Sets APSR.GE bits + C on BE rotate right by 32-SHIFT bits + C because there is no rotate left +IF_BE(< rsb SHIFT, SHIFT, #32>)
mov DST, sp mov ILEFT, #4 @@ -146,16 +150,16 @@ PROLOGUE(_nettle_sha256_compress) ldm INPUT!, {I1,I2,I3,I4} sel I0, I0, I1 ror I0, I0, SHIFT - rev I0, I0 +IF_LE(< rev I0, I0>) sel I1, I1, I2 ror I1, I1, SHIFT - rev I1, I1 +IF_LE(< rev I1, I1>) sel I2, I2, I3 ror I2, I2, SHIFT - rev I2, I2 +IF_LE(< rev I2, I2>) sel I3, I3, I4 ror I3, I3, SHIFT - rev I3, I3 +IF_LE(< rev I3, I3>) subs ILEFT, ILEFT, #1 stm DST!, {I0,I1,I2,I3} mov I0, I4 diff --git a/asm.m4 b/asm.m4 index 4018c235..343a55fc 100644 --- a/asm.m4 +++ b/asm.m4 @@ -51,6 +51,9 @@ define(<ALIGN>, <.align ifelse(ALIGN_LOG,yes,<m4_log2($1)>,$1)
)
+define(<IF_BE>, <ifelse(WORDS_BIGENDIAN,yes,<$1>,<$2>)>) +define(<IF_LE>, <IF_BE(<$2>, <$1>)>) + dnl Struct defining macros
dnl STRUCTURE(prefix) diff --git a/config.m4.in b/config.m4.in index e39c880c..11f90a40 100644 --- a/config.m4.in +++ b/config.m4.in @@ -7,6 +7,7 @@ define(<TYPE_PROGBITS>, <@ASM_TYPE_PROGBITS@>)dnl define(<ALIGN_LOG>, <@ASM_ALIGN_LOG@>)dnl define(<W64_ABI>, <@W64_ABI@>)dnl define(<RODATA>, <@ASM_RODATA@>)dnl +define(<WORDS_BIGENDIAN>, <@ASM_WORDS_BIGENDIAN@>)dnl divert(1) @ASM_MARK_NOEXEC_STACK@ divert diff --git a/configure.ac b/configure.ac index 41bf0998..21eba3b5 100644 --- a/configure.ac +++ b/configure.ac @@ -201,7 +201,9 @@ LSH_FUNC_STRERROR # getenv_secure is used for fat overrides, # getline is used in the testsuite AC_CHECK_FUNCS(secure_getenv getline) -AC_C_BIGENDIAN +AC_C_BIGENDIAN([AC_DEFINE([WORDS_BIGENDIAN], 1) + [ASM_WORDS_BIGENDIAN=yes]], + [ASM_WORDS_BIGENDIAN=no])
AC_CACHE_CHECK([for __builtin_bswap64], nettle_cv_c_builtin_bswap64, @@ -811,6 +813,7 @@ AC_SUBST(ASM_TYPE_PROGBITS) AC_SUBST(ASM_MARK_NOEXEC_STACK) AC_SUBST(ASM_ALIGN_LOG) AC_SUBST(W64_ABI) +AC_SUBST(ASM_WORDS_BIGENDIAN) AC_SUBST(EMULATOR)
AC_SUBST(LIBNETTLE_MAJOR)