Hi Niels,
thanks for getting back to me so quickly.
On Tue, Feb 06, 2018 at 07:36:22PM +0100, Niels Möller wrote:
Is there maybe a problem with the list or my email that you can discern?
None previously known to me (it's a plain mailman installation). I think the most commonly used way to subscribe is to use the web frontend, and then go to the link in the confirmation email using a web browser.
Mhmm, with another From address and a few more tries I now got myself subscribed. Before that I had a number of confirmation emails leave my server and be accepted by mail.lysator.liu.se but the responses never getting back to my server even with that other From address.
Anywho, I'll leave all the quoted text for now so people can myabe still pick up on the conversion. Sorry for the mess.
I just ran into a problem where gnutls's certificate verification fails only on big-endian arm Linux boards but not on the otherwise identical little-endian ones. After recompiling nettle with --disable-assembler the problem goes away on big-endian arm as well. Considering that big-endian arm isn't all that common, I suspect nettle's optimised arm asm might have some endianness issues.
I have done no testing on big-endian arm. My recent big-endian tests have been on the ultrasparc t5 in the gcc compile farm (gcc202.fsffrance.org), and locally using debian's mips cross compiler and qemu. So I'm fairly confident that the C code is endian-safe.
Lots of questions, since I'm unfamiliar with such systems:
What board and linux (dist?) are you running this on?
I have a number of Cubieboard2s that run Gentoo Linux with a vanilla, mainline Linus kernel.
# uname -a Linux b 4.15.0-gentoo #2 SMP Sun Feb 4 18:46:30 CET 2018 armv7b ARMv7 Processor rev 4 (v7b) Allwinner sun7i (A20) Family GNU/Linux
The only difference between little- and big-endian boards is the following Linux kernel config options:
-# CONFIG_CPU_BIG_ENDIAN is not set +CONFIG_CPU_BIG_ENDIAN=y +CONFIG_CPU_ENDIAN_BE8=y
This makes the kernel switch the CPU to big-endian mode on boot. Userland is big-endian as well.
Big-endian on ARM is somewhat curious in that instruction encoding stays little-endian but loads and stores use big-endian byte order - if the CPU is in that mode. It can be switched back and forth at will and it basically only changes where it starts loading/storing bytes and in which order it continues.
Because that would be too easy, it has two different big-endian operating modes called BE32 and BE8. From what I understand, BE32 actually stores bytes the same order as little-endian in memory but redirects accesses to individual bytes of words to make them appear to be stored big-endian while BE8 actually stores words in big-endian byte order and accesses individual bytes directly. The gory details are here if you're interested: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0290g/ch06s05s01.html. https://blog.richliu.com/2010/04/08/907/arm11-be8-and-be32
BE32 is deprecated, newer cores don't even support it.
I'm running BE8, so kernel and userland are BE8:
# file /usr/bin/nettle-hash /usr/bin/nettle-hash: ELF 32-bit MSB shared object, ARM, EABI5 BE8 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux 3.2.0, stripped
What's the host triplet?
armv7veb-hardfloat-linux-gnueabi
gcc: Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/armv7veb-hardfloat-linux-gnueabi/7.2.0/lto-wrapper Target: armv7veb-hardfloat-linux-gnueabi Configured with: /var/tmp/portage/sys-devel/gcc-7.2.0-r1/work/gcc-7.2.0/configure --host=armv7veb-hardfloat-linux-gnueabi --build=armv7veb-hardfloat-linux-gnueabi --prefix=/usr --bindir=/usr/armv7veb-hardfloat-linux-gnueabi/gcc-bin/7.2.0 --includedir=/usr/lib/gcc/armv7veb-hardfloat-linux-gnueabi/7.2.0/include --datadir=/usr/share/gcc-data/armv7veb-hardfloat-linux-gnueabi/7.2.0 --mandir=/usr/share/gcc-data/armv7veb-hardfloat-linux-gnueabi/7.2.0/man --infodir=/usr/share/gcc-data/armv7veb-hardfloat-linux-gnueabi/7.2.0/info --with-gxx-include-dir=/usr/lib/gcc/armv7veb-hardfloat-linux-gnueabi/7.2.0/include/g++-v7 --with-python-dir=/share/gcc-data/armv7veb-hardfloat-linux-gnueabi/7.2.0/python --enable-languages=c,c++ --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --disable-nls --enable-checking=release --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo Hardened 7.2.0-r1 p1.1' --enable-esp --enable-libstdcxx-time --disable-libstdcxx-pch --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --disable-multilib --disable-altivec --disable-fixed-point --with-float=hard --with-float=hard --with-fpu=vfpv3-d16 --disable-libgcj --disable-libgomp --disable-libmudflap --disable-libssp --disable-libcilkrts --disable-libmpx --enable-vtable-verify --enable-libvtv --disable-libquadmath --enable-lto --without-isl --disable-libsanitizer --enable-default-pie --enable-default-ssp --with-arch=armv7ve Thread model: posix gcc version 7.2.0 (Gentoo Hardened 7.2.0-r1 p1.1)
Are you cross compiling, or compiling natively?
It's all native on the board. I have a cross-toolchain and qemu on standby on x86_64 if necessary.
Does configure detect it as big-endian (check for WORDS_BIGENDIAN in config.h)?
It seems so:
nettle-3.4 # grep WORDS_BIG config.h /* Define WORDS_BIGENDIAN to 1 if your processor stores words with the most # define WORDS_BIGENDIAN 1 # ifndef WORDS_BIGENDIAN # define WORDS_BIGENDIAN 1
Which of nettle's own tests (make check) fail?
With --disable-assembler all checks pass. Here's the make check output with arm asm:
PASS: aes PASS: arcfour PASS: arctwo PASS: blowfish PASS: cast128 PASS: base16 PASS: base64 PASS: camellia Error, expected:
3e00ef2f895f40d6 7f5bb8e81f09a5a1 2c840ec3ce9a7f3b 181be188ef711a1e 984ce172b9216f41 9f445367456d5619 314a42a3da86b001 387bfdb80e0cfe42 Got:
4c4389ad2c1a14e5 9a58af4d26f726c8 2beb7f103f529dc2 a8a203e7c69fb546 141dc7988d5106d1 c03d895bb6576d0f 9718ecc1ae929e8d c8dc1f6ee7038bbb ../run-tests: line 57: 7819 Aborted "$1" $testflags FAIL: chacha PASS: des PASS: des3 PASS: des-compat PASS: md2 PASS: md4 PASS: md5 PASS: md5-compat PASS: memeql Assert failed: memxor-test.c:106: MEMEQ (size, dst, c) ../run-tests: line 57: 7865 Aborted "$1" $testflags FAIL: memxor PASS: gosthash94 PASS: ripemd160
Got:
f278b7b482ad71f2 83dcb4f9fe864547 5ed675894046fbeb 59de3b76052dfd99
Expected:
077709362c2e32df 0ddc3f0dc47bba63 90b6c73bb50f9c31 22ec844ad7c2b3e5 ../run-tests: line 57: 7881 Aborted "$1" $testflags FAIL: hkdf Encrypt failed: Input: 0000000000000000
Output: 0ab22555fbcab5ce
Expected: fc207dbfc76c5e17
../run-tests: line 57: 7887 Aborted "$1" $testflags FAIL: salsa20
Got:
9844f81e1408f6ec b932137d33bed7cf dcf518a3
Expected:
da39a3ee5e6b4b0d 3255bfef95601890 afd80709 ../run-tests: line 57: 7893 Aborted "$1" $testflags FAIL: sha1 Got:
51a8a435f33f1941 a7646a966b9f99e5 095b59c1072c0acd 2a893d99
Expected:
23097d223405d822 8642a477bda255b3 2aadbce4bda0b3f7 e36c9da7 ../run-tests: line 57: 7899 Aborted "$1" $testflags FAIL: sha224
Got:
312892d3e4bda557 75e12e46320ed33a 329b15d73167b830 ec07ba0845c7b4cf
Expected:
ba7816bf8f01cfea 414140de5dae2223 b00361a396177a9c b410ff61f20015ad ../run-tests: line 57: 7905 Aborted "$1" $testflags FAIL: sha256 PASS: sha384 PASS: sha512 PASS: sha512-224 PASS: sha512-256 PASS: sha3-permute PASS: sha3-224 PASS: sha3-256 PASS: sha3-384 PASS: sha3-512 PASS: serpent PASS: twofish PASS: version PASS: knuth-lfib PASS: cbc PASS: cfb PASS: ctr PASS: gcm PASS: eax CCM digest failed: Adata: 0001020304050607
Input: 20212223
Output: 98055abb
Expected: 4dac255d
../run-tests: line 57: 8001 Aborted "$1" $testflags FAIL: ccm PASS: poly1305 Assert failed: testutils.c:619: MEMEQ(length, data, ciphertext->data) ../run-tests: line 57: 8012 Aborted "$1" $testflags FAIL: chacha-poly1305 Assert failed: hmac-test.c:205: MEMEQ ((tstring_hex("b617318655057264 e28bc0b6fb378c8e f146be00"))->length, digest, (tstring_hex("b617318655057264 e28bc0b6fb378c8e f146be00"))->data) ../run-tests: line 57: 8018 Aborted "$1" $testflags FAIL: hmac umac32 failed msg: length: 0 tag: 9f972a17 ref: 113145fb ../run-tests: line 57: 8024 Aborted "$1" $testflags FAIL: umac PASS: meta-hash PASS: meta-cipher PASS: meta-aead PASS: meta-armor PASS: buffer Assert failed: yarrow-test.c:185: memcmp(digest, expected_input, sizeof(digest)) == 0 ../run-tests: line 57: 8055 Aborted "$1" $testflags FAIL: yarrow Assert failed: pbkdf2-test.c:38: MEMEQ ((tstring_hex("0c60c80f961f0e71f3a9b524af6012062fe037a6"))->length, dk, (tstring_hex("0c60c80f961f0e71f3a9b524af6012062fe037a6"))->data) ../run-tests: line 57: 8061 Aborted "$1" $testflags FAIL: pbkdf2 Assert failed: pss-mgf1-test.c:22: MEMEQ (expected->length, mask, expected->data) ../run-tests: line 57: 8067 Aborted "$1" $testflags FAIL: pss-mgf1 PASS: sexp PASS: sexp-format PASS: rsa2sexp PASS: sexp2rsa PASS: bignum PASS: random-prime PASS: pkcs1 Assert failed: pss-test.c:29: mpz_cmp(m, expected) == 0 ../run-tests: line 57: 8108 Aborted "$1" $testflags FAIL: pss PASS: rsa-sign-tr Assert failed: rsa-pss-sign-tr-test.c:72: mpz_cmp(signature, expected) == 0 ../run-tests: line 57: 8119 Aborted "$1" $testflags FAIL: rsa-pss-sign-tr Assert failed: testutils.c:1004: mpz_cmp (signature, expected) == 0 ../run-tests: line 57: 8125 Aborted "$1" $testflags FAIL: rsa PASS: rsa-encrypt Assert failed: testutils.c:1004: mpz_cmp (signature, expected) == 0 ../run-tests: line 57: 8136 Aborted "$1" $testflags FAIL: rsa-keygen Assert failed: testutils.c:1189: mpz_cmp (signature.r, expected->r) == 0 && mpz_cmp (signature.s, expected->s) == 0 ../run-tests: line 57: 8142 Aborted "$1" $testflags FAIL: dsa PASS: dsa-keygen PASS: curve25519-dh PASS: ecc-mod PASS: ecc-modinv PASS: ecc-redc PASS: ecc-sqrt PASS: ecc-dup PASS: ecc-add PASS: ecc-mul-g PASS: ecc-mul-a PASS: ecdsa-sign PASS: ecdsa-verify PASS: ecdsa-keygen PASS: ecdh PASS: eddsa-compress PASS: eddsa-sign PASS: eddsa-verify PASS: ed25519 PASS: cxx PASS: sexp-conv 1c1 < 2de201fee759dffb05a5ff127f4b0b134bf10f466cf174ebff52d387e551225a61e30ec850c38681574a1a8cefa1aa6030481cebc92268863871796ed1afd017969a1d70bb1c936fa1a71a975ddcc07a8d492d6caf5942182b03fa69fea603d904e1cd7c2c9f78e060662d7cf5ec2a5d5af7988e3054513f9f356b749360ec13 ---
5c96ffe7e925224ce6e98648bf2ed3193cab2fc82af9c7fa7fdc5b623bde1d77c5409129d16d1127ae4fad519c24059fe85f4a4360a900f3dee906e6de2ecd010fa56c02d3f7d0772d43439464a91b025722a6f0b6cb65aee1017b29aff4511f90315caae0be74c2ac496474896e7e3ad200cb7c609ddef5c674272964e4b780
FAIL: pkcs1-conv test1.out test2.out differ: char 1, line 1 FAIL: nettle-pbkdf2 PASS: symbols PASS: dlopen ===================== 21 of 94 tests failed ===================== make[1]: *** [Makefile:136: check] Error 1
Looks bad.
I've not narrowed this down to a proper test case yet because I'm wondering if this is even warrants digging into. Might this be an easy fix or do I have to expect this to get so involved that I might just as well just disable asm on big-endian arm and leave it at that? I *am* all set to dive into this to provide a better test case and perhaps even patch - just asking for the odds to solve this with only a beginner's arm asm skills.
Note how all the SHA1/256 digests below differ for the same certificate.
From these symptoms, the main suspect is the data load in
arm/v6/sha1-compress.asm (see the LOAD macro) and arm/v6/sha256-compress.asm (look at the code after the .Lcopy label).
As a quick test, you could try just deleting all use of the "rev" instruction in those two files.
I'm not sure what's needed to properly support big-endian there, maybe deleting rev isn't enough, one might also need to shift differently in the unaligned case. If you want to use the same assembly source file for both big- and little-endian, with only some m4 ifelse to do conditional things in the asm files, you should let configure substitute something in config.m4.in to test on.
Yes, the masking and shifting needs some adjustment, too. I got sha1-test to succeed with below patch. What do you think: Could we go some route like that for the other arm asm code as well? I'd be willing to throw in aarch64 as well because I've got some Pine64s running BE floating around also. :)
The aes code also loads unaligned data, but it reads it byte-by-byte, without the tricks to use aligned word loads + rotate + sel.
Before attempting to support big-endian arm, I'd need some idea on how to test it.
Any halfway current ARM cross toolchain should be able to also output big-endian arm binaries (-mbig-endian). Then you could test those with qemu-user-armeb, which is very light-weight in that it doesn't need a kernel or emulated system and allows to run binaries directly.
If it's hard for me to test, the safest change may be to just disable all arm assembly on big-endian.
I'm not ready to go there yet. Poking around ARM ASM unexpectedly is fun. :)
From f876368b333c72878808e74a0af5aa631d42d357 Mon Sep 17 00:00:00 2001
From: Michael Weiser michael.weiser@gmx.de Date: Wed, 7 Feb 2018 00:11:24 +0100 Subject: [PATCH] Support big-endian arm in sha1 armv6 assembly code
--- arm/v6/sha1-compress.asm | 10 ++++++++++ asm.m4 | 10 ++++++++++ config.m4.in | 1 + configure.ac | 2 ++ 4 files changed, 23 insertions(+)
diff --git a/arm/v6/sha1-compress.asm b/arm/v6/sha1-compress.asm index 59d6297e..116a80f0 100644 --- a/arm/v6/sha1-compress.asm +++ b/arm/v6/sha1-compress.asm @@ -52,7 +52,9 @@ define(<LOAD>, < sel W, WPREV, T0 ror W, W, SHIFT mov WPREV, T0 +NOT_IF_BE(< rev W, W +>) str W, [SP,#eval(4*$1)]
)
define(<EXPN>, < @@ -127,8 +129,16 @@ PROLOGUE(_nettle_sha1_compress) lsl SHIFT, SHIFT, #3 mov T0, #0 movne T0, #-1 +IF_BE(< + lsr W, T0, SHIFT +>, < lsl W, T0, SHIFT +>) uadd8 T0, T0, W C Sets APSR.GE bits +IF_BE(< + neg SHIFT, SHIFT C Rotate right by 32-SHIFT bits + add SHIFT, SHIFT, #32 C because there's no rotate left +>, <>) ldr K, .LK1 ldm STATE, {SA,SB,SC,SD,SE} diff --git a/asm.m4 b/asm.m4 index 4018c235..34e39317 100644 --- a/asm.m4 +++ b/asm.m4 @@ -51,6 +51,16 @@ define(<ALIGN>, <.align ifelse(ALIGN_LOG,yes,<m4_log2($1)>,$1)
)
+define(<IF_BE>, +<ifelse(WORDS_BIGENDIAN,yes, +<$1>, +<$2>)>) + +define(<NOT_IF_BE>, +<ifelse(WORDS_BIGENDIAN,no, +<$1>, +<>)>) + dnl Struct defining macros
dnl STRUCTURE(prefix) diff --git a/config.m4.in b/config.m4.in index e39c880c..11f90a40 100644 --- a/config.m4.in +++ b/config.m4.in @@ -7,6 +7,7 @@ define(<TYPE_PROGBITS>, <@ASM_TYPE_PROGBITS@>)dnl define(<ALIGN_LOG>, <@ASM_ALIGN_LOG@>)dnl define(<W64_ABI>, <@W64_ABI@>)dnl define(<RODATA>, <@ASM_RODATA@>)dnl +define(<WORDS_BIGENDIAN>, <@ASM_WORDS_BIGENDIAN@>)dnl divert(1) @ASM_MARK_NOEXEC_STACK@ divert diff --git a/configure.ac b/configure.ac index 41bf0998..5db72be8 100644 --- a/configure.ac +++ b/configure.ac @@ -691,6 +691,7 @@ ASM_TYPE_FUNCTION='@function' ASM_TYPE_PROGBITS='@progbits' ASM_MARK_NOEXEC_STACK='' ASM_ALIGN_LOG='' +ASM_WORDS_BIGENDIAN="$ac_cv_c_bigendian"
if test x$enable_assembler = xyes ; then AC_CACHE_CHECK([if globals are prefixed by underscore], @@ -811,6 +812,7 @@ AC_SUBST(ASM_TYPE_PROGBITS) AC_SUBST(ASM_MARK_NOEXEC_STACK) AC_SUBST(ASM_ALIGN_LOG) AC_SUBST(W64_ABI) +AC_SUBST(ASM_WORDS_BIGENDIAN) AC_SUBST(EMULATOR)
AC_SUBST(LIBNETTLE_MAJOR)
Michael Weiser michael@weiser.dinsnail.net writes:
What board and linux (dist?) are you running this on?
I have a number of Cubieboard2s that run Gentoo Linux with a vanilla, mainline Linus kernel.
# uname -a Linux b 4.15.0-gentoo #2 SMP Sun Feb 4 18:46:30 CET 2018 armv7b ARMv7 Processor rev 4 (v7b) Allwinner sun7i (A20) Family GNU/Linux
The only difference between little- and big-endian boards is the following Linux kernel config options:
-# CONFIG_CPU_BIG_ENDIAN is not set +CONFIG_CPU_BIG_ENDIAN=y +CONFIG_CPU_ENDIAN_BE8=y
This makes the kernel switch the CPU to big-endian mode on boot. Userland is big-endian as well.
Cool.
What's the host triplet?
armv7veb-hardfloat-linux-gnueabi
^^
And the "eb" is for big-endian?
Are you cross compiling, or compiling natively?
It's all native on the board. I have a cross-toolchain and qemu on standby on x86_64 if necessary.
Does configure detect it as big-endian (check for WORDS_BIGENDIAN in config.h)?
It seems so:
nettle-3.4 # grep WORDS_BIG config.h /* Define WORDS_BIGENDIAN to 1 if your processor stores words with the most # define WORDS_BIGENDIAN 1 # ifndef WORDS_BIGENDIAN # define WORDS_BIGENDIAN 1
Can you check if it's detected correctly also when cross-compiling?
Which of nettle's own tests (make check) fail?
With --disable-assembler all checks pass. Here's the make check output with arm asm:
FAIL: chacha
The chacha code doesn't look endian-dependent to me. I'd guess it's a consequence of incorrect memxor (below).
FAIL: memxor
This also does some tricks with word reads and rotate. (The C code does that too, but with conditions on WORDS_BIGENDIAN).
FAIL: hkdf
Probably due to broken sha1 or sha256.
FAIL: sha1
This one you have already looked into.
FAIL: sha256
Similar problems.
FAIL: umac
Similar problem, I would guess. But this time, loading 64 bits at a time into neon registers.
The remaining failures are most likely not independent issues.
Yes, the masking and shifting needs some adjustment, too. I got sha1-test to succeed with below patch. What do you think: Could we go some route like that for the other arm asm code as well?
Sounds reasonable, I'm happy to apply patches. If you feel like, v6/aes-*.asm could also use better code for aligned reading of input data.
I'd be willing to throw in aarch64 as well because I've got some Pine64s running BE floating around also. :)
Aarch64 assembly (for both endian flavors) would be nice, but it's a separate project. I haven't yet looked into aarch64-assembly. I made an attempt to build nettle under termux on my android phone a while ago, but it failed because it didn't provide /bin/sh at the expected place.
Before attempting to support big-endian arm, I'd need some idea on how to test it.
Any halfway current ARM cross toolchain should be able to also output big-endian arm binaries (-mbig-endian). Then you could test those with qemu-user-armeb, which is very light-weight in that it doesn't need a kernel or emulated system and allows to run binaries directly.
Sounds good. I hope the needed tools are packaged in debian, I'll have to check that.
From f876368b333c72878808e74a0af5aa631d42d357 Mon Sep 17 00:00:00 2001 From: Michael Weiser michael.weiser@gmx.de Date: Wed, 7 Feb 2018 00:11:24 +0100 Subject: [PATCH] Support big-endian arm in sha1 armv6 assembly code
arm/v6/sha1-compress.asm | 10 ++++++++++ asm.m4 | 10 ++++++++++ config.m4.in | 1 + configure.ac | 2 ++ 4 files changed, 23 insertions(+)
diff --git a/arm/v6/sha1-compress.asm b/arm/v6/sha1-compress.asm index 59d6297e..116a80f0 100644 --- a/arm/v6/sha1-compress.asm +++ b/arm/v6/sha1-compress.asm @@ -52,7 +52,9 @@ define(<LOAD>, < sel W, WPREV, T0 ror W, W, SHIFT mov WPREV, T0 +NOT_IF_BE(< rev W, W +>)
I'd prefer IF_LE or IF_NOT_BE or UNLESS_BE over NOT_IF_BE. And it might look better as a single line,
IF_LE(< rev W, W>)
+IF_BE(<
- neg SHIFT, SHIFT C Rotate right by 32-SHIFT bits
- add SHIFT, SHIFT, #32 C because there's no rotate left
+>, <>)
Can the rsb instruction be used for this?
+define(<IF_BE>, +<ifelse(WORDS_BIGENDIAN,yes, +<$1>, +<$2>)>)
Would be good to check explicitly for the supported values "yes" and "no" (m4 ifelse can have more than two alternatives), and fail with m4exit if configure produced any other value, e.g., "unknown".
+define(<NOT_IF_BE>, +<ifelse(WORDS_BIGENDIAN,no, +<$1>, +<>)>)
As above, I'm not so fond of the name. And for symmetry, it would be nice with an else clause just as for IF_BE.
--- a/configure.ac +++ b/configure.ac @@ -691,6 +691,7 @@ ASM_TYPE_FUNCTION='@function' ASM_TYPE_PROGBITS='@progbits' ASM_MARK_NOEXEC_STACK='' ASM_ALIGN_LOG='' +ASM_WORDS_BIGENDIAN="$ac_cv_c_bigendian"
If you have the time, it would be good to file an autoconf bug report, asking them to document (and support) that AC_C_BIGENDIAN sets the shell variable ac_cv_c_bigendian.
Regards, /Niels
On Wed, 2018-02-07 at 13:13 +0100, Niels Möller wrote:
I'd be willing to throw in aarch64 as well because I've got some Pine64s running BE floating around also. :)
Aarch64 assembly (for both endian flavors) would be nice, but it's a separate project. I haven't yet looked into aarch64-assembly. I made an attempt to build nettle under termux on my android phone a while ago, but it failed because it didn't provide /bin/sh at the expected place.
If you need access to the aarch64 CI server, let me know.
regards, Nikos
Hi Niels,
On Wed, Feb 07, 2018 at 01:13:32PM +0100, Niels Möller wrote:
What's the host triplet?
armv7veb-hardfloat-linux-gnueabi
^^
And the "eb" is for big-endian?
Only the b actually. ve stands for virtualization extensions: http://gcc.gnu.org/ml/gcc-patches/2013-12/msg01783.html. But that's just my fancy. More common triples would most likely use armv7b or armv7eb and the above should perhaps have been armv7veeb. :)
# define WORDS_BIGENDIAN 1
Can you check if it's detected correctly also when cross-compiling?
# ./configure --host=armv7veb-hardfloat-linux-gnueabi checking build system type... x86_64-unknown-linux-gnu checking host system type... armv7veb-hardfloat-linux-gnueabi [...] configure: summary of build options:
Version: nettle 3.4 Host type: armv7veb-hardfloat-linux-gnueabi ABI: standard Assembly files: arm/v6 arm Install prefix: /usr/local Library directory: ${exec_prefix}/lib Compiler: armv7veb-hardfloat-linux-gnueabi-gcc Static libraries: yes Shared libraries: yes Public key crypto: no Using mini-gmp: no Documentation: yes
# grep WORDS_BIG config.h /* Define WORDS_BIGENDIAN to 1 if your processor stores words with the most # define WORDS_BIGENDIAN 1 # ifndef WORDS_BIGENDIAN # define WORDS_BIGENDIAN 1
Seems fine.
FAIL: memxor
This also does some tricks with word reads and rotate. (The C code does that too, but with conditions on WORDS_BIGENDIAN).
I think I got memxor, sha1 and sha256 sorted. Patch below.
FAIL: chacha
The chacha code doesn't look endian-dependent to me. I'd guess it's a consequence of incorrect memxor (below).
This one is still failing, even though memxor and sha are fixed. I've been looking at the code and can't find any apparent reason. In chacha-core-internal.c I see the following bit of code that does seem to do endianness handling:
dst[i] = LE_SWAP32 (t);
Would this apply to chacha-core-internal.asm, too?
FAIL: umac
Similar problem, I would guess. But this time, loading 64 bits at a time into neon registers.
I'm drawing a bit of a blank on this one. It fails on the very first test case of umac32 where only umac-nh is used and all the input is zeroes. So there does seem to be another endianness dependency in the actual computation code. Have I understood correctly, that vld1.8 reads a byte stream and should be endianness-neutral anyway and the keys are in host endianness?
If you feel like, v6/aes-*.asm could also use better code for aligned reading of input data.
Huh, getting existing code to work again is one thing. But actual better code is certainly beyond me. :-/
Aarch64 assembly (for both endian flavors) would be nice, but it's a separate project. I haven't yet looked into aarch64-assembly. I made an attempt to build nettle under termux on my android phone a while ago, but it failed because it didn't provide /bin/sh at the expected place.
Sorry, I think I had confused nettle with an other library I came across during debugging which had armv8 code. Again, I think I should leave producing actually working and efficient assembler code to someone who knows what they're doing. :)
Before attempting to support big-endian arm, I'd need some idea on how to test it.
Any halfway current ARM cross toolchain should be able to also output big-endian arm binaries (-mbig-endian). Then you could test those with qemu-user-armeb, which is very light-weight in that it doesn't need a kernel or emulated system and allows to run binaries directly.
Sounds good. I hope the needed tools are packaged in debian, I'll have to check that.
I was wrong: While the compiler is able to output big-endian objects with -mbig-endian, it needs matching libs as well (e.g. libgcc_s). Debian doesn't have anything precompiled for armeb. They refer you to Linaro's toolchains or rebootstrap for building from scratch instead (I do something similar with crossdev on Gentoo).
This Linaro toolchain works for me: https://releases.linaro.org/components/toolchain/binaries/latest/armeb-linux...
michael@debian:~/nettle$ PATH=$HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/bin:$PATH ./configure --host=armeb-linux-gnueabihf michael@debian:~/nettle$ make [...] michael@debian:~/nettle$ file libnettle.so libnettle.so: ELF 32-bit MSB shared object, ARM, EABI5 BE8 version 1 (SYSV), dynamically linked, BuildID[sha1]=1a8daa9c1d3e61b9d99d34f462337d02c47c9d74, with debug_info, not stripped michael@debian:~/nettle$ make testsuite/sha1-test
Now qemu can be installed, which automatically registers with binfmt so that arm binaries can just be executed:
michael@debian:~/nettle$ sudo apt-get install qemu-user-static michael@debian:~/nettle$ file testsuite/sha1-test testsuite/sha1-test: ELF 32-bit MSB executable, ARM, EABI5 BE8 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux 3.2.0, BuildID[sha1]=ec39b7153f4c09d11cac92d34c8e509bb1f4d0a0, with debug_info, not stripped michael@debian:~/nettle$ testsuite/sha1-test /lib/ld-linux-armhf.so.3: No such file or directory michael@debian:~/nettle$ QEMU_LD_PREFIX=$HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/armeb-linux-gnueabihf/libc testsuite/sha1-test qemu: uncaught target signal 11 (Segmentation fault) - core dumped Segmentation fault
This segfaults because of a bug in qemu where it tries to use the host's /etc/ld.so.cache. Deleting it "solves" that. Alternatively, it could be run in a chroot to avoid the segfault but would require some fiddling with the compiler's sysroot.
michael@debian:~/nettle$ sudo rm /etc/ld.so.cache michael@debian:~/nettle$ QEMU_LD_PREFIX=$HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/armeb-linux-gnueabihf/libc testsuite/sha1-test testsuite/sha1-test: error while loading shared libraries: libnettle.so.6: cannot open shared object file: No such file or directory michael@debian:~/nettle$ ln -sfn libnettle.so libnettle.so.6 michael@debian:~/nettle$ QEMU_LD_PREFIX=$HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/armeb-linux-gnueabihf/libc LD_LIBRARY_PATH=. testsuite/sha1-test
This worked because configure detected only generic arm support:
Assembly files: arm
So plain arm assembly seems to be BE-safe. :) After hacking configure to also enable arm/v6 with this triple I get:
michael@debian:~/nettle$ QEMU_LD_PREFIX=$HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/armeb-linux-gnueabihf/libc LD_LIBRARY_PATH=. testsuite/sha1-test
Got:
9844f81e1408f6ec b932137d33bed7cf dcf518a3
Expected:
da39a3ee5e6b4b0d 3255bfef95601890 afd80709 qemu: uncaught target signal 6 (Aborted) - core dumped Aborted
Which seems about right. With the patch that goes away:
michael@debian:~/nettle$ git am 0001-Support-big-endian-arm-in-sha1-armv6-assembly-code.patch Applying: Support big-endian arm in sha1 armv6 assembly code [make && make check] michael@debian:~/nettle$ QEMU_LD_PREFIX=$HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/armeb-linux-gnueabihf/libc LD_LIBRARY_PATH=. testsuite/sha1-test michael@debian:~/nettle$
I also tried rebootstrap but this quickly got really involved.
--- a/configure.ac +++ b/configure.ac @@ -691,6 +691,7 @@ ASM_TYPE_FUNCTION='@function' ASM_TYPE_PROGBITS='@progbits' ASM_MARK_NOEXEC_STACK='' ASM_ALIGN_LOG='' +ASM_WORDS_BIGENDIAN="$ac_cv_c_bigendian"
If you have the time, it would be good to file an autoconf bug report, asking them to document (and support) that AC_C_BIGENDIAN sets the shell variable ac_cv_c_bigendian.
Instead I augmented the default action (which is documented and shouldn't change) by setting ASM_WORDS_BIGENDIAN directly. Also this should make the explicit value checking in IF_BE redundant because we now know for sure configure will never emit anything other than yes and no. Documentation says that AC_C_BIGENDIAN will abort if endianness can't be determined.
From db70ecccdc65a97c103f3900b4f45d8370c1dd62 Mon Sep 17 00:00:00 2001
From: Michael Weiser michael.weiser@gmx.de Date: Wed, 7 Feb 2018 00:11:24 +0100 Subject: [PATCH] Support big-endian arm in assembly code
Introduce m4 macros to conditionally handle differences of little- and big-endian arm in assembler code. Adjust sha1-compress, sha256-compress and memxor for arm to work in big-endian mode. --- arm/memxor.asm | 21 +++++++++++++++----- arm/memxor3.asm | 49 ++++++++++++++++++++++++++++++---------------- arm/v6/sha1-compress.asm | 8 ++++++-- arm/v6/sha256-compress.asm | 14 ++++++++----- asm.m4 | 3 +++ config.m4.in | 1 + configure.ac | 5 ++++- 7 files changed, 71 insertions(+), 30 deletions(-)
diff --git a/arm/memxor.asm b/arm/memxor.asm index a50e91bc..239a4034 100644 --- a/arm/memxor.asm +++ b/arm/memxor.asm @@ -44,6 +44,11 @@ define(<N>, <r2>) define(<CNT>, <r6>) define(<TNC>, <r12>)
+C little-endian and big-endian need to shift in different directions for +C alignment correction +define(<S0ADJ>, IF_LE(<lsr>, <lsl>)) +define(<S1ADJ>, IF_LE(<lsl>, <lsr>)) + .syntax unified
.file "memxor.asm" @@ -99,6 +104,8 @@ PROLOGUE(nettle_memxor) C C With little-endian, we need to do C DST[i] ^= (SRC[i] >> CNT) ^ (SRC[i+1] << TNC) + C With big-endian, we need to do + C DST[i] ^= (SRC[i] << CNT) ^ (SRC[i+1] >> TNC)
push {r4,r5,r6} @@ -117,14 +124,14 @@ PROLOGUE(nettle_memxor) .Lmemxor_word_loop: ldr r5, [SRC], #+4 ldr r3, [DST] - eor r3, r3, r4, lsr CNT - eor r3, r3, r5, lsl TNC + eor r3, r3, r4, S0ADJ CNT + eor r3, r3, r5, S1ADJ TNC str r3, [DST], #+4 .Lmemxor_odd: ldr r4, [SRC], #+4 ldr r3, [DST] - eor r3, r3, r5, lsr CNT - eor r3, r3, r4, lsl TNC + eor r3, r3, r5, S0ADJ CNT + eor r3, r3, r4, S1ADJ TNC str r3, [DST], #+4 subs N, #8 bcs .Lmemxor_word_loop @@ -132,10 +139,14 @@ PROLOGUE(nettle_memxor) beq .Lmemxor_odd_done
C We have TNC/8 left-over bytes in r4, high end - lsr r4, CNT + S0ADJ r4, CNT ldr r3, [DST] eor r3, r4
+ C memxor_leftover does an LSB store + C so we need to reverse if actually BE +IF_BE(< rev r3, r3>) + pop {r4,r5,r6}
C Store bytes, one by one. diff --git a/arm/memxor3.asm b/arm/memxor3.asm index 139fd208..69598e1c 100644 --- a/arm/memxor3.asm +++ b/arm/memxor3.asm @@ -49,6 +49,11 @@ define(<ATNC>, <r10>) define(<BCNT>, <r11>) define(<BTNC>, <r12>)
+C little-endian and big-endian need to shift in different directions for +C alignment correction +define(<S0ADJ>, IF_LE(<lsr>, <lsl>)) +define(<S1ADJ>, IF_LE(<lsl>, <lsr>)) + .syntax unified
.file "memxor3.asm" @@ -124,6 +129,8 @@ PROLOGUE(nettle_memxor3) C C With little-endian, we need to do C DST[i-i] ^= (SRC[i-i] >> CNT) ^ (SRC[i] << TNC) + C With big-endian, we need to do + C DST[i-i] ^= (SRC[i-i] << CNT) ^ (SRC[i] >> TNC) rsb ATNC, ACNT, #32 bic BP, #3
@@ -138,14 +145,14 @@ PROLOGUE(nettle_memxor3) .Lmemxor3_au_loop: ldr r5, [BP, #-4]! ldr r6, [AP, #-4]! - eor r6, r6, r4, lsl ATNC - eor r6, r6, r5, lsr ACNT + eor r6, r6, r4, S1ADJ ATNC + eor r6, r6, r5, S0ADJ ACNT str r6, [DST, #-4]! .Lmemxor3_au_odd: ldr r4, [BP, #-4]! ldr r6, [AP, #-4]! - eor r6, r6, r5, lsl ATNC - eor r6, r6, r4, lsr ACNT + eor r6, r6, r5, S1ADJ ATNC + eor r6, r6, r4, S0ADJ ACNT str r6, [DST, #-4]! subs N, #8 bcs .Lmemxor3_au_loop @@ -154,7 +161,11 @@ PROLOGUE(nettle_memxor3)
C Leftover bytes in r4, low end ldr r5, [AP, #-4] - eor r4, r5, r4, lsl ATNC + eor r4, r5, r4, S1ADJ ATNC + + C leftover does an LSB store + C so we need to reverse if actually BE +IF_BE(< rev r4, r4>)
.Lmemxor3_au_leftover: C Store a byte at a time @@ -247,21 +258,25 @@ PROLOGUE(nettle_memxor3) ldr r5, [AP, #-4]! ldr r6, [BP, #-4]! eor r5, r6 - lsl r4, ATNC - eor r4, r4, r5, lsr ACNT + S1ADJ r4, ATNC + eor r4, r4, r5, S0ADJ ACNT str r4, [DST, #-4]! .Lmemxor3_uu_odd: ldr r4, [AP, #-4]! ldr r6, [BP, #-4]! eor r4, r6 - lsl r5, ATNC - eor r5, r5, r4, lsr ACNT + S1ADJ r5, ATNC + eor r5, r5, r4, S0ADJ ACNT str r5, [DST, #-4]! subs N, #8 bcs .Lmemxor3_uu_loop adds N, #8 beq .Lmemxor3_done
+ C leftover does an LSB store + C so we need to reverse if actually BE +IF_BE(< rev r4, r4>) + C Leftover bytes in a4, low end ror r4, ACNT .Lmemxor3_uu_leftover: @@ -290,18 +305,18 @@ PROLOGUE(nettle_memxor3) .Lmemxor3_uud_loop: ldr r5, [AP, #-4]! ldr r7, [BP, #-4]! - lsl r4, ATNC - eor r4, r4, r6, lsl BTNC - eor r4, r4, r5, lsr ACNT - eor r4, r4, r7, lsr BCNT + S1ADJ r4, ATNC + eor r4, r4, r6, S1ADJ BTNC + eor r4, r4, r5, S0ADJ ACNT + eor r4, r4, r7, S0ADJ BCNT str r4, [DST, #-4]! .Lmemxor3_uud_odd: ldr r4, [AP, #-4]! ldr r6, [BP, #-4]! - lsl r5, ATNC - eor r5, r5, r7, lsl BTNC - eor r5, r5, r4, lsr ACNT - eor r5, r5, r6, lsr BCNT + S1ADJ r5, ATNC + eor r5, r5, r7, S1ADJ BTNC + eor r5, r5, r4, S0ADJ ACNT + eor r5, r5, r6, S0ADJ BCNT str r5, [DST, #-4]! subs N, #8 bcs .Lmemxor3_uud_loop diff --git a/arm/v6/sha1-compress.asm b/arm/v6/sha1-compress.asm index 59d6297e..52739b69 100644 --- a/arm/v6/sha1-compress.asm +++ b/arm/v6/sha1-compress.asm @@ -52,7 +52,7 @@ define(<LOAD>, < sel W, WPREV, T0 ror W, W, SHIFT mov WPREV, T0 - rev W, W +IF_LE(< rev W, W>) str W, [SP,#eval(4*$1)]
)
define(<EXPN>, < @@ -127,8 +127,12 @@ PROLOGUE(_nettle_sha1_compress) lsl SHIFT, SHIFT, #3 mov T0, #0 movne T0, #-1 - lsl W, T0, SHIFT +IF_LE(< lsl W, T0, SHIFT>) +IF_BE(< lsr W, T0, SHIFT>) uadd8 T0, T0, W C Sets APSR.GE bits + C on BE rotate right by 32-SHIFT bits + C because there is no rotate left +IF_BE(< rsb SHIFT, SHIFT, #32>) ldr K, .LK1 ldm STATE, {SA,SB,SC,SD,SE} diff --git a/arm/v6/sha256-compress.asm b/arm/v6/sha256-compress.asm index e6f4e1e9..324730c7 100644 --- a/arm/v6/sha256-compress.asm +++ b/arm/v6/sha256-compress.asm @@ -137,8 +137,12 @@ PROLOGUE(_nettle_sha256_compress) lsl SHIFT, SHIFT, #3 mov T0, #0 movne T0, #-1 - lsl I1, T0, SHIFT +IF_LE(< lsl I1, T0, SHIFT>) +IF_BE(< lsr I1, T0, SHIFT>) uadd8 T0, T0, I1 C Sets APSR.GE bits + C on BE rotate right by 32-SHIFT bits + C because there is no rotate left +IF_BE(< rsb SHIFT, SHIFT, #32>)
mov DST, sp mov ILEFT, #4 @@ -146,16 +150,16 @@ PROLOGUE(_nettle_sha256_compress) ldm INPUT!, {I1,I2,I3,I4} sel I0, I0, I1 ror I0, I0, SHIFT - rev I0, I0 +IF_LE(< rev I0, I0>) sel I1, I1, I2 ror I1, I1, SHIFT - rev I1, I1 +IF_LE(< rev I1, I1>) sel I2, I2, I3 ror I2, I2, SHIFT - rev I2, I2 +IF_LE(< rev I2, I2>) sel I3, I3, I4 ror I3, I3, SHIFT - rev I3, I3 +IF_LE(< rev I3, I3>) subs ILEFT, ILEFT, #1 stm DST!, {I0,I1,I2,I3} mov I0, I4 diff --git a/asm.m4 b/asm.m4 index 4018c235..343a55fc 100644 --- a/asm.m4 +++ b/asm.m4 @@ -51,6 +51,9 @@ define(<ALIGN>, <.align ifelse(ALIGN_LOG,yes,<m4_log2($1)>,$1)
)
+define(<IF_BE>, <ifelse(WORDS_BIGENDIAN,yes,<$1>,<$2>)>) +define(<IF_LE>, <IF_BE(<$2>, <$1>)>) + dnl Struct defining macros
dnl STRUCTURE(prefix) diff --git a/config.m4.in b/config.m4.in index e39c880c..11f90a40 100644 --- a/config.m4.in +++ b/config.m4.in @@ -7,6 +7,7 @@ define(<TYPE_PROGBITS>, <@ASM_TYPE_PROGBITS@>)dnl define(<ALIGN_LOG>, <@ASM_ALIGN_LOG@>)dnl define(<W64_ABI>, <@W64_ABI@>)dnl define(<RODATA>, <@ASM_RODATA@>)dnl +define(<WORDS_BIGENDIAN>, <@ASM_WORDS_BIGENDIAN@>)dnl divert(1) @ASM_MARK_NOEXEC_STACK@ divert diff --git a/configure.ac b/configure.ac index 41bf0998..21eba3b5 100644 --- a/configure.ac +++ b/configure.ac @@ -201,7 +201,9 @@ LSH_FUNC_STRERROR # getenv_secure is used for fat overrides, # getline is used in the testsuite AC_CHECK_FUNCS(secure_getenv getline) -AC_C_BIGENDIAN +AC_C_BIGENDIAN([AC_DEFINE([WORDS_BIGENDIAN], 1) + [ASM_WORDS_BIGENDIAN=yes]], + [ASM_WORDS_BIGENDIAN=no])
AC_CACHE_CHECK([for __builtin_bswap64], nettle_cv_c_builtin_bswap64, @@ -811,6 +813,7 @@ AC_SUBST(ASM_TYPE_PROGBITS) AC_SUBST(ASM_MARK_NOEXEC_STACK) AC_SUBST(ASM_ALIGN_LOG) AC_SUBST(W64_ABI) +AC_SUBST(ASM_WORDS_BIGENDIAN) AC_SUBST(EMULATOR)
AC_SUBST(LIBNETTLE_MAJOR)
Michael Weiser michael@weiser.dinsnail.net writes:
Hi Niels,
On Wed, Feb 07, 2018 at 01:13:32PM +0100, Niels Möller wrote:
Can you check if it's detected correctly also when cross-compiling?
[...] Seems fine.
Good!
FAIL: memxor
This also does some tricks with word reads and rotate. (The C code does that too, but with conditions on WORDS_BIGENDIAN).
I think I got memxor, sha1 and sha256 sorted. Patch below.
Nice. Only some quick comments for now.
This one is still failing, even though memxor and sha are fixed. I've been looking at the code and can't find any apparent reason. In chacha-core-internal.c I see the following bit of code that does seem to do endianness handling:
dst[i] = LE_SWAP32 (t);
Would this apply to chacha-core-internal.asm, too?
That's right, it is expected to produce the output as an array of 16 32-bit words, stored in *little* endian order. So byteswap is needed before the final
vstm DST, {X0,X1,X2,X3}
instruction. The idea is that the buffer can be used directly with memxor, without the C code having to do any byte swaps.
FAIL: umac
Similar problem, I would guess. But this time, loading 64 bits at a time into neon registers.
I'm drawing a bit of a blank on this one. It fails on the very first test case of umac32 where only umac-nh is used and all the input is zeroes. So there does seem to be another endianness dependency in the actual computation code.
It could be that if the all-zero input is misaligned, the aligned read + rotation tricks gets non-zero data from outside of the input area into the computation.
Have I understood correctly, that vld1.8 reads a byte stream and should be endianness-neutral
Don't remember, have to consult the arm docs.
and the keys are in host endianness?
I think so.
I was wrong: While the compiler is able to output big-endian objects with -mbig-endian, it needs matching libs as well (e.g. libgcc_s). Debian doesn't have anything precompiled for armeb.
Maybe full-system qemu is easier then (assuming there's some dist supporting big-endian arm)?
Instead I augmented the default action (which is documented and shouldn't change) by setting ASM_WORDS_BIGENDIAN directly. Also this should make the explicit value checking in IF_BE redundant because we now know for sure configure will never emit anything other than yes and no. Documentation says that AC_C_BIGENDIAN will abort if endianness can't be determined.
[...]
--- a/configure.ac +++ b/configure.ac @@ -201,7 +201,9 @@ LSH_FUNC_STRERROR # getenv_secure is used for fat overrides, # getline is used in the testsuite AC_CHECK_FUNCS(secure_getenv getline) -AC_C_BIGENDIAN +AC_C_BIGENDIAN([AC_DEFINE([WORDS_BIGENDIAN], 1)
- [ASM_WORDS_BIGENDIAN=yes]],
- [ASM_WORDS_BIGENDIAN=no])
I think I'd indent differently, to group the two parts ACTION-IF-TRUE together, and drop the redundant square brackets in "[ASM_WORDS_BIGENDIAN=yes]".
You leave the ACTION-IS-UNIVERSAL as default, which I think is good. I hope that's not relevant for arm. Might still be good to set ASM_WORDS_BIGENDIAN to some default value before this check, and have IF_BE fail if used with an unknown endianness.
Regards, /Niels
Hello Niels,
On Sun, Feb 11, 2018 at 11:03:41AM +0100, Niels Möller wrote:
dst[i] = LE_SWAP32 (t);
Would this apply to chacha-core-internal.asm, too?
That's right, it is expected to produce the output as an array of 16 32-bit words, stored in *little* endian order. So byteswap is needed
Right. When this still didn't fix it, I compared little- and big-endian behaviour and found that a.) vldm and vstm switch doublewords for no reason I can see or find documentation about and b.) vext extracts from the top of the vector, not bottom. Taking both into account, I now have chacha and salsa20 passing tests.
FAIL: umac
I'm drawing a bit of a blank on this one. It fails on the very first
It could be that if the all-zero input is misaligned, the aligned read + rotation tricks gets non-zero data from outside of the input area into the computation.
Apparently, NEON adjusts for endianness, meaning shifts switch direction as well. Which would explain why chacha and salsa20 didn't need more adjustment. All I needed to do in the end was change the order of registers for the 64bit return value in umac-nh and the checks passed.
I now have the whole testsuite passing apart from these two:
PASS: cxx ./sexp-conv-test: line 17: ../tools/sexp-conv: No such file or directory cmp: EOF on test1.out which is empty FAIL: sexp-conv SKIP: pkcs1-conv ./nettle-pbkdf2-test: line 18: ../tools/nettle-pbkdf2: No such file or directory cmp: EOF on test1.out which is empty FAIL: nettle-pbkdf2 PASS: symbols PASS: dlopen ==================== 2 of 93 tests failed ====================
They've been failing all along. Can they be ignored?
I was wrong: While the compiler is able to output big-endian objects with -mbig-endian, it needs matching libs as well (e.g. libgcc_s). Debian doesn't have anything precompiled for armeb.
Maybe full-system qemu is easier then (assuming there's some dist supporting big-endian arm)?
Weeell, depends on what you consider easier: I haven't found any binary distribution that supports armeb. Yocto and buildroot seem to support it but still require compiling the whole thing.
-AC_C_BIGENDIAN +AC_C_BIGENDIAN([AC_DEFINE([WORDS_BIGENDIAN], 1)
- [ASM_WORDS_BIGENDIAN=yes]],
- [ASM_WORDS_BIGENDIAN=no])
I think I'd indent differently, to group the two parts ACTION-IF-TRUE together, and drop the redundant square brackets in "[ASM_WORDS_BIGENDIAN=yes]".
Done.
You leave the ACTION-IS-UNIVERSAL as default, which I think is good. I hope that's not relevant for arm.
Ahem, seems I was looking at old documentation of autoconf which didn't have action-if-universal.
Apple does do arm and someone could potentially want to build a fat nettle that supports x86_64 and arm or rather arm and arm64.
Does nettle currently support being compiled fat with assembly at all? It would require building the individual platform's asm source and then lipo-ing them together, I guess. I'm not clear on the specifics.
Might still be good to set ASM_WORDS_BIGENDIAN to some default value before this check, and have IF_BE fail if used with an unknown endianness.
But then I want to have a nice error message so as to not leave the user with an aborted build and no apparent reason. :) Is this portable?
The patch got quite large now. Should I better make a series out of it?
From 67de31a70f8b8076681d6ddd221605365080103f Mon Sep 17 00:00:00 2001
From: Michael Weiser michael.weiser@gmx.de Date: Wed, 7 Feb 2018 00:11:24 +0100 Subject: [PATCH] Support big-endian arm in assembly code
Introduce m4 macros to conditionally handle differences of little- and big-endian arm in assembler code. Adjust sha1-compress, sha256-compress, umac-nh, chacha-core-internal, salsa20-core-internal and memxor for arm to work in big-endian mode. --- arm/memxor.asm | 21 ++++++++++++---- arm/memxor3.asm | 49 +++++++++++++++++++++++++------------- arm/neon/chacha-core-internal.asm | 37 +++++++++++++++++++++++----- arm/neon/salsa20-core-internal.asm | 43 ++++++++++++++++++++++++++++----- arm/neon/umac-nh.asm | 4 +++- arm/v6/sha1-compress.asm | 8 +++++-- arm/v6/sha256-compress.asm | 14 +++++++---- asm.m4 | 8 +++++++ config.m4.in | 1 + configure.ac | 7 +++++- 10 files changed, 149 insertions(+), 43 deletions(-)
diff --git a/arm/memxor.asm b/arm/memxor.asm index a50e91bc..239a4034 100644 --- a/arm/memxor.asm +++ b/arm/memxor.asm @@ -44,6 +44,11 @@ define(<N>, <r2>) define(<CNT>, <r6>) define(<TNC>, <r12>)
+C little-endian and big-endian need to shift in different directions for +C alignment correction +define(<S0ADJ>, IF_LE(<lsr>, <lsl>)) +define(<S1ADJ>, IF_LE(<lsl>, <lsr>)) + .syntax unified
.file "memxor.asm" @@ -99,6 +104,8 @@ PROLOGUE(nettle_memxor) C C With little-endian, we need to do C DST[i] ^= (SRC[i] >> CNT) ^ (SRC[i+1] << TNC) + C With big-endian, we need to do + C DST[i] ^= (SRC[i] << CNT) ^ (SRC[i+1] >> TNC)
push {r4,r5,r6} @@ -117,14 +124,14 @@ PROLOGUE(nettle_memxor) .Lmemxor_word_loop: ldr r5, [SRC], #+4 ldr r3, [DST] - eor r3, r3, r4, lsr CNT - eor r3, r3, r5, lsl TNC + eor r3, r3, r4, S0ADJ CNT + eor r3, r3, r5, S1ADJ TNC str r3, [DST], #+4 .Lmemxor_odd: ldr r4, [SRC], #+4 ldr r3, [DST] - eor r3, r3, r5, lsr CNT - eor r3, r3, r4, lsl TNC + eor r3, r3, r5, S0ADJ CNT + eor r3, r3, r4, S1ADJ TNC str r3, [DST], #+4 subs N, #8 bcs .Lmemxor_word_loop @@ -132,10 +139,14 @@ PROLOGUE(nettle_memxor) beq .Lmemxor_odd_done
C We have TNC/8 left-over bytes in r4, high end - lsr r4, CNT + S0ADJ r4, CNT ldr r3, [DST] eor r3, r4
+ C memxor_leftover does an LSB store + C so we need to reverse if actually BE +IF_BE(< rev r3, r3>) + pop {r4,r5,r6}
C Store bytes, one by one. diff --git a/arm/memxor3.asm b/arm/memxor3.asm index 139fd208..69598e1c 100644 --- a/arm/memxor3.asm +++ b/arm/memxor3.asm @@ -49,6 +49,11 @@ define(<ATNC>, <r10>) define(<BCNT>, <r11>) define(<BTNC>, <r12>)
+C little-endian and big-endian need to shift in different directions for +C alignment correction +define(<S0ADJ>, IF_LE(<lsr>, <lsl>)) +define(<S1ADJ>, IF_LE(<lsl>, <lsr>)) + .syntax unified
.file "memxor3.asm" @@ -124,6 +129,8 @@ PROLOGUE(nettle_memxor3) C C With little-endian, we need to do C DST[i-i] ^= (SRC[i-i] >> CNT) ^ (SRC[i] << TNC) + C With big-endian, we need to do + C DST[i-i] ^= (SRC[i-i] << CNT) ^ (SRC[i] >> TNC) rsb ATNC, ACNT, #32 bic BP, #3
@@ -138,14 +145,14 @@ PROLOGUE(nettle_memxor3) .Lmemxor3_au_loop: ldr r5, [BP, #-4]! ldr r6, [AP, #-4]! - eor r6, r6, r4, lsl ATNC - eor r6, r6, r5, lsr ACNT + eor r6, r6, r4, S1ADJ ATNC + eor r6, r6, r5, S0ADJ ACNT str r6, [DST, #-4]! .Lmemxor3_au_odd: ldr r4, [BP, #-4]! ldr r6, [AP, #-4]! - eor r6, r6, r5, lsl ATNC - eor r6, r6, r4, lsr ACNT + eor r6, r6, r5, S1ADJ ATNC + eor r6, r6, r4, S0ADJ ACNT str r6, [DST, #-4]! subs N, #8 bcs .Lmemxor3_au_loop @@ -154,7 +161,11 @@ PROLOGUE(nettle_memxor3)
C Leftover bytes in r4, low end ldr r5, [AP, #-4] - eor r4, r5, r4, lsl ATNC + eor r4, r5, r4, S1ADJ ATNC + + C leftover does an LSB store + C so we need to reverse if actually BE +IF_BE(< rev r4, r4>)
.Lmemxor3_au_leftover: C Store a byte at a time @@ -247,21 +258,25 @@ PROLOGUE(nettle_memxor3) ldr r5, [AP, #-4]! ldr r6, [BP, #-4]! eor r5, r6 - lsl r4, ATNC - eor r4, r4, r5, lsr ACNT + S1ADJ r4, ATNC + eor r4, r4, r5, S0ADJ ACNT str r4, [DST, #-4]! .Lmemxor3_uu_odd: ldr r4, [AP, #-4]! ldr r6, [BP, #-4]! eor r4, r6 - lsl r5, ATNC - eor r5, r5, r4, lsr ACNT + S1ADJ r5, ATNC + eor r5, r5, r4, S0ADJ ACNT str r5, [DST, #-4]! subs N, #8 bcs .Lmemxor3_uu_loop adds N, #8 beq .Lmemxor3_done
+ C leftover does an LSB store + C so we need to reverse if actually BE +IF_BE(< rev r4, r4>) + C Leftover bytes in a4, low end ror r4, ACNT .Lmemxor3_uu_leftover: @@ -290,18 +305,18 @@ PROLOGUE(nettle_memxor3) .Lmemxor3_uud_loop: ldr r5, [AP, #-4]! ldr r7, [BP, #-4]! - lsl r4, ATNC - eor r4, r4, r6, lsl BTNC - eor r4, r4, r5, lsr ACNT - eor r4, r4, r7, lsr BCNT + S1ADJ r4, ATNC + eor r4, r4, r6, S1ADJ BTNC + eor r4, r4, r5, S0ADJ ACNT + eor r4, r4, r7, S0ADJ BCNT str r4, [DST, #-4]! .Lmemxor3_uud_odd: ldr r4, [AP, #-4]! ldr r6, [BP, #-4]! - lsl r5, ATNC - eor r5, r5, r7, lsl BTNC - eor r5, r5, r4, lsr ACNT - eor r5, r5, r6, lsr BCNT + S1ADJ r5, ATNC + eor r5, r5, r7, S1ADJ BTNC + eor r5, r5, r4, S0ADJ ACNT + eor r5, r5, r6, S0ADJ BCNT str r5, [DST, #-4]! subs N, #8 bcs .Lmemxor3_uud_loop diff --git a/arm/neon/chacha-core-internal.asm b/arm/neon/chacha-core-internal.asm index 6f623106..43bacda6 100644 --- a/arm/neon/chacha-core-internal.asm +++ b/arm/neon/chacha-core-internal.asm @@ -90,31 +90,50 @@ PROLOGUE(_nettle_chacha_core) vmov S2, X2 vmov S3, X3
- C Input rows: + C Input rows little-endian: C 0 1 2 3 X0 C 4 5 6 7 X1 C 8 9 10 11 X2 C 12 13 14 15 X3
+ C Input rows big-endian: + C 2 3 0 1 X0 + C 6 7 4 5 X1 + C 10 11 8 9 X2 + C 14 15 12 13 X3 + C because vldm switches doublewords + .Loop: QROUND(X0, X1, X2, X3)
- C Rotate rows, to get + C In little-endian rotate rows, to get C 0 1 2 3 C 5 6 7 4 >>> 3 C 10 11 8 9 >>> 2 C 15 12 13 14 >>> 1 - vext.32 X1, X1, X1, #1 + + C In big-endian rotate rows, to get + C 2 3 0 1 + C 7 4 5 6 >>> 3 + C 8 9 10 11 >>> 2 + C 13 14 15 12 >>> 1 + + C vext extracts from the top of the vector instead of bottom +IF_LE(< vext.32 X1, X1, X1, #1>) +IF_BE(< vext.32 X1, X1, X1, #3>) vext.32 X2, X2, X2, #2 - vext.32 X3, X3, X3, #3 +IF_LE(< vext.32 X3, X3, X3, #3>) +IF_BE(< vext.32 X3, X3, X3, #1>)
QROUND(X0, X1, X2, X3)
subs ROUNDS, ROUNDS, #2 C Inverse rotation - vext.32 X1, X1, X1, #3 +IF_LE(< vext.32 X1, X1, X1, #3>) +IF_BE(< vext.32 X1, X1, X1, #1>) vext.32 X2, X2, X2, #2 - vext.32 X3, X3, X3, #1 +IF_LE(< vext.32 X3, X3, X3, #1>) +IF_BE(< vext.32 X3, X3, X3, #3>)
bhi .Loop
@@ -123,6 +142,12 @@ PROLOGUE(_nettle_chacha_core) vadd.u32 X2, X2, S2 vadd.u32 X3, X3, S3
+ C caller expects result little-endian +IF_BE(< vrev32.u8 X0, X0 + vrev32.u8 X1, X1 + vrev32.u8 X2, X2 + vrev32.u8 X3, X3>) + vstm DST, {X0,X1,X2,X3} bx lr EPILOGUE(_nettle_chacha_core) diff --git a/arm/neon/salsa20-core-internal.asm b/arm/neon/salsa20-core-internal.asm index 34eb1fba..12a812d8 100644 --- a/arm/neon/salsa20-core-internal.asm +++ b/arm/neon/salsa20-core-internal.asm @@ -88,7 +88,7 @@ define(<QROUND>, < PROLOGUE(_nettle_salsa20_core) vldm SRC, {X0,X1,X2,X3}
- C Input rows: + C Input rows little-endian: C 0 1 2 3 X0 C 4 5 6 7 X1 C 8 9 10 11 X2 @@ -99,6 +99,18 @@ PROLOGUE(_nettle_salsa20_core) C 8 13 2 7 C 12 1 6 11
+ C Input rows big-endian: + C 2 3 0 1 X0 + C 6 7 4 5 X1 + C 10 11 8 9 X2 + C 14 15 12 13 X3 + C because vldm switches doublewords + C Permuted to: + C 10 15 0 5 + C 14 3 4 9 + C 2 7 8 13 + C 6 11 12 1 + C FIXME: Construct in some other way? adr r12, .Lmasks vldm r12, {M0101, M0110, M0011} @@ -112,6 +124,7 @@ PROLOGUE(_nettle_salsa20_core) C 4 1 6 3 T0 v C 8 13 10 15 T1 ^ C 12 9 14 11 X3 v + C same in big endian just with transposed double-rows vmov T0, X1 vmov T1, X2 vbit T0, X0, M0101 @@ -140,22 +153,34 @@ PROLOGUE(_nettle_salsa20_core) .Loop: QROUND(X0, X1, X2, X3)
- C Rotate rows, to get + C In little-endian rotate rows, to get C 0 5 10 15 C 3 4 9 14 >>> 1 C 2 7 8 13 >>> 2 C 1 6 11 12 >>> 3 - vext.32 X1, X1, X1, #3 + + C In big-endian rotate rows, to get + C 10 15 0 5 + C 9 14 3 4 >>> 1 + C 8 13 2 7 >>> 2 + C 11 12 1 6 >>> 3 + + C vext extracts from the top of the vector instead of bottom +IF_LE(< vext.32 X1, X1, X1, #3>) +IF_BE(< vext.32 X1, X1, X1, #1>) vext.32 X2, X2, X2, #2 - vext.32 X3, X3, X3, #1 +IF_LE(< vext.32 X3, X3, X3, #1>) +IF_BE(< vext.32 X3, X3, X3, #3>)
QROUND(X0, X3, X2, X1)
subs ROUNDS, ROUNDS, #2 C Inverse rotation - vext.32 X1, X1, X1, #1 +IF_LE(< vext.32 X1, X1, X1, #1>) +IF_BE(< vext.32 X1, X1, X1, #3>) vext.32 X2, X2, X2, #2 - vext.32 X3, X3, X3, #3 +IF_LE(< vext.32 X3, X3, X3, #3>) +IF_BE(< vext.32 X3, X3, X3, #1>)
bhi .Loop
@@ -181,6 +206,12 @@ PROLOGUE(_nettle_salsa20_core) vadd.u32 X2, X2, S2 vadd.u32 X3, X3, S3
+ C caller expects result little-endian +IF_BE(< vrev32.u8 X0, X0 + vrev32.u8 X1, X1 + vrev32.u8 X2, X2 + vrev32.u8 X3, X3>) + vstm DST, {X0,X1,X2,X3} bx lr EPILOGUE(_nettle_salsa20_core) diff --git a/arm/neon/umac-nh.asm b/arm/neon/umac-nh.asm index 158a5686..2b617202 100644 --- a/arm/neon/umac-nh.asm +++ b/arm/neon/umac-nh.asm @@ -97,6 +97,8 @@ PROLOGUE(_nettle_umac_nh) bhi .Loop
vadd.i64 D0REG(QY), D0REG(QY), D1REG(QY) - vmov r0, r1, D0REG(QY) + C return values use memory endianness +IF_LE(< vmov r0, r1, D0REG(QY)>) +IF_BE(< vmov r1, r0, D0REG(QY)>) bx lr EPILOGUE(_nettle_umac_nh) diff --git a/arm/v6/sha1-compress.asm b/arm/v6/sha1-compress.asm index 59d6297e..8cc22be7 100644 --- a/arm/v6/sha1-compress.asm +++ b/arm/v6/sha1-compress.asm @@ -52,7 +52,7 @@ define(<LOAD>, < sel W, WPREV, T0 ror W, W, SHIFT mov WPREV, T0 - rev W, W +IF_LE(< rev W, W>) str W, [SP,#eval(4*$1)]
)
define(<EXPN>, < @@ -127,8 +127,12 @@ PROLOGUE(_nettle_sha1_compress) lsl SHIFT, SHIFT, #3 mov T0, #0 movne T0, #-1 - lsl W, T0, SHIFT +IF_LE(< lsl W, T0, SHIFT>) +IF_BE(< lsr W, T0, SHIFT>) uadd8 T0, T0, W C Sets APSR.GE bits + C on BE rotate right by 32-SHIFT bits + C because there is no rotate left +IF_BE(< rsb SHIFT, SHIFT, #32>) ldr K, .LK1 ldm STATE, {SA,SB,SC,SD,SE} diff --git a/arm/v6/sha256-compress.asm b/arm/v6/sha256-compress.asm index e6f4e1e9..324730c7 100644 --- a/arm/v6/sha256-compress.asm +++ b/arm/v6/sha256-compress.asm @@ -137,8 +137,12 @@ PROLOGUE(_nettle_sha256_compress) lsl SHIFT, SHIFT, #3 mov T0, #0 movne T0, #-1 - lsl I1, T0, SHIFT +IF_LE(< lsl I1, T0, SHIFT>) +IF_BE(< lsr I1, T0, SHIFT>) uadd8 T0, T0, I1 C Sets APSR.GE bits + C on BE rotate right by 32-SHIFT bits + C because there is no rotate left +IF_BE(< rsb SHIFT, SHIFT, #32>)
mov DST, sp mov ILEFT, #4 @@ -146,16 +150,16 @@ PROLOGUE(_nettle_sha256_compress) ldm INPUT!, {I1,I2,I3,I4} sel I0, I0, I1 ror I0, I0, SHIFT - rev I0, I0 +IF_LE(< rev I0, I0>) sel I1, I1, I2 ror I1, I1, SHIFT - rev I1, I1 +IF_LE(< rev I1, I1>) sel I2, I2, I3 ror I2, I2, SHIFT - rev I2, I2 +IF_LE(< rev I2, I2>) sel I3, I3, I4 ror I3, I3, SHIFT - rev I3, I3 +IF_LE(< rev I3, I3>) subs ILEFT, ILEFT, #1 stm DST!, {I0,I1,I2,I3} mov I0, I4 diff --git a/asm.m4 b/asm.m4 index 4018c235..8c290551 100644 --- a/asm.m4 +++ b/asm.m4 @@ -51,6 +51,14 @@ define(<ALIGN>, <.align ifelse(ALIGN_LOG,yes,<m4_log2($1)>,$1)
)
+define(<IF_BE>, <ifelse( +WORDS_BIGENDIAN,yes,<$1>, +WORDS_BIGENDIAN,no,<$2>, +<errprint(__file__:__line__:,<Unsupported endianness value>,WORDS_BIGENDIAN,< +>) + m4exit(1)>)>) +define(<IF_LE>, <IF_BE(<$2>, <$1>)>) + dnl Struct defining macros
dnl STRUCTURE(prefix) diff --git a/config.m4.in b/config.m4.in index e39c880c..11f90a40 100644 --- a/config.m4.in +++ b/config.m4.in @@ -7,6 +7,7 @@ define(<TYPE_PROGBITS>, <@ASM_TYPE_PROGBITS@>)dnl define(<ALIGN_LOG>, <@ASM_ALIGN_LOG@>)dnl define(<W64_ABI>, <@W64_ABI@>)dnl define(<RODATA>, <@ASM_RODATA@>)dnl +define(<WORDS_BIGENDIAN>, <@ASM_WORDS_BIGENDIAN@>)dnl divert(1) @ASM_MARK_NOEXEC_STACK@ divert diff --git a/configure.ac b/configure.ac index 41bf0998..b57d2fb8 100644 --- a/configure.ac +++ b/configure.ac @@ -201,7 +201,11 @@ LSH_FUNC_STRERROR # getenv_secure is used for fat overrides, # getline is used in the testsuite AC_CHECK_FUNCS(secure_getenv getline) -AC_C_BIGENDIAN + +ASM_WORDS_BIGENDIAN=unknown +AC_C_BIGENDIAN([AC_DEFINE([WORDS_BIGENDIAN], 1) + ASM_WORDS_BIGENDIAN=yes], + [ASM_WORDS_BIGENDIAN=no])
AC_CACHE_CHECK([for __builtin_bswap64], nettle_cv_c_builtin_bswap64, @@ -811,6 +815,7 @@ AC_SUBST(ASM_TYPE_PROGBITS) AC_SUBST(ASM_MARK_NOEXEC_STACK) AC_SUBST(ASM_ALIGN_LOG) AC_SUBST(W64_ABI) +AC_SUBST(ASM_WORDS_BIGENDIAN) AC_SUBST(EMULATOR)
AC_SUBST(LIBNETTLE_MAJOR)
Michael Weiser michael@weiser.dinsnail.net writes:
Right. When this still didn't fix it, I compared little- and big-endian behaviour and found that a.) vldm and vstm switch doublewords for no reason I can see or find documentation about and b.)
By "doublewords", you mean 64-bit words, right?
It might make sense to view it as big-endian or little-endian load of 128-bit values, and a 128-bit (16-byte) byte swap will then also swap the low and high 64-bit halves.
vext extracts from the top of the vector, not bottom. Taking both into account, I now have chacha and salsa20 passing tests.
If it's hard to find docs, I take it as a sign big-endian arm is a bit obscure... Could you add a short note to arm/README with your findings? (It's quite some time since I did neon assembly, so I don't recall off the top of my head any details on what the various instructions, in particular vextr, do).
PASS: cxx ./sexp-conv-test: line 17: ../tools/sexp-conv: No such file or directory cmp: EOF on test1.out which is empty FAIL: sexp-conv SKIP: pkcs1-conv ./nettle-pbkdf2-test: line 18: ../tools/nettle-pbkdf2: No such file or directory cmp: EOF on test1.out which is empty FAIL: nettle-pbkdf2 PASS: symbols PASS: dlopen ==================== 2 of 93 tests failed ====================
They've been failing all along. Can they be ignored?
They're not that relevant to your changes, but I'd like to understand why they fail. What's the contents of the tools dir in your buld tree? You haven't done something like switched from building in the source tree build to a separate build tree, without a proper cleaning (make distclean) in the source tree?
Weeell, depends on what you consider easier: I haven't found any binary distribution that supports armeb. Yocto and buildroot seem to support it but still require compiling the whole thing.
Hmm. Sounds more than a bit inconvenient.
Apple does do arm and someone could potentially want to build a fat nettle that supports x86_64 and arm or rather arm and arm64.
My concern is not breaking any setup which currently works, e.g, a non assebly "universal" build involving architectures with different endianness.
Does nettle currently support being compiled fat with assembly at all?
I don't think so. I'd expect one would have to build for one arch at a time, and have some postprocessing scripts to produce apple-fat libraries.
But then I want to have a nice error message so as to not leave the user with an aborted build and no apparent reason. :) Is this portable?
According to http://pubs.opengroup.org/onlinepubs/9699919799/utilities/m4.html, errprint and m4exit are standard m4. (If they're also supported in practice is a different question, it's desirable to at least work with both GNU and BSD m4). If __file__ and __line__ are unportable, you could omit that. Since the error message reports a pretty global config problem, precise location isn't that important.
The patch got quite large now. Should I better make a series out of it?
As you prefer, I think it is workable as is. It might help to split out the configure-related changes.
Regards, /Niels
Hi Niels,
On Mon, Feb 12, 2018 at 08:59:16AM +0100, Niels Möller wrote:
Right. When this still didn't fix it, I compared little- and big-endian behaviour and found that a.) vldm and vstm switch doublewords for no reason I can see or find documentation about and b.)
By "doublewords", you mean 64-bit words, right?
Yes. ARM talks in bytes, halfwords, words, doublewords and quadwords.
It might make sense to view it as big-endian or little-endian load of 128-bit values, and a 128-bit (16-byte) byte swap will then also swap the low and high 64-bit halves.
[...]
If it's hard to find docs, I take it as a sign big-endian arm is a bit obscure...
Actually, it's all quite well-documented, just not always as obviously as I'd like: The ARM ARM (Architecture Reference Manual) spells out the low-level details. With additionally looking very closely at the gdb output, I found for the chacha and salsa implementations:
1. There's no vldm or vstm on quadword registers in the architecture. It gets translated into vldm on the corresponding number of doubleword registers.
Disassembly of section .text:
00000000 <_nettle_chacha_core>: 0: ec910b10 vldmia r1, {d0-d7}
This is hinted at here http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204j/Bcfchhi... by saying: "If Q registers are specified, on disassembly they are shown as D registers."
2. vldm and vstm on doubleword registers swap 32-bit words inside the doubleword to get a full byte-swap in addition to the byte- and halfword-swapping the word-access already does. Since chacha and salsa input is a matrix of 32-bit words, the word swap transposes even and odd columns (not doublewords):
// Combine the word-aligned words in the correct order for current endianness. D[d+r] = if BigEndian() then word1:word2 else word2:word1;
3. The input to chacha-core is 32bit words in host endianness.
4. gdb's print output ordering is really confusing.
So all that's basically happening is that odd and even columns get switched. The individual words' values are exactly the same because the input is in host endianness already. So NEON doesn't adjust for endianness after all.
What's been fooling me is that apparently gdb tries to show the values of vector registers as if they had been stored to memory by an operation of the full bit-size of the register shown and then read back again as consecutive elements of various other sizes (8, 16, 32, 64-bit):
p/x $q0 le: u8 = {0x65, 0x78, 0x70, 0x61, 0x6e, 0x64, 0x20, 0x33, 0x32, 0x2d, 0x62, 0x79, 0x74, 0x65, 0x20, 0x6b} be: u8 = {0x79, 0x62, 0x2d, 0x32, 0x6b, 0x20, 0x65, 0x74, 0x61, 0x70, 0x78, 0x65, 0x33, 0x20, 0x64, 0x6e} ^ bytes reversed by 128-bit store + read as byte sequence -> vldm 1:0:3:2 column swap still visible
le: u32 = {0x61707865, 0x3320646e, 0x79622d32, 0x6b206574} be: u32 = {0x79622d32, 0x6b206574, 0x61707865, 0x3320646e} ^ bytes reversed by 128-bit store + read as four consecutive big-endian 32-bit words + vldm column swap -> makes it appear doublewords have been swapped
The realisation that even and odd columns get switched also explains the necessary vext adjustments. So it's also not true that vext changes the end of the vector where it extracts.
Regarding umac it's similar: vld1.8 loads a byte sequence from memory without any swapping with either le or be. vld1.i32 reads the keys stored in host endianness as words from memory. So the representation ending up in the registers is the same as well which is why the code doesn't need any adjustment.
Finally, the register switch for the return value with vmov in umac-nh stems from the calling convention. AAPCS says:
"Fundamental types larger than 32 bits may be passed as parameters to, or returned as the result of, function calls. When these types are in core registers the following rules apply: * A doubleword sized type is passed in two consecutive registers (e.g., r0 and r1, or r2 and r3). The content of the registers is as if the value had been loaded from memory representation with a single LDM instruction."
When loading a big-endian doubleword using ldm, the words end up in the registers with the right values but transposed. Since the calling convention mandates exactly this, we have to transpose the words upon function exit as well.
Phew.
Could you add a short note to arm/README with your findings? (It's quite some time since I did neon assembly, so I don't recall off the top of my head any details on what the various instructions, in particular vextr, do).
Done.
FAIL: sexp-conv FAIL: nettle-pbkdf2 They've been failing all along. Can they be ignored?
They're not that relevant to your changes, but I'd like to understand why they fail. What's the contents of the tools dir in your buld tree? You haven't done something like switched from building in the source tree build to a separate build tree, without a proper cleaning (make distclean) in the source tree?
No. But I have been ignoring an annoying build failure due to TeX being missing. After reconfiguring with --disable-documentation build and testsuite succeed. My bad.
Weeell, depends on what you consider easier: I haven't found any binary distribution that supports armeb. Yocto and buildroot seem to support it but still require compiling the whole thing.
Hmm. Sounds more than a bit inconvenient.
The qemu-user chroot route with the linaro cross toolchain isn't too bad actually:
cd $HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/armeb-linux-gnueabihf/libc cp /usr/bin/qemu-armeb-static usr/bin wget https://gmplib.org/download/gmp/gmp-6.1.2.tar.lz tar -xf gmp-6.1.2.tar.lz cd gmp-6.1.2 # segfaults in qemu with -march=armv4 default PATH=$PWD/../../../bin:$PATH CFLAGS="-march=armv7-a" ./configure --host=armeb-linux-gnueabihf --prefix=$PWD/../gmp PATH=$PWD/../../../bin:$PATH make -j4 install
git clone https://git.lysator.liu.se/nettle/nettle.git cd nettle autoreconf PATH=$PWD/../../../bin:$PATH ./configure --host=armeb-linux-gnueabihf --enable-arm-neon --with-include-path=$PWD/../gmp/include --with-lib-path=$PWD/../gmp/lib PATH=$PWD/../../../bin:$PATH make -j4 NETTLE_TEST_ROOT=/nettle/testsuite PATH=$PWD/../../../bin:$PATH make -j4 check EMULATOR="sudo QEMU_SET_ENV=LD_LIBRARY_PATH=/nettle/.lib:/gmp/lib chroot $PWD/.."
with this small patch to run-tests: diff --git a/run-tests b/run-tests index 3d5655cf..bbc2bb4c 100755 --- a/run-tests +++ b/run-tests @@ -37,7 +37,7 @@ find_program () { ;; *) if [ -x "$1" ] ; then - echo "./$1" + echo "${NETTLE_TEST_ROOT:=.}/$1" else echo "$srcdir/$1" fi
Apple does do arm and someone could potentially want to build a fat nettle that supports x86_64 and arm or rather arm and arm64.
My concern is not breaking any setup which currently works, e.g, a non assebly "universal" build involving architectures with different endianness.
Right, that should be fine then.
Does nettle currently support being compiled fat with assembly at all?
I don't think so. I'd expect one would have to build for one arch at a time, and have some postprocessing scripts to produce apple-fat libraries.
Apple have wrapped this in the compiler driver using multiple -arch arguments. "gcc -arch x86_64 -arch arm" will run the compiler twice on the same file and lipo the resulting objects together into a fat object. The linker supports linking those into fat binaries.
If all the assembler implementations of the same routine were in one file wrapped by #ifdefs the same could be done there. Otherwise, assembly and lipoing would have to be done explicitly for those files.
# clang -v -arch x86_64 -arch i386 -c -o t.o t.c [...] Apple LLVM version 9.0.0 (clang-900.0.39.2) Target: i386-apple-darwin17.4.0 Thread model: posix InstalledDir: /Library/Developer/CommandLineTools/usr/bin "/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple x86_64-apple-macosx10.13.0 ... [...] "/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple i386-apple-macosx10.13.0 ... [...] "/Library/Developer/CommandLineTools/usr/bin/lipo" -create -output t.o /var/folders/ft/dp06pw254ybbzt42f1qn65pm0000gp/T/t-5eeded.o /var/folders/ft/dp06pw254ybbzt42f1qn65pm0000gp/T/t-b25776.o # file t.o t.o: Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit object x86_64] [i386:Mach-O object i386] t.o (for architecture x86_64): Mach-O 64-bit object x86_64 t.o (for architecture i386): Mach-O object i386
But then I want to have a nice error message so as to not leave the user with an aborted build and no apparent reason. :) Is this portable?
According to http://pubs.opengroup.org/onlinepubs/9699919799/utilities/m4.html, errprint and m4exit are standard m4. (If they're also supported in practice is a different question, it's desirable to at least work with both GNU and BSD m4). If __file__ and __line__ are unportable, you could omit that. Since the error message reports a pretty global config problem, precise location isn't that important.
Not critical, __file__ and __line__ dropped. Net/Free/OpenBSD m4 support them though.
The patch got quite large now. Should I better make a series out of it?
As you prefer, I think it is workable as is. It might help to split out the configure-related changes.
Series forthcoming.
nettle-bugs@lists.lysator.liu.se