Hi Niels,
On Mon, Feb 12, 2018 at 08:59:16AM +0100, Niels Möller wrote:
Right. When this still didn't fix it, I compared little- and big-endian behaviour and found that a.) vldm and vstm switch doublewords for no reason I can see or find documentation about and b.)
By "doublewords", you mean 64-bit words, right?
Yes. ARM talks in bytes, halfwords, words, doublewords and quadwords.
It might make sense to view it as big-endian or little-endian load of 128-bit values, and a 128-bit (16-byte) byte swap will then also swap the low and high 64-bit halves.
[...]
If it's hard to find docs, I take it as a sign big-endian arm is a bit obscure...
Actually, it's all quite well-documented, just not always as obviously as I'd like: The ARM ARM (Architecture Reference Manual) spells out the low-level details. With additionally looking very closely at the gdb output, I found for the chacha and salsa implementations:
1. There's no vldm or vstm on quadword registers in the architecture. It gets translated into vldm on the corresponding number of doubleword registers.
Disassembly of section .text:
00000000 <_nettle_chacha_core>: 0: ec910b10 vldmia r1, {d0-d7}
This is hinted at here http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204j/Bcfchhi... by saying: "If Q registers are specified, on disassembly they are shown as D registers."
2. vldm and vstm on doubleword registers swap 32-bit words inside the doubleword to get a full byte-swap in addition to the byte- and halfword-swapping the word-access already does. Since chacha and salsa input is a matrix of 32-bit words, the word swap transposes even and odd columns (not doublewords):
// Combine the word-aligned words in the correct order for current endianness. D[d+r] = if BigEndian() then word1:word2 else word2:word1;
3. The input to chacha-core is 32bit words in host endianness.
4. gdb's print output ordering is really confusing.
So all that's basically happening is that odd and even columns get switched. The individual words' values are exactly the same because the input is in host endianness already. So NEON doesn't adjust for endianness after all.
What's been fooling me is that apparently gdb tries to show the values of vector registers as if they had been stored to memory by an operation of the full bit-size of the register shown and then read back again as consecutive elements of various other sizes (8, 16, 32, 64-bit):
p/x $q0 le: u8 = {0x65, 0x78, 0x70, 0x61, 0x6e, 0x64, 0x20, 0x33, 0x32, 0x2d, 0x62, 0x79, 0x74, 0x65, 0x20, 0x6b} be: u8 = {0x79, 0x62, 0x2d, 0x32, 0x6b, 0x20, 0x65, 0x74, 0x61, 0x70, 0x78, 0x65, 0x33, 0x20, 0x64, 0x6e} ^ bytes reversed by 128-bit store + read as byte sequence -> vldm 1:0:3:2 column swap still visible
le: u32 = {0x61707865, 0x3320646e, 0x79622d32, 0x6b206574} be: u32 = {0x79622d32, 0x6b206574, 0x61707865, 0x3320646e} ^ bytes reversed by 128-bit store + read as four consecutive big-endian 32-bit words + vldm column swap -> makes it appear doublewords have been swapped
The realisation that even and odd columns get switched also explains the necessary vext adjustments. So it's also not true that vext changes the end of the vector where it extracts.
Regarding umac it's similar: vld1.8 loads a byte sequence from memory without any swapping with either le or be. vld1.i32 reads the keys stored in host endianness as words from memory. So the representation ending up in the registers is the same as well which is why the code doesn't need any adjustment.
Finally, the register switch for the return value with vmov in umac-nh stems from the calling convention. AAPCS says:
"Fundamental types larger than 32 bits may be passed as parameters to, or returned as the result of, function calls. When these types are in core registers the following rules apply: * A doubleword sized type is passed in two consecutive registers (e.g., r0 and r1, or r2 and r3). The content of the registers is as if the value had been loaded from memory representation with a single LDM instruction."
When loading a big-endian doubleword using ldm, the words end up in the registers with the right values but transposed. Since the calling convention mandates exactly this, we have to transpose the words upon function exit as well.
Phew.
Could you add a short note to arm/README with your findings? (It's quite some time since I did neon assembly, so I don't recall off the top of my head any details on what the various instructions, in particular vextr, do).
Done.
FAIL: sexp-conv FAIL: nettle-pbkdf2 They've been failing all along. Can they be ignored?
They're not that relevant to your changes, but I'd like to understand why they fail. What's the contents of the tools dir in your buld tree? You haven't done something like switched from building in the source tree build to a separate build tree, without a proper cleaning (make distclean) in the source tree?
No. But I have been ignoring an annoying build failure due to TeX being missing. After reconfiguring with --disable-documentation build and testsuite succeed. My bad.
Weeell, depends on what you consider easier: I haven't found any binary distribution that supports armeb. Yocto and buildroot seem to support it but still require compiling the whole thing.
Hmm. Sounds more than a bit inconvenient.
The qemu-user chroot route with the linaro cross toolchain isn't too bad actually:
cd $HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/armeb-linux-gnueabihf/libc cp /usr/bin/qemu-armeb-static usr/bin wget https://gmplib.org/download/gmp/gmp-6.1.2.tar.lz tar -xf gmp-6.1.2.tar.lz cd gmp-6.1.2 # segfaults in qemu with -march=armv4 default PATH=$PWD/../../../bin:$PATH CFLAGS="-march=armv7-a" ./configure --host=armeb-linux-gnueabihf --prefix=$PWD/../gmp PATH=$PWD/../../../bin:$PATH make -j4 install
git clone https://git.lysator.liu.se/nettle/nettle.git cd nettle autoreconf PATH=$PWD/../../../bin:$PATH ./configure --host=armeb-linux-gnueabihf --enable-arm-neon --with-include-path=$PWD/../gmp/include --with-lib-path=$PWD/../gmp/lib PATH=$PWD/../../../bin:$PATH make -j4 NETTLE_TEST_ROOT=/nettle/testsuite PATH=$PWD/../../../bin:$PATH make -j4 check EMULATOR="sudo QEMU_SET_ENV=LD_LIBRARY_PATH=/nettle/.lib:/gmp/lib chroot $PWD/.."
with this small patch to run-tests: diff --git a/run-tests b/run-tests index 3d5655cf..bbc2bb4c 100755 --- a/run-tests +++ b/run-tests @@ -37,7 +37,7 @@ find_program () { ;; *) if [ -x "$1" ] ; then - echo "./$1" + echo "${NETTLE_TEST_ROOT:=.}/$1" else echo "$srcdir/$1" fi
Apple does do arm and someone could potentially want to build a fat nettle that supports x86_64 and arm or rather arm and arm64.
My concern is not breaking any setup which currently works, e.g, a non assebly "universal" build involving architectures with different endianness.
Right, that should be fine then.
Does nettle currently support being compiled fat with assembly at all?
I don't think so. I'd expect one would have to build for one arch at a time, and have some postprocessing scripts to produce apple-fat libraries.
Apple have wrapped this in the compiler driver using multiple -arch arguments. "gcc -arch x86_64 -arch arm" will run the compiler twice on the same file and lipo the resulting objects together into a fat object. The linker supports linking those into fat binaries.
If all the assembler implementations of the same routine were in one file wrapped by #ifdefs the same could be done there. Otherwise, assembly and lipoing would have to be done explicitly for those files.
# clang -v -arch x86_64 -arch i386 -c -o t.o t.c [...] Apple LLVM version 9.0.0 (clang-900.0.39.2) Target: i386-apple-darwin17.4.0 Thread model: posix InstalledDir: /Library/Developer/CommandLineTools/usr/bin "/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple x86_64-apple-macosx10.13.0 ... [...] "/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple i386-apple-macosx10.13.0 ... [...] "/Library/Developer/CommandLineTools/usr/bin/lipo" -create -output t.o /var/folders/ft/dp06pw254ybbzt42f1qn65pm0000gp/T/t-5eeded.o /var/folders/ft/dp06pw254ybbzt42f1qn65pm0000gp/T/t-b25776.o # file t.o t.o: Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit object x86_64] [i386:Mach-O object i386] t.o (for architecture x86_64): Mach-O 64-bit object x86_64 t.o (for architecture i386): Mach-O object i386
But then I want to have a nice error message so as to not leave the user with an aborted build and no apparent reason. :) Is this portable?
According to http://pubs.opengroup.org/onlinepubs/9699919799/utilities/m4.html, errprint and m4exit are standard m4. (If they're also supported in practice is a different question, it's desirable to at least work with both GNU and BSD m4). If __file__ and __line__ are unportable, you could omit that. Since the error message reports a pretty global config problem, precise location isn't that important.
Not critical, __file__ and __line__ dropped. Net/Free/OpenBSD m4 support them though.
The patch got quite large now. Should I better make a series out of it?
As you prefer, I think it is workable as is. It might help to split out the configure-related changes.
Series forthcoming.