Re: Miscomputation with big-endian arm asm

13 Feb 2018

Hi Niels,
On Mon, Feb 12, 2018 at 08:59:16AM +0100, Niels Möller wrote:
...
...
Right. When this still didn't fix it, I compared little- and big-endian
behaviour and found that a.) vldm and vstm switch doublewords for no
reason I can see or find documentation about and b.)
By "doublewords", you mean 64-bit words, right?
Yes. ARM talks in bytes, halfwords, words, doublewords and quadwords.
...
It might make sense to view it as big-endian or little-endian load of
128-bit values, and a 128-bit (16-byte) byte swap will then also swap
the low and high 64-bit halves.
[...]
...
If it's hard to find docs, I take it as a sign big-endian arm is a bit
obscure...
Actually, it's all quite well-documented, just not always as obviously
as I'd like: The ARM ARM (Architecture Reference Manual) spells out the
low-level details. With additionally looking very closely at the gdb
output, I found for the chacha and salsa implementations:
1. There's no vldm or vstm on quadword registers in the architecture. It
gets translated into vldm on the corresponding number of doubleword
registers.
Disassembly of section .text:
00000000 <_nettle_chacha_core>:
   0:   ec910b10        vldmia  r1, {d0-d7}
This is hinted at here
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204j/Bcfchhi...
by saying: "If Q registers are specified, on disassembly they are shown
as D registers."
2. vldm and vstm on doubleword registers swap 32-bit words inside the
doubleword to get a full byte-swap in addition to the byte- and
halfword-swapping the word-access already does. Since chacha and salsa
input is a matrix of 32-bit words, the word swap transposes even and odd
columns (not doublewords):
// Combine the word-aligned words in the correct order for current endianness. 
D[d+r] = if BigEndian() then word1:word2 else word2:word1;
3. The input to chacha-core is 32bit words in host endianness.
4. gdb's print output ordering is really confusing.
So all that's basically happening is that odd and even columns get
switched. The individual words' values are exactly the same because the
input is in host endianness already. So NEON doesn't adjust for
endianness after all.
What's been fooling me is that apparently gdb tries to show the values
of vector registers as if they had been stored to memory by an operation
of the full bit-size of the register shown and then read back again as
consecutive elements of various other sizes (8, 16, 32, 64-bit):
p/x $q0
le: u8 = {0x65, 0x78, 0x70, 0x61, 0x6e, 0x64, 0x20, 0x33, 0x32, 0x2d, 0x62, 0x79, 0x74, 0x65, 0x20, 0x6b}
be: u8 = {0x79, 0x62, 0x2d, 0x32, 0x6b, 0x20, 0x65, 0x74, 0x61, 0x70, 0x78, 0x65, 0x33, 0x20, 0x64, 0x6e}
      ^ bytes reversed by 128-bit store + read as byte sequence -> vldm
1:0:3:2 column swap still visible
le: u32 = {0x61707865, 0x3320646e, 0x79622d32, 0x6b206574}
be: u32 = {0x79622d32, 0x6b206574, 0x61707865, 0x3320646e}
       ^ bytes reversed by 128-bit store + read as four consecutive
big-endian 32-bit words + vldm column swap -> makes it appear
doublewords have been swapped
The realisation that even and odd columns get switched also explains the
necessary vext adjustments. So it's also not true that vext changes the
end of the vector where it extracts.
Regarding umac it's similar: vld1.8 loads a byte sequence from memory
without any swapping with either le or be. vld1.i32 reads the keys
stored in host endianness as words from memory. So the representation
ending up in the registers is the same as well which is why the code
doesn't need any adjustment.
Finally, the register switch for the return value with vmov in umac-nh
stems from the calling convention. AAPCS says:
"Fundamental types larger than 32 bits may be passed as parameters to, or
returned as the result of, function calls. When these types are in core
registers the following rules apply:
* A doubleword sized type is passed in two consecutive registers (e.g.,
r0 and r1, or r2 and r3). The content of the registers is as if the
value had been loaded from memory representation with a single LDM
instruction."
When loading a big-endian doubleword using ldm, the words end up in the
registers with the right values but transposed. Since the calling
convention mandates exactly this, we have to transpose the words upon
function exit as well.
Phew.
...
Could you add a short note to arm/README with your findings?
(It's quite some time since I did neon assembly, so I don't recall off
the top of my head any details on what the various instructions, in
particular vextr, do).
Done.
...
...
FAIL: sexp-conv
FAIL: nettle-pbkdf2
They've been failing all along. Can they be ignored?
They're not that relevant to your changes, but I'd like to understand
why they fail. What's the contents of the tools dir in your buld tree?
You haven't done something like switched from building in the source
tree build to a separate build tree, without a proper cleaning (make
distclean) in the source tree?
No. But I have been ignoring an annoying build failure due to TeX being
missing. After reconfiguring with --disable-documentation build and
testsuite succeed. My bad.
...
...
Weeell, depends on what you consider easier: I haven't found any binary
distribution that supports armeb. Yocto and buildroot seem to support it
but still require compiling the whole thing.
Hmm. Sounds more than a bit inconvenient.
The qemu-user chroot route with the linaro cross toolchain isn't too bad
actually:
cd $HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/armeb-linux-gnueabihf/libc
cp /usr/bin/qemu-armeb-static usr/bin
wget https://gmplib.org/download/gmp/gmp-6.1.2.tar.lz
tar -xf gmp-6.1.2.tar.lz
cd gmp-6.1.2
# segfaults in qemu with -march=armv4 default
PATH=$PWD/../../../bin:$PATH CFLAGS="-march=armv7-a" ./configure --host=armeb-linux-gnueabihf --prefix=$PWD/../gmp
PATH=$PWD/../../../bin:$PATH make -j4 install
git clone https://git.lysator.liu.se/nettle/nettle.git
cd nettle
autoreconf
PATH=$PWD/../../../bin:$PATH ./configure --host=armeb-linux-gnueabihf --enable-arm-neon --with-include-path=$PWD/../gmp/include --with-lib-path=$PWD/../gmp/lib
PATH=$PWD/../../../bin:$PATH make -j4
NETTLE_TEST_ROOT=/nettle/testsuite PATH=$PWD/../../../bin:$PATH make -j4 check EMULATOR="sudo QEMU_SET_ENV=LD_LIBRARY_PATH=/nettle/.lib:/gmp/lib chroot $PWD/.."
with this small patch to run-tests:

diff --git a/run-tests b/run-tests
index 3d5655cf..bbc2bb4c 100755
--- a/run-tests
+++ b/run-tests
@@ -37,7 +37,7 @@ find_program () {
          ;;
        *)
          if [ -x "$1" ] ; then
-             echo "./$1"
+             echo "${NETTLE_TEST_ROOT:=.}/$1"
          else
              echo "$srcdir/$1"
          fi
...
...
Apple does do arm and someone could potentially want to build a fat
nettle that supports x86_64 and arm or rather arm and arm64.
My concern is not breaking any setup which currently works, e.g, a non
assebly "universal" build involving architectures with different
endianness.
Right, that should be fine then.
...
...
Does nettle currently support being compiled fat with assembly at all?
I don't think so. I'd expect one would have to build for one arch at a
time, and have some postprocessing scripts to produce apple-fat
libraries.
Apple have wrapped this in the compiler driver using multiple -arch
arguments. "gcc -arch x86_64 -arch arm" will run the compiler twice on
the same file and lipo the resulting objects together into a fat object.
The linker supports linking those into fat binaries.
If all the assembler implementations of the same routine were in one
file wrapped by #ifdefs the same could be done there. Otherwise,
assembly and lipoing would have to be done explicitly for those files.
# clang -v -arch x86_64 -arch i386 -c -o t.o t.c
[...]
Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: i386-apple-darwin17.4.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
 "/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple
x86_64-apple-macosx10.13.0 ...
[...]
 "/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple
i386-apple-macosx10.13.0 ...
[...]
"/Library/Developer/CommandLineTools/usr/bin/lipo" -create -output t.o
/var/folders/ft/dp06pw254ybbzt42f1qn65pm0000gp/T/t-5eeded.o
/var/folders/ft/dp06pw254ybbzt42f1qn65pm0000gp/T/t-b25776.o
# file t.o
t.o: Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit
object x86_64] [i386:Mach-O object i386]
t.o (for architecture x86_64):	Mach-O 64-bit object x86_64
t.o (for architecture i386):	Mach-O object i386
...
...
But then I want to have a nice error message so as to not leave the user
with an aborted build and no apparent reason. :) Is this portable?
According to
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/m4.html,
errprint and m4exit are standard m4. (If they're also supported in
practice is a different question, it's desirable to at least work with
both GNU and BSD m4). If __file__ and __line__ are unportable, you could
omit that. Since the error message reports a pretty global config
problem, precise location isn't that important.
Not critical, __file__ and __line__ dropped. Net/Free/OpenBSD m4
support them though.
...
...
The patch got quite large now. Should I better make a series out of it?
As you prefer, I think it is workable as is. It might help to split out
the configure-related changes.
Series forthcoming.
-- 
Thanks,
Michael

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Miscomputation with big-endian arm asm