Hello,
вс, 23 февр. 2020 г. в 20:00, Michael Weiser michael@weiser.dinsnail.net:
Hi Dmitry,
On Sun, Feb 23, 2020 at 07:09:43PM +0300, Dmitry Baryshkov wrote:
I will check with fresh Yocto build later or tomorrow.
Thanks!
I have checked both armv7vet2b and armv5eb targets with qemu. Your patch fixes the issue for me.
Could Yocto be used for CI then? Do they do any kind of binary releases for armeb? How long and voluminous is a build of an armeb system with and without a native toolchain?
No, I found no package feeds/binary releases for armeb. So to use Yocto for CI, we'd have to build an image with Yocto SDK inside. I can try implementing it.
Hi Dmity,
On Mon, Feb 24, 2020 at 08:58:44PM +0300, Dmitry Baryshkov wrote:
I will check with fresh Yocto build later or tomorrow.
Thanks!
I have checked both armv7vet2b and armv5eb targets with qemu. Your patch fixes the issue for me.
Perfect!
@Niels: What do you think of these changes?
These comment changes are bugging me:
diff --git a/arm/memxor.asm b/arm/memxor.asm index 239a4034..b802e95c 100644 --- a/arm/memxor.asm +++ b/arm/memxor.asm @@ -138,24 +138,24 @@ PROLOGUE(nettle_memxor) adds N, #8 beq .Lmemxor_odd_done
- C We have TNC/8 left-over bytes in r4, high end + C We have TNC/8 left-over bytes in r4, (since working upwards) low + C end on LE and high end on BE S0ADJ r4, CNT ldr r3, [DST] eor r3, r4
diff --git a/arm/memxor3.asm b/arm/memxor3.asm index 69598e1c..76b8aae6 100644 --- a/arm/memxor3.asm +++ b/arm/memxor3.asm @@ -159,21 +159,21 @@ PROLOGUE(nettle_memxor3) adds N, #8 beq .Lmemxor3_done
- C Leftover bytes in r4, low end + C Leftover bytes in r4, (since working downwards) in high end on LE and + C low end on BE ldr r5, [AP, #-4] eor r4, r5, r4, S1ADJ ATNC
Have I totally misunderstood how strb works or how the comment is meant if to my thinking the bytes are sitting in the low or high end bits of the register and ror #24 and lsr #8 bring the next byte down into the lowermost 8 bits for saving by strb?
Full patch for reference again below and at https://git.lysator.liu.se/michaelweiser/nettle/-/tree/arm-memxor-generic.
If it's acceptable shall I rather git send-email it or do a MR on gitlab?
Could Yocto be used for CI then? Do they do any kind of binary releases for armeb? How long and voluminous is a build of an armeb system with and without a native toolchain?
No, I found no package feeds/binary releases for armeb. So to use Yocto for CI, we'd have to build an image with Yocto SDK inside. I can try implementing it.
I'm currently doing the same with buildroot. The advantage there is that it builds relatively quickly (around an hour on my quad-core workstation) and with minimal configuration:
FROM debian:buster AS build
MAINTAINER Nikos Mavrogiannopoulos nmav@redhat.com
RUN apt-get update -qq -y RUN apt-get install -y dash [...] g++ cpio unzip bc # tlsfuzzer deps RUN apt-get install -y python-six
RUN useradd -m buildroot USER buildroot
WORKDIR /home/buildroot RUN git clone https://github.com/buildroot/buildroot WORKDIR /home/buildroot/buildroot RUN ( \ echo 'BR2_armeb=y' ; \ echo 'BR2_TOOLCHAIN_BUILDROOT_GLIBC=y' ; \ echo 'BR2_TOOLCHAIN_BUILDROOT_CXX=y' ; \ echo 'BR2_PACKAGE_GMP=y' ; \ ) > .config RUN make olddefconfig RUN make -j16
The downside is that it will not output a native toolchain for armeb. So nettle needs to be cross-compiled using the buildroot toolchain and then run either using the EMULATOR mechanism of the testsuite or by chrooting into the buildroot rootfs.
So if Yocto can be made to build a native toolchain that would certainly simplify things (at the cost of image build time).
Do you know Nikos' build-images project for gnutls/nettle (https://gitlab.com/gnutls/build-images)? There's some qemu bits (specific to Debian's multiarch though) in docker-debian-cross that might be helpful.
Michael Weiser michael.weiser@gmx.de writes:
These comment changes are bugging me:
diff --git a/arm/memxor.asm b/arm/memxor.asm index 239a4034..b802e95c 100644 --- a/arm/memxor.asm +++ b/arm/memxor.asm @@ -138,24 +138,24 @@ PROLOGUE(nettle_memxor) adds N, #8 beq .Lmemxor_odd_done
- C We have TNC/8 left-over bytes in r4, high end
- C We have TNC/8 left-over bytes in r4, (since working upwards) low
- C end on LE and high end on BE S0ADJ r4, CNT ldr r3, [DST] eor r3, r4
The correctness in all cases is not that obvious to me now, but the idea is that we write aligned words, and read aligned words. But since input and output may have different alignment, src words are shifted around so that matching bytes are xored together. The ascii art a bit higher up in the file tries to illustrate that.
At this point in the code, r4 holds an aligned word read from the src area (possibly reading to a word edge beyond the end of the input bytes). The first bytes in this word (on LE, those are the least significant "low end" bytes of r4) have already been xored with the previous destination word and stored back. The "left-over" bytes referred to in the comment are the bytes in r4 that have not yet been processed, and those are the last bytes, and which in LE are the most significant bytes of r4, located at the "high end" of the register.
And then not all of the left-over bytes should be stored back after xoring, since they may be read from beyond the end of the input.
Does that make sense?
The C code does something similar, except that I think it avoids reading anything beyond end of input, since that is undefined behavior in C.
Full patch for reference again below and at https://git.lysator.liu.se/michaelweiser/nettle/-/tree/arm-memxor-generic.
If it's acceptable shall I rather git send-email it or do a MR on gitlab?
Either alternative is ok (although I'm still not that used to gitlab MRs).
Regards, /Niels
Hello Niels,
On Tue, Mar 03, 2020 at 06:57:25PM +0100, Niels Möller wrote:
The correctness in all cases is not that obvious to me now, but the idea is that we write aligned words, and read aligned words. But since input and output may have different alignment, src words are shifted around so that matching bytes are xored together. The ascii art a bit higher up in the file tries to illustrate that.
At this point in the code, r4 holds an aligned word read from the src area (possibly reading to a word edge beyond the end of the input bytes). The first bytes in this word (on LE, those are the least significant "low end" bytes of r4) have already been xored with the previous destination word and stored back. The "left-over" bytes referred to in the comment are the bytes in r4 that have not yet been
Thanks for the explanation. It confirms my understanding so far.
processed, and those are the last bytes, and which in LE are the most significant bytes of r4, located at the "high end" of the register.
This is where after a lot of scratching of my thinking cap I got to the conclusion that in LE we're actually working with the least significant bytes of r4 at the low end of the register. My guess is that it's just a matter of interpretation what end of the register is "high" and most significant. I just want to make sure I have a correct understanding of what the code is doing while messing with it.
Rereading the ARM ARM[1] it says that register contents are little-endian, i.e. the lowest numbered bits being least significant. ([1] D6.3.2) Also, higher bits are left and lower ones are right. ([1] D6.5.3) Byte order is converted only on memory access ([1] A3.3). strb stores the lowest (rightmost) byte of r4 (bits 7..0) to memory ([1] A7.7.160).
This matches what the code is doing: On LE it's saving the lowest byte to DST, incrementing by one (moving upward in memory) and then right-shifts r4 down 8 bits and saves the next byte. So it's saving the least significant byte first which on LE matches how ldr read the word from memory into the register. For validation I'm infering that it must be the least significant (rightmost) byte because lsr is discarding what it shifts out of the register to the right which would result in data loss if strb were to save the highest (leftmost) byte instead.
On BE we now do essentially a rotate-left by 8 bits (by doing a rotate right by 24 bits) to get the highest byte (bits 31..24) down to bits 7..0 of the register while preserving the rest by shifting/rotating them into the upper part (no discard as with lsr). Storing the the most significant byte first again matches how ldr loaded it from memory.
momxor3 seems to do everything the other way around because it works downward in memory.
Sorry if all this seems pedestrian but it isn't my daily fare. And sorry for quoting regulation by paragraph but I really want to make sure that I'm not misunderstanding all this miserably. :)
The C code does something similar, except that I think it avoids reading anything beyond end of input, since that is undefined behavior in C.
Ah, but the C code of memxor works downward in memory, doesn't it?
Long story short: Let me know what to do with those comments based on how much my thinking is off.
[1] ARM armv7-m Architecture Reference Manual https://static.docs.arm.com/ddi0403/e/DDI0403E_B_armv7m_arm.pdf (only one I could find publicly available)
Michael Weiser michael.weiser@gmx.de writes:
This is where after a lot of scratching of my thinking cap I got to the conclusion that in LE we're actually working with the least significant bytes of r4 at the low end of the register.
My understanding of LE here is that the least significant CNT bits (first in memory) where processed earlier, and the remaining TNC bits we need to process are at the high end. So we first shift right CNT bits to discard the bits already processed, and then do eight bit a a time, shifting right in the loop.
My guess is that it's just a matter of interpretation what end of the register is "high" and most significant.
The way I think about it, a 32-bit register holds a binary integer. Each bit has a weight, from 1 to 2^31 (let's stick to unsigned interpretation). And then I try to stay with the widely used convention that bits are numbered 0-31, with bit k having weight 2^k, and with "right", "lower", "less significant", being synonymous, all meaning the bits with smaller weight, and "left", "higher", "more significant" meaning bits with higher weight.
Note this termonology is endian independent. It's only when storing the integer in memory on a byte-addressed machine, that it becomes relevant to ask which 8 bits get stored at which address, with "little-endian" meaning that the lower/right/less significant bits in the register get stored at lower addresses in memory.
Maybe it's illuminating to compare with bit order. On a byte-addressed machine, we can't really talk about in which "order" individual bits of a byte are stored in memory. But we'd have to care about bit order, e.g., if transmitting a byte serially over a wire.
Long story short: Let me know what to do with those comments based on how much my thinking is off.
You could maybe just say "We have TNC/8 left-over bytes in r4 (high end if little endian, low end if big endian)". Or feel free to rephrase.
Also, I see the final
b .Lmemxor_bytes
is slightly suboptimal, in that it will reread individual bytes from the word at DST. It might be better to check if N > TNC/8, and if so read and xor one more source word.
ldr r4, [SRC] eor r3, r4, S1ADJ TNC
And we can then have the byte-storing loop run until N == 0, without updating or checking TNC. But that's a separate improvement.
Regards, /Niels
Hello Niels,
This is where after a lot of scratching of my thinking cap I got to the conclusion that in LE we're actually working with the least significant bytes of r4 at the low end of the register.
My understanding of LE here is that the least significant CNT bits (first in memory) where processed earlier, and the remaining TNC bits we need to process are at the high end. So we first shift right CNT bits to discard the bits already processed, and then do eight bit a a time, shifting right in the loop.
Okay, so the comment is referring to the situation in the register literally just before the next instruction. I was clearly looking too far afield for clues on what's going on.
My guess is that it's just a matter of interpretation what end of the register is "high" and most significant.
The way I think about it, a 32-bit register holds a binary integer. Each bit has a weight, from 1 to 2^31 (let's stick to unsigned interpretation). And then I try to stay with the widely used convention that bits are numbered 0-31, with bit k having weight 2^k, and with "right", "lower", "less significant", being synonymous, all meaning the bits with smaller weight, and "left", "higher", "more significant" meaning bits with higher weight.
Note this termonology is endian independent. It's only when storing the integer in memory on a byte-addressed machine, that it becomes relevant to ask which 8 bits get stored at which address, with "little-endian" meaning that the lower/right/less significant bits in the register get stored at lower addresses in memory.
Maybe it's illuminating to compare with bit order. On a byte-addressed machine, we can't really talk about in which "order" individual bits of a byte are stored in memory. But we'd have to care about bit order, e.g., if transmitting a byte serially over a wire.
Thanks for indulging me. :) I had a similar understanding and treatise to that effect all typed up but then found ARMs view of the world in their ARM and felt stupid.
Long story short: Let me know what to do with those comments based on how much my thinking is off.
You could maybe just say "We have TNC/8 left-over bytes in r4 (high end if little endian, low end if big endian)". Or feel free to rephrase.
I've gone with:
diff --git a/arm/memxor.asm b/arm/memxor.asm index 239a4034..e4619629 100644 --- a/arm/memxor.asm +++ b/arm/memxor.asm @@ -138,24 +138,25 @@ PROLOGUE(nettle_memxor) adds N, #8 beq .Lmemxor_odd_done
- C We have TNC/8 left-over bytes in r4, high end + C We have TNC/8 left-over bytes in r4, high end on LE and low end on + C BE, excess bits to be discarded by alignment adjustment at the other S0ADJ r4, CNT + C now byte-aligned at low end on LE and high end on BE ldr r3, [DST] eor r3, r4
Patch in followup mail for your consideration.
ARM assembly adjustments for big-endian systems contained armv6+-only instructions (rev) in generic arm memxor code. Replace those with an actual conversion of the leftover byte store routines for big-endian systems. This also provides a slight optimisation by removing the additional instruction as well as increased symmetry between little- and big-endian implementations.
Signed-off-by: Michael Weiser michael.weiser@gmx.de --- arm/memxor.asm | 13 +++++++------ arm/memxor3.asm | 31 ++++++++++++++++++------------- 2 files changed, 25 insertions(+), 19 deletions(-)
diff --git a/arm/memxor.asm b/arm/memxor.asm index 239a4034..e4619629 100644 --- a/arm/memxor.asm +++ b/arm/memxor.asm @@ -138,24 +138,25 @@ PROLOGUE(nettle_memxor) adds N, #8 beq .Lmemxor_odd_done
- C We have TNC/8 left-over bytes in r4, high end + C We have TNC/8 left-over bytes in r4, high end on LE and low end on + C BE, excess bits to be discarded by alignment adjustment at the other S0ADJ r4, CNT + C now byte-aligned at low end on LE and high end on BE ldr r3, [DST] eor r3, r4
- C memxor_leftover does an LSB store - C so we need to reverse if actually BE -IF_BE(< rev r3, r3>) - pop {r4,r5,r6}
C Store bytes, one by one. .Lmemxor_leftover: + C bring uppermost byte down for saving while preserving lower ones +IF_BE(< ror r3, #24>) strb r3, [DST], #+1 subs N, #1 beq .Lmemxor_done subs TNC, #8 - lsr r3, #8 + C bring down next byte, no need to preserve +IF_LE(< lsr r3, #8>) bne .Lmemxor_leftover b .Lmemxor_bytes .Lmemxor_odd_done: diff --git a/arm/memxor3.asm b/arm/memxor3.asm index 69598e1c..b6c6da49 100644 --- a/arm/memxor3.asm +++ b/arm/memxor3.asm @@ -159,21 +159,23 @@ PROLOGUE(nettle_memxor3) adds N, #8 beq .Lmemxor3_done
- C Leftover bytes in r4, low end + C Leftover bytes in r4, low end on LE and high end on BE before + C preparatory alignment correction ldr r5, [AP, #-4] eor r4, r5, r4, S1ADJ ATNC - - C leftover does an LSB store - C so we need to reverse if actually BE -IF_BE(< rev r4, r4>) + C now byte-aligned in high end on LE and low end on BE because we're + C working downwards in saving the very first bytes of the buffer
.Lmemxor3_au_leftover: C Store a byte at a time - ror r4, #24 + C bring uppermost byte down for saving while preserving lower ones +IF_LE(< ror r4, #24>) strb r4, [DST, #-1]! subs N, #1 beq .Lmemxor3_done subs ACNT, #8 + C bring down next byte, no need to preserve +IF_BE(< lsr r4, #8>) sub AP, #1 bne .Lmemxor3_au_leftover b .Lmemxor3_bytes @@ -273,18 +275,21 @@ IF_BE(< rev r4, r4>) adds N, #8 beq .Lmemxor3_done
- C leftover does an LSB store - C so we need to reverse if actually BE -IF_BE(< rev r4, r4>) - - C Leftover bytes in a4, low end - ror r4, ACNT + C Leftover bytes in r4, low end on LE and high end on BE before + C preparatory alignment correction +IF_LE(< ror r4, ACNT>) +IF_BE(< ror r4, ATNC>) + C now byte-aligned in high end on LE and low end on BE because we're + C working downwards in saving the very first bytes of the buffer .Lmemxor3_uu_leftover: - ror r4, #24 + C bring uppermost byte down for saving while preserving lower ones +IF_LE(< ror r4, #24>) strb r4, [DST, #-1]! subs N, #1 beq .Lmemxor3_done subs ACNT, #8 + C bring down next byte, no need to preserve +IF_BE(< lsr r4, #8>) bne .Lmemxor3_uu_leftover b .Lmemxor3_bytes
Michael Weiser michael.weiser@gmx.de writes:
ARM assembly adjustments for big-endian systems contained armv6+-only instructions (rev) in generic arm memxor code. Replace those with an actual conversion of the leftover byte store routines for big-endian systems. This also provides a slight optimisation by removing the additional instruction as well as increased symmetry between little- and big-endian implementations.
Merged onto the master-update branch (to let the ci system verify that ARM LE isn't broken). Thanks.
It would be nice to have arm be in the ci as well, but you need to coordinate with Nikos. I have a somewhat fuzzy understanding of how it's set up, and know very little about the various system images being used.
Regards, /Niels
Hello Niels,
On Thu, Mar 12, 2020 at 10:01:51PM +0100, Niels Möller wrote:
ARM assembly adjustments for big-endian systems contained armv6+-only instructions (rev) in generic arm memxor code. Replace those with an actual conversion of the leftover byte store routines for big-endian systems. This also provides a slight optimisation by removing the additional instruction as well as increased symmetry between little- and big-endian implementations.
Merged onto the master-update branch (to let the ci system verify that ARM LE isn't broken). Thanks.
FWIW it just cleared preliminary BE CI: https://gitlab.com/michaelweiser/nettle/pipelines/125862809 https://gitlab.com/michaelweiser/nettle/pipelines/125862752
It would be nice to have arm be in the ci as well, but you need to coordinate with Nikos. I have a somewhat fuzzy understanding of how it's set up, and know very little about the various system images being used.
Will do.
nettle-bugs@lists.lysator.liu.se