Re: Release of Nettle-3.7?

List overview All Threads
Download

newer

older

Bug fix release Nettle-3.7.1 ?

[AArch64] Optimize GHASH

nisse＠lysator.liu.se

19 Dec 2020 19 Dec '20

8:51 a.m.

Michael Weiser michael@weiser.dinsnail.net writes:

...

Porting over the basic IF_[LB]E mechanism from chacha-core-internal was easy and fixed up the first of the three interleaved blocks right away. For the other two I am still in the process of wrapping my head around how the interleaving works and how it would need some adjustment for BE.

The 3-way functions don't do anything fancy, just each of the three blocks represented in separate registers, and same instruction sequence as for the 1-way version, duplicated threee times and interleaved.

The 2-way version (for ARM, that's salsa only) tries to be a bit more clever, with registers representing either odd or even words from both blocks.

Not sure how endianness affects the code to move words around.

Byte swapping should go close to the final stores, but after the addition of the initial state.

Regards, /Niels

-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.

Show replies by date

Michael Weiser

21 Dec 21 Dec

6:45 p.m.

New subject: Release of Nettle-3.7?

Hello Niels,

On Sat, Dec 19, 2020 at 09:51:45AM +0100, Niels Möller wrote:

...

...
Porting over the basic IF_[LB]E mechanism from chacha-core-internal was easy and fixed up the first of the three interleaved blocks right away. For the other two I am still in the process of wrapping my head around how the interleaving works and how it would need some adjustment for BE.

The 3-way functions don't do anything fancy, just each of the three blocks represented in separate registers, and same instruction sequence as for the 1-way version, duplicated threee times and interleaved.

I've got the tests passing for chacha now. Apart from the straightforward porting-over of the BE shift and reverse-on-store logic from chacha20-core-internal.asm special treatment is necessary for the part of the state that's treated as a 64-bit counter. The two 32-bit words it's comprised of are in host-endianness but consecutive order. So they get reversed by the BE load. This is actually the case for all 32-bit operands throughout the routine on BE (and for chacha-core-internal also) and cancels itself out on the final store. But for the 64-bit counter it needs to be taken into account for the addition to produce correct results.

See the attached patch for my current approach to fixing it, which is explicit transposing, adding and then transposing again to be as transposed as the other operands. I wonder if the surrounding C code could be changed to supply that part of the state as a 64-bit doubleword in host endianness to the assembler routine to cut down on adjustment.

Alternatively, could the 64-bit operation be broken down into two 32-bit operations which implicitly adjust to the transposed 32-bit words on BE?

...

The 2-way version (for ARM, that's salsa only) tries to be a bit more clever, with registers representing either odd or even words from both blocks.

For a start this also needs adjustment for the 64-bit counter treatment.

...

Not sure how endianness affects the code to move words around.

The routine "suffers" from the same effect as chacha: The 32-bit input operands are in host order in memory and their individual values end up correctly in the registers. But since vldm loads consecutive 64-bit values, it ends up transposing 32-bit words that comprise the 64-bit register value. After the initial swap and transpose operations, the X and Y matrices are basically correctly filled but flipped two ways.

I've tried to document what I see in the registers on armeb to get a handle on how to proceed:

vtrn.32 X0, Y3 C X0: 0 0 2 2 Y3: 1 1 3 3 vtrn.32 X1, Y0 C X1: 4 4 6 6 Y0: 5 5 7 7 - vtrn.32 X2, Y1 C X2: 8 8 10 10 Y1: 9 9 1 1 <- typo? + vtrn.32 X2, Y1 C X2: 8 8 10 10 Y1: 9 9 11 11 vtrn.32 X3, Y2 C X3: 12 12 14 14 Y2: 13 13 15 15 + C BE: + C X0: 3 3 1 1 Y3: 2 2 0 0 + C X1: 7 7 5 5 Y0: 6 6 4 4 + C X2: 11 11 9 9 Y1: 10 10 8 8 + C X3: 15 15 13 13 Y2: 14 14 12 12

C Swap, to get C X0: 0 10 Y0: 5 15 C X1: 4 14 Y1: 9 3 C X2: 8 2 Y2: 13 7 C X3: 12 6 Y3: 1 11 vswp D1REG(X0), D1REG(X2) vswp D1REG(X1), D1REG(X3) vswp D1REG(Y0), D1REG(Y2) vswp D1REG(Y1), D1REG(Y3)

+ C BE: + C X0: 11 1 Y0: 14 4 + C X1: 15 5 Y1: 2 8 + C X2: 3 9 Y2: 6 12 + C X3: 7 13 Y3: 10 0

I wonder if the code working on them contains some symmetry that could be exploited to (with minimal changes) get correct results on these transposed matrices.

Otherwise I wonder if it would be possible for both chacha and salsa to change the actual loading and storing so there's no transposing of 32-bit operands. I looked at vld4.32 but that does some fancy de-interleaving and needs two operations to load four q registers.

Otherwise we'd need a lot of vrev64.u32s to basically revert the 32-bit transposition happening upon load and save to end up with identical matrices to LE.

-- Michael

nisse＠lysator.liu.se

8:16 p.m.

New subject: Release of Nettle-3.7?

Michael Weiser michael.weiser@gmx.de writes:

...

See the attached patch for my current approach to fixing it, which is explicit transposing, adding and then transposing again to be as transposed as the other operands.

I haven't yet read the code, but I have some comments based on your description only.

...

I wonder if the surrounding C code could be changed to supply that part of the state as a 64-bit doubleword in host endianness to the assembler routine to cut down on adjustment.

I think it will be a bit cumbersum to change the interface to the C code.

...

Alternatively, could the 64-bit operation be broken down into two 32-bit operations which implicitly adjust to the transposed 32-bit words on BE?

Maybe. But we still need to propagate the carry, can that be done in a better way than transpose, 64-bit add, transpose?

...

I've tried to document what I see in the registers on armeb to get a handle on how to proceed:

vtrn.32 X0, Y3 C X0: 0 0 2 2 Y3: 1 1 3 3 vtrn.32 X1, Y0 C X1: 4 4 6 6 Y0: 5 5 7 7

vtrn.32 X2, Y1 C X2: 8 8 10 10 Y1: 9 9 1 1 <- typo?

vtrn.32 X2, Y1 C X2: 8 8 10 10 Y1: 9 9 11 11

Indeed a typo. I just checked in the fix, thanks!

...

vtrn.32 X3, Y2 C X3: 12 12 14 14 Y2: 13 13 15 15

```
		C BE:
```
```
		C X0:  3  3  1  1  Y3:  2  2  0  0
```
```
		C X1:  7  7  5  5  Y0:  6  6  4  4
```
```
		C X2: 11 11  9  9  Y1: 10 10  8  8
```
```
		C X3: 15 15 13 13  Y2: 14 14 12 12
```

Also, it's somewhat important to keep track of which block a word belongs to. In the LE code, X0 really is A0 B0 A2 B2, where A refers to the first block, and B to the second.

What's the layout before the transpose, immediately after load? I'd guess you get X0: 1 0 3 2?

For the little endian code, the transpose can be viewed as

X0: A0 A1 A2 A3 / / denotes elements swapped. Y3: B0 B1 B2 B3

If instead we start with the order 1 0 3 2, we get the same result (but with registers swapped) if we do

Y3: B1 B0 B3 B2 \ \ X0: A1 A0 A3 A2

So I would expect there's some clever way to get the BE case to work with about the same number of transpose instructions, even if initial word order is somewhat different.

...

I wonder if the code working on them contains some symmetry that could be exploited to (with minimal changes) get correct results on these transposed matrices.

At least, both blocks are treated equally (except that the initial counter addition is done to only the second block, and that the final result is written in the right order. So it doesn't matter if X0 contains A0 B0 A2 B2 or B0 A0 B2 A2. And unlike the one-way code, we only use

vext32 ... #2

to rotate data between rounds, never #1 or #3.

...

Otherwise I wonder if it would be possible for both chacha and salsa to change the actual loading and storing so there's no transposing of 32-bit operands. I looked at vld4.32 but that does some fancy de-interleaving and needs two operations to load four q registers.

The new powerpc code uses load and store instructions that behave the same in this respect, for both BE and LE. But not sure if there's any easy way on ARM. I'm not that familiar with the more special load and store instructions. Would vst2.32 be useful in some way for the final store (and vst3.32 for chacha-3core)?

...

Otherwise we'd need a lot of vrev64.u32s to basically revert the 32-bit transposition happening upon load and save to end up with identical matrices to LE.

If that's an easier way to get it working, I think it's a good start. I'd expect that's still give a reasonable speedup over the 1-way version.

Regards, /Niels

-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.

Michael Weiser

25 Dec 25 Dec

5:02 p.m.

New subject: Release of Nettle-3.7?

Hello Niels,

On Mon, Dec 21, 2020 at 09:16:25PM +0100, Niels Möller wrote:

...

What's the layout before the transpose, immediately after load? I'd guess you get X1: 1 0 3 2?

TL;DR: Yes, it is. I abandoned this approach for now though, since I found some options to eliminate the word transposition effect of vldm/vstm in the first place (see below).

Longer story for completeness: It seems I ran afoul gdb's way of displaying registers in memory endianness again. I knew all this once already.[1] I should likely do this more often than every couple of years. ;)

[1] https://marc.info/?l=nettle-bugs&m=152436948907236&w=2

On LE I get after the initial load:

Breakpoint 1, _nettle_salsa20_2core () at salsa20-2core.s:39 39 vldm r1, {q0,q1,q2,q3} (gdb) s 40 adr r12, .Lcount1 (gdb) i r q0 q1 q2 q3 q0 {u8 = {0x0, 0x0, 0x0, 0x0, 0x1, 0x0, 0x0, 0x0, 0x2, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0}, u16 = {0x0, 0x0, 0x1, 0x0, 0x2, 0x0, 0x3, 0x0}, u32 = {0x0, 0x1, 0x2, 0x3}, u64 = {0x100000000, 0x300000002}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}} q1 {u8 = {0x4, 0x0, 0x0, 0x0, 0x5, 0x0, 0x0, 0x0, 0x6, 0x0, 0x0, 0x0, 0x7, 0x0, 0x0, 0x0}, u16 = {0x4, 0x0, 0x5, 0x0, 0x6, 0x0, 0x7, 0x0}, u32 = {0x4, 0x5, 0x6, 0x7}, u64 = {0x500000004, 0x700000006}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}} q2 {u8 = {0xff, 0xff, 0xff, 0xff, 0x9, 0x0, 0x0, 0x0, 0xa, 0x0, 0x0, 0x0, 0xb, 0x0, 0x0, 0x0}, u16 = {0xffff, 0xffff, 0x9, 0x0, 0xa, 0x0, 0xb, 0x0}, u32 = {0xffffffff, 0x9, 0xa, 0xb}, u64 = {0x9ffffffff, 0xb0000000a}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}} q3 {u8 = {0xc, 0x0, 0x0, 0x0, 0xd, 0x0, 0x0, 0x0, 0xe, 0x0, 0x0, 0x0, 0xf, 0x0, 0x0, 0x0}, u16 = {0xc, 0x0, 0xd, 0x0, 0xe, 0x0, 0xf, 0x0}, u32 = {0xc, 0xd, 0xe, 0xf}, u64 = {0xd0000000c, 0xf0000000e}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}}

On the u8 representation we can see that gdb prints them as if they were stored little-endian in memory. That's why the u32 representaion actually matches up with our expectations.

On BE I get:

Breakpoint 1, _nettle_salsa20_2core () at salsa20-2core.s:39 39 vldm r1, {q0,q1,q2,q3} (gdb) s 40 adr r12, .Lcount1 (gdb) i r q0 q1 q2 q3 q0 {u8 = {0x0, 0x0, 0x0, 0x2, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1}, u16 = {0x0, 0x2, 0x0, 0x3, 0x0, 0x0, 0x0, 0x1}, u32 = {0x2, 0x3, 0x0, 0x1}, u64 = {0x200000003, 0x1}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}} q1 {u8 = {0x0, 0x0, 0x0, 0x6, 0x0, 0x0, 0x0, 0x7, 0x0, 0x0, 0x0, 0x4, 0x0, 0x0, 0x0, 0x5}, u16 = {0x0, 0x6, 0x0, 0x7, 0x0, 0x4, 0x0, 0x5}, u32 = {0x6, 0x7, 0x4, 0x5}, u64 = {0x600000007, 0x400000005}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}} q2 {u8 = {0x0, 0x0, 0x0, 0xa, 0x0, 0x0, 0x0, 0xb, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x9}, u16 = {0x0, 0xa, 0x0, 0xb, 0xffff, 0xffff, 0x0, 0x9}, u32 = {0xa, 0xb, 0xffffffff, 0x9}, u64 = {0xa0000000b, 0xffffffff00000009}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}} q3 {u8 = {0x0, 0x0, 0x0, 0xe, 0x0, 0x0, 0x0, 0xf, 0x0, 0x0, 0x0, 0xc, 0x0, 0x0, 0x0, 0xd}, u16 = {0x0, 0xe, 0x0, 0xf, 0x0, 0xc, 0x0, 0xd}, u32 = {0xe, 0xf, 0xc, 0xd}, u64 = {0xe0000000f, 0xc0000000d}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}}

Here gdb prints them as if they were stored in big-endian order in memory. So we have to read them in reverse to compare them to the LE output.

That would mean, if we read the LE output for q0.u32 as 0 1 2 3, the equivalent BE output would read 1 0 3 2 just as you guessed. So you're right and all my notes in the code are likely wrong because I read the output the wrong way around (again).

...

...
Otherwise I wonder if it would be possible for both chacha and salsa to change the actual loading and storing so there's no transposing of 32-bit operands. I looked at vld4.32 but that does some fancy de-interleaving and needs two operations to load four q registers.

The new powerpc code uses load and store instructions that behave the same in this respect, for both BE and LE. But not sure if there's any easy way on ARM. I'm not that familiar with the more special load and store instructions. Would vst2.32 be useful in some way for the final store (and vst3.32 for chacha-3core)?

For this I have found two candidates, once I wrapped my head around the (de-)interleaving part of VLDn/VSTn:

Option 1: VLDn.dt/VSTn.dt[2, C.13.5, page C-63]: It turns out, the n in VLDn/VSTn is the number of interleaved elements and the .dt is the width/datatype of those elements. So vld2.32 loads 32-bit operands from memory which it assumes to be two interleaved vectors. So it sends odd-numbered elements to one register and even-numbered to another. We don't need nor want that. That's where vld1/vst1 come in: They do no (de-)interleaving, just sequential loading or storing of elements. (It's already in use in umac-nh.asm but I didn't remember.)

The number of elements it loads only depends on the number of registers given. So vld1.64 {q0, q1}, [r1] does not mean "load one 64-bit operand into some part of q0 or q1" but "load 64-bit operands sequentially without deinterleaving until q0 and q1 are 'full'", i.e. four of them.

So for our case where we have a matrix of 32-bit words in host endianness that we need to load sequentially into q registers without any transposing we can use vld1.32 {q0, q1}, [r1].

This is also a drop-in fix for the 64-bit counter addition.

The drawback compared to vldm is that we need to issue two operations to load four q registers because each vld1/vst1 can only work with up to four d (i.e. two q) registers. This also means that we need to increment the base address for the second load which requires a scratch register if we want to keep the original value for later reference.

Regarding performance I found a document from ARM for the Cortex-A8 which had some cycle numbers[2]. According to it, two vld1's should take (at worst/no alignment) six cycles where vldm would run five cycles for the same amount of registers. This doesn't include any mov necessary to initialise the base address scratch register. The element size (e.g. .8 vs. .64) doesn't seem to play into it at all. It gets faster with better alignment. Here's a quick calculation with a bit of code for illustration:

C vst1.8 because caller expects results little-endian C speed: C 1 q register == 2 d registers, doc talks d registers C vstm: (number of registers/2) + mod(number of registers, 2) + 1 == (8/2) + mod(8, 2) + 1 == 4 + 0 + 1 = 5 cycles C vst1.8: 2ops each 4-reg unaligned: 2*3 == 6 cycles (plus potentially mov to set up address counter) IF_LE(` vstm DST, {X0,X1,X2,X3}') IF_BE(` vst1.8 {X0,X1}, [DST]! vst1.8 {X2,X3}, [DST]')

My feeling is that it doesn't matter much because it happens outside the main loop.

Attached are two patches be-neon-asm-2.diff and 0002-arm-Unify-neon-asm-for-big-and-little-endian-modes.patch for illustration what using those intructions would look like. An armeb CI run is at https://gitlab.com/michaelweiser/nettle/-/jobs/932123909.

As expected, all the special treatment of transposed operands can just go away because it doesn't happen any more. Also, vld1.32 (for sequential loads of 32-bit operands in host-endianness) and vld1.8 (for sequential store of register contents to get an implicit little-endian store without any vrev32.u8s) works the same on LE as well as BE. So we could use those as separate BE implementation and leave the LE code conditionalized but otherwise intact or we could unify the code to work for both cases without difference.

Option 2: By coincidence I found that vldm/vstm can work with s registers originally intended for use with VFP. They're just a different view of the d0-d15 or q0-q7 registers. When giving s registers as arguments to vldm/vstm they start to behave identically to vst1.32, i.e. load/save 32-bit words sequentially.

The drawback is that only q0 through q7 are mapped as s0 through s31. So we cannot use that mechanism to load directly into the higher eight q registers. The attached patch be-neon-asm-1.diff showcases what using those would look like. Where necessary, I loaded or stored via s0-s15 (i.e. q0-q3) using vmov or the already presetn vrev32s. Since that routinely clobbers those lower registers, I needed to add a second reload of the original context into T0-T3 in salsa20-2core.asm.

Also, it's not entirely clear to me from the documentation if this will work on every ARM core that supports NEON. The NEON programmer's guide[3] states that VLDM/VSTM is a shared VFP/NEON instruction and s registers *can* be specified. I read that to mean that it will work on every NEON core. It appears that every core that has NEON also has at least VFP3 but I've found no definite statement to that effect. Some sources speak of NEON as an extension to VFP but I've found no confirmation by ARM.

Also it does not get rid of all those vrev32.u8s before store on BE. All in all, option 1 (vld1/vst1) seems more straightforward and elegant to me. We could also oppotunistically use both approaches where they fit best, i.e. vldm/vstm when working with q0-q7 and vld1.{8,32} for q8-q15.

[2] https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf [3] https://developer.arm.com/documentation/ddi0344/b/instruction-cycle-timing/i...

-- Happy holidays, Michael

nisse＠lysator.liu.se

9:48 p.m.

New subject: Release of Nettle-3.7?

Michael Weiser michael.weiser@gmx.de writes:

...

Longer story for completeness: It seems I ran afoul gdb's way of displaying registers in memory endianness again. I knew all this once already.[1] I should likely do this more often than every couple of years. ;)

I'm always confused by the conventions for ordering of the components of vector registers. When I write out values in code comments, I try to use the order in which the elements appeared in memory.

...

So for our case where we have a matrix of 32-bit words in host endianness that we need to load sequentially into q registers without any transposing we can use vld1.32 {q0, q1}, [r1].

This is also a drop-in fix for the 64-bit counter addition.

Sounds good.

...

The drawback compared to vldm is that we need to issue two operations to load four q registers because each vld1/vst1 can only work with up to four d (i.e. two q) registers. This also means that we need to increment the base address for the second load which requires a scratch register if we want to keep the original value for later reference.

Since we have plenty of registers available, (including r3 which seems unused and free to clobber), I'd suggest using

define(`SRCp32', `r3')

and an

add SRCp32, SRC, #32

in function entry, and then leave both SRC and SRCp32 unmodified for the rest of the function.

...

Regarding performance I found a document from ARM for the Cortex-A8 which had some cycle numbers[2]. According to it, two vld1's should take (at worst/no alignment) six cycles where vldm would run five cycles for the same amount of registers. [...]

...

My feeling is that it doesn't matter much because it happens outside the main loop.

If it's just a cycle or two per call, I think it's ok.

...

As expected, all the special treatment of transposed operands can just go away because it doesn't happen any more. Also, vld1.32 (for sequential loads of 32-bit operands in host-endianness) and vld1.8 (for sequential store of register contents to get an implicit little-endian store without any vrev32.u8s) works the same on LE as well as BE.

Neat. Use of vld1.8 is worth a commment in the code (and/or arm/README).

...

Option 2: By coincidence I found that vldm/vstm can work with s registers originally intended for use with VFP. They're just a different view of the d0-d15 or q0-q7 registers. When giving s registers as arguments to vldm/vstm they start to behave identically to vst1.32, i.e. load/save 32-bit words sequentially.

[...]

...

Also, it's not entirely clear to me from the documentation if this will work on every ARM core that supports NEON. The NEON programmer's guide[3] states that VLDM/VSTM is a shared VFP/NEON instruction and s registers *can* be specified. I read that to mean that it will work on every NEON core. It appears that every core that has NEON also has at least VFP3 but I've found no definite statement to that effect. Some sources speak of NEON as an extension to VFP but I've found no confirmation by ARM.

That sounds a bit complicated, and since there's no great benefit over vld1, maybe best to stay away from that?

...

All in all, option 1 (vld1/vst1) seems more straightforward and elegant to me.

Sounds good to me too.

...

From 07c7ea6d62b33aa0c3e176c0e54ffc409fd78516 Mon Sep 17 00:00:00 2001 From: Michael Weiser michael.weiser@gmx.de Date: Fri, 25 Dec 2020 17:13:52 +0100 Subject: [PATCH 2/2] arm: Unify neon asm for big- and little-endian modes

Switch arm neon assemlber routines to endianness-agnostic loads and stores where possible to avoid modifications to the rest of the code. This involves switching to vld1.32 for loading consecutive 32-bit words in host endianness as well as vst1.8 for storing back to memory in little-endian order as required by the caller.

I like this approach. It would be nice if you coudl benchmark it on little-endian, to verify that there's no unexpectedly large speed regression (a regression of just cycle or two per block, if that's at all measurable, is ok, I think).

...

PROLOGUE(_nettle_chacha_3core)

vldm SRC, {X0,X1,X2,X3}

mov r12, SRC

vld1.32 {X0,X1}, [r12]!

vld1.32 {X2,X3}, [r12]

My suggestion is to do this as

add SRCp32, SRC, #32 vld1.32 {X0,X1}, [SRC] vld1.32 {X2,X3}, [SRCp32]

and reuse SRCp32 for the second load of the same data, further down (assuming r3 really is free to use for this purpose; if we have to save and restore a register to do this, your approach with temporary use of r12 seems better). Another option, with no need for an extra registerm is to just use post-increment, modifying SRC here. And either explicitly subtract 32, or use opposite load order and pre-decrement for the second load.

Regards, /Niels

-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.

Michael Weiser

29 Dec 29 Dec

10 p.m.

New subject: Release of Nettle-3.7?

Hello Niels,

On Fri, Dec 25, 2020 at 10:48:19PM +0100, Niels Möller wrote:

...

Since we have plenty of registers available, (including r3 which seems unused and free to clobber), I'd suggest using

...

define(`SRCp32', `r3')

...

and an

...

add SRCp32, SRC, #32

...

in function entry, and then leave both SRC and SRCp32 unmodified for the rest of the function.

I've done that and according to nettle-benchmark it saves one to two cycles per block compared to the mov+postincrement approach.

...

...
As expected, all the special treatment of transposed operands can just go away because it doesn't happen any more. Also, vld1.32 (for sequential loads of 32-bit operands in host-endianness) and vld1.8 (for sequential store of register contents to get an implicit little-endian store without any vrev32.u8s) works the same on LE as well as BE.

Neat. Use of vld1.8 is worth a commment in the code (and/or arm/README).

I added those where it seemed to make sense. It was already in the README but I've extended it a bit with the new findings.

...

...
Option 2: By coincidence I found that vldm/vstm can work with s registers originally intended for use with VFP. They're just a different

That sounds a bit complicated, and since there's no great benefit over vld1, maybe best to stay away from that?

Also, interestingly, when I use vldm to s regs wherever possible (see second attached patch), it doesn't give any speedup. It saves the scratch register in all routines I've touched, though. In general, it seems that add+2*vld1.32 is exactly the same number of cycles as the equivalent vldm.

...

...
Switch arm neon assemlber routines to endianness-agnostic loads and stores where possible to avoid modifications to the rest of the code. This involves switching to vld1.32 for loading consecutive 32-bit words in host endianness as well as vst1.8 for storing back to memory in little-endian order as required by the caller.

I like this approach. It would be nice if you coudl benchmark it on little-endian, to verify that there's no unexpectedly large speed regression (a regression of just cycle or two per block, if that's at all measurable, is ok, I think).

It comes out at around seven cycles per block slowdown for chacha-3core and five for salsa20-2core. I trace this to vst1.8. It's just slower than vstm (in contrast to vldm vs. vld1.32). I managed to save a cumulative two cycles by rescheduling instructions so that there's no two consecutive vst1.8s which seems to avoid stalls in the pipeline or bus access waits (at least on my machine). Element width (8 vs. 32 vs. 64) doesn't seem to play into it. Alignment can't be used to improve performance: The tests immediately bus error when giving a :64 alignment hint to vst1.8.

Baseline with --disable-assembler comes in with these numbers on my Cubieboard2 with 1GHz Allwinner A20 which is a Cortex-A7 implementation:

[michael@c2-le:~/nettle/build-noasm/examples] LD_LIBRARY_PATH=../.lib ./nettle-benchmark -f 1000000000 chacha salsa20

Algorithm mode Mbyte/s cycles/byte cycles/block

chacha encrypt 30.43 31.34 2005.82 chacha decrypt 30.41 31.36 2006.89

chacha_poly1305 encrypt 23.57 40.47 2589.77 chacha_poly1305 decrypt 23.55 40.50 2592.15 chacha_poly1305 update 104.42 9.13 584.51

salsa20 encrypt 35.10 27.17 1738.73 salsa20 decrypt 35.10 27.17 1738.75

salsa20r12 encrypt 50.12 19.03 1217.75 salsa20r12 decrypt 50.15 19.01 1216.93

(BTW: Am I using the benchmark correctly, particularly the frequency parameter?)

Baseline unmodified assembler routines (without --enable-fat) come in at:

[michael@c2-le:~/nettle/build-orig/examples] LD_LIBRARY_PATH=../.lib ./nettle-benchmark -f 1000000000 chacha salsa20

Algorithm mode Mbyte/s cycles/byte cycles/block

chacha encrypt 63.06 15.12 967.83 chacha decrypt 63.06 15.12 967.82

chacha_poly1305 encrypt 39.18 24.34 1557.72 chacha_poly1305 decrypt 39.18 24.34 1557.96 chacha_poly1305 update 104.38 9.14 584.75

salsa20 encrypt 62.15 15.34 982.04 salsa20 decrypt 62.07 15.36 983.33

salsa20r12 encrypt 92.69 10.29 658.48 salsa20r12 decrypt 92.70 10.29 658.43

Attached unified code (patch 0001) comes in like this:

[michael@c2-le:~/nettle/build-unified-add/examples] LD_LIBRARY_PATH=../.lib ./nettle-benchmark -f 1000000000 chacha salsa20

Algorithm mode Mbyte/s cycles/byte cycles/block

chacha encrypt 62.61 15.23 974.79 chacha decrypt 62.62 15.23 974.72

chacha_poly1305 encrypt 39.14 24.36 1559.28 chacha_poly1305 decrypt 39.18 24.34 1558.00 chacha_poly1305 update 103.65 9.20 588.88

salsa20 encrypt 61.80 15.43 987.65 salsa20 decrypt 61.81 15.43 987.51

salsa20r12 encrypt 91.88 10.38 664.30 salsa20r12 decrypt 91.91 10.38 664.07

What's nice is that the same code gives very consistent numbers on BE (no idea what's going on with poly1305 though):

[michael@c2-be:~/nettle/build-unified-add/examples] LD_LIBRARY_PATH=../.lib ./nettle-benchmark -f 1000000000 chacha salsa20

Algorithm mode Mbyte/s cycles/byte cycles/block

chacha encrypt 62.56 15.25 975.69 chacha decrypt 62.62 15.23 974.68

chacha_poly1305 encrypt 38.40 24.83 1589.32 chacha_poly1305 decrypt 38.40 24.83 1589.38 chacha_poly1305 update 99.92 9.54 610.86

salsa20 encrypt 61.80 15.43 987.58 salsa20 decrypt 61.81 15.43 987.41

salsa20r12 encrypt 91.90 10.38 664.14 salsa20r12 decrypt 91.93 10.37 663.92

As said, the second patch (switching back to vldm via s regs where possible) doesn't change these numbers at all (but saves a register).

(What's nice about my boards it that due to missing power-saving and frequency-scaling functionality they give very, very consistent numbers across multiple runs.)

My first reflex is that 400Kbyte/s for chacha and 350Kbyte/s for salsa20 is relevant enough to keep separate implementations for LE and BE in the code *or* dig deeper into why vst1.8 is so much slower.

Do you (or anybody else) have a hardware arm board for testing, possibly with a Cortex A8 or A9 implementation to see how it behaves there?

I have a couple of RasPis and little- and big-endian pine64s (aarch64) gathering dust in a box which I could fire up for some testing (not sure about 32-bit support on the pine64s, though).

...

and reuse SRCp32 for the second load of the same data, further down (assuming r3 really is free to use for this purpose; if we have to save

I read AAPCS as saying that r3 can be used as scratch register inbetween subroutine calls. Since we don't to subroutine calls, its use should be fine.

I've got one side-track which might point to some peculiarity of my machine: The unmodified assembler code *without* chacha-3core and salsa20-2core (files moved out of the way before configure) is no faster or even slower than what the C compiler produces:

[michael@c2-le:~/nettle/build-no23core/examples] LD_LIBRARY_PATH=../.lib ./nettle-benchmark -f 1000000000 chacha salsa20

Algorithm mode Mbyte/s cycles/byte cycles/block

chacha encrypt 31.35 30.42 1946.66 chacha decrypt 31.34 30.43 1947.30

chacha_poly1305 encrypt 24.10 39.57 2532.24 chacha_poly1305 decrypt 24.10 39.57 2532.21 chacha_poly1305 update 104.42 9.13 584.53

salsa20 encrypt 30.38 31.39 2008.96 salsa20 decrypt 30.39 31.38 2008.34

salsa20r12 encrypt 47.00 20.29 1298.56 salsa20r12 decrypt 47.01 20.29 1298.25

Does this seem reasonable or does it point to some flaw in my benchmarking or system software/hardware? (I've done my best using gdb to verify that the asm routines are in use. Unfortunately, nettle-benchmark is resisting attempts to ltrace or gdb-debug it, so I diagnosed the testsuite tests instead.)

-- Thanks, Michael

nisse＠lysator.liu.se

30 Dec 30 Dec

8:12 p.m.

New subject: Release of Nettle-3.7?

Michael Weiser michael.weiser@gmx.de writes:

...

It comes out at around seven cycles per block slowdown for chacha-3core and five for salsa20-2core. I trace this to vst1.8. It's just slower than vstm (in contrast to vldm vs. vld1.32). I managed to save a cumulative two cycles by rescheduling instructions so that there's no two consecutive vst1.8s which seems to avoid stalls in the pipeline or bus access waits (at least on my machine). Element width (8 vs. 32 vs. 64) doesn't seem to play into it.

Thanks for investigating. Maybe keep some IF_BE / IF_LE just for the store instructions, to stay with vstm on little-endian?

...

(BTW: Am I using the benchmark correctly, particularly the frequency parameter?)

I think it's right. But it's a floating point number, so -f 1e9 for 1 GHz should work too.

...

Alignment can't be used to improve performance: The tests immediately bus error when giving a :64 alignment hint to vst1.8.

Unfortunately, I'm not aware of any nice and portable way to enforce alignment from the calling C code.

...

Do you (or anybody else) have a hardware arm board for testing, possibly with a Cortex A8 or A9 implementation to see how it behaves there?

I have access to the GMP test systems on https://gmplib.org/devel/testsystems, but little time to benchmark things in the near future.

...

I've got one side-track which might point to some peculiarity of my machine: The unmodified assembler code *without* chacha-3core and salsa20-2core (files moved out of the way before configure) is no faster or even slower than what the C compiler produces:

[...]

...

Does this seem reasonable or does it point to some flaw in my benchmarking or system software/hardware?

That's unexpected. In principle I guess it's possible for the C compiler to generate great vectorized code, but that seems a bit unlikely. Do you get the same results if you build Nettle-3.6?

...

From ChangeLog comments, it seems I got 45% speedup for Salsa20,

compared to the C implementation, when I wrote the original neon assembly code. At the time, benchmarked on a pandaboard (cortex a9), if I remember correctly.

Is it for a fat build? If so, it's possibly that the fat setup logic selects the C implementation is this hacked setup (but on the other hand, I'd guess a fat build may just failed at link time if these files are removed).

Regards, /Niels

-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.

Michael Weiser

1 Jan 1 Jan

4:29 p.m.

New subject: Release of Nettle-3.7?

Happy new year, Niels and all around,

On Wed, Dec 30, 2020 at 09:12:24PM +0100, Niels Möller wrote:

...

...
It comes out at around seven cycles per block slowdown for chacha-3core and five for salsa20-2core. I trace this to vst1.8. It's just slower

Thanks for investigating. Maybe keep some IF_BE / IF_LE just for the store instructions, to stay with vstm on little-endian?

Sounds good. I'll try to finalise a patch and reconfirm that there's no speed regression from it.

...

...
Does this seem reasonable or does it point to some flaw in my benchmarking or system software/hardware?

That's unexpected. In principle I guess it's possible for the C compiler to generate great vectorized code, but that seems a bit unlikely. Do you get the same results if you build Nettle-3.6?

With the help of Jeff I've gone on a bit of a benchmark binge using a:

- Raspberry Pi 1B (Broadcom BCM2835, arm11), - Cubieboard2 (Allwinner A20, Cortex-A7), - Wandboard (Freescale i.MX6 DualLite, Cortex-A9), - Tinkerboard (Rockchip RK3288, Cortex-A17) and - Raspberry Pi 4 (Broadcom BCM2711, Cortex-A72).

The rpi1b doesn't do NEON, so there's no numbers for that. I booted the rpi4 with Ubuntu 20.04 armhf with arm32 kernel and userland to avoid any influence of switches from/to 64bit mode. Some other metrics of the systems (such as compiler) and the build commands used are in the attached result notes. The Debian and Ubuntu systems had cpufreq activated. Since I didn't want to mess with that, I ran the benchmark multiple times in a loop to get cpufreq to scale up.

I've put together a small script that parses the manual notes for plotting using gnuplot. That produced the attached charts, which are quite interesting.

t=$(mktemp) ; cat nettle-arm-bench.txt | python3 nettle-arm-bench.py >$t ; gnuplot -e "set term pngcairo font 'sans,9' size 960, 540; set style data histograms; set ylabel 'cycles/block'; set xtics rotate out; set style fill solid border; set style histogram clustered; plot f or [COL=2:5] '$t' using COL:xticlabels(1) title columnheader;" >nettle-arm-bench-chart.png ; rm -f "$t"

...

...
From ChangeLog comments, it seems I got 45% speedup for Salsa20,

compared to the C implementation, when I wrote the original neon assembly code. At the time, benchmarked on a pandaboard (cortex a9), if I remember correctly.

I've disassembled an example of what the C compiler produces (I think chacha-core-internal.o) and there were no NEON instructions in there. At first glance it looked very similar to the armv6 assembler code.

BTW: The compilers default to their respective architecture, so would produce armv5 code on the rpi1b and armv7 on tinkerboard/wandboard/ cubieboard2/rpi4.

If these numbers are correct, it would seem that gcc got a *lot* better in optimising for ARM in recent versions. And ARM seems to have continuously improved native ARM instruction performance but NEON has been stagnant.

What confuses me is that the arm, armv6 and neon routines all give approximately the same speed. I'd have expected some visible difference there. Maybe I'm still just doing something wrong here?

At least the numbers rule out some peculiarity of the Cubieboards or my Gentoo installation, IMO.

...

Is it for a fat build? If so, it's possibly that the fat setup logic selects the C implementation is this hacked setup (but on the other hand, I'd guess a fat build may just failed at link time if these files are removed).

I did not enable fat for nettle 3.6 and explicitly disabled it for master. I forced selection of specific routines using configure options.

-- Thanks, Michael

nisse＠lysator.liu.se

5:07 p.m.

New subject: Release of Nettle-3.7?

Michael Weiser michael.weiser@gmx.de writes:

...

Happy new year, Niels and all around,

On Wed, Dec 30, 2020 at 09:12:24PM +0100, Niels Möller wrote:

...
...
It comes out at around seven cycles per block slowdown for chacha-3core and five for salsa20-2core. I trace this to vst1.8. It's just slower

Thanks for investigating. Maybe keep some IF_BE / IF_LE just for the store instructions, to stay with vstm on little-endian?

Sounds good. I'll try to finalise a patch and reconfirm that there's no speed regression from it.

Sounds good!

...

With the help of Jeff I've gone on a bit of a benchmark binge using a:

Raspberry Pi 1B (Broadcom BCM2835, arm11),

Cubieboard2 (Allwinner A20, Cortex-A7),

Wandboard (Freescale i.MX6 DualLite, Cortex-A9),

Tinkerboard (Rockchip RK3288, Cortex-A17) and

Raspberry Pi 4 (Broadcom BCM2711, Cortex-A72).

The rpi1b doesn't do NEON, so there's no numbers for that. I booted the rpi4 with Ubuntu 20.04 armhf with arm32 kernel and userland to avoid any influence of switches from/to 64bit mode. Some other metrics of the systems (such as compiler) and the build commands used are in the attached result notes. The Debian and Ubuntu systems had cpufreq activated. Since I didn't want to mess with that, I ran the benchmark multiple times in a loop to get cpufreq to scale up.

I've put together a small script that parses the manual notes for plotting using gnuplot. That produced the attached charts, which are quite interesting.

Thanks for investigating. So from these charts, it looks like the single-block Neon code is of no benefit on any of the test systems. And even significantly slower on the tinkerboard and rpi4.

If that's right, the code should probably just be deleted. But I'll have to do a little benchmarking on my own before doing that.

...

If these numbers are correct, it would seem that gcc got a *lot* better in optimising for ARM in recent versions. And ARM seems to have continuously improved native ARM instruction performance but NEON has been stagnant.

Interesting.

...

What confuses me is that the arm, armv6 and neon routines all give approximately the same speed. I'd have expected some visible difference there. Maybe I'm still just doing something wrong here?

If you look specifically at salsa20 and chacha performance, there's no arm or armv6 assembly, so arm, armv6 and noasm should all use the C implementation. While neon will run different code (unless something is highly messed up in the config).

Regards, /Niels

-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.

nisse＠lysator.liu.se

6:50 p.m.

New subject: Release of Nettle-3.7?

nisse@lysator.liu.se (Niels Möller) writes:

...

Thanks for investigating. So from these charts, it looks like the single-block Neon code is of no benefit on any of the test systems. And even significantly slower on the tinkerboard and rpi4.

If that's right, the code should probably just be deleted. But I'll have to do a little benchmarking on my own before doing that.

I've done a benchmark run of nettle-3.6 on the GMP "nanot2" system, with a Cortex-A9 processor. The installed compiler is gcc-5.4 (a few years old). This is what I get:

nisse@nanot2:~/build$ nettle-nanot2-noasm/config.status --version nettle config.status 3.6 configured by /home/nisse/hack/nettle-3.6/configure, generated by GNU Autoconf 2.69, with options "'--disable-shared' '--disable-assembler'"

nisse@nanot2:~/build$ nettle-nanot2-noasm/examples/nettle-benchmark -f 1.4e9 salsa20

benchmark call overhead: 0.006500 us 9.10 cycles

Algorithm mode Mbyte/s cycles/byte cycles/block

salsa20 encrypt 78.52 17.00 1088.22 salsa20 decrypt 78.52 17.00 1088.22

salsa20r12 encrypt 111.62 11.96 765.57 salsa20r12 decrypt 111.62 11.96 765.57

nisse@nanot2:~/build$ nettle-nanot2-noasm/examples/nettle-benchmark -f 1.4e9 chacha

benchmark call overhead: 0.006500 us 9.10 cycles

Algorithm mode Mbyte/s cycles/byte cycles/block

chacha encrypt 66.21 20.17 1290.57 chacha decrypt 66.21 20.17 1290.57

-------------

nisse@nanot2:~/build$ nettle-nanot2-neon/config.status --version nettle config.status 3.6 configured by /home/nisse/hack/nettle-3.6/configure, generated by GNU Autoconf 2.69, with options "'--disable-shared' '--enable-arm-neon'"

nisse@nanot2:~/build$ nettle-nanot2-neon/examples/nettle-benchmark -f 1.4e9 salsa20

benchmark call overhead: 0.006450 us 9.03 cycles

Algorithm mode Mbyte/s cycles/byte cycles/block

salsa20 encrypt 74.41 17.94 1148.38 salsa20 decrypt 74.41 17.94 1148.38

salsa20r12 encrypt 113.56 11.76 752.44 salsa20r12 decrypt 113.56 11.76 752.44

nisse@nanot2:~/build$ nettle-nanot2-neon/examples/nettle-benchmark -f 1.4e9 chacha

benchmark call overhead: 0.006438 us 9.01 cycles

Algorithm mode Mbyte/s cycles/byte cycles/block

chacha encrypt 75.12 17.77 1137.44 chacha decrypt 75.12 17.77 1137.44

So no big differences, but the neon code improves performance slightly for chacha and sal20r12, and degrades performance sligtly for salsa20.

I had a quick look at the disassembly of the C implementations, and it uses a fair amount of loads and stores to the stack in the loop (since it has too few general purpose registers for the state to fit). But maybe it's well enough scheduled to do many instructions can be executed in parallel. To compare to the neon code, which does more work per instruction, but with dependencies forcing sequential execution of the instructions.

Regards, /Niels

-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.

nisse＠lysator.liu.se

13 Jan 13 Jan

3:20 p.m.

New subject: Old ARM Neon code for salsa20 and chacha (was: Re: Release of Nettle-3.7?)

nisse@lysator.liu.se (Niels Möller) writes:

...

I've done a benchmark run of nettle-3.6 on the GMP "nanot2" system, with a Cortex-A9 processor. The installed compiler is gcc-5.4 (a few years old).

I choose Cortex-A9 for this test in attempt to reproduce my old numbers. Even if it's probably not that relevant today.

...

So no big differences, but the neon code improves performance slightly for chacha and sal20r12, and degrades performance sligtly for salsa20.

(The improvement for chacha actually seem significant, 13% speedup for the Neon code).

This is all about the old single-block functions. The Neon code for both salsa20 and chacha uses instructions operating on four 32-bit entries at a time. But most instructions depend on the result of the previous instruction, and latency of Neon instructions is pretty high. According to measurements by Torbjörn Granlund, we typically have a latency of at *least* two cycles (the only observed case of single-cycle latency was for veor on A53 and A55).

In addition, two shift operations, even if they are independent typically can't be issued in the same cycle, because they compete for a single shift unit. So if we look at a single round (i.e., a quarter of a qround) and annotate with latency numbers, i.e., the earliest cycle the instruction can be started, and for simplicity assume that all instructions but veor has a latency of 2 cycles, we get (this is for salsa20):

vadd.i32 q8, q0, q3 0 t = x0 + x1 vshl.i32 q9, q8, #7 2 t <<<= 7 vshr.u32 q8, q8, #25 3 veor q1, q1, q8 4 x1 ^= t veor q1, q1, q9 5

vadd.i32 q8, q0, q1 6 (next QROUND)

So that's 6 cycles, for the same work as 12 scalar (32-bit) operations (rotation is a single operation if done on scalar registers). So at best, we can expect to get two 32-bit operations done per cycle. For SIMD, that's not great at all.

For processors that can issue two instructions per cycle, and with shorter latency, scalar code (i.e., code using only the general purpose 32-bit registers) could get more or less the same throughput. The scalar code also gets the advantage that there's a handy rotate instruction (instead of the shift right + shift left + combine used in the Neon code), but it has the disadvantage of register shortage, and will need a bunch of load and store instructions to access the state.

That doesn't quite explain why I saw a 45% speedup with Neon in 2013, which has now disappeared. But maybe current gcc has good enough instruction scheduling to produce code that can issue 2 instructions per cycle on Cortex-A9 (which has quite limited out-of-order capabilities), and gcc back then couldn't do that?

So what's next? Should the old code just be deleted?

With the new 2-way or 3-way functions, performance of the single-block functions isn't that critical, so deletion may be ok even if it causes some small regression on some processors (e.g., single-block chacha getting 13% slower on the old Cortex-A9)

Regards, /Niels

-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.

nisse＠lysator.liu.se

28 Jan 28 Jan

6:26 p.m.

New subject: Old ARM Neon code for salsa20 and chacha

nisse@lysator.liu.se (Niels Möller) writes:

...

For processors that can issue two instructions per cycle, and with shorter latency, scalar code (i.e., code using only the general purpose 32-bit registers) could get more or less the same throughput. The scalar code also gets the advantage that there's a handy rotate instruction (instead of the shift right + shift left + combine used in the Neon code), but it has the disadvantage of register shortage, and will need a bunch of load and store instructions to access the state.

That doesn't quite explain why I saw a 45% speedup with Neon in 2013, which has now disappeared. But maybe current gcc has good enough instruction scheduling to produce code that can issue 2 instructions per cycle on Cortex-A9 (which has quite limited out-of-order capabilities), and gcc back then couldn't do that?

So what's next? Should the old code just be deleted?

With the new 2-way or 3-way functions, performance of the single-block functions isn't that critical, so deletion may be ok even if it causes some small regression on some processors (e.g., single-block chacha getting 13% slower on the old Cortex-A9)

I've made a branch with deletion of this code, "delete-1-way-neon". Any comments before I merge to master?

Regards, /Niels

-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.

Michael Weiser

6 Feb 6 Feb

7:06 p.m.

New subject: Old ARM Neon code for salsa20 and chacha

Hello Niels,

On Thu, Jan 28, 2021 at 07:26:46PM +0100, Niels Möller wrote:

...

...
With the new 2-way or 3-way functions, performance of the single-block functions isn't that critical, so deletion may be ok even if it causes some small regression on some processors (e.g., single-block chacha getting 13% slower on the old Cortex-A9)

I've made a branch with deletion of this code, "delete-1-way-neon". Any comments before I merge to master?

Removing them also lowers the amount of code to maintain. I've done a few quick builds of the branch with and without assembly and NEON in particular enabled. All combinations build, run the testsuite and benchmark results look consistent to me.

-- Thanks, Michael

Michael Weiser

3 Jan 3 Jan

11:14 p.m.

New subject: Release of Nettle-3.7?

Hello Niels,

On Fri, Jan 01, 2021 at 06:07:14PM +0100, Niels Möller wrote:

...

...
With the help of Jeff I've gone on a bit of a benchmark binge using a:

Raspberry Pi 1B (Broadcom BCM2835, arm11),

Cubieboard2 (Allwinner A20, Cortex-A7),

Wandboard (Freescale i.MX6 DualLite, Cortex-A9),

Tinkerboard (Rockchip RK3288, Cortex-A17) and

Raspberry Pi 4 (Broadcom BCM2711, Cortex-A72).

Thanks for investigating. So from these charts, it looks like the single-block Neon code is of no benefit on any of the test systems. And even significantly slower on the tinkerboard and rpi4.

Attached is the new patch that unconditionally switches from vldm to vld1.32 but keeps vstm in favour of vst1.8 on little-endian for stores.

I've done some additional benchmarks to verify impact on performance. I've used the wandboard, tinkerboard and rpi as before and cubieboard2s in little- and big-endian modes. This time I switched the cpufreq governor of the first three to "performance" to get more stable numbers (which helped noticeably, @Jeff: and switched back to ondemand on your boxes after). Also I did ten consecutive runs of benchmark and naively averaged the numbers (see attached raw data document). With another python script (attached) I created another chart using gnuplot[1].

This time I've normalised the numbers to percentages with unmodified master as reference to give a clearer indication for very small changes. So the first and fourth bar of each group (master and master-no23core) represent 100 percent for the following two bars respectively. The second bar (-unified) shows the values for the attached patch. The third bar (-unified-full) shows the values for the previous patch which unconditionally used vst1. -no23core again shows performance with chacha-2core and salsa-3core disabled.

The graph shows the expected slowdown when using vst1 for cubieboard and wandboard. The slowdown for the big-endian cubieboard (second cluster) can be ignored because the faster routines on unmodified master are broken. So the second and third bar just show the performance that needs to be sacrificed to get them working compared to LE.

On cubieboard, wandboard and tinkerboard there's still a small overhead from the switch to vld1.32 which was not reliably visible in my earlier benchmarks.

What's interesting is that on both tinkerboard and rpi4 there's also speedups from the switch to vld1.32 and even vst1.8 (the latter also on the wandboard but only for the likely irrelevant single core routines). So it seems, the performance penalty isn't set in stone and might differ between generations and implementations.

...

From that point of view, the slight performance hit for vld1.32 but

keeping of vstm on LE seems the best compromise, at least for the benchmarked set of machines.

Do you have any ideas how it might be that the wandboard, tinkerboard and rpi4 show speedups with vst1.8 for one algorithm but slowdowns for the other and even contradict each other in that? Does it make sense to dig into that some more or should we leave it be for now?

[1] t=$(mktemp) ; cat nettle-arm-bench-2.txt | python3 nettle-arm-bench-2.py >$t ; gnuplot -e "set term pngcairo font 'sans,9' size 960, 540; set style data histograms; set ylabel 'cycles/block'; set yrange [98:]; set xtics rotate out; set style fill solid border; set style histogram clustered; plot for [COL=2:7] '$t' using COL:xticlabels(1) title columnheader;" >nettle-arm-bench-chart-2.png ; rm -f "$t"

...

...
What confuses me is that the arm, armv6 and neon routines all give approximately the same speed. I'd have expected some visible difference

If you look specifically at salsa20 and chacha performance, there's no arm or armv6 assembly, so arm, armv6 and noasm should all use the C implementation. While neon will run different code (unless something is

Duh. So the slight differences were most likely due to the arm native assembly memxor routines.

-- Thanks, Michael

nisse＠lysator.liu.se

13 Jan 13 Jan

12:43 p.m.

New subject: Release of Nettle-3.7?

Michael Weiser michael.weiser@gmx.de writes:

...

Attached is the new patch that unconditionally switches from vldm to vld1.32 but keeps vstm in favour of vst1.8 on little-endian for stores.

Thanks! Applied now.

...

From that point of view, the slight performance hit for vld1.32 but keeping of vstm on LE seems the best compromise, at least for the benchmarked set of machines.

I agree. One could consider having several variants and do code selection depending on processor flavor. But I don't think that's worth the effort if difference is just a percent or so.

...

Do you have any ideas how it might be that the wandboard, tinkerboard and rpi4 show speedups with vst1.8 for one algorithm but slowdowns for the other and even contradict each other in that? Does it make sense to dig into that some more or should we leave it be for now?

I'd guess the algorithms differ in the details in how vst1.8 is scheduled, and that's why vst1.8 is more or less efficient.

Regards, /Niels

-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.

Michael Weiser

6:35 p.m.

New subject: Release of Nettle-3.7?

Hello Niels,

On Wed, Jan 13, 2021 at 01:43:38PM +0100, Niels Möller wrote:

...

...
Attached is the new patch that unconditionally switches from vldm to vld1.32 but keeps vstm in favour of vst1.8 on little-endian for stores.

Thanks! Applied now.

Perfect! Incidentally: The other day I was migrating my big-endian cubieboard from LibreSSL to OpenSSL. Afterwards I was wondering why their assembly routines didn't fail as I remembered and had disabled for LibreSSL. From a quick glance at the OpenSSL code, it seems they're doing exactly the same thing using v{ld,st}1[1,2].

[1] https://github.com/openssl/openssl/blob/8bc5b0a570c8a2c9886a3cae9dea2016d510... [2] https://github.com/openssl/openssl/blob/8bc5b0a570c8a2c9886a3cae9dea2016d510...

-- Thanks, Michael