I think I'd like to make a nettle-2.6 release fairly soon. Recent changes:
1. I disabled the x86_64 assembly for sha3_permute. It gave a very modest speedup on the Intel processor I benchmarked it on, and a sever slowdown on the AMD processor I also benchmarked it on. The latter machine seemed to execute the loop at only one instruction per cycle, rather than three as it should; my best guess is that it's the moves of data between regular registers and xmm registers that somehow stall.
Maybe it could be rewritten to use xmm registers exclusively, but then register allocation gets *very* tight, so one might need to keep a few words of the state on the stack instead. But I don't think I'll try that soon; the current C implementation is reasonably efficient, with performance of sha256 and sha3-256 in the same ballbark (but both slower than sha512).
2. I think I fixed the bugs in some subdirectory make targets which broke "make install" without a preceding "make all".
Ah, and a technical detail. There are no new features added to to libhogweed, but I still intend to increment the minor number of that shared library in the release. Is that right, or should I keep the same hogweed minor number as in nettle-2.5 (i.e., libhogweed.so.4.4)?
I've updated the NEWS file (current version of the 2.6 entries appended below, for convenience). Are you aware of any missing pieces, either in the code, in NEWS, or in other documentation?
Regards, /Niels
NEWS for the 2.6 release
Bug fixes:
* Fixed a bug in ctr_crypt. For zero length (which should be a NOP), it sometimes incremented the counter. Reported by Tim Kosse.
* Fixed a small memory leak in nettle_realloc and nettle_xrealloc.
New features:
* Support for PKCS #5 PBKDF2. Contributed by Simon Josefsson. Specification in RFC 2898 and test vectors in RFC 6070.
* Support for GOST R 34.11-94 hash algorithm. Ported from librhash by Nikos Mavrogiannopoulos. Written by Aleksey Kravchenko. More information in RFC4357. Test vectors taken from the GOST hash wikipedia page.
* Support for SHA3. Miscellaneous:
* The include file <nettle/sha.h> has been split into <nettle/sha1.h> and <nettle/sha2.h>. For now, sha.h is kept for backwards compatibility and it simply includes both files, but applications are encouraged to use the new names. The new SHA3 functions are declared in <nettle/sha3.h>.
* Testsuite can be run under valgrind, using
make check EMULATOR='$(VALGRIND)'
For this to work, test programs and other executables now deallocate storage. * New configure options --disable-documentation and --disable-static. Contributed by Sam Thursfield and Alon Bar-Lev, respectively. * The section on hash functions in the manual is split into separate nodes for recommended hash functions and legacy hash functions.
* Various smaller improvements, most of them portability fixes. Credits go to David Woodhouse, Tim Rühsen, Martin Storsjö, Nikos Mavrogiannopoulos, Fredrik Thulin and Dennis Clarke.
Finally, a note on the naming of the various "SHA" hash functions. Naming is a bit inconsistent; we have, e.g.,
SHA1: sha1_digest SHA2: sha256_digest (not sha2_256_digest) SHA3: sha3_256_digest
Renaming the SHA2 functions to make Nettle's naming more consistent has been considered, but the current naming follows common usage. Most documents (including the specification for SHA2) refer to 256-bit SHA2 as "SHA-256" or "SHA256" rather than "SHA2-256".
The libraries are intended to be binary compatible with nettle-2.2 and later. The shared library names are libnettle.so.4.5 and libhogweed.so.2.3, with sonames still libnettle.so.4 and libhogweed.so.2
"NM" == Niels Möller nisse@lysator.liu.se writes:
NM> my best guess is that it's the NM> moves of data between regular registers and xmm registers that NM> somehow stall.
IIRC, the advice I've seen is to always move data between the integer registers and the xmm registers via the stack.
All of the relevant gcc- and llvm-produced code I've looked (at least over the last few months; I can't remember too far back) follows that pattern.
Yes, The 47414_15h_sw_opt_guide.pdf, in §10.4 says:
,----< §10.4, p169 of 47414_15h_sw_opt_guide.pdf¹ > | Optimization | | When moving data from a GPR to an XMM register, use separate store and | load instructions to move the data first from the source register to a | temporary location in memory and then from memory into the destination | register, taking the memory latency into account when scheduling both | stages of the load-store sequence. | | When moving data from an XMM register to a general-purpose register, | use the VMOVD instruction. | | Whenever possible, use loads and stores of the same data length. (See | 6.3, ‘Store-to-Load Forwarding Restrictions” on page 98 for more | information.) `----
VMOVD, obviosuly, doesn’t apply for fam10 and earlier; I didn’t look through my archive to find the sw_opt_guide for earlier processors, though.
1] http://support.amd.com/us/Processor_TechDocs/47414_15h_sw_opt_guide.pdf
-JimC
James Cloos cloos@jhcloos.com writes:
,----< §10.4, p169 of 47414_15h_sw_opt_guide.pdf¹ > | Optimization | | When moving data from a GPR to an XMM register, use separate store and | load instructions to move the data first from the source register to a | temporary location in memory and then from memory into the destination | register, taking the memory latency into account when scheduling both | stages of the load-store sequence.
Thanks for the hint. Maybe I can try that, it sounds like a fairly easy fix. If I can get the code run at three instructions per cycle, that would be a pretty nice speedup on amd processors.
| Whenever possible, use loads and stores of the same data length. (See | 6.3, ‘Store-to-Load Forwarding Restrictions” on page 98 for more | information.)
Not sure how to interpret this. The interesting cases here are:
1. Writing the 64 low bits of an xmm register, (movq with memory destination) and reading it back into a gpr.
2. Writing a 128-bit xmm register (movaps), and reading it back into two gpr registers.
And then the opposite direction.
Regards, /Niels
"NM" == Niels Möller nisse@lysator.liu.se writes:
NM> Thanks for the hint. Maybe I can try that, it sounds like a fairly easy NM> fix. If I can get the code run at three instructions per cycle, that NM> would be a pretty nice speedup on amd processors.
Indeed.
| Whenever possible, use loads and stores of the same data length. (See | 6.3, ‘Store-to-Load Forwarding Restrictions” on page 98 for more | information.)
NM> Not sure how to interpret this. The interesting cases here are:
In the context of saving a 128-bit xmm register and reading the halves into two 64-big integer registers, I think it means make sure you use the instruction which includes the 0x66 prefix octet (which specifies that the 128 bits are two 64-bit values rather than four 32-bit values).
I don't see a 4x32 version of MOVDQA in the original xmm book, just the 2x64, so it shouldn't be an issue for this application. If there were, you'd want to be sure to use the '66 0F 6F /r' version and not the putative '0F 6F /r' version.
It is more of an issue when dealing with packed floats vs packed doubles. Eg, the XORPS and XORPD both do a 128-bit bit-for-bit XOR, but if you use the XORPS version in code otherwise dealing with packed doubles, or visa-versa, the pipeline will stall.
There is a similar issue when mixing float or double instructions with non-floating-point loads and stores.
I think that, internally, they use different register files for packed doubles and packed singles. Or, more generally, packed 64-bit-at-a-time vs packed-32-bit-at-a-time. But that is conjecture.
-JimC
On Wed, Jan 2, 2013 at 9:45 PM, Niels Möller nisse@lysator.liu.se wrote:
I think I'd like to make a nettle-2.6 release fairly soon.
I tried to use the new stream algorithm salsa20, but noticed that the variant implemented is mentioned neither in the header nor in the documentation. From the code and the previous discussion in the ML I see that the 20 rounds variant is there. While this is nice, I do think that having the 12 round variant as well is advantageous since this is the variant accepted in profile 1 of estream (http://www.ecrypt.eu.org/stream/).
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
I tried to use the new stream algorithm salsa20, but noticed that the variant implemented is mentioned neither in the header nor in the documentation.
What's missing, more precisely? E.g., salsa20_crypt is both in the salsa20.h header file and in the Salsa20 section in the manual.
From the code and the previous discussion in the ML I see that the 20 rounds variant is there.
Right, other variants where postponed for lack of clear use cases.
While this is nice, I do think that having the 12 round variant as well is advantageous since this is the variant accepted in profile 1 of estream (http://www.ecrypt.eu.org/stream/).
What name should it use, salsa20_12_crypt? I imagine there are some testvectors somewhere on the ecrypt site?
It should be straightforward to implement on top of _salsa20_core, just like in salsa20-crypt.c. For an all-assembly implementation including the xor:ing, I guess one would want a _salsa20_crypt with an argument specifying number of rounds.
Regards, /Niels
On 01/03/2013 06:09 PM, Niels Möller wrote:
What's missing, more precisely? E.g., salsa20_crypt is both in the salsa20.h header file and in the Salsa20 section in the manual.
It may be better to refer to it as salsa20/20. Then it would be clear which variant it is.
From the code and the previous discussion in the ML I see that the 20 rounds variant is there.
Right, other variants where postponed for lack of clear use cases.
As far as I know Salsa20 isn't standardized in a protocol. However if that occurs it may be that the 20/12 variant is selected because of estream.
What name should it use, salsa20_12_crypt? I imagine there are some testvectors somewhere on the ecrypt site?
No idea.
regards, Nikos
nettle-bugs@lists.lysator.liu.se