Jeffrey Walton <noloader(a)gmail.com> writes:
> Looks good on a Celeron J3455, which is a [low-end] Goldmont machine
> with the instructions:
[...]
> goldmont:nettle$ LD_LIBRARY_PATH=.lib:/usr/local/lib64/
> ./examples/nettle-benchmark
> sha1_compress: 84.60 cycles
85 cycles is a lot less than than 136 cycles I observed in my testing.
The function is 131 instructions long, so it's approximately 1.5
instructions per cycle.
> sha1 update 1194.33
> openssl sha1 update 1321.71
And this is a 11% difference (compared to 8% in my benckmarks). Makes
sense if the main crunching is fewer cycles, then the per block function
call overhead is relatively larger.
> A small suggestion may be to update Section 8 Installation
> (https://www.lysator.liu.se/~nisse/nettle/nettle.html). It was not
> obvious to me how to enable the hardware acceleration.
There's an --enable-x86-aesni configure option which should enable the
aesni code unconditionally in non-fat builds. And an --enable-arm-neon.
But it seems I forgot to add a corresponding --enable-x86-sha-ni.
But --enable-fat is the most common way to enable the support. I'm
considering enabling it by default in the next release.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
Forwarded to the list.
---------- Forwarded message ----------
From: Jeffrey Walton <noloader(a)gmail.com>
To: "Niels Möller" <nisse(a)lysator.liu.se>
Cc: nettle-bugs(a)lists.lysator.liu.se
Bcc:
Date: Thu, 8 Feb 2018 16:34:43 -0500
Subject: Re: x86 sha_ni
On Thu, Feb 8, 2018 at 12:18 PM, Niels Möller <nisse(a)lysator.liu.se> wrote:
> nisse(a)lysator.liu.se (Niels Möller) writes:
>
>> Below replacement for sha1-compress.asm seems to run on roughly 2
>> cycles/byte when I benchmark it on an "AMD Ryzen 7 1700X" cpu in the gcc
>> compile farm. Still sligthly slower than openssl, to squeeze out a few
>> more cycles, it might help to change the sha1_compress interface to let
>> it process more than one 64-byte block at a time.
>>
>> I hope to be able to wire it up via fat-x86_64.c reasonably soon. In the
>> mean time, if anyone wants to try it out, just change the
>> sha1-compress.asm symlink to point to this file.
>
> Enabled via fat-x86_64 now, and pushed to a branch named
> x86_64-sha_ni-sha1.
Looks good on a Celeron J3455, which is a [low-end] Goldmont machine
with the instructions:
goldmont:nettle$ autoreconf -f -i
...
goldmont:nettle$ ./configure --enable-fat
...
goldmont:nettle$ make && make check
...
goldmont:nettle$ LD_LIBRARY_PATH=.lib:/usr/local/lib64/
./examples/nettle-benchmark
sha1_compress: 84.60 cycles
salsa20_core: 282.80 cycles
sha3_permute: 1542.60 cycles (64.27 / round)
benchmark call overhead: 0.001604 us
Algorithm mode Mbyte/s
...
md2 update 6.90
md4 update 568.11
md5 update 384.08
openssl md5 update 443.76
sha1 update 1194.33
openssl sha1 update 1321.71
sha224 update 110.31
sha256 update 110.10
sha384 update 174.32
sha512 update 173.99
sha512-224 update 174.35
sha512-256 update 174.16
sha3_224 update 136.77
sha3_256 update 129.46
sha3_384 update 99.23
sha3_512 update 69.25
ripemd160 update 161.00
gosthash94 update 39.48
umac32 update 6560.05
umac64 update 3130.26
umac96 update 2457.21
umac128 update 1936.56
poly1305-aes update 914.79
...
A small suggestion may be to update Section 8 Installation
(https://www.lysator.liu.se/~nisse/nettle/nettle.html). It was not
obvious to me how to enable the hardware acceleration. A quick
sentence on how to enable AES-NI and SHA would make it obvious for
future readers. (Thanks for the offline help).
Jeff
--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
I wrote a crude/simple test program to compare the performance of
AES-128-CBC across openssl, gcrypt, nettle and gnutls, and was
surprised to find that nettle is consistently ~25% slower than
the other libraries for its AESNI implementation.
On my Core i7-6820HQ I get
nettle: 850 MB/s
gcrypt: 1172 MB/s
gnutls: 1230 MB/s
openssl: 1153 MB/s
with versions
nettle-3.3-2.fc26.x86_64
libgcrypt-1.7.8-1.fc26.x86_64
gnutls-3.5.14-1.fc26.x86_64
openssl-1.1.0f-7.fc26.x86_64
And on Xeon E5-2609 I get
nettle: 325 MB/s
gcrypt: 403 MB/s
gnutls: 414 MB/s
openssl: 414 MB/s
with versions
nettle-3.3-1.fc25.x86_64
libgcrypt-1.7.8-1.fc25.x86_64
gnutls-3.5.14-1.fc25.x86_64
openssl-1.0.2k-1.fc25.x86_64
Naively I would have expected them all to be pretty much equal given that
they're delegating to the same hardware routines. Has anyone else done
comparative benchmarks of nettle's impl against others & seen the same
kind of results ? I'll attach my test program to this mail, so if I made
a mistake in usage there feel free to point it out.
FWIW, I also found there is some wierd interaction between nettle and
glibc-2.23. If I have that glibc version and run with NETTLE_FAT_VERBOSE=1
it claims it is picking the AESNI impl, but the performance figures clearly
show it is actually running the pure software impl because they're 100 MB/s
instead of 325 MB/s. I upgraded to glibc 2.24 and this wierdness went away,
so I've not investigated that further.
Regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|