I was trying the nettle-benchmark program and found that it hangs at
startup burning 100% CPU.
Debugging shows this is when measuring benchmark overhead. With a quick
printf of the "ncalls" variable in time_function(), I can see that it
overflows:
time_function ncalls=100 elapsed=0.000010
time_function ncalls=1000 elapsed=0.000003
time_function ncalls=10000 elapsed=0.000002
time_function ncalls=100000 elapsed=0.000002
time_function ncalls=1000000 elapsed=0.000002
time_function ncalls=10000000 …
[View More]elapsed=0.000002
time_function ncalls=100000000 elapsed=0.000002
time_function ncalls=1000000000 elapsed=0.000002
time_function ncalls=1410065408 elapsed=0.000002
time_function ncalls=1215752192 elapsed=0.000002
time_function ncalls=-727379968 elapsed=0.000002
time_function ncalls=1316134912 elapsed=0.000002
time_function ncalls=276447232 elapsed=0.000002
time_function ncalls=-1530494976 elapsed=0.000002
time_function ncalls=1874919424 elapsed=0.000002
time_function ncalls=1569325056 elapsed=0.000002
time_function ncalls=-1486618624 elapsed=0.000002
time_function ncalls=-1981284352 elapsed=0.000002
time_function ncalls=1661992960 elapsed=0.000002
time_function ncalls=-559939584 elapsed=0.000002
time_function ncalls=-1304428544 elapsed=0.000002
time_function ncalls=-159383552 elapsed=0.000002
time_function ncalls=-1593835520 elapsed=0.000002
time_function ncalls=1241513984 elapsed=0.000002
time_function ncalls=-469762048 elapsed=0.000002
time_function ncalls=-402653184 elapsed=0.000002
time_function ncalls=268435456 elapsed=0.000002
time_function ncalls=-1610612736 elapsed=0.000002
time_function ncalls=1073741824 elapsed=0.000002
time_function ncalls=-2147483648 elapsed=0.000002
time_function ncalls=0 elapsed=0.000002
time_function ncalls=0 elapsed=0.000002
time_function ncalls=0 elapsed=0.000002
The elapsed time is the same regardless of ncalls, so I'm thinking that
the compiler as been clever and optimized bench_nothing() into literally
nothing. If I modify it to
static void
bench_nothing(void *arg UNUSED)
{
static int i = 0;
i++;
return;
}
then things work, but of course we're not benchmarking "nothing" anymore.
This is on Fedora 32 with gcc-10.1.1-1.fc32.x86_64
Regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
[View Less]
I've spent an evening taking a closer look at intel AES performance,
after it was pointed out to me by Torbjörn that the Intel AESNI
instructions are highly pipelined on most or all processors supporting
them. Meaning that the processor can start execution of one even two
instructions per cycle, provided that they are independent, while it
takes several cycles until the results become available for depending
instructions.
Out-of-order (OoO) execution can help to run things in parallel, even if
…
[View More]the instruction stream locally is a sequence of dependent instructions.
E.g., my main development machine has a broadwell processor, and it can
issue one aesni instruction per cycle, and I think the latency is 7 cycles.
Encrypting one block needs 10 aesni instructions (one per round, and then
there's a eleventh subkey applied with plain xor which can issue in
parallel with another instruction in the same cycle).
When I benchmark, aes128 ECB runs in 10.2 cycles per block, which means
that out-of-order execution is very successful, executing many
iterations of the loop in parallel. CBC encrypt, which is inherently
non-parallel, runs *much* slower, I get 91 cycles per block, where the
latency of just the aes encyption wold be 71 cycles (if 7 cycles latency
per aesni instruction is correct, and then one more cycle for the xor
subkey, the remaining 20 cycles for the CBC processing which seems to be
a bit slow in itself).
I'm considering rearranging the loops to interleave multiple blocks. See
below code which implements 4-way interleaving (a drop-in replacement
for x86_64/aesni/aes-encrypt-internal). If this approach turns out to be
useful, might be beneficial to extend to 8-way interleaving: That would
allow instructions to run nicely in parallel even without any
out-of-order-execution.
However, the interleaved code makes no change to performance in my
benchmarks on my machine. Since the old code is close to 10 cycles /
block, which is a hard limit from instruction issue of the aesni
instructions, and it seems almost all other instructions are already
executing in parallel with them.
But it might be an improvement on other processors, or for applications
that process a small number of blocks at a time (in the middle between
CBC which always does one block at a time, and the ECB benchmark which
does 10 KB, or 640 blocks, at a time), by making things easier for the
processor's OoO-machinery.
If you have any application benchmarks it would be interesting if you
could try out the interleaved version. I'm also thinking that maybe we
should add benchmarks for various message sizes?
Regards,
/Niels
C x86_64/aesni/aes-encrypt-internal.asm
ifelse(<
Copyright (C) 2015, 2018 Niels Möller
This file is part of GNU Nettle.
GNU Nettle is free software: you can redistribute it and/or
modify it under the terms of either:
* the GNU Lesser General Public License as published by the Free
Software Foundation; either version 3 of the License, or (at your
option) any later version.
or
* the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your
option) any later version.
or both in parallel, as here.
GNU Nettle is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
You should have received copies of the GNU General Public License and
the GNU Lesser General Public License along with this program. If
not, see http://www.gnu.org/licenses/.
>)
C Input argument
define(<ROUNDS>, <%rdi>)
define(<KEYS>, <%rsi>)
C define(<TABLE>, <%rdx>) C Unused here
define(<LENGTH>,<%rcx>)
define(<DST>, <%r8>)
define(<SRC>, <%r9>)
define(<CNT>, <%r10>)
define(<TMP>, <%r10>) C Can overlap CNT
define(<TAB>, <%rdx>)
define(<SUBKEY>, <%xmm0>)
define(<B0>, <%xmm1>)
define(<B1>, <%xmm2>)
define(<B2>, <%xmm3>)
define(<B3>, <%xmm4>)
.file "aes-encrypt-internal.asm"
C _aes_encrypt(unsigned rounds, const uint32_t *keys,
C const struct aes_table *T,
C size_t length, uint8_t *dst,
C uint8_t *src)
.text
ALIGN(16)
PROLOGUE(_nettle_aes_encrypt)
W64_ENTRY(6, 5)
shr $4, LENGTH
test LENGTH, LENGTH
jz .Lend
C Each round uses 16 bytes of subkeys, i.e., 16 bytes. We have
C an initial xor round, rounds-1 regular rounds, and one final
C round. Adjust so that ROUNDS reflects the number of regular
C rounds, KEYS points to the key for the final round, and KEYS
C + ROUNDS points to the key for the first regular round.
dec XREG(ROUNDS)
shl $4, XREG(ROUNDS)
lea 16(ROUNDS, KEYS), KEYS
neg ROUNDS
jmp .Loop_end
.Lloop_4w:
mov ROUNDS, CNT
movups -16(KEYS, ROUNDS), SUBKEY
movups (SRC), B0
movups 16(SRC), B1
movups 32(SRC), B2
movups 48(SRC), B3
pxor SUBKEY, B0
pxor SUBKEY, B1
pxor SUBKEY, B2
pxor SUBKEY, B3
.Lround_loop_4w:
movups (KEYS, CNT), SUBKEY
add $16, CNT
aesenc SUBKEY, B0
aesenc SUBKEY, B1
aesenc SUBKEY, B2
aesenc SUBKEY, B3
jne .Lround_loop_4w
movups (KEYS), SUBKEY
aesenclast SUBKEY, B0
aesenclast SUBKEY, B1
aesenclast SUBKEY, B2
aesenclast SUBKEY, B3
movups B0, (DST)
movups B1, 16(DST)
movups B2, 32(DST)
movups B3, 48(DST)
add $64, SRC
add $64, DST
sub $4, LENGTH
.Loop_end:
cmp $4, LENGTH
jnc .Lloop_4w
lea .Ljmptab(%rip), TAB
movslq (TAB, LENGTH, 4), TMP
lea (TAB, TMP), TMP
jmp *TMP
.Ltail3:
mov ROUNDS, CNT
movups -16(KEYS, ROUNDS), SUBKEY
movups (SRC), B0
movups 16(SRC), B1
movups 32(SRC), B2
pxor SUBKEY, B0
pxor SUBKEY, B1
pxor SUBKEY, B2
.Lround_loop_3w:
movups (KEYS, CNT), SUBKEY
add $16, CNT
aesenc SUBKEY, B0
aesenc SUBKEY, B1
aesenc SUBKEY, B2
jne .Lround_loop_3w
movups (KEYS), SUBKEY
aesenclast SUBKEY, B0
aesenclast SUBKEY, B1
aesenclast SUBKEY, B2
movups B0, (DST)
movups B1, 16(DST)
movups B2, 32(DST)
jmp .Lend
.Ltail2:
mov ROUNDS, CNT
movups -16(KEYS, ROUNDS), SUBKEY
movups (SRC), B0
movups 16(SRC), B1
pxor SUBKEY, B0
pxor SUBKEY, B1
.Lround_loop_2w:
movups (KEYS, CNT), SUBKEY
add $16, CNT
aesenc SUBKEY, B0
aesenc SUBKEY, B1
jne .Lround_loop_2w
movups (KEYS), SUBKEY
aesenclast SUBKEY, B0
aesenclast SUBKEY, B1
movups B0, (DST)
movups B1, 16(DST)
jmp .Lend
.Ltail1:
mov ROUNDS, CNT
movups -16(KEYS, ROUNDS), SUBKEY
movups (SRC), B0
pxor SUBKEY, B0
.Lround_loop_1w:
movups (KEYS, CNT), SUBKEY
add $16, CNT
aesenc SUBKEY, B0
jne .Lround_loop_1w
movups (KEYS), SUBKEY
aesenclast SUBKEY, B0
movups B0, (DST)
.Lend:
W64_EXIT(6, 5)
ret
EPILOGUE(_nettle_aes_encrypt)
C FIXME: Put in rodata?
ALIGN(4)
.Ljmptab:
.long .Lend - .Ljmptab
.long .Ltail1 - .Ljmptab
.long .Ltail2 - .Ljmptab
.long .Ltail3 - .Ljmptab
--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
[View Less]
Hi all,
Here are a couple of small patches for nettle-benchmark:
- removes the deprecated OpenSSL hash API
- adds more OpenSSL sha2 hashes into the mix
Kindly merge or let me know of your concerns :-)
Related question:
As you can see from the second patch, nettle performance is a little low
wrt OpenSSL - ~55% for sha1, and ~65% for sha2.
Is that normal, or there is something off with my system/build?
I'm using the default "--enable-assembler" and "-O2" as seen in the
configure.ac, …
[View More]plus my processor lacks the SHA_NI ISA.
Thanks
Emil
P.S. More misc patches coming shortly, so stay tuned :-P
Emil Velikov (2):
examples: don't use deprecated OpenSSL hashing API
external: add more openssl sha2 digests to the benchmark
examples/nettle-benchmark.c | 6 +-
examples/nettle-openssl.c | 112 ++++++++++++++----------------------
nettle-internal.h | 3 +
3 files changed, 50 insertions(+), 71 deletions(-)
--
2.25.1
[View Less]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
I'm happy to announce a new release of GNU Nettle, a low-level
cryptographics library. This version includes several new features, and
a couple of bug fixes, see NEWS entries below.
The Nettle home page can be found at
https://www.lysator.liu.se/~nisse/nettle/, and the manual at
https://www.lysator.liu.se/~nisse/nettle/nettle.html.
The release can be downloaded from
https://ftp.gnu.org/gnu/nettle/nettle-3.6.tar.gzftp://ftp.gnu.org/gnu/…
[View More]nettle/nettle-3.6.tar.gzhttps://www.lysator.liu.se/~nisse/archive/nettle-3.6.tar.gz
Happy hacking,
/Niels Möller
NEWS for the Nettle 3.6 release
This release adds a couple of new features, most notable being
support for ED448 signatures.
It is not binary compatible with earlier releases. The shared
library names are libnettle.so.8.0 and libhogweed.so.6.0, with
sonames nibnettle.so.8 and libhogweed.so.6. The changed
sonames are mainly to avoid upgrade problems with recent
GnuTLS versions, that depend on Nettle internals outside of
the advertised ABI. But also because of the removal of
internal poly1305 functions which were undocumented but
declared in an installed header file, see Interface changes
below.
New features:
* Support for Curve448 and ED448 signatures. Contributed by
Daiki Ueno.
* Support for SHAKE256 (SHA3 variant with arbitrary output
size). Contributed by Daiki Ueno.
* Support for SIV-CMAC (Synthetic Initialization Vector) mode,
contributed by Nikos Mavrogiannopoulos.
* Support for CMAC64, contributed by Dmitry Baryshkov.
* Support for the "CryptoPro" variant of the GOST hash
function, as gosthash94cp. Contributed by Dmitry Baryshkov.
* Support for GOST DSA signatures, including GOST curves
gc256b and gc512a. Contributed by Dmitry Baryshkov.
* Support for Intel CET in x86 and x86_64 assembly files, if
enabled via CFLAGS (gcc --fcf-protection=full). Contributed
by H.J. Lu and Simo Sorce.
* A few new functions to improve support for the Chacha
variant with 96-bit nonce and 32-bit block counter (the
existing functions use nonce and counter of 64-bit each),
and functions to set the counter. Contributed by Daiki Ueno.
* New interface, struct nettle_mac, for MAC (message
authentication code) algorithms. This abstraction is only
for MACs that don't require a per-message nonce. For HMAC,
the key size is fixed, and equal the digest size of the
underlying hash function.
Bug fixes:
* Fix bug in cfb8_decrypt. Previously, the IV was not updated
correctly in the case of input data shorter than the block
size. Reported by Stephan Mueller, fixed by Daiki Ueno.
* Fix configure check for __builtin_bswap64, the incorrect
check would result in link errors on platforms missing this
function. Patch contributed by George Koehler.
* All use of old-fashioned suffix rules in the Makefiles have
been replaced with %-pattern rules. Nettle's use of suffix
rules in earlier versions depended on undocumented GNU make
behavior, which is being deprecated in GNU make 4.3.
Building with other make programs than GNU make is untested
and unsupported. (Building with BSD make or Solaris make
used to work years ago, but has not been tested recently).
Interface changes:
* Declarations of internal poly1305.h functions have been
removed from the header file poly1305.h, to make it clear
that they are not part of the advertised API or ABI.
Miscellaneous:
* Building the public key support of nettle now requires GMP
version 6.1.0 or later (unless --enable-mini-gmp is used).
* A fair amount of changes to ECC internals, with a few
deleted and a few new fields in the internal struct
ecc_curve. Files and functions have been renamed to more
consistently match the curve name, e.g., ecc-256.c has been
renamed to ecc-secp256r1.c.
* Documentation for chacha-poly1305 updated. It is no longer
experimental. The implementation was updated to follow RFC
8439 in Nettle-3.1, but that was not documented or announced
at the time.
- --
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCAAdFiEEy0li0HDXfX/Li6Nicdjx/zaMZncFAl6p3hsACgkQcdjx/zaM
ZneC5gf7BZuz13jnIzETuRCtqwcV8BaFZOhBrDmqPxHeCVL2BVZwUxVpIVZAhqKu
ngj5i4GEQBHLg5BRJk/97gyn4YCbWfr7397tqBdUWO2VWFKaG+5QGCG3pjjxyjgm
hECNrRpSLHHVzUFi2bLCo4Ur+R2d52I1l+hI7CekTxAk1c01xhpobs0pSUDUCfco
/c8gNbbrNZc/KxUq1qtaWucxvysa4BsfnqucnhjAftMrmishFdr282gWNrnK3q9K
kHIxCL01bYIQVQmYdH0VglGtq7rYCkL870Ip21OOaL+LIHm1FMaDpXHbXi/GkGqK
Ukre//RxgMbwPMsM7eh5rp7pOAqdug==
=QvUR
-----END PGP SIGNATURE-----
[View Less]
I've made another tarball,
http://www.lysator.liu.se/~nisse/archive/nettle-3.6rc2.tar.gzhttp://www.lysator.liu.se/~nisse/archive/nettle-3.6rc1.tar.gz.sig
Changes some rc1:
* Sonames of *both* libnettle and libhogweed are updated.
* Merged gost_vko.
* Other api/abi cleanups, affecting gosthash and poly1305.
* Deleted the test asserts that failed when linking
hogweed.dll in wine (and presumably on windows too).
* Some more Makefile and test cleanups, deleted the .testrules.make
…
[View More] file (replaced with %-pattern rules in the main
testsuite/Makefile.in), deleted workaround with extra dll symlinks,
and instead set WINEPATH.
* NEWS edits.
--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
[View Less]
I have a question to the list, if anyone here knows details of how
windows dlls work.
I cross compile for windows with
./configure --host=x86_64-w64-mingw32 --enable-mini-gmp CXX=/bin/false
make
and run tests with
make check EMULATOR=wine64
It fails the ecc-dup and ecc-add tests, in this and similar asserts:
ASSERT (ecc->dup == ecc_dup_jj);
Here, the right hand side is a symbol from libhogweed-x.dll, and on the
left hand side, ecc refers to a constant struct in the same dll.
…
[View More]
My guess us that dynamic linking on windows doesn't provide function
pointer comparisons as specified by the C standard. Maybe left-hand side
is the real function entry point, and the right hand side is the address
of some glue code related to dynamic linking. Is that right? If so, I
can just disable this assert for windows, but I'd like to understand
what'g going on.
I think the reason it works with ELF, is that ELF dynamic linking tries
harder to make all references point to the same PLT glue code, but I
don't fully understand the details there either.
Somewhat related to this: The ecc pointer is taken from this array in
testutils.c:
const struct ecc_curve * const ecc_curves[] = {
&_nettle_secp_192r1,
&_nettle_secp_224r1,
&_nettle_secp_256r1,
&_nettle_secp_384r1,
&_nettle_secp_521r1,
&_nettle_curve25519,
&_nettle_curve448,
&_nettle_gost_gc256b,
&_nettle_gost_gc512a,
NULL
};
The entries are references to data objects in the dll. I'm a bit
surprised that works at all, is it expected to work? I had the
impression that dlls only exported functions. Should the test code be
changed to use the advertised getter functions, nettle_get_secp_256r1
and friends?
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
[View Less]