I wrote a crude/simple test program to compare the performance of AES-128-CBC across openssl, gcrypt, nettle and gnutls, and was surprised to find that nettle is consistently ~25% slower than the other libraries for its AESNI implementation.
On my Core i7-6820HQ I get
nettle: 850 MB/s gcrypt: 1172 MB/s gnutls: 1230 MB/s openssl: 1153 MB/s
with versions
nettle-3.3-2.fc26.x86_64 libgcrypt-1.7.8-1.fc26.x86_64 gnutls-3.5.14-1.fc26.x86_64 openssl-1.1.0f-7.fc26.x86_64
And on Xeon E5-2609 I get
nettle: 325 MB/s gcrypt: 403 MB/s gnutls: 414 MB/s openssl: 414 MB/s
with versions
nettle-3.3-1.fc25.x86_64 libgcrypt-1.7.8-1.fc25.x86_64 gnutls-3.5.14-1.fc25.x86_64 openssl-1.0.2k-1.fc25.x86_64
Naively I would have expected them all to be pretty much equal given that they're delegating to the same hardware routines. Has anyone else done comparative benchmarks of nettle's impl against others & seen the same kind of results ? I'll attach my test program to this mail, so if I made a mistake in usage there feel free to point it out.
FWIW, I also found there is some wierd interaction between nettle and glibc-2.23. If I have that glibc version and run with NETTLE_FAT_VERBOSE=1 it claims it is picking the AESNI impl, but the performance figures clearly show it is actually running the pure software impl because they're 100 MB/s instead of 325 MB/s. I upgraded to glibc 2.24 and this wierdness went away, so I've not investigated that further.
Regards, Daniel
"Daniel P. Berrange" berrange@redhat.com writes:
Naively I would have expected them all to be pretty much equal given that they're delegating to the same hardware routines.
This is not completely unexpected.
* Nettle's AESNI assembly routines were written for simplicity and small code size, without putting a lot of effort into it. They could probably be sped up by some unrolling or more careful instruction scheduling. Patches welcome (but we shouldn't use excessive unrolling unless there's a significant speedup).
* Nettle's AES-CBC uses general CBC functions invoking the AES encrypt and decrypt functions. In particular for CBC *en*crypt, this adds significant overhead for function calls, and the memxor function will examine src/dst alignment once per block. CBC *de*crypt is usually a bit faster, since we can then decrypt more than one block at a time.
comparative benchmarks of nettle's impl against others & seen the same kind of results ?
You can also try the ./examples/nettle-benchmark program; if openssl was found at configure time, it includes benchmarks of some openssl functions for comparison.
FWIW, I also found there is some wierd interaction between nettle and glibc-2.23. If I have that glibc version and run with NETTLE_FAT_VERBOSE=1 it claims it is picking the AESNI impl, but the performance figures clearly show it is actually running the pure software impl because they're 100 MB/s instead of 325 MB/s.
Odd. Nettle-3.1 used glibc's IFUNC feature, but that was disabled in later versions due to problems with the order the resolver functions were called.
Regards, /Niels
On Wed, Aug 02, 2017 at 04:25:42PM +0200, Niels Möller wrote:
"Daniel P. Berrange" berrange@redhat.com writes:
Naively I would have expected them all to be pretty much equal given that they're delegating to the same hardware routines.
This is not completely unexpected.
- Nettle's AESNI assembly routines were written for simplicity and small code size, without putting a lot of effort into it. They could probably be sped up by some unrolling or more careful instruction scheduling. Patches welcome (but we shouldn't use excessive unrolling unless there's a significant speedup).
Unfortunately I don't have any useful expertize in asm code, so I won't be able to provide any patches in this area.
- Nettle's AES-CBC uses general CBC functions invoking the AES encrypt and decrypt functions. In particular for CBC *en*crypt, this adds significant overhead for function calls, and the memxor function will examine src/dst alignment once per block. CBC *de*crypt is usually a bit faster, since we can then decrypt more than one block at a time.
comparative benchmarks of nettle's impl against others & seen the same kind of results ?
You can also try the ./examples/nettle-benchmark program; if openssl was found at configure time, it includes benchmarks of some openssl functions for comparison.
FYI, that benchmark program is somewhat misleading, because it directly uses the openssl AES APIs, which always go to the generic software version, and thus make openssl look real slow by comparison. To exercise the AESNI impls in openssl, it would need to be rewritten to use the openssl EVP APIs which dynamically choose the best impl.
Regards, Daniel
"Daniel P. Berrange" berrange@redhat.com writes:
FYI, that benchmark program is somewhat misleading, because it directly uses the openssl AES APIs, which always go to the generic software version, and thus make openssl look real slow by comparison. To exercise the AESNI impls in openssl, it would need to be rewritten to use the openssl EVP APIs which dynamically choose the best impl.
Thanks, I wasn't aware of that. (It's more than a decade since I last wrote any real code using ssleay/openssl).
Would you like to help fix that?
Regards, /Niels
On Wed, Aug 02, 2017 at 05:30:48PM +0200, Niels Möller wrote:
"Daniel P. Berrange" berrange@redhat.com writes:
FYI, that benchmark program is somewhat misleading, because it directly uses the openssl AES APIs, which always go to the generic software version, and thus make openssl look real slow by comparison. To exercise the AESNI impls in openssl, it would need to be rewritten to use the openssl EVP APIs which dynamically choose the best impl.
Thanks, I wasn't aware of that. (It's more than a decade since I last wrote any real code using ssleay/openssl).
I only learnt it when I was investigating why it reported such different results than the 'openssl speed' command :-)
Would you like to help fix that?
Sure, I can look at providing a patch
Regards, Daniel
"Daniel P. Berrange" berrange@redhat.com writes:
I wrote a crude/simple test program to compare the performance of AES-128-CBC across openssl, gcrypt, nettle and gnutls, and was surprised to find that nettle is consistently ~25% slower than the other libraries for its AESNI implementation.
I've now pushed new aesni code to the master-updates branch. It reads all subkeys into registers upfront, and unrolls the round loop. This brings a great speedup when calling the aes functions with many blocks at a time, but little difference when doing only one block at a time. Results for aes128, when benchmarkign on my machine (intel broadwell):
ECB encrypt and decrypt: About 90% speedup, from 1.25 cycles/byte to 0.65, about the same as openssl, or even *slightly* faster.
CBC encrypt: No significant change, about 5.7 cycles/byte. CBC decrypt: About 60% speedup, from 1,5 cycles/byte down to 0.93.
CTR mode: No significant change, about 2.5 cycles/byte.
I think it's reasonble to speed up CTR mode by passing more blocks per call to the encryption function (currently it does 4 blocks at a time), and maybe by some more efficient routine to generate the counter input.
To improve CBC would need some structural and possibly ugly changes.
For now, I don't have separate assembly functions for aes128, aes192 and aes256, and I've tried to organize it so that aes128 gets the least penalty for this generality. See https://git.lysator.liu.se/nettle/nettle/blob/master-updates/x86_64/aesni/ae...
I wonder if there are any chips that can execute two independent aesenc instructions in parallel? If so, it would be pretty straight forward to do two blocks at a time in parallel, doubling the speed for aes128 and aes192 (for aes256, we don't have enough registers for all 15 subkeys and two blocks of data).
Regards, /Niels
On Wed, Jan 3, 2018 at 7:36 PM, Niels Möller nisse@lysator.liu.se wrote:
"Daniel P. Berrange" berrange@redhat.com writes:
I wrote a crude/simple test program to compare the performance of AES-128-CBC across openssl, gcrypt, nettle and gnutls, and was surprised to find that nettle is consistently ~25% slower than the other libraries for its AESNI implementation.
I've now pushed new aesni code to the master-updates branch. It reads all subkeys into registers upfront, and unrolls the round loop. This brings a great speedup when calling the aes functions with many blocks at a time, but little difference when doing only one block at a time. Results for aes128, when benchmarkign on my machine (intel broadwell):
ECB encrypt and decrypt: About 90% speedup, from 1.25 cycles/byte to 0.65, about the same as openssl, or even *slightly* faster.
That's great news.
CBC encrypt: No significant change, about 5.7 cycles/byte. CBC decrypt: About 60% speedup, from 1,5 cycles/byte down to 0.93.
CTR mode: No significant change, about 2.5 cycles/byte.
I think it's reasonble to speed up CTR mode by passing more blocks per call to the encryption function (currently it does 4 blocks at a time), and maybe by some more efficient routine to generate the counter input.
To improve CBC would need some structural and possibly ugly changes.
If I had to chose between optimizing one of two, I'd say CTR. All the modern AEAD modes (GCM, CCM) use CTR, while CBC is only used as legacy and backwards compatible mode.
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
If I had to chose between optimizing one of two, I'd say CTR.
I agree CTR seems more important. I'm guessing that the loop
for (p = dst, left = length; left >= block_size; left -= block_size, p += block_size) { memcpy (p, ctr, block_size); INCREMENT(block_size, ctr); }
in ctr_crypt contribudes quite a few cycles per byte. It would be faster to use an always word-aligned area, and do the copying and incrementing using word operations (and final byteswap when running on a little-endian platform), and with no intermediate stores.
Would be a pretty simple routine (maybe we don't even need to go to assembly) if we require that the block size is a multiple of sizeof(unsigned long), and even simpler if we restrict to block size 16. But uglier and less efficient, if it needs to support the general case.
Maybe we could have a special case for blocksize 16, and accept that unusual blocksizes will be much slower. Or could we drop support for all but the most relevant block sizes here?
Regards, /Niels
On Thu, Jan 4, 2018 at 2:15 PM, Niels Möller nisse@lysator.liu.se wrote:
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
If I had to chose between optimizing one of two, I'd say CTR.
I agree CTR seems more important. I'm guessing that the loop
for (p = dst, left = length; left >= block_size; left -= block_size, p += block_size) { memcpy (p, ctr, block_size); INCREMENT(block_size, ctr); }
in ctr_crypt contribudes quite a few cycles per byte. It would be faster to use an always word-aligned area, and do the copying and incrementing using word operations (and final byteswap when running on a little-endian platform), and with no intermediate stores.
Would be a pretty simple routine (maybe we don't even need to go to assembly) if we require that the block size is a multiple of sizeof(unsigned long), and even simpler if we restrict to block size 16. But uglier and less efficient, if it needs to support the general case.
Maybe we could have a special case for blocksize 16, and accept that unusual blocksizes will be much slower. Or could we drop support for all but the most relevant block sizes here?
I wouldn't expect if anyone uses 3des in CTR mode, but I wouldn't be surprised by it either. What about introducing ctr_crypt128() and having it used by CCM, and EAX? (it seems gcm is not using it anyway)
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
I wouldn't expect if anyone uses 3des in CTR mode, but I wouldn't be surprised by it either.
It's in the ssh specs, with "recommended" status. See RFC 4344. I'd guess it's rarely used, though.
Back to AESNI, I've now pushed the change to the master branch. It would be interesting with some benchmarks on other machines than mine.
Regards, /Niels
Hello,
2018-01-04 21:36 GMT+03:00 Niels Möller nisse@lysator.liu.se:
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
I wouldn't expect if anyone uses 3des in CTR mode, but I wouldn't be surprised by it either.
It's in the ssh specs, with "recommended" status. See RFC 4344. I'd guess it's rarely used, though.
Back to AESNI, I've now pushed the change to the master branch. It would be interesting with some benchmarks on other machines than mine.
I'm attaching log on my i3-4005U @ 1.6GHz box.
BTW: it might be interesting to enable 'fat' binaries by default. Otherwise distributions might easily built nettle w/o optimized function versions.
On Thu, 2018-01-04 at 23:43 +0300, Dmitry Eremin-Solenikov wrote:
Hello,
2018-01-04 21:36 GMT+03:00 Niels Möller nisse@lysator.liu.se:
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
I wouldn't expect if anyone uses 3des in CTR mode, but I wouldn't be surprised by it either.
It's in the ssh specs, with "recommended" status. See RFC 4344. I'd guess it's rarely used, though.
Back to AESNI, I've now pushed the change to the master branch. It would be interesting with some benchmarks on other machines than mine.
I'm attaching log on my i3-4005U @ 1.6GHz box.
Attached also my log for Intel i7-5600U @ 2.60GHz
regards, Nikos
nisse@lysator.liu.se (Niels Möller) writes:
I agree CTR seems more important. I'm guessing that the loop
for (p = dst, left = length; left >= block_size; left -= block_size, p += block_size) { memcpy (p, ctr, block_size); INCREMENT(block_size, ctr); }
in ctr_crypt contribudes quite a few cycles per byte. It would be faster to use an always word-aligned area, and do the copying and incrementing using word operations (and final byteswap when running on a little-endian platform), and with no intermediate stores.
I've tried this, with special code for block size 16. (Without any assembly, but using __builtin_bswap64). Pushed to the ctr-opt branch. Gives a nice speedup. On my machine:
Nettle-3.4:
Algorithm mode Mbyte/s cycles/byte cycles/block
aes128 ECB encrypt 1589.75 1.26 20.16 aes128 ECB decrypt 1642.91 1.22 19.50 aes128 CBC encrypt 354.43 5.65 90.41 aes128 CBC decrypt 1519.10 1.32 21.09 aes128 (in-place) 1338.70 1.50 23.94 aes128 CTR 727.24 2.75 44.06 aes128 (in-place) 774.78 2.58 41.36
master branch:
Algorithm mode Mbyte/s cycles/byte cycles/block
aes128 ECB encrypt 3143.18 0.64 10.19 aes128 ECB decrypt 3159.88 0.63 10.14 aes128 CBC encrypt 351.37 5.70 91.20 aes128 CBC decrypt 2726.47 0.73 11.75 aes128 (in-place) 2131.99 0.94 15.03 aes128 CTR 970.08 2.06 33.03 aes128 (in-place) 796.31 2.51 40.24
ctr-opt branch:
Algorithm mode Mbyte/s cycles/byte cycles/block
aes128 ECB encrypt 3159.18 0.63 10.14 aes128 ECB decrypt 3159.82 0.63 10.14 aes128 CBC encrypt 351.80 5.69 91.08 aes128 CBC decrypt 2723.80 0.74 11.76 aes128 (in-place) 2156.27 0.93 14.86 aes128 CTR 1778.84 1.13 18.01 aes128 (in-place) 1550.39 1.29 20.67
Which means that aes128-ctr is twice as fast as in 3.4.
If anyone has a big-endian machine handy, it would be nice with additional testing for both correctness and performance (I have access to a few virtual machines with non-x86 architectures, where I can test this before merging to the master branch, but that's not so useful for benchmarking).
Regards, /Niels
On Tue, 2018-01-09 at 08:29 +0100, Niels Möller wrote:
nisse@lysator.liu.se (Niels Möller) writes:
I agree CTR seems more important. I'm guessing that the loop
for (p = dst, left = length; left >= block_size; left -= block_size, p += block_size) { memcpy (p, ctr, block_size); INCREMENT(block_size, ctr); }
in ctr_crypt contribudes quite a few cycles per byte. It would be faster to use an always word-aligned area, and do the copying and incrementing using word operations (and final byteswap when running on a little-endian platform), and with no intermediate stores.
I've tried this, with special code for block size 16. (Without any assembly, but using __builtin_bswap64). Pushed to the ctr-opt branch. Gives a nice speedup. On my machine:
I see a quite large speedup on my x86_64 too on CTR. Note however that GCM performance is not affected.
regards, Nikos
On Tue, 2018-01-09 at 09:17 +0100, Nikos Mavrogiannopoulos wrote:
in ctr_crypt contribudes quite a few cycles per byte. It would be faster to use an always word-aligned area, and do the copying and incrementing using word operations (and final byteswap when running on a little-endian platform), and with no intermediate stores.
I've tried this, with special code for block size 16. (Without any assembly, but using __builtin_bswap64). Pushed to the ctr-opt branch. Gives a nice speedup. On my machine:
I see a quite large speedup on my x86_64 too on CTR. Note however that GCM performance is not affected.
To follow up on this, gcm would get an 8% (on my system) speedup by switching gcm_crypt() with ctr_crypt(). With that change as is however, the 32-bit counter is replaced with an "unlimited" counter. Wouldn't introducing an assert on decrypt and encrypt length be sufficient to share that code?
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
To follow up on this, gcm would get an 8% (on my system) speedup by switching gcm_crypt() with ctr_crypt(). With that change as is however, the 32-bit counter is replaced with an "unlimited" counter. Wouldn't introducing an assert on decrypt and encrypt length be sufficient to share that code?
I think it's valid to use gcm with an IV which makes the 32-bit counter start close to 2^32 - 1, and then propagating carry further then 32 bits would produce incorrect results. Right? (I'm afraid there's no test case for that, though).
I agree it would be very nice to reuse ctr_crypt and not duplicate most of the logic. But I think we need a gcm-specific variant of ctr_fill. To do that, it would make sense to add a field
uint32_t u32[4];
to the nettle_block16 union.
To reduce code duplication, we could add a fill function pointer as argument to ctr_crypt16, and use that for gcm. Not sure if that's a good idea, but it might be nice and clean and indirect call to the fill function should be negligible.
Regards, /Niels
On Tue, 2018-01-30 at 20:57 +0100, Niels Möller wrote:
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
To follow up on this, gcm would get an 8% (on my system) speedup by switching gcm_crypt() with ctr_crypt(). With that change as is however, the 32-bit counter is replaced with an "unlimited" counter. Wouldn't introducing an assert on decrypt and encrypt length be sufficient to share that code?
I think it's valid to use gcm with an IV which makes the 32-bit counter start close to 2^32 - 1, and then propagating carry further then 32 bits would produce incorrect results. Right? (I'm afraid there's no test case for that, though).
I agree it would be very nice to reuse ctr_crypt and not duplicate most of the logic. But I think we need a gcm-specific variant of ctr_fill. To do that, it would make sense to add a field
uint32_t u32[4];
to the nettle_block16 union.
To reduce code duplication, we could add a fill function pointer as argument to ctr_crypt16, and use that for gcm. Not sure if that's a good idea, but it might be nice and clean and indirect call to the fill function should be negligible.
It seems that ctr_crypt16() would not handle the whole input and that was complicating things. I've modified it towards that, and added the parameter. I did a gcm_fill(), but I didn't see the need for the nettle_block16 update, as the version I did (quite simplistic), didn't seem to differ in performance comparing to ctr_fill16.
regards, Nikos
Nikos Mavrogiannopoulos nmav@redhat.com writes:
It seems that ctr_crypt16() would not handle the whole input and that was complicating things.
I was afraid of that. Doing the extra block would be something like
done = ctr_crypt16(cipher, f, gcm_fill, ctx->ctr.b, length, dst, src); if (done < length) { uint8_t block[16]; assert (done % 16 == 0); assert (length - done < 16); f(ctx, block, ctx->ctr.b, 16, 16); memxor3(dst + done, src + done, block, length - done); }
(if we skip updating the counter in this case; I don't think gcm promises anything about the counter after a partial block).
But I agree it makes sense to let ctr_crypt16 do that.
More detailed comments later.
Regards, /Niels
Nikos Mavrogiannopoulos nmav@redhat.com writes:
It seems that ctr_crypt16() would not handle the whole input and that was complicating things. I've modified it towards that, and added the parameter. I did a gcm_fill(), but I didn't see the need for the nettle_block16 update, as the version I did (quite simplistic), didn't seem to differ in performance comparing to ctr_fill16.
I've applied first part with some reorganization. ctr-internal.h now declares
/* Fill BUFFER (n blocks) with incrementing CTR values. It would be nice if CTR was always 64-bit aligned, but it isn't when called from ctr_crypt. */ typedef void nettle_fill16_func(uint8_t *ctr, size_t n, union nettle_block16 *buffer);
void _ctr_crypt16(const void *ctx, nettle_cipher_func *f, nettle_fill16_func *fill, uint8_t *ctr, size_t length, uint8_t *dst, const uint8_t *src);
And I moved the implementation to a separate file ctr16.c.
Your change to gcm.c is then applied almost unchanged on top of that. Result pushed to a branch named "gcm-ctr-opt". On my machine, it gives a gcm_aes128 speedup of 54% (from 12.2 cycles/byte to 7.9).
Very nice! Needs a little testing on big-endian before merge to master.
Thanks, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
I've tried this, with special code for block size 16. (Without any assembly, but using __builtin_bswap64). Pushed to the ctr-opt branch.
For the ctr changes, I need some testing on big-endian before merging to master. Most of the gmp virtual test machines are down at the moment, pending security upgrades related to spectre and meltdown.
I've applied for a gcc compile farm account, and was approved Wednesday evening, but it seems my account and ssh key hasn't yet been propageted to the farm machines.
Is anyone on the list familiar with debian cross compilers? It would be convenient to be able to locally cross compile for, e.g., debian mips and run tests with qemu-user, if needed things are packaged.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
I've applied for a gcc compile farm account, and was approved Wednesday evening, but it seems my account and ssh key hasn't yet been propageted to the farm machines.
Got this set up now, and tested successfully on an Ultrasparc T5 (gcc202.fsffrance.org). Speedup of aes128 ctr was around 20%.
Speaking of sparc, recent chips have some crypto instructions which could be used to speedup aes considerably. I don't think I'm going to do any more sparc assembly hacking soon, but if someone else is interested in sparc performance, it might be a reasonably easy project with large speedup.
Current sparc aes code was written back in 2005, using the 32-bit sparcstation I had at home at the time, running Redhat linux. It was adapted for sparc64 by Henrik Grubbström, and essentially the same code is used for both 32-bit and 64-bit.
Regards, /Niels
nettle-bugs@lists.lysator.liu.se