Hi, I got pinged by someone testing the performance of TLS handshakes and it seems that gnutls/nettle with RSA is significantly slower than openssl. On the other hand, secp256r1 and ed25519 are faster. (btw. both openssl and gnutls/nettle are slower than rusttls). Nevertheless the RSA caught my attention because I had the impression that nettle was at some point equivalent if not faster. I see that the hogweed benchmark values in nettle show a 3x difference in signing for the TR version and ~2x for the unprotected. Going back to 3.1 did not affect that. Was that always the case? If not any ideas what could have caused that? Did we miss some optimizations? (from a quick review of openssl' RSA code, I see that smooth CRT RSA was added relatively recently, but could that get such a big performance benefit?)
name size sign/ms verify/ms rsa 2048 0.8881 27.1422 rsa (openssl) 2048 1.4249 45.2295
rsa-tr 2048 0.4257 29.1152 rsa-tr (openssl) 2048 1.3735 46.1692
regards, Nikos
On Mon, 2019-12-02 at 13:24 +0100, Nikos Mavrogiannopoulos wrote:
Hi, I got pinged by someone testing the performance of TLS handshakes and it seems that gnutls/nettle with RSA is significantly slower than openssl. On the other hand, secp256r1 and ed25519 are faster. (btw. both openssl and gnutls/nettle are slower than rusttls).
FYI last time I checked rusttls it does not employ any countermeasure, not even blinding, easy to be fast that way.
Nevertheless the RSA caught my attention because I had the impression that nettle was at some point equivalent if not faster. I see that the hogweed benchmark values in nettle show a 3x difference in signing for the TR version and ~2x for the unprotected. Going back to 3.1 did not affect that. Was that always the case? If not any ideas what could have caused that? Did we miss some optimizations? (from a quick review of openssl' RSA code, I see that smooth CRT RSA was added relatively recently, but could that get such a big performance benefit?)
Would you be able to measure OpenSSL's RSA from a release before the smooth CRt was added ?
name size sign/ms verify/ms rsa 2048 0.8881 27.1422
rsa (openssl) 2048 1.4249 45.2295
rsa-tr 2048 0.4257 29.1152
rsa-tr (openssl) 2048 1.3735 46.1692
regards, Nikos _______________________________________________ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
I got pinged by someone testing the performance of TLS handshakes and it seems that gnutls/nettle with RSA is significantly slower than openssl.
To quote the NEWS file for Nettle-3.4.1:
Performance regression:
* All RSA private key operations employing RSA blinding, i.e., rsa_decrypt_tr, rsa_*_sign_tr, the new rsa_sec_decrypt, and rsa_compute_root_tr, are significantly slower. This is because (i) RSA blinding now use side-channel silent operations, (ii) blinding includes a modular inversion, and (iii) side-channel silent modular inversion, implemented as mpn_sec_invert, is very expensive. A 60% slowdown for 2048-bit RSA keys have been measured.
name size sign/ms verify/ms rsa 2048 0.8881 27.1422
rsa (openssl) 2048 1.4249 45.2295
rsa-tr 2048 0.4257 29.1152
rsa-tr (openssl) 2048 1.3735 46.1692
The above explains why Nettle's rsa-tr is much slower than the non-tr version. But it's disappointing that there also looks like a pretty large general slowdown.
I think most of the running time for RSA operations, except for modular inversion, are in wel-tuned GMP functions. For best speed, make sure GMP is either compiled with --enable-fat, or configured for the machine it's running on, and use a recent version. To track down any problems, it's important to know more precisely what processor it's running on and how gmp was configured.
For what it's worth, this is what I get on the laptop (quite old, "U4100 @ 1.30GHz" according to /proc/cpuinfo, should probably be "SU4100", detected as core2-pc-linux-gnu by gmp) I'm sitting in front of right now:
$ ../examples/hogweed-benchmark rsa name size sign/ms verify/ms rsa 2048 0.2106 7.2703 rsa-tr 2048 0.1158 6.8202 rsa (openssl) 2048 0.2024 6.4992 rsa-tr (openssl) 2048 0.1959 6.4983
So here, Nettle is slightly faster except for side-channel silent signing. It's a bit odd that *verify* for rsa-tr appears slower than the non-tr, since no secrets are involved, and the same function is called. May be a problem in the benchmark program.
Is "Smooth CRT" something that I should look up?
Regards, /Niels
On Mon, Dec 2, 2019 at 9:47 PM Niels Möller nisse@lysator.liu.se wrote:
name size sign/ms verify/ms rsa 2048 0.8881 27.1422
rsa (openssl) 2048 1.4249 45.2295
rsa-tr 2048 0.4257 29.1152
rsa-tr (openssl) 2048 1.3735 46.1692
The above explains why Nettle's rsa-tr is much slower than the non-tr version. But it's disappointing that there also looks like a pretty large general slowdown.
I think most of the running time for RSA operations, except for modular inversion, are in wel-tuned GMP functions. For best speed, make sure GMP is either compiled with --enable-fat, or configured for the machine it's running on, and use a recent version. To track down any problems, it's important to know more precisely what processor it's running on and how gmp was configured.
That seemed trivial before I wrote this email, but that was actually the case. The fedora maintainer had removed the --enable-fat option in a seemingly unrelated commit. I've reported it at: https://bugzilla.redhat.com/show_bug.cgi?id=1779060
Is "Smooth CRT" something that I should look up?
I do not know more about it. I only saw that in the openssl commit claims a speed up but without any numbers.
regards, Nikos
On Tue, 2019-12-03 at 08:59 +0100, Nikos Mavrogiannopoulos wrote:
On Mon, Dec 2, 2019 at 9:47 PM Niels Möller nisse@lysator.liu.se wrote:
name size sign/ms verify/ms rsa 2048 0.8881 27.1422
rsa (openssl) 2048 1.4249 45.2295
rsa-tr 2048 0.4257 29.1152
rsa-tr (openssl) 2048 1.3735 46.1692
The above explains why Nettle's rsa-tr is much slower than the non- tr version. But it's disappointing that there also looks like a pretty large general slowdown.
I think most of the running time for RSA operations, except for modular inversion, are in wel-tuned GMP functions. For best speed, make sure GMP is either compiled with --enable-fat, or configured for the machine it's running on, and use a recent version. To track down any problems, it's important to know more precisely what processor it's running on and how gmp was configured.
That seemed trivial before I wrote this email, but that was actually the case. The fedora maintainer had removed the --enable-fat option in a seemingly unrelated commit. I've reported it at: https://bugzilla.redhat.com/show_bug.cgi?id=1779060
Hmm even after --enable-fat was given to gmp not much has changed.
My CPU is Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz and that's what I see:
1. gmp without --enable-fat rsa 2048 0.8881 27.1422
2. gmp with --enable-fat rsa 2048 1.0973 40.4561
3. gmp with --enable-fat compiled outside distribution (as ./configure --enable-fat) rsa 2048 1.5127 53.6693
The corresponding value on that cpu for openssl's RSA is: rsa (openssl) 2048 1.9212 61.4107
So it may be that it is quite hard to get good values out of gmp without having a custom compilation. In particular I see that locally I have: -mtune=skylake -march=broadwell -fomit-frame-pointer
while fedora sets: -mtune=generic and the --enable-fat is not sufficient to overcome this.
regards, Nikos
Nikos Mavrogiannopoulos nmav@redhat.com writes:
Hmm even after --enable-fat was given to gmp not much has changed.
My CPU is Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz and that's what I see:
gmp without --enable-fat rsa 2048 0.8881 27.1422
gmp with --enable-fat rsa 2048 1.0973 40.4561
gmp with --enable-fat compiled outside distribution (as ./configure
--enable-fat) rsa 2048 1.5127 53.6693
That's quite a big difference.
The corresponding value on that cpu for openssl's RSA is: rsa (openssl) 2048 1.9212 61.4107
So it may be that it is quite hard to get good values out of gmp without having a custom compilation. In particular I see that locally I have: -mtune=skylake -march=broadwell -fomit-frame-pointer
while fedora sets: -mtune=generic and the --enable-fat is not sufficient to overcome this.
Those flags should only affect the code generated by the C compiler, and I'd expect all critical loops to be in assembly on your machine.
But --enable-fat is more than just selecting the right assembly code at runtime, also various thresholds should be set depending on the cpu type. What performance do you get from a default (non-fat) build on your machine? The default will select code and thresholds based on (the gmp-specific) config.guess.
Maybe send a mail to the gmp-discuss or gmp-bugs list and ask for advice (see https://gmplib.org/#MAILINGLISTS)? Don't forget to say precisely which GMP version you're using.
Regards, /Niels
nettle-bugs@lists.lysator.liu.se