Hi
I just published a module used in some PBKDF2-HMAC-SHA512 testing I've been doing under a contract with NORDUnet A/S.
https://github.com/fredrikt/python-ndnkdf
I invoke the new PBKDF2 functions in libnettle using Python ctypes, which achieves a ~ 25x speedup compared to the standard python-pbkdf2 that uses SHA512 from hashlib (presumably a C function), but does the xoring in native Python.
PBKDF2-HMAC-SHA512 benchmark result :
N= 16 -> Python == 0 ms, Nettle == 0 ms N= 32 -> Python == 1 ms, Nettle == 0 ms N= 64 -> Python == 2 ms, Nettle == 0 ms N= 128 -> Python == 4 ms, Nettle == 0 ms N= 256 -> Python == 8 ms, Nettle == 0 ms N= 512 -> Python == 18 ms, Nettle == 0 ms N= 1024 -> Python == 35 ms, Nettle == 1 ms N= 2048 -> Python == 68 ms, Nettle == 2 ms N= 4096 -> Python == 133 ms, Nettle == 5 ms N= 8192 -> Python == 262 ms, Nettle == 11 ms N= 16384 -> Python == 521 ms, Nettle == 22 ms
If someone has access to a modern AMD CPU, I would be very interested in getting the benchmark output of examples/pbkdf2-plot on that machine. Thanks.
/Fredrik
Fredrik Thulin fredrik@thulin.net writes:
I just published a module used in some PBKDF2-HMAC-SHA512 testing I've been doing under a contract with NORDUnet A/S.
Cool. Why SHA512 rather than SHA256, is using it specified somewhere?
If PBKDF2-HMAC-SHA512 is widely used, it would sense to add a convenience function pbkdf2_hmac_sha512 to nettle.
What's the source of your test vectors? It would be nice with additional test vectors also for nettle's testsuite/pbkdf2-test.c.
I invoke the new PBKDF2 functions in libnettle using Python ctypes, which achieves a ~ 25x speedup compared to the standard python-pbkdf2 that uses SHA512 from hashlib (presumably a C function), but does the xoring in native Python.
Is the speed of sha512 itself comparable? Nettle's implementatiion is fairly straight-forward C code.
If someone has access to a modern AMD CPU, I would be very interested in getting the benchmark output of examples/pbkdf2-plot on that machine. Thanks.
My machine at home is a lowend but reasonably modern AMD, E-350, iirc. I'm not very familiar with python, but if you tell me the steps needed to get the benchmark running on a debian system I can give it a try.
Regards, /Niels
On Tue, 2012-11-27 at 14:06 +0100, Niels Möller wrote:
Fredrik Thulin fredrik@thulin.net writes:
I just published a module used in some PBKDF2-HMAC-SHA512 testing I've been doing under a contract with NORDUnet A/S.
Cool. Why SHA512 rather than SHA256, is using it specified somewhere?
I wanted maximum speed on 64 bits CPUs without settling for SHA-1, but I won't claim this was a particularly well illuminated decision.
If PBKDF2-HMAC-SHA512 is widely used, it would sense to add a convenience function pbkdf2_hmac_sha512 to nettle.
I wouldn't mind that, although I've already failed to convince Simon =).
It would remove the hack of passing a 1k buffer as sha512ctx (https://github.com/fredrikt/python-ndnkdf/blob/master/ndnkdf/ndnkdf.py#L76).
What's the source of your test vectors? It would be nice with additional test vectors also for nettle's testsuite/pbkdf2-test.c.
IIRC I actually took the test vectors *from* Nettle - my basic concern with testing was to verify that I was successfully calling libnettle, not that libnettle works.
I did implement a test case that compares the output of Nettle with that of python-pbkdf2 for a number of (key, salt, iterations) though.
I invoke the new PBKDF2 functions in libnettle using Python ctypes, which achieves a ~ 25x speedup compared to the standard python-pbkdf2 that uses SHA512 from hashlib (presumably a C function), but does the xoring in native Python.
Is the speed of sha512 itself comparable? Nettle's implementatiion is fairly straight-forward C code.
Haven't measured. Optimizing a SHA512 implementation is really above my head, but I've heard talks about using AMD XOP instruction set to optimize SHA512 on other mailing lists...
If someone has access to a modern AMD CPU, I would be very interested in getting the benchmark output of examples/pbkdf2-plot on that machine. Thanks.
My machine at home is a lowend but reasonably modern AMD, E-350, iirc. I'm not very familiar with python, but if you tell me the steps needed to get the benchmark running on a debian system I can give it a try.
Thanks. Something like
$ git clone https://github.com/dlitz/python-pbkdf2 $ git clone https://github.com/fredrikt/python-ndnkdf $ cd python-ndnkdf/examples $ PYTHONPATH=../../python-pbkdf2 ./pbkdf2-plot
(assuming there is a new enough libnettle in the system library path).
/Fredrik
Fredrik Thulin fredrik@thulin.net writes:
On Tue, 2012-11-27 at 14:06 +0100, Niels Möller wrote:
Cool. Why SHA512 rather than SHA256, is using it specified somewhere?
I wanted maximum speed on 64 bits CPUs without settling for SHA-1, but I won't claim this was a particularly well illuminated decision.
Makes some sense. But on the other hand, the point of pbkdf2 is to be slow (for the attacker), so selecting a faster hash function just means that you need to use a larger iteration count...
It would remove the hack of passing a 1k buffer as sha512ctx (https://github.com/fredrikt/python-ndnkdf/blob/master/ndnkdf/ndnkdf.py#L76).
The right way would be to use some configure test to get the size of that struct, I guess. (pyconfigure-0.1 was released yesterday, btw. See http://www.gnu.org/software/pyconfigure/. I haven't done any python modules since ancient versions of the gmp module, but I imagine pyconfigure could be useful if you need configure tests for your python module).
I did implement a test case that compares the output of Nettle with that of python-pbkdf2 for a number of (key, salt, iterations) though.
For lack of more authoritative test vectors, adding a couple of testvectors generated by python-pbkdf2, to the nettle testsuite would be nice.
Haven't measured. Optimizing a SHA512 implementation is really above my head, but I've heard talks about using AMD XOP instruction set to optimize SHA512 on other mailing lists...
That's a project for another day. It's some time since I wrote the C implementation, and I can't guess if a clever assembly implementation would gain a 10% or a 100% speedup compared to what gcc generates.
$ git clone https://github.com/dlitz/python-pbkdf2 $ git clone https://github.com/fredrikt/python-ndnkdf $ cd python-ndnkdf/examples $ PYTHONPATH=../../python-pbkdf2 ./pbkdf2-plot
Ok. With
$ NDNKDF_PATH=~/build/nettle-shared/.lib PYTHONPATH=../../python-pbkdf2:.. ./pbkdf2-plot
I get
PBKDF2-HMAC-SHA512 benchmark result :
N= 16 -> Python == 2 ms, Nettle == 0 ms N= 32 -> Python == 6 ms, Nettle == 0 ms N= 64 -> Python == 8 ms, Nettle == 0 ms N= 128 -> Python == 16 ms, Nettle == 0 ms N= 256 -> Python == 30 ms, Nettle == 1 ms N= 512 -> Python == 57 ms, Nettle == 2 ms N= 1024 -> Python == 114 ms, Nettle == 3 ms N= 2048 -> Python == 229 ms, Nettle == 7 ms N= 4096 -> Python == 453 ms, Nettle == 15 ms N= 8192 -> Python == 909 ms, Nettle == 30 ms N= 16384 -> Python == 1834 ms, Nettle == 59 ms
The machine has an "AMD E-350" processor, 1.6 GHz dual core (but I guess the number of cores doesn't matter here). GMP's configure refers to the cpu as "bobcat", which if I understand these things correctly is AMD's current low-end.
Regards, /Niels
On Tue, 2012-11-27 at 22:47 +0100, Niels Möller wrote:
Fredrik Thulin fredrik@thulin.net writes:
On Tue, 2012-11-27 at 14:06 +0100, Niels Möller wrote:
Cool. Why SHA512 rather than SHA256, is using it specified somewhere?
I wanted maximum speed on 64 bits CPUs without settling for SHA-1, but I won't claim this was a particularly well illuminated decision.
Makes some sense. But on the other hand, the point of pbkdf2 is to be slow (for the attacker), so selecting a faster hash function just means that you need to use a larger iteration count...
Right - I realized it was poorly phrased after I sent it.
To be a bit more elaborate, my reasoning is that to minimize an attackers advantage (bot nets will always have more CPU than I do), I have to choose an algorithm that is as fast as possible for me and use it with as many iterations as I can afford. I know I will have 64 bits, but not yet if I should go with in Intel or AMD here.
On top of that, it doesn't hurt to choose an algorithm that might me more expensive for attackers than me. AFAICT, most GPUs today are 32 bits (so SHA512 would be the slowest SHA currently), although I guess it is safe to assume GPUs will become 64 bits too.
Naturally, what I'm building will be upgradeable with new algorithms and/or iteration counts over time, but SHA512 seems like the strongest choice today.
The only reason *not* to go with SHA512 (if we're limiting ourselves to SHAs) seems to be that it is considered overkill - admittedly by people that know far more than me about secure hashes. I don't mind the overkill though, and prefer safe to sorry.
For lack of more authoritative test vectors, adding a couple of testvectors generated by python-pbkdf2, to the nettle testsuite would be nice.
I'll see what I can do.
Haven't measured. Optimizing a SHA512 implementation is really above my head, but I've heard talks about using AMD XOP instruction set to optimize SHA512 on other mailing lists...
That's a project for another day. It's some time since I wrote the C implementation, and I can't guess if a clever assembly implementation would gain a 10% or a 100% speedup compared to what gcc generates.
An indication could perhaps be gleaned from
http://cvs.openssl.org/chngview?cn=22648
If I understand the "Current performance in cycles per processed byte" data correctly it seems like use of AVX/XOP could give a 50% performance improvement for SHA256 and a 38% improvement for SHA512 on Sandy Bridge. I have no idea how OpenSSLs previous implementation performed compared to Nettle though.
...
I get
PBKDF2-HMAC-SHA512 benchmark result :
... N= 16384 -> Python == 1834 ms, Nettle == 59 ms
The machine has an "AMD E-350" processor, 1.6 GHz dual core (but I guess the number of cores doesn't matter here). GMP's configure refers to the cpu as "bobcat", which if I understand these things correctly is AMD's current low-end.
Thanks. I know not to use a low-end AMD then ;). You are correct that the benchmark only uses one core.
/Fredrik
nettle-bugs@lists.lysator.liu.se