I've new written some ARM assembly files for AES, SHA1 and SHA256. I haven't tried any clever scheduling (it seems quite unpredictable), so I think I win over the C code is mainly because I fit the important values in registers. I have benchmarked on a cortex-a9 system, with nice but not earth-shattering improvements over the C implementation.
In my benchmarks, AES decrypt is significantly slower than AES decrypt, for no reason that I understand, and I have tried some other variants of instruction scheduling.
I have one question on the ABI, if there's anyone on the list with more ARM experience: In the ABI spec, register r9 is reserved for use for things such as thread local storage. I'm almost sure that a leaf function can use r9 like any other callee-save register, but I'd like to have that confirmed before I make use of it. Potential problem case is if the function is interrupted by a signal and signal handler somehow depends on having a valid value in r9, or if context switching somehow assumes that r9 is never modified.
Any suggestion for what to try optimizing next? I imagine all algorithms that make use of 64-bit data or are designed for simd operations could be sped up a bit using the neon simd instructions (which I so far haven't made any use of). This includes sha512, sha3, serpent, salsa, maybe camellia.
Regards, /Niels
On 03/12/2013 05:51 AM, Niels Möller wrote:
In my benchmarks, AES decrypt is significantly slower than AES decrypt, for no reason that I understand, and I have tried some other variants of instruction scheduling.
You've got "AES decrypt" in both parts of the sentence above -- i assume one of them is supposed to be "AES encrypt", but i'm not sure which one. I'm curious, though :)
Sorry to not have anything helpful to offer on the questions you asked.
Regards,
--dkg
Daniel Kahn Gillmor dkg@fifthhorseman.net writes:
You've got "AES decrypt" in both parts of the sentence above -- i assume one of them is supposed to be "AES encrypt", but i'm not sure which one. I'm curious, though :)
Ooops. In my enchmarks, decrypt is slower than encrypt (it had been a bit worse if it were the other way round, since some constructions, in particular CTR mode, uses encrypt only).
Regards, /Niels
On 03/12/2013 10:51 AM, Niels Möller wrote:
I've new written some ARM assembly files for AES, SHA1 and SHA256. I haven't tried any clever scheduling (it seems quite unpredictable), so I think I win over the C code is mainly because I fit the important values in registers. I have benchmarked on a cortex-a9 system, with nice but not earth-shattering improvements over the C implementation.
Have you performed comparisons with the openssl AES-arm implementation? I would be curious about it because the openssl's version was for armv4 and as far as I remembered it outperformed nettle's C implementation for around 10-15% more (values out of memory I couldn't find out my measurements).
regards, Nikos
nettle-bugs@lists.lysator.liu.se