I've new written some ARM assembly files for AES, SHA1 and SHA256. I
haven't tried any clever scheduling (it seems quite unpredictable), so I
think I win over the C code is mainly because I fit the important values
in registers. I have benchmarked on a cortex-a9 system, with nice but
not earth-shattering improvements over the C implementation.
In my benchmarks, AES decrypt is significantly slower than AES decrypt,
for no reason that I understand, and I have tried some other variants of
instruction scheduling.
I have one question on the ABI, if there's anyone on the list with more
ARM experience: In the ABI spec, register r9 is reserved for use for
things such as thread local storage. I'm almost sure that a leaf
function can use r9 like any other callee-save register, but I'd like to
have that confirmed before I make use of it. Potential problem case is
if the function is interrupted by a signal and signal handler somehow
depends on having a valid value in r9, or if context switching somehow
assumes that r9 is never modified.
Any suggestion for what to try optimizing next? I imagine all algorithms
that make use of 64-bit data or are designed for simd operations could
be sped up a bit using the neon simd instructions (which I so far
haven't made any use of). This includes sha512, sha3, serpent, salsa,
maybe camellia.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.