I've new written some ARM assembly files for AES, SHA1 and SHA256. I haven't tried any clever scheduling (it seems quite unpredictable), so I think I win over the C code is mainly because I fit the important values in registers. I have benchmarked on a cortex-a9 system, with nice but not earth-shattering improvements over the C implementation.
In my benchmarks, AES decrypt is significantly slower than AES decrypt, for no reason that I understand, and I have tried some other variants of instruction scheduling.
I have one question on the ABI, if there's anyone on the list with more ARM experience: In the ABI spec, register r9 is reserved for use for things such as thread local storage. I'm almost sure that a leaf function can use r9 like any other callee-save register, but I'd like to have that confirmed before I make use of it. Potential problem case is if the function is interrupted by a signal and signal handler somehow depends on having a valid value in r9, or if context switching somehow assumes that r9 is never modified.
Any suggestion for what to try optimizing next? I imagine all algorithms that make use of 64-bit data or are designed for simd operations could be sped up a bit using the neon simd instructions (which I so far haven't made any use of). This includes sha512, sha3, serpent, salsa, maybe camellia.
Regards, /Niels