ARM assembly

12 Mar 2013


      I've new written some ARM assembly files for AES, SHA1 and SHA256. I
haven't tried any clever scheduling (it seems quite unpredictable), so I
think I win over the C code is mainly because I fit the important values
in registers. I have benchmarked on a cortex-a9 system, with nice but
not earth-shattering improvements over the C implementation.
In my benchmarks, AES decrypt is significantly slower than AES decrypt,
for no reason that I understand, and I have tried some other variants of
instruction scheduling.
I have one question on the ABI, if there's anyone on the list with more
ARM experience: In the ABI spec, register r9 is reserved for use for
things such as thread local storage. I'm almost sure that a leaf
function can use r9 like any other callee-save register, but I'd like to
have that confirmed before I make use of it. Potential problem case is
if the function is interrupted by a signal and signal handler somehow
depends on having a valid value in r9, or if context switching somehow
assumes that r9 is never modified.
Any suggestion for what to try optimizing next? I imagine all algorithms
that make use of 64-bit data or are designed for simd operations could
be sped up a bit using the neon simd instructions (which I so far
haven't made any use of). This includes sha512, sha3, serpent, salsa,
maybe camellia.
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

ARM assembly