nisse@lysator.liu.se (Niels Möller) writes:
So instead of, e.g,
uxtb T0, X, ror #8 ldr [TABLE, T0, lsl #2]
put the value 0x3fc in MASK, and do
and T0, MASK, X, ror#6 ldr [TABLE, T0]
Eliminating a shift/rotate operation might even make the code faster.
I have tried this now. I get same speed if I do this trick to the main round transformations. But uxtb is also used in the final round, and I'm having some difficulty replacing that with and without making it slower.
Timing on the A9 appears to be very sensitive, adding a single instruction, even a nop, can slow it down a lot. And for the final round, we do substitutions via the sbox table, and hence need the mask 0xff rather than 0x3fc, and the single instruction to set that up seems remarkably expensive.
So I think we may need separate versions.
Regards, /Niels