"Yuriy M. Kaminskiy" yumkam@gmail.com writes:
I've had another look, trying to understand how it differs.
Does not use pre-rotated tables (as in AES_SMALL), so reduces d-cache footprint from 4.25K to 1K (enc)/1.25K (dec); completely unrolled, so increases i-cache footprint from 948b to 4416b (enc)/4032b (dec)
Not sure unrolling is that beneficial; Nettle's implementation does two rounds at a time (since just like in your patch, src and destination registers alternate when doing a round), and that's so many instructions that lop iverhead should be pretty small.
As it completely replaces current implementation, I just attached new files (will post final version as a patch).
As you say, it doesn't use prerotated tables, but instead adds a , ror #x to the relevant eor instructions.
Load and store of the cleartext and ciphertext bytes is different (and I have some difficulty following it).
Masking to get table indices is the same as in nettle's arm/aes-encrypt-internal.asm, while nettle's v6 code uses the uxtb instruction, which saves one register (which the code doesn't take much advantage of, though).
The code in your patch has more careful instruction scheduling, e.g., interleaving addition of roundkeys with the sbox table lookups. Nettle's code is written with only a single temporary register used for everything, which makes it impossible to interleave independent parts of the mangling. While your patch seems to alternate between three different temporaries.
Regards, /Niels