On raspberry pi 3b+ (cortex-a53 @ 1.4GHz): Before: aes128 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 39.58 ns/B 24.10 MiB/s - c/B ECB dec | 39.57 ns/B 24.10 MiB/s - c/B After: ECB enc | 15.24 ns/B 62.57 MiB/s - c/B ECB dec | 15.68 ns/B 60.80 MiB/s - c/B
Passes nettle regression test (only little-endian though)
Does not use pre-rotated tables (as in AES_SMALL), so reduces d-cache footprint from 4.25K to 1K (enc)/1.25K (dec); completely unrolled, so increases i-cache footprint from 948b to 4416b (enc)/4032b (dec)
As it completely replaces current implementation, I just attached new files (will post final version as a patch).
P.S. Yes, I tried convert macros to m4: complete failure (no named parameters, problems with more than 9 arguments, weird expansion rules); so I fallen back to good ol' gas. Sorry.
P.P.S. with this change, gcm/neon and (to-be-publushed) chacha_blocks/neon, gnutls-cli --benchmark-ciphers: Before: Checking cipher-MAC combinations, payload size: 16384 AES-128-GCM 13.56 MB/sec CHACHA20-POLY1305 68.26 MB/sec AES-128-CBC-SHA1 16.72 MB/sec AES-128-CBC-SHA256 15.07 MB/sec After: AES-128-GCM 35.32 MB/sec CHACHA20-POLY1305 94.94 MB/sec AES-128-CBC-SHA1 27.53 MB/sec AES-128-CBC-SHA256 23.30 MB/sec