On raspberry pi 3b+ (cortex-a53 @ 1.4GHz):
Before:
aes128 | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 39.58 ns/B 24.10 MiB/s - c/B
ECB dec | 39.57 ns/B 24.10 MiB/s - c/B
After:
ECB enc | 15.24 ns/B 62.57 MiB/s - c/B
ECB dec | 15.68 ns/B 60.80 MiB/s - c/B
Passes nettle regression test (only little-endian though)
Does not use pre-rotated tables (as in AES_SMALL), so reduces d-cache
footprint from 4.25K to 1K (enc)/1.25K (dec);
completely unrolled, so increases i-cache footprint
from 948b to 4416b (enc)/4032b (dec)
As it completely replaces current implementation, I just attached new
files (will post final version as a patch).
P.S. Yes, I tried convert macros to m4: complete failure (no named
parameters, problems with more than 9 arguments, weird expansion rules);
so I fallen back to good ol' gas. Sorry.
P.P.S. with this change, gcm/neon and (to-be-publushed) chacha_blocks/neon,
gnutls-cli --benchmark-ciphers:
Before:
Checking cipher-MAC combinations, payload size: 16384
AES-128-GCM 13.56 MB/sec
CHACHA20-POLY1305 68.26 MB/sec
AES-128-CBC-SHA1 16.72 MB/sec
AES-128-CBC-SHA256 15.07 MB/sec
After:
AES-128-GCM 35.32 MB/sec
CHACHA20-POLY1305 94.94 MB/sec
AES-128-CBC-SHA1 27.53 MB/sec
AES-128-CBC-SHA256 23.30 MB/sec