On raspberry pi 3b+ (cortex-a53 @ 1.4GHz): Before: aes128 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 39.58 ns/B 24.10 MiB/s - c/B ECB dec | 39.57 ns/B 24.10 MiB/s - c/B After: ECB enc | 15.24 ns/B 62.57 MiB/s - c/B ECB dec | 15.68 ns/B 60.80 MiB/s - c/B
Passes nettle regression test (only little-endian though)
Does not use pre-rotated tables (as in AES_SMALL), so reduces d-cache footprint from 4.25K to 1K (enc)/1.25K (dec); completely unrolled, so increases i-cache footprint from 948b to 4416b (enc)/4032b (dec)
As it completely replaces current implementation, I just attached new files (will post final version as a patch).
P.S. Yes, I tried convert macros to m4: complete failure (no named parameters, problems with more than 9 arguments, weird expansion rules); so I fallen back to good ol' gas. Sorry.
P.P.S. with this change, gcm/neon and (to-be-publushed) chacha_blocks/neon, gnutls-cli --benchmark-ciphers: Before: Checking cipher-MAC combinations, payload size: 16384 AES-128-GCM 13.56 MB/sec CHACHA20-POLY1305 68.26 MB/sec AES-128-CBC-SHA1 16.72 MB/sec AES-128-CBC-SHA256 15.07 MB/sec After: AES-128-GCM 35.32 MB/sec CHACHA20-POLY1305 94.94 MB/sec AES-128-CBC-SHA1 27.53 MB/sec AES-128-CBC-SHA256 23.30 MB/sec
"Yuriy M. Kaminskiy" yumkam@gmail.com writes:
On raspberry pi 3b+ (cortex-a53 @ 1.4GHz): Before: aes128 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 39.58 ns/B 24.10 MiB/s - c/B ECB dec | 39.57 ns/B 24.10 MiB/s - c/B After: ECB enc | 15.24 ns/B 62.57 MiB/s - c/B ECB dec | 15.68 ns/B 60.80 MiB/s - c/B
Passes nettle regression test (only little-endian though)
Cool!
Does not use pre-rotated tables (as in AES_SMALL), so reduces d-cache footprint from 4.25K to 1K (enc)/1.25K (dec);
We could ficgure out a way to exclude unneeded tables in builds that unconditionally uses this code.
I think I tried this years ago, and found it slower, recorded in this comment:
C It's tempting to use eor with rotation, but that's slower.
But things may have changed, or you're doing it in a different way than I tried. Have you benchmarked with small and large tables?
completely unrolled, so increases i-cache footprint from 948b to 4416b (enc)/4032b (dec)
Do you have any numbers for the performance gain from unrolling?
With complete unrolling, it may be good with separate entry points (and possibly separate files) for aes128, aes192, aes256. I've been considering doing that for the x86_64/aesni (and then the old-style aes_encrypt needs to be changed to not use the _aes_encrypt function with a rounds argument; I have a branch doing that lying around somewhere).
P.S. Yes, I tried convert macros to m4: complete failure (no named parameters, problems with more than 9 arguments, weird expansion rules); so I fallen back to good ol' gas. Sorry.
No named arguments may be a bit annoying. At least for the AES code, I see no macros with more than 9 arguments.
define(<KEYSCHEDULE_REVERSED>,<yes>) define(<IF_KEYSCHEDULE_REVERSED>,<ifelse( KEYSCHEDULE_REVERSED,yes,<$1>, KEYSCHEDULE_REVERSED,no,<$2>)>)
What is this for?
C helper macros .macro ldr_unaligned_le rout rsrc offs rtmp ldrb \rout, [\rsrc, #((\offs) + 0)] ldrb \rtmp, [\rsrc, #((\offs) + 1)] orr \rout, \rout, \rtmp, lsl #8 ldrb \rtmp, [\rsrc, #((\offs) + 2)] orr \rout, \rout, \rtmp, lsl #16 ldrb \rtmp, [\rsrc, #((\offs) + 3)] orr \rout, \rout, \rtmp, lsl #24 .endm
A different way to read unaligned data is to read aligned words, and rotate and shift on the fly. There's an example of this in arm/v6/sha256-compress.asm, using ldm, sel and ror, + some setup code and one extra register for keeping left-over bytes.
PROLOGUE(_nettle_aes_decrypt) .cfi_startproc teq PARAM_LENGTH, #0 bxeq lr
push {r0,r3,%r4-%r11, %ip, %lr} .cfi_adjust_cfa_offset 48 .cfi_rel_offset r0, 0 C PARAM_LENGTH .cfi_rel_offset r3, 4 C PARAM_ROUNDS .cfi_rel_offset r4, 8 .cfi_rel_offset r5, 12 .cfi_rel_offset r6, 16 .cfi_rel_offset r7, 20 .cfi_rel_offset r8, 24 .cfi_rel_offset r9, 28 .cfi_rel_offset r10, 32 .cfi_rel_offset r11, 36 .cfi_rel_offset ip, 40 .cfi_rel_offset lr, 44
Are these .cfi_* pseudoops essential? I'm afraid I'm ignorant of the fine details here; I just see from the gas manual that they appear to be related to stack unwinding.
Regards, /Niels
On 17.03.2019 11:08, Niels Möller wrote:
"Yuriy M. Kaminskiy" yumkam@gmail.com writes:
On raspberry pi 3b+ (cortex-a53 @ 1.4GHz): Before: aes128 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 39.58 ns/B 24.10 MiB/s - c/B ECB dec | 39.57 ns/B 24.10 MiB/s - c/B After: ECB enc | 15.24 ns/B 62.57 MiB/s - c/B ECB dec | 15.68 ns/B 60.80 MiB/s - c/B
Passes nettle regression test (only little-endian though)
Cool!
Does not use pre-rotated tables (as in AES_SMALL), so reduces d-cache footprint from 4.25K to 1K (enc)/1.25K (dec);
We could ficgure out a way to exclude unneeded tables in builds that unconditionally uses this code.
I think I tried this years ago, and found it slower, recorded in this comment:
C It's tempting to use eor with rotation, but that's slower.
But things may have changed, or you're doing it in a different way than I tried. Have you benchmarked with small and large tables?
Well, it is not my code. I just taken it from libgcrypt, and adopted to nettle (reordered arguments, added loop over blocks, changed decrypt keyschedule to nettle's), without any major changes.
completely unrolled, so increases i-cache footprint from 948b to 4416b (enc)/4032b (dec)
Do you have any numbers for the performance gain from unrolling?
With complete unrolling, it may be good with separate entry points (and possibly separate files) for aes128, aes192, aes256.
Now at least it reuses some code. If application calls both aes128 and aes256, some i-cache is saved.
I've been considering doing that for the x86_64/aesni (and then the old-style aes_encrypt needs to be changed to not use the _aes_encrypt function with a rounds argument; I have a branch doing that lying around somewhere).
P.S. Yes, I tried convert macros to m4: complete failure (no named parameters, problems with more than 9 arguments, weird expansion rules); so I fallen back to good ol' gas. Sorry.
No named arguments may be a bit annoying. At least for the AES code, I see no macros with more than 9 arguments.
Some with 10: .macro do_encround next_r ra rb rc rd rna rnb rnc rnd preload_key .macro encround round ra rb rc rd rna rnb rnc rnd preload_key
(And I failed with groking m4 way of making indirect macro call)
define(<KEYSCHEDULE_REVERSED>,<yes>) define(<IF_KEYSCHEDULE_REVERSED>,<ifelse( KEYSCHEDULE_REVERSED,yes,<$1>, KEYSCHEDULE_REVERSED,no,<$2>)>)
What is this for?
See FIXME comment in aes-invert-internal.c; original gcrypt code used unswapped key schedule and walk backwards in aes_decrypt; for nettle port, I had to change it, but left original code as an option (if nettle sometime decide to follow FIXME and switch it).
BTW, mtable in aes-invert-internal is exactly same as _aes_decrypt_table.table[0], it would be good to merge them.
(Another trick used in the last round of gcrypt's aes-encrypt: _aes_encrypt_table.sbox[i] == (_aes_encrypt_table.table[0][i]>> 8) & 0xff _aes_encrypt_table.sbox[i] == (_aes_encrypt_table.table[0][i]>>16) & 0xff It does not use sbox and saves 256 bytes of d-cache footprint [but no same thing with aes-decrypt])
C helper macros .macro ldr_unaligned_le rout rsrc offs rtmp ldrb \rout, [\rsrc, #((\offs) + 0)] ldrb \rtmp, [\rsrc, #((\offs) + 1)] orr \rout, \rout, \rtmp, lsl #8 ldrb \rtmp, [\rsrc, #((\offs) + 2)] orr \rout, \rout, \rtmp, lsl #16 ldrb \rtmp, [\rsrc, #((\offs) + 3)] orr \rout, \rout, \rtmp, lsl #24 .endm
A different way to read unaligned data is to read aligned words, and rotate and shift on the fly. There's an example of this in arm/v6/sha256-compress.asm, using ldm, sel and ror, + some setup code and one extra register for keeping left-over bytes.
Actually, this is unused in the last version of code: armv6 supports unaligned ldr, so I replaced if(aligned){ldm;IF_BE(rev); ) } else { ldr_unaligned_le } with simple unconditional ldr; IF_BE(rev); It was a bit faster with misaligned buffers, and almost same speed with aligned buffers. I've left it if someone will want to use this on armv5 (or it will turn out slower on some other cpu).
PROLOGUE(_nettle_aes_decrypt) .cfi_startproc teq PARAM_LENGTH, #0 bxeq lr
push {r0,r3,%r4-%r11, %ip, %lr} .cfi_adjust_cfa_offset 48 .cfi_rel_offset r0, 0 C PARAM_LENGTH .cfi_rel_offset r3, 4 C PARAM_ROUNDS .cfi_rel_offset r4, 8
...
Are these .cfi_* pseudoops essential? I'm afraid I'm ignorant of the fine details here; I just see from the gas manual that they appear to be related to stack unwinding.
They are useful for gdb, valgrind, etc to produce sensible backtrace, to be able to move up/down callchain (without losing values from callee-saved registers, etc), and AFAIK they don't add any runtime overhead, so I add them when possible (FWIW, they were not present in original gcrypt code).
P.S. There were stupid last-minute error in posted aes-encrypt, patch attached.
"Yuriy M. Kaminskiy" yumkam@gmail.com writes:
I've had another look, trying to understand how it differs.
Does not use pre-rotated tables (as in AES_SMALL), so reduces d-cache footprint from 4.25K to 1K (enc)/1.25K (dec); completely unrolled, so increases i-cache footprint from 948b to 4416b (enc)/4032b (dec)
Not sure unrolling is that beneficial; Nettle's implementation does two rounds at a time (since just like in your patch, src and destination registers alternate when doing a round), and that's so many instructions that lop iverhead should be pretty small.
As it completely replaces current implementation, I just attached new files (will post final version as a patch).
As you say, it doesn't use prerotated tables, but instead adds a , ror #x to the relevant eor instructions.
Load and store of the cleartext and ciphertext bytes is different (and I have some difficulty following it).
Masking to get table indices is the same as in nettle's arm/aes-encrypt-internal.asm, while nettle's v6 code uses the uxtb instruction, which saves one register (which the code doesn't take much advantage of, though).
The code in your patch has more careful instruction scheduling, e.g., interleaving addition of roundkeys with the sbox table lookups. Nettle's code is written with only a single temporary register used for everything, which makes it impossible to interleave independent parts of the mangling. While your patch seems to alternate between three different temporaries.
Regards, /Niels
On Sun, Mar 24, 2019 at 08:45:28PM +0100, Niels Möller wrote:
"Yuriy M. Kaminskiy" yumkam@gmail.com writes:
I've had another look, trying to understand how it differs.
Does not use pre-rotated tables (as in AES_SMALL), so reduces d-cache footprint from 4.25K to 1K (enc)/1.25K (dec); completely unrolled, so increases i-cache footprint from 948b to 4416b (enc)/4032b (dec)
Not sure unrolling is that beneficial; Nettle's implementation does two rounds at a time (since just like in your patch, src and destination registers alternate when doing a round), and that's so many instructions that lop iverhead should be pretty small.
As gcrypt implementation uses all registers, nothing left for keeping round counter, so there are no much choice (I'll probably try spill it on stack and auto{inc,dec}rement key pointer later).
As it completely replaces current implementation, I just attached new files (will post final version as a patch).
As you say, it doesn't use prerotated tables, but instead adds a , ror #x to the relevant eor instructions.
Load and store of the cleartext and ciphertext bytes is different (and I have some difficulty following it).
Masking to get table indices is the same as in nettle's arm/aes-encrypt-internal.asm, while nettle's v6 code uses the uxtb instruction, which saves one register (which the code doesn't take much advantage of, though).
The code in your patch has more careful instruction scheduling, e.g., interleaving addition of roundkeys with the sbox table lookups. Nettle's code is written with only a single temporary register used for everything, which makes it impossible to interleave independent parts of the mangling. While your patch seems to alternate between three different temporaries.
P.S.
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
btw, short pgp keyids considered harmful.
nettle-bugs@lists.lysator.liu.se