Hi Niels,
Here is the version 2 for AES/GCM stitched patch. The stitched code is in all assembly and m4 macros are used. The overall performance improved around ~110% and 120% for encrypt and decrypt respectably. Please see the attached patch and aes benchmark.
Thanks. -Danny
On Nov 22, 2023, at 2:27 AM, Niels Möller nisse@lysator.liu.se wrote:
Danny Tsen dtsen@us.ibm.com writes:
Interleaving at the instructions level may be a good option but due to PPC instruction pipeline this may need to have sufficient registers/vectors. Use same vectors to change contents in successive instructions may require more cycles. In that case, more vectors/scalar will get involved and all vectors assignment may have to change. That’s the reason I avoided in this case.
To investigate the potential, I would suggest some experiments with software pipelining.
Write a loop to do 4 blocks of ctr-aes128 at a time, fully unrolling the round loop. I think that should be 44 instructions of aes mangling, plus instructions to setup the counter input, and do the final xor and endianness things with the message. Arrange so that it loads the AES state in a set of registers we can call A, operating in-place on these registers. But at the end, arrange the XORing so that the final cryptotext is located in a different set of registers, B.
Then, write the instructions to do ghash using the B registers as input, I think that should be about 20-25 instructions. Interleave those as well as possible with the AES instructions (say, two aes instructions, one ghash instruction, etc).
Software pipelining means that each iteration of the loop does aes-ctr on four blocks, + ghash on the output for the four *previous* blocks (so one needs extra code outside of the loop to deal with first and last 4 blocks). Decrypt processing should be simpler.
Then you can benchmark that loop in isolation. It doesn't need to be the complete function, the handling of first and last blocks can be omitted, and it doesn't even have to be completely correct, as long as it's the right instruction mix and the right data dependencies. The benchmark should give a good idea for the potential speedup, if any, from instruction-level interleaving.
I would hope 4-way is doable with available vector registers (and this inner loop should be less than 100 instructions, so not too unmanageable). Going up to 8-way (like the current AES code) would also be interesting, but as you say, you might have a shortage of registers. If you have to copy state between registers and memory in each iteration of an 8-way loop (which it looks like you also have to do in your current patch), that overhead cost may outweight the gains you have from more independence in the AES rounds.
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.
Danny Tsen dtsen@us.ibm.com writes:
Here is the version 2 for AES/GCM stitched patch. The stitched code is in all assembly and m4 macros are used. The overall performance improved around ~110% and 120% for encrypt and decrypt respectably. Please see the attached patch and aes benchmark.
Thanks, comments below.
--- a/fat-ppc.c +++ b/fat-ppc.c @@ -226,6 +231,8 @@ fat_init (void) _nettle_ghash_update_arm64() */ _nettle_ghash_set_key_vec = _nettle_ghash_set_key_ppc64; _nettle_ghash_update_vec = _nettle_ghash_update_ppc64;
_nettle_ppc_gcm_aes_encrypt_vec = _nettle_ppc_gcm_aes_encrypt_ppc64;
} else {_nettle_ppc_gcm_aes_decrypt_vec = _nettle_ppc_gcm_aes_decrypt_ppc64;
Fat setup is a bit tricky, here it looks like _nettle_ppc_gcm_aes_decrypt_vec is left undefined by the else clause. I would suspect that breaks when the extensions aren't available. You can test that with NETTLE_FAT_OVERRIDE=none.
gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) { +#if defined(HAVE_NATIVE_AES_GCM_STITCH)
- if (length >= 128) {
- PPC_GCM_CRYPT(1, _AES128_ROUNDS, ctx, length, dst, src);
- if (length == 0) {
return;
- }
- }
+#endif /* HAVE_NATIVE_AES_GCM_STITCH */
- GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
}
In a non-fat build, it's right with a compile-time #if to select if the optimized code should be called. But in a fat build, we'de need a valid function in all cases, but doing different things depending on the runtime fat initialization. One could do that with two versions of gcm_aes128_encrypt (which is likely preferable if we do something similar for other archs that has separate assembly for aes128, aes192, etc). Or we would need to call some function unconditionally, which would be a nop if the extensions are not available. E.g, do something like
#if HAVE_NATIVE_fat_aes_gcm_encrypt void gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) { size_t done = _gcm_aes_encrypt (&ctx->key, &ctx->gcm.x, &ctx->gcm.ctr, _AES128_ROUNDS, &ctx->cipher.keys, length, dst, src); ctx->data_size += done; length -= done; if (length > 0) { src += done; dst += done; GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src); } } #endif
where the C-implementation of _gcm_aes_encrypt could just return 0.
And it's preferable that te same interface could be used on other archs, even if they don't do 8 blocks at a time like your ppc code.
--- a/gcm.h +++ b/gcm.h @@ -195,6 +195,47 @@ gcm_digest(struct gcm_ctx *ctx, const struct gcm_key *key, (nettle_cipher_func *) (encrypt), \ (length), (digest)))
+#if defined(HAVE_NATIVE_AES_GCM_STITCH) +#define _ppc_gcm_aes_encrypt _nettle_ppc_gcm_aes_encrypt +#define _ppc_gcm_aes_decrypt _nettle_ppc_gcm_aes_decrypt +void +_ppc_gcm_aes_encrypt (void *ctx, size_t rounds, uint8_t *ctr,
size_t len, uint8_t *dst, const uint8_t *src);
+void +_ppc_gcm_aes_decrypt (void *ctx, size_t rounds, uint8_t *ctr,
size_t len, uint8_t *dst, const uint8_t *src);
+struct ppc_gcm_aes_context {
- uint8_t *x;
- uint8_t *htable;
- struct aes128_ctx *rkeys;
+}; +#define GET_PPC_CTX(gcm_aes_ctx, ctx, key, cipher) \
- { \
- (gcm_aes_ctx)->x = (uint8_t *) &(ctx)->x; \
- (gcm_aes_ctx)->htable = (uint8_t *) (key); \
- (gcm_aes_ctx)->rkeys = (struct aes128_ctx *) (cipher)->keys; \
- }
+#define PPC_GCM_CRYPT(encrypt, rounds, ctx, length, dst, src) \
- { \
- size_t rem_len = 0; \
- struct ppc_gcm_aes_context c; \
- struct gcm_ctx *gctx = &(ctx)->gcm; \
- GET_PPC_CTX(&c, gctx, &(ctx)->key, &(ctx)->cipher); \
- if ((encrypt)) { \
_ppc_gcm_aes_encrypt(&c, (rounds), (&(ctx)->gcm)->ctr.b, (length), (dst), (src)); \
- } else { \
_ppc_gcm_aes_decrypt(&c, (rounds), (&(ctx)->gcm)->ctr.b, (length), (dst), (src)); \
- } \
- rem_len = (length) % (GCM_BLOCK_SIZE * 8); \
- (length) -= rem_len; \
- gctx->data_size += (length); \
- (dst) += (length); \
- (src) += (length); \
- (length) = rem_len; \
- }
+#endif /* HAVE_NATIVE_AES_GCM_STITCH */
This looks a little awkward. I think it would be better to pass the various pointers needed by the assembly implementation as separate (register) arguments. Or pass the pointer to the struct gcm_aesxxx_ctx directly (with the disadvantage that assembly code needs to know corresponding offsets).
--- a/powerpc64/machine.m4 +++ b/powerpc64/machine.m4 @@ -63,3 +63,40 @@ C INC_VR(VR, INC) define(`INC_VR',`ifelse(substr($1,0,1),`v', ``v'eval($2+substr($1,1,len($1)))', `eval($2+$1)')')
+C Adding state and round key 0 +C XOR_4RK0(state, state, rkey0) +define(`XOR_4RK0',
- `vxor $1, $1, $5
- vxor $2, $2, $5
- vxor $3, $3, $5
- vxor $4, $4, $5')
+C Do 4 vcipher/vcipherlast +C VCIPHER(vcipher/vcipherlast, state, state, rkey) +define(`VCIPHER4',
- `$1 $2, $2, $6
- $1 $3, $3, $6
- $1 $4, $4, $6
- $1 $5, $5, $6')
I thing this could be generalized to a OP_4WAY macro, used as
OP_4WAY(vxor/vcipher/vcipherlast, a, b, c, d, k)
One could also consider generalizing it to arbitrary number of registers with an m4 lop, to have
OP_NWAY(op, k, a, b,..., x)
expand to
op a, a, k op b, b, k ... op x, x, k
But that may be overkill if only 4-way and 8-way are used.
+C Adding multiplication product +C ADD_PROD(c1, c2, a, b) +define(`ADD_PROD',
- `vxor $1,$1,$3
- vxor $2,$2,$4')
Maybe rename GF_ADD; ADD_PROD is not so specific.
+C GF multification of L/M and data +C GF_MUL( +C GF_MUL(F, R, HL, HM, S) +define(`GF_MUL',
- `vpmsumd $1,$3,$5
- vpmsumd $2,$4,$5')
Looks like GF_MUL is only used in the pattern GF_MUL; GF_MUL; ADD_PROD? So could perhaps combine in one macro. With a commend saying which operation it performs.
--- /dev/null +++ b/powerpc64/p8/gcm-aes-decrypt.asm
[...]
+define(`SAVE_GPR', `std $1, $2(SP)') +define(`RESTORE_GPR', `ld $1, $2(SP)')
I don't think these macros add much readability. One could possibly have macros that take a range of registers, but not sure that's worth the effort.
+.align 5 +L8x_round_loop1:
- lxvd2x VSR(K),r11,RK
- vperm K,K,K,LE_MASK
- VCIPHER4(vcipher, S0, S1, S2, S3, K)
- VCIPHER4(vcipher, S4, S5, S6, S7, K)
- addi r11,r11,0x10
- bdnz L8x_round_loop1
- lxvd2x VSR(K),r11,RK
- vperm K,K,K,LE_MASK
- VCIPHER4(vcipherlast, S0, S1, S2, S3, K)
- VCIPHER4(vcipherlast, S4, S5, S6, S7, K)
- cmpdi LOOP, 0
- beq do_ghash
+.align 5 +Loop8x_de:
Is there a good reason why you have another copy decrypt round loop, before the main loop?
Regards, /Niels
On Dec 11, 2023, at 10:32 AM, Niels Möller nisse@lysator.liu.se wrote:
Danny Tsen dtsen@us.ibm.com writes:
Here is the version 2 for AES/GCM stitched patch. The stitched code is in all assembly and m4 macros are used. The overall performance improved around ~110% and 120% for encrypt and decrypt respectably. Please see the attached patch and aes benchmark.
Thanks, comments below.
--- a/fat-ppc.c +++ b/fat-ppc.c @@ -226,6 +231,8 @@ fat_init (void) _nettle_ghash_update_arm64() */ _nettle_ghash_set_key_vec = _nettle_ghash_set_key_ppc64; _nettle_ghash_update_vec = _nettle_ghash_update_ppc64;
_nettle_ppc_gcm_aes_encrypt_vec = _nettle_ppc_gcm_aes_encrypt_ppc64;
} else {_nettle_ppc_gcm_aes_decrypt_vec = _nettle_ppc_gcm_aes_decrypt_ppc64;
Fat setup is a bit tricky, here it looks like _nettle_ppc_gcm_aes_decrypt_vec is left undefined by the else clause. I would suspect that breaks when the extensions aren't available. You can test that with NETTLE_FAT_OVERRIDE=none.
Sure. I’ll test and see to fix it.
gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) { +#if defined(HAVE_NATIVE_AES_GCM_STITCH)
- if (length >= 128) {
- PPC_GCM_CRYPT(1, _AES128_ROUNDS, ctx, length, dst, src);
- if (length == 0) {
return;
- }
- }
+#endif /* HAVE_NATIVE_AES_GCM_STITCH */
- GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
}
In a non-fat build, it's right with a compile-time #if to select if the optimized code should be called. But in a fat build, we'de need a valid function in all cases, but doing different things depending on the runtime fat initialization. One could do that with two versions of gcm_aes128_encrypt (which is likely preferable if we do something similar for other archs that has separate assembly for aes128, aes192, etc). Or we would need to call some function unconditionally, which would be a nop if the extensions are not available. E.g, do something like
#if HAVE_NATIVE_fat_aes_gcm_encrypt void gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) { size_t done = _gcm_aes_encrypt (&ctx->key, &ctx->gcm.x, &ctx->gcm.ctr, _AES128_ROUNDS, &ctx->cipher.keys, length, dst, src); ctx->data_size += done; length -= done; if (length > 0) { src += done; dst += done; GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src); } } #endif
where the C-implementation of _gcm_aes_encrypt could just return 0.
And it's preferable that te same interface could be used on other archs, even if they don't do 8 blocks at a time like your ppc code.
--- a/gcm.h +++ b/gcm.h @@ -195,6 +195,47 @@ gcm_digest(struct gcm_ctx *ctx, const struct gcm_key *key, (nettle_cipher_func *) (encrypt), \ (length), (digest)))
+#if defined(HAVE_NATIVE_AES_GCM_STITCH) +#define _ppc_gcm_aes_encrypt _nettle_ppc_gcm_aes_encrypt +#define _ppc_gcm_aes_decrypt _nettle_ppc_gcm_aes_decrypt +void +_ppc_gcm_aes_encrypt (void *ctx, size_t rounds, uint8_t *ctr,
size_t len, uint8_t *dst, const uint8_t *src);
+void +_ppc_gcm_aes_decrypt (void *ctx, size_t rounds, uint8_t *ctr,
size_t len, uint8_t *dst, const uint8_t *src);
+struct ppc_gcm_aes_context {
- uint8_t *x;
- uint8_t *htable;
- struct aes128_ctx *rkeys;
+}; +#define GET_PPC_CTX(gcm_aes_ctx, ctx, key, cipher) \
- { \
- (gcm_aes_ctx)->x = (uint8_t *) &(ctx)->x; \
- (gcm_aes_ctx)->htable = (uint8_t *) (key); \
- (gcm_aes_ctx)->rkeys = (struct aes128_ctx *) (cipher)->keys; \
- }
+#define PPC_GCM_CRYPT(encrypt, rounds, ctx, length, dst, src) \
- { \
- size_t rem_len = 0; \
- struct ppc_gcm_aes_context c; \
- struct gcm_ctx *gctx = &(ctx)->gcm; \
- GET_PPC_CTX(&c, gctx, &(ctx)->key, &(ctx)->cipher); \
- if ((encrypt)) { \
_ppc_gcm_aes_encrypt(&c, (rounds), (&(ctx)->gcm)->ctr.b, (length), (dst), (src)); \
- } else { \
_ppc_gcm_aes_decrypt(&c, (rounds), (&(ctx)->gcm)->ctr.b, (length), (dst), (src)); \
- } \
- rem_len = (length) % (GCM_BLOCK_SIZE * 8); \
- (length) -= rem_len; \
- gctx->data_size += (length); \
- (dst) += (length); \
- (src) += (length); \
- (length) = rem_len; \
- }
+#endif /* HAVE_NATIVE_AES_GCM_STITCH */
This looks a little awkward. I think it would be better to pass the various pointers needed by the assembly implementation as separate (register) arguments. Or pass the pointer to the struct gcm_aesxxx_ctx directly (with the disadvantage that assembly code needs to know corresponding offsets).
I’ll see what’s a better options.
--- a/powerpc64/machine.m4 +++ b/powerpc64/machine.m4 @@ -63,3 +63,40 @@ C INC_VR(VR, INC) define(`INC_VR',`ifelse(substr($1,0,1),`v', ``v'eval($2+substr($1,1,len($1)))', `eval($2+$1)')')
+C Adding state and round key 0 +C XOR_4RK0(state, state, rkey0) +define(`XOR_4RK0',
- `vxor $1, $1, $5
- vxor $2, $2, $5
- vxor $3, $3, $5
- vxor $4, $4, $5')
+C Do 4 vcipher/vcipherlast +C VCIPHER(vcipher/vcipherlast, state, state, rkey) +define(`VCIPHER4',
- `$1 $2, $2, $6
- $1 $3, $3, $6
- $1 $4, $4, $6
- $1 $5, $5, $6')
I thing this could be generalized to a OP_4WAY macro, used as
OP_4WAY(vxor/vcipher/vcipherlast, a, b, c, d, k)
One could also consider generalizing it to arbitrary number of registers with an m4 lop, to have
OP_NWAY(op, k, a, b,..., x)
expand to
op a, a, k op b, b, k ... op x, x, k
But that may be overkill if only 4-way and 8-way are used.
This is only for ppc not for other architecture and defined in machine.m4. So, I don’t see the point.
+C Adding multiplication product +C ADD_PROD(c1, c2, a, b) +define(`ADD_PROD',
- `vxor $1,$1,$3
- vxor $2,$2,$4')
Maybe rename GF_ADD; ADD_PROD is not so specific.
+C GF multification of L/M and data +C GF_MUL( +C GF_MUL(F, R, HL, HM, S) +define(`GF_MUL',
- `vpmsumd $1,$3,$5
- vpmsumd $2,$4,$5')
Looks like GF_MUL is only used in the pattern GF_MUL; GF_MUL; ADD_PROD? So could perhaps combine in one macro. With a commend saying which operation it performs.
This is to be more flexible to use each macro so not to combine them.
--- /dev/null +++ b/powerpc64/p8/gcm-aes-decrypt.asm
[...]
+define(`SAVE_GPR', `std $1, $2(SP)') +define(`RESTORE_GPR', `ld $1, $2(SP)')
I don't think these macros add much readability. One could possibly have macros that take a range of registers, but not sure that's worth the effort.
Probably not.
+.align 5 +L8x_round_loop1:
- lxvd2x VSR(K),r11,RK
- vperm K,K,K,LE_MASK
- VCIPHER4(vcipher, S0, S1, S2, S3, K)
- VCIPHER4(vcipher, S4, S5, S6, S7, K)
- addi r11,r11,0x10
- bdnz L8x_round_loop1
- lxvd2x VSR(K),r11,RK
- vperm K,K,K,LE_MASK
- VCIPHER4(vcipherlast, S0, S1, S2, S3, K)
- VCIPHER4(vcipherlast, S4, S5, S6, S7, K)
- cmpdi LOOP, 0
- beq do_ghash
+.align 5 +Loop8x_de:
Is there a good reason why you have another copy decrypt round loop, before the main loop?
I tested both. Yes, this will improve around 5 - 6%.
Thanks. -Danny
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.
Hi Niels,
Here is another revised patch testing with NETTLE_FAT_OVERRIDE. Same performance as last version. I also added a new test vector with 917 bytes for AES/GCM tests to test multiple blocks and partial. Attached are the patch and benchmark for AES.
Thanks. -Danny
On Dec 12, 2023, at 9:01 AM, Danny Tsen dtsen@us.ibm.com wrote:
On Dec 11, 2023, at 10:32 AM, Niels Möller nisse@lysator.liu.se wrote:
Danny Tsen dtsen@us.ibm.com writes:
Here is the version 2 for AES/GCM stitched patch. The stitched code is in all assembly and m4 macros are used. The overall performance improved around ~110% and 120% for encrypt and decrypt respectably. Please see the attached patch and aes benchmark.
Thanks, comments below.
--- a/fat-ppc.c +++ b/fat-ppc.c @@ -226,6 +231,8 @@ fat_init (void) _nettle_ghash_update_arm64() */ _nettle_ghash_set_key_vec = _nettle_ghash_set_key_ppc64; _nettle_ghash_update_vec = _nettle_ghash_update_ppc64; + _nettle_ppc_gcm_aes_encrypt_vec = _nettle_ppc_gcm_aes_encrypt_ppc64; + _nettle_ppc_gcm_aes_decrypt_vec = _nettle_ppc_gcm_aes_decrypt_ppc64; } else {
Fat setup is a bit tricky, here it looks like _nettle_ppc_gcm_aes_decrypt_vec is left undefined by the else clause. I would suspect that breaks when the extensions aren't available. You can test that with NETTLE_FAT_OVERRIDE=none.
Sure. I’ll test and see to fix it.
gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) { +#if defined(HAVE_NATIVE_AES_GCM_STITCH) + if (length >= 128) { + PPC_GCM_CRYPT(1, _AES128_ROUNDS, ctx, length, dst, src); + if (length == 0) { + return; + } + } +#endif /* HAVE_NATIVE_AES_GCM_STITCH */ + GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src); }
In a non-fat build, it's right with a compile-time #if to select if the optimized code should be called. But in a fat build, we'de need a valid function in all cases, but doing different things depending on the runtime fat initialization. One could do that with two versions of gcm_aes128_encrypt (which is likely preferable if we do something similar for other archs that has separate assembly for aes128, aes192, etc). Or we would need to call some function unconditionally, which would be a nop if the extensions are not available. E.g, do something like
#if HAVE_NATIVE_fat_aes_gcm_encrypt void gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) { size_t done = _gcm_aes_encrypt (&ctx->key, &ctx->gcm.x, &ctx->gcm.ctr, _AES128_ROUNDS, &ctx->cipher.keys, length, dst, src); ctx->data_size += done; length -= done; if (length > 0) { src += done; dst += done; GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src); } } #endif
where the C-implementation of _gcm_aes_encrypt could just return 0.
And it's preferable that te same interface could be used on other archs, even if they don't do 8 blocks at a time like your ppc code.
--- a/gcm.h +++ b/gcm.h @@ -195,6 +195,47 @@ gcm_digest(struct gcm_ctx *ctx, const struct gcm_key *key, (nettle_cipher_func *) (encrypt), \ (length), (digest)))
+#if defined(HAVE_NATIVE_AES_GCM_STITCH) +#define _ppc_gcm_aes_encrypt _nettle_ppc_gcm_aes_encrypt +#define _ppc_gcm_aes_decrypt _nettle_ppc_gcm_aes_decrypt +void +_ppc_gcm_aes_encrypt (void *ctx, size_t rounds, uint8_t *ctr, + size_t len, uint8_t *dst, const uint8_t *src); +void +_ppc_gcm_aes_decrypt (void *ctx, size_t rounds, uint8_t *ctr, + size_t len, uint8_t *dst, const uint8_t *src); +struct ppc_gcm_aes_context { + uint8_t *x; + uint8_t *htable; + struct aes128_ctx *rkeys; +}; +#define GET_PPC_CTX(gcm_aes_ctx, ctx, key, cipher) \ + { \ + (gcm_aes_ctx)->x = (uint8_t *) &(ctx)->x; \ + (gcm_aes_ctx)->htable = (uint8_t *) (key); \ + (gcm_aes_ctx)->rkeys = (struct aes128_ctx *) (cipher)->keys; \ + } + +#define PPC_GCM_CRYPT(encrypt, rounds, ctx, length, dst, src) \ + { \ + size_t rem_len = 0; \ + struct ppc_gcm_aes_context c; \ + struct gcm_ctx *gctx = &(ctx)->gcm; \ + GET_PPC_CTX(&c, gctx, &(ctx)->key, &(ctx)->cipher); \ + if ((encrypt)) { \ + _ppc_gcm_aes_encrypt(&c, (rounds), (&(ctx)->gcm)->ctr.b, (length), (dst), (src)); \ + } else { \ + _ppc_gcm_aes_decrypt(&c, (rounds), (&(ctx)->gcm)->ctr.b, (length), (dst), (src)); \ + } \ + rem_len = (length) % (GCM_BLOCK_SIZE * 8); \ + (length) -= rem_len; \ + gctx->data_size += (length); \ + (dst) += (length); \ + (src) += (length); \ + (length) = rem_len; \ + } +#endif /* HAVE_NATIVE_AES_GCM_STITCH */
This looks a little awkward. I think it would be better to pass the various pointers needed by the assembly implementation as separate (register) arguments. Or pass the pointer to the struct gcm_aesxxx_ctx directly (with the disadvantage that assembly code needs to know corresponding offsets).
I’ll see what’s a better options.
--- a/powerpc64/machine.m4 +++ b/powerpc64/machine.m4 @@ -63,3 +63,40 @@ C INC_VR(VR, INC) define(`INC_VR',`ifelse(substr($1,0,1),`v', ``v'eval($2+substr($1,1,len($1)))', `eval($2+$1)')') + +C Adding state and round key 0 +C XOR_4RK0(state, state, rkey0) +define(`XOR_4RK0', + `vxor $1, $1, $5 + vxor $2, $2, $5 + vxor $3, $3, $5 + vxor $4, $4, $5') + +C Do 4 vcipher/vcipherlast +C VCIPHER(vcipher/vcipherlast, state, state, rkey) +define(`VCIPHER4', + `$1 $2, $2, $6 + $1 $3, $3, $6 + $1 $4, $4, $6 + $1 $5, $5, $6')
I thing this could be generalized to a OP_4WAY macro, used as
OP_4WAY(vxor/vcipher/vcipherlast, a, b, c, d, k)
One could also consider generalizing it to arbitrary number of registers with an m4 lop, to have
OP_NWAY(op, k, a, b,..., x)
expand to
op a, a, k op b, b, k ... op x, x, k
But that may be overkill if only 4-way and 8-way are used.
This is only for ppc not for other architecture and defined in machine.m4. So, I don’t see the point.
+C Adding multiplication product +C ADD_PROD(c1, c2, a, b) +define(`ADD_PROD', + `vxor $1,$1,$3 + vxor $2,$2,$4')
Maybe rename GF_ADD; ADD_PROD is not so specific.
+C GF multification of L/M and data +C GF_MUL( +C GF_MUL(F, R, HL, HM, S) +define(`GF_MUL', + `vpmsumd $1,$3,$5 + vpmsumd $2,$4,$5')
Looks like GF_MUL is only used in the pattern GF_MUL; GF_MUL; ADD_PROD? So could perhaps combine in one macro. With a commend saying which operation it performs.
This is to be more flexible to use each macro so not to combine them.
--- /dev/null +++ b/powerpc64/p8/gcm-aes-decrypt.asm [...]
+define(`SAVE_GPR', `std $1, $2(SP)') +define(`RESTORE_GPR', `ld $1, $2(SP)')
I don't think these macros add much readability. One could possibly have macros that take a range of registers, but not sure that's worth the effort.
Probably not.
+.align 5 +L8x_round_loop1: + lxvd2x VSR(K),r11,RK + vperm K,K,K,LE_MASK + VCIPHER4(vcipher, S0, S1, S2, S3, K) + VCIPHER4(vcipher, S4, S5, S6, S7, K) + addi r11,r11,0x10 + bdnz L8x_round_loop1 + + lxvd2x VSR(K),r11,RK + vperm K,K,K,LE_MASK + VCIPHER4(vcipherlast, S0, S1, S2, S3, K) + VCIPHER4(vcipherlast, S4, S5, S6, S7, K) + + cmpdi LOOP, 0 + beq do_ghash + +.align 5 +Loop8x_de:
Is there a good reason why you have another copy decrypt round loop, before the main loop?
I tested both. Yes, this will improve around 5 - 6%.
Thanks. -Danny
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.
_______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.semailto:nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-leave@lists.lysator.liu.semailto:nettle-bugs-leave@lists.lysator.liu.se
Danny Tsen dtsen@us.ibm.com writes:
Here is another revised patch testing with NETTLE_FAT_OVERRIDE. Same performance as last version. I also added a new test vector with 917 bytes for AES/GCM tests to test multiple blocks and partial. Attached are the patch and benchmark for AES.
Thanks. I haven't yet been able to look closer, but to make progress and start with the easy parts, I've committed the new test case. For next step, I'm considering the refactoring of ghash to use the new macros.
Regards, /Niels
Niels Möller nisse@lysator.liu.se writes:
For next step, I'm considering the refactoring of ghash to use the new macros.
I actually started with the macros relevant for the AES code. And it turned out to be rather easy to do the m4 loops to operate on an arbitrary list of registers. See branch ppc-aes-macros, in particular, https://gitlab.com/gnutls/nettle/-/blob/ppc-aes-macros/powerpc64/machine.m4?...
I'm not entirely happy with the naming (OP_YXX, OP_YXXX), which is on one hand intended to correspond to the instruction pattern the macro expands to, and at the same time correspond to the expected order of arguments, which really doesn't work that well. Suggestion welcome.
Regards, /Niels
Correction, 719 bytes test vector.
On Dec 12, 2023, at 9:01 AM, Danny Tsen dtsen@us.ibm.com wrote:
On Dec 11, 2023, at 10:32 AM, Niels Möller nisse@lysator.liu.se wrote:
Danny Tsen dtsen@us.ibm.com writes:
Here is the version 2 for AES/GCM stitched patch. The stitched code is in all assembly and m4 macros are used. The overall performance improved around ~110% and 120% for encrypt and decrypt respectably. Please see the attached patch and aes benchmark.
Thanks, comments below.
--- a/fat-ppc.c +++ b/fat-ppc.c @@ -226,6 +231,8 @@ fat_init (void) _nettle_ghash_update_arm64() */ _nettle_ghash_set_key_vec = _nettle_ghash_set_key_ppc64; _nettle_ghash_update_vec = _nettle_ghash_update_ppc64; + _nettle_ppc_gcm_aes_encrypt_vec = _nettle_ppc_gcm_aes_encrypt_ppc64; + _nettle_ppc_gcm_aes_decrypt_vec = _nettle_ppc_gcm_aes_decrypt_ppc64; } else {
Fat setup is a bit tricky, here it looks like _nettle_ppc_gcm_aes_decrypt_vec is left undefined by the else clause. I would suspect that breaks when the extensions aren't available. You can test that with NETTLE_FAT_OVERRIDE=none.
Sure. I’ll test and see to fix it.
gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) { +#if defined(HAVE_NATIVE_AES_GCM_STITCH) + if (length >= 128) { + PPC_GCM_CRYPT(1, _AES128_ROUNDS, ctx, length, dst, src); + if (length == 0) { + return; + } + } +#endif /* HAVE_NATIVE_AES_GCM_STITCH */ + GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src); }
In a non-fat build, it's right with a compile-time #if to select if the optimized code should be called. But in a fat build, we'de need a valid function in all cases, but doing different things depending on the runtime fat initialization. One could do that with two versions of gcm_aes128_encrypt (which is likely preferable if we do something similar for other archs that has separate assembly for aes128, aes192, etc). Or we would need to call some function unconditionally, which would be a nop if the extensions are not available. E.g, do something like
#if HAVE_NATIVE_fat_aes_gcm_encrypt void gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) { size_t done = _gcm_aes_encrypt (&ctx->key, &ctx->gcm.x, &ctx->gcm.ctr, _AES128_ROUNDS, &ctx->cipher.keys, length, dst, src); ctx->data_size += done; length -= done; if (length > 0) { src += done; dst += done; GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src); } } #endif
where the C-implementation of _gcm_aes_encrypt could just return 0.
And it's preferable that te same interface could be used on other archs, even if they don't do 8 blocks at a time like your ppc code.
--- a/gcm.h +++ b/gcm.h @@ -195,6 +195,47 @@ gcm_digest(struct gcm_ctx *ctx, const struct gcm_key *key, (nettle_cipher_func *) (encrypt), \ (length), (digest)))
+#if defined(HAVE_NATIVE_AES_GCM_STITCH) +#define _ppc_gcm_aes_encrypt _nettle_ppc_gcm_aes_encrypt +#define _ppc_gcm_aes_decrypt _nettle_ppc_gcm_aes_decrypt +void +_ppc_gcm_aes_encrypt (void *ctx, size_t rounds, uint8_t *ctr, + size_t len, uint8_t *dst, const uint8_t *src); +void +_ppc_gcm_aes_decrypt (void *ctx, size_t rounds, uint8_t *ctr, + size_t len, uint8_t *dst, const uint8_t *src); +struct ppc_gcm_aes_context { + uint8_t *x; + uint8_t *htable; + struct aes128_ctx *rkeys; +}; +#define GET_PPC_CTX(gcm_aes_ctx, ctx, key, cipher) \ + { \ + (gcm_aes_ctx)->x = (uint8_t *) &(ctx)->x; \ + (gcm_aes_ctx)->htable = (uint8_t *) (key); \ + (gcm_aes_ctx)->rkeys = (struct aes128_ctx *) (cipher)->keys; \ + } + +#define PPC_GCM_CRYPT(encrypt, rounds, ctx, length, dst, src) \ + { \ + size_t rem_len = 0; \ + struct ppc_gcm_aes_context c; \ + struct gcm_ctx *gctx = &(ctx)->gcm; \ + GET_PPC_CTX(&c, gctx, &(ctx)->key, &(ctx)->cipher); \ + if ((encrypt)) { \ + _ppc_gcm_aes_encrypt(&c, (rounds), (&(ctx)->gcm)->ctr.b, (length), (dst), (src)); \ + } else { \ + _ppc_gcm_aes_decrypt(&c, (rounds), (&(ctx)->gcm)->ctr.b, (length), (dst), (src)); \ + } \ + rem_len = (length) % (GCM_BLOCK_SIZE * 8); \ + (length) -= rem_len; \ + gctx->data_size += (length); \ + (dst) += (length); \ + (src) += (length); \ + (length) = rem_len; \ + } +#endif /* HAVE_NATIVE_AES_GCM_STITCH */
This looks a little awkward. I think it would be better to pass the various pointers needed by the assembly implementation as separate (register) arguments. Or pass the pointer to the struct gcm_aesxxx_ctx directly (with the disadvantage that assembly code needs to know corresponding offsets).
I’ll see what’s a better options.
--- a/powerpc64/machine.m4 +++ b/powerpc64/machine.m4 @@ -63,3 +63,40 @@ C INC_VR(VR, INC) define(`INC_VR',`ifelse(substr($1,0,1),`v', ``v'eval($2+substr($1,1,len($1)))', `eval($2+$1)')') + +C Adding state and round key 0 +C XOR_4RK0(state, state, rkey0) +define(`XOR_4RK0', + `vxor $1, $1, $5 + vxor $2, $2, $5 + vxor $3, $3, $5 + vxor $4, $4, $5') + +C Do 4 vcipher/vcipherlast +C VCIPHER(vcipher/vcipherlast, state, state, rkey) +define(`VCIPHER4', + `$1 $2, $2, $6 + $1 $3, $3, $6 + $1 $4, $4, $6 + $1 $5, $5, $6')
I thing this could be generalized to a OP_4WAY macro, used as
OP_4WAY(vxor/vcipher/vcipherlast, a, b, c, d, k)
One could also consider generalizing it to arbitrary number of registers with an m4 lop, to have
OP_NWAY(op, k, a, b,..., x)
expand to
op a, a, k op b, b, k ... op x, x, k
But that may be overkill if only 4-way and 8-way are used.
This is only for ppc not for other architecture and defined in machine.m4. So, I don’t see the point.
+C Adding multiplication product +C ADD_PROD(c1, c2, a, b) +define(`ADD_PROD', + `vxor $1,$1,$3 + vxor $2,$2,$4')
Maybe rename GF_ADD; ADD_PROD is not so specific.
+C GF multification of L/M and data +C GF_MUL( +C GF_MUL(F, R, HL, HM, S) +define(`GF_MUL', + `vpmsumd $1,$3,$5 + vpmsumd $2,$4,$5')
Looks like GF_MUL is only used in the pattern GF_MUL; GF_MUL; ADD_PROD? So could perhaps combine in one macro. With a commend saying which operation it performs.
This is to be more flexible to use each macro so not to combine them.
--- /dev/null +++ b/powerpc64/p8/gcm-aes-decrypt.asm [...]
+define(`SAVE_GPR', `std $1, $2(SP)') +define(`RESTORE_GPR', `ld $1, $2(SP)')
I don't think these macros add much readability. One could possibly have macros that take a range of registers, but not sure that's worth the effort.
Probably not.
+.align 5 +L8x_round_loop1: + lxvd2x VSR(K),r11,RK + vperm K,K,K,LE_MASK + VCIPHER4(vcipher, S0, S1, S2, S3, K) + VCIPHER4(vcipher, S4, S5, S6, S7, K) + addi r11,r11,0x10 + bdnz L8x_round_loop1 + + lxvd2x VSR(K),r11,RK + vperm K,K,K,LE_MASK + VCIPHER4(vcipherlast, S0, S1, S2, S3, K) + VCIPHER4(vcipherlast, S4, S5, S6, S7, K) + + cmpdi LOOP, 0 + beq do_ghash + +.align 5 +Loop8x_de:
Is there a good reason why you have another copy decrypt round loop, before the main loop?
I tested both. Yes, this will improve around 5 - 6%.
Thanks. -Danny
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.
_______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.semailto:nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-leave@lists.lysator.liu.semailto:nettle-bugs-leave@lists.lysator.liu.se
nettle-bugs@lists.lysator.liu.se