Hi Niels,
Here is the v5 patch from your comments. Please review.
Thanks. -Danny
On Feb 14, 2024, at 8:46 AM, Niels Möller nisse@lysator.liu.se wrote:
Danny Tsen dtsen@us.ibm.com writes:
Here is the new patch v4 for AES/GCM stitched implementation and benchmark based on the current repo.
Thanks. I'm not able to read it all carefully at the moment, but I have a few comments, see below.
In the mean time, I've also tried to implement something similar for x86_64, see branch x86_64-gcm-aes. Unfortunately, I get no speedup, to the contrary, my stitched implementation seems considerably slower... But at least that helped me understand the higher-level issues better.
--- a/gcm-aes128.c +++ b/gcm-aes128.c @@ -63,14 +63,30 @@ void gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) {
- GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
- size_t done = _gcm_aes_encrypt (&ctx->key, &ctx->gcm.x.b, &ctx->gcm.ctr.b,
_AES128_ROUNDS, &ctx->cipher.keys, length, dst, src);
I know I asked you to explode the context into many separate arguments to _gcm_aes_encrypt. I'm now backpedalling a bit on that. For one, it's not so nice to have so many arguments that they can't be passed in registers. Second, when running a fat build on a machine where the needed instructions are unavailable, it's a bit of a waste to have to spend lots of instructions on preparing those arguments for calling a nop function. So to reduce overhead, I'm now leaning towards an interface like
/* To reduce the number of arguments (e.g., maximum of 6 register arguments on x86_64), pass a pointer to gcm_key, which really is a pointer to the first member of the appropriate gcm_aes*_ctx struct. */ size_t _gcm_aes_encrypt (struct gcm_key *CTX, unsigned rounds, size_t size, uint8_t *dst, const uint8_t *src);
That's not so pretty, but I think that is workable and efficient, and since it is an internal function, the interface can be changed if this is implemented on other architectures and we find out that it needs some tweaks. See https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/x86_64/aesni_... for the code I wrote to accept that ctx argument.
It would also be nice to have a #define around the code calling _gcm_aes_encrypt, so that it is compiled only if (i) we have an non-trivial implementation of _gcm_aes_encrypt, or (ii) we're a fat build, which may select a non-trivial implementation of _gcm_aes_encrypt at run time.
- ctx->gcm.data_size += done;
- length -= done;
- if (length > 0) {
Not sure of the check for length > 0 is needed. It is fine to call gcm_encrypt/GCM_ENCRYPT with length 0. There will be some overhead for a call with length 0, though, which may be a more common case when _gcm_aes_encrypt is used?
+define(`SAVE_GPR', `std $1, $2(SP)') +define(`RESTORE_GPR', `ld $1, $2(SP)')
I think the above two macros are unneeded, it's easier to read to use std and ld directly.
+define(`SAVE_VR',
- `li r11, $2
- stvx $1, r11, $3')
+define(`RESTORE_VR',
- `li r11, $2
- lvx $1, r11, $3')
It would be nice if we could trim the use of vector registers so we don't need to save and restore lots of them. And if we need two instructions anyway, then maybe it would be clearer with PUSH_VR/POP_VR that also adjusts the stack pointer, and doesn't need to use an additional register for indexing?
- C load table elements
- li r9,1*16
- li r10,2*16
- li r11,3*16
- lxvd2x VSR(H1M),0,HT
- lxvd2x VSR(H1L),r9,HT
- lxvd2x VSR(H2M),r10,HT
- lxvd2x VSR(H2L),r11,HT
- li r9,4*16
- li r10,5*16
- li r11,6*16
- li r12,7*16
- lxvd2x VSR(H3M),r9,HT
- lxvd2x VSR(H3L),r10,HT
- lxvd2x VSR(H4M),r11,HT
- lxvd2x VSR(H4L),r12,HT
I think it would be nicer to follow the style I tried to implement in my recent updates, using some registers (e.g., r9-r11) as offsets, initializing them only once, and using everywhere. E.g., in this case, the loading could be
lxvd2x VSR(H1M),0,HT lxvd2x VSR(H1L),r9,HT lxvd2x VSR(H2M),r10,HT lxvd2x VSR(H2L),r11,HT addi HT, HT, 64 lxvd2x VSR(H3M),0,HT lxvd2x VSR(H3L),r9,HT lxvd2x VSR(H4M),r10,HT lxvd2x VSR(H4L),r11,HT
- C do two 4x ghash
- C previous digest combining
- xxlor vs0, VSR(S0), VSR(S0)
- vxor S0,S0,D
- GF_MUL(F2, R2, H3L, H3M, S1)
- GF_MUL(F, R, H4L, H4M, S0)
- vxor F,F,F2
- vxor R,R,R2
- xxlor VSR(S0), vs0, vs0
- GF_MUL(F3, R3, H2L, H2M, S2)
- GF_MUL(F4, R4, H1L, H1M, S3)
- vxor F3,F3,F4
- vxor R3,R3,R4
- vxor F,F,F3
- vxor D,R,R3
- GHASH_REDUCE(D, F, POLY_L, R2, F2) C R2, F2 used as temporaries
It may be possible to reduce number of registers, without making the code slower, by accumulating differently. You ultimately accumulate the values into F, D, before the final reduction. Maybe you don't need separate registers for all of the F, F2, F3, F4 registers, and all of D, R, R2, R3, R4?
If you have any short-lived registers used for the AES part, they could also overlap with short-lived registers used for ghash.
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.