Hi Niels,
Please let me know when you merge the code and we can work from there.
Thanks. -Danny ________________________________ From: Niels Möller nisse@lysator.liu.se Sent: Friday, February 23, 2024 1:07 AM To: Danny Tsen dtsen@us.ibm.com Cc: nettle-bugs@lists.lysator.liu.se nettle-bugs@lists.lysator.liu.se; George Wilson gcwilson@us.ibm.com Subject: [EXTERNAL] Re: ppc64 micro optimization
Danny Tsen dtsen@us.ibm.com writes:
Here is the v5 patch from your comments. Please review.
Thanks. I think this looks pretty good. Maybe I should commit it on a branch and we can iterate from there. I'll be on vacation and mostly offline next week, though.
--- a/gcm-aes128.c +++ b/gcm-aes128.c @@ -63,6 +63,11 @@ void gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) {
- size_t done = _gcm_aes_encrypt ((struct gcm_key *)ctx, _AES128_ROUNDS, length, dst, src);
 - ctx->gcm.data_size += done;
 - length -= done;
 - src += done;
 - dst += done; GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
 }
We should come up with some preprocessor things to completely omit the new code on architectures that don't have _gcm_aes_encrypt (possibly with some macro to reduce duplication). I think that's the main thing I'd like to have before merge. Otherwise, looks nice and clean.
Ah, and I think you you could write &ctx->key instead of the explicit cast.
- C load table elements
 - li r9,1*16
 - li r10,2*16
 - li r11,3*16
 - lxvd2x VSR(H1M),0,HT
 - lxvd2x VSR(H1L),r9,HT
 - lxvd2x VSR(H2M),r10,HT
 - lxvd2x VSR(H2L),r11,HT
 - addi HT, HT, 64
 - lxvd2x VSR(H3M),0,HT
 - lxvd2x VSR(H3L),r9,HT
 - lxvd2x VSR(H4M),r10,HT
 - lxvd2x VSR(H4L),r11,HT
 - li r25,0x10
 - li r26,0x20
 - li r27,0x30
 - li r28,0x40
 - li r29,0x50
 - li r30,0x60
 - li r31,0x70
 
I still think there's opportunity to reduce number of registers (and corresponding load-store of callee save registers. E.g, here r9-r11 are used for the same thing as r25-r27.
+.align 5
- C increase ctr value as input to aes_encrypt
 - vaddudm S1, S0, CNT1
 - vaddudm S2, S1, CNT1
 - vaddudm S3, S2, CNT1
 - vaddudm S4, S3, CNT1
 - vaddudm S5, S4, CNT1
 - vaddudm S6, S5, CNT1
 - vaddudm S7, S6, CNT1
 
This is a rather long dependency chain; I wonder if you could make a measurable saving of a cycle or two by using additional CNT2 or CNT4 registers (if not, it's preferable to keep the current simple chain).
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.