Hi Niels,
Please let me know when you merge the code and we can work from there.
Thanks. -Danny ________________________________ From: Niels Möller nisse@lysator.liu.se Sent: Friday, February 23, 2024 1:07 AM To: Danny Tsen dtsen@us.ibm.com Cc: nettle-bugs@lists.lysator.liu.se nettle-bugs@lists.lysator.liu.se; George Wilson gcwilson@us.ibm.com Subject: [EXTERNAL] Re: ppc64 micro optimization
Danny Tsen dtsen@us.ibm.com writes:
Here is the v5 patch from your comments. Please review.
Thanks. I think this looks pretty good. Maybe I should commit it on a branch and we can iterate from there. I'll be on vacation and mostly offline next week, though.
--- a/gcm-aes128.c +++ b/gcm-aes128.c @@ -63,6 +63,11 @@ void gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) {
- size_t done = _gcm_aes_encrypt ((struct gcm_key *)ctx, _AES128_ROUNDS, length, dst, src);
- ctx->gcm.data_size += done;
- length -= done;
- src += done;
- dst += done; GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
}
We should come up with some preprocessor things to completely omit the new code on architectures that don't have _gcm_aes_encrypt (possibly with some macro to reduce duplication). I think that's the main thing I'd like to have before merge. Otherwise, looks nice and clean.
Ah, and I think you you could write &ctx->key instead of the explicit cast.
- C load table elements
- li r9,1*16
- li r10,2*16
- li r11,3*16
- lxvd2x VSR(H1M),0,HT
- lxvd2x VSR(H1L),r9,HT
- lxvd2x VSR(H2M),r10,HT
- lxvd2x VSR(H2L),r11,HT
- addi HT, HT, 64
- lxvd2x VSR(H3M),0,HT
- lxvd2x VSR(H3L),r9,HT
- lxvd2x VSR(H4M),r10,HT
- lxvd2x VSR(H4L),r11,HT
- li r25,0x10
- li r26,0x20
- li r27,0x30
- li r28,0x40
- li r29,0x50
- li r30,0x60
- li r31,0x70
I still think there's opportunity to reduce number of registers (and corresponding load-store of callee save registers. E.g, here r9-r11 are used for the same thing as r25-r27.
+.align 5
- C increase ctr value as input to aes_encrypt
- vaddudm S1, S0, CNT1
- vaddudm S2, S1, CNT1
- vaddudm S3, S2, CNT1
- vaddudm S4, S3, CNT1
- vaddudm S5, S4, CNT1
- vaddudm S6, S5, CNT1
- vaddudm S7, S6, CNT1
This is a rather long dependency chain; I wonder if you could make a measurable saving of a cycle or two by using additional CNT2 or CNT4 registers (if not, it's preferable to keep the current simple chain).
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.