RE: ppc64 micro optimization

26 Feb 2024

Hi Niels,
Please let me know when you merge the code and we can work from there.
Thanks.
-Danny
________________________________
From: Niels Möller nisse@lysator.liu.se
Sent: Friday, February 23, 2024 1:07 AM
To: Danny Tsen dtsen@us.ibm.com
Cc: nettle-bugs@lists.lysator.liu.se nettle-bugs@lists.lysator.liu.se; George Wilson gcwilson@us.ibm.com
Subject: [EXTERNAL] Re: ppc64 micro optimization
Danny Tsen dtsen@us.ibm.com writes:
...
Here is the v5 patch from your comments.  Please review.
Thanks. I think this looks pretty good. Maybe I should commit it on a
branch and we can iterate from there. I'll be on vacation and mostly
offline next week, though.
...

--- a/gcm-aes128.c
+++ b/gcm-aes128.c
@@ -63,6 +63,11 @@ void
 gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx,
               size_t length, uint8_t *dst, const uint8_t *src)
 {

size_t done = _gcm_aes_encrypt ((struct gcm_key *)ctx, _AES128_ROUNDS, length, dst, src);
ctx->gcm.data_size += done;
length -= done;
src += done;
dst += done;
GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);

}
We should come up with some preprocessor things to completely omit the
new code on architectures that don't have _gcm_aes_encrypt (possibly
with some macro to reduce duplication). I think that's the main thing
I'd like to have before merge. Otherwise, looks nice and clean.
Ah, and I think you you could write &ctx->key instead of the explicit
cast.
...

C load table elements
li             r9,1*16
li             r10,2*16
li             r11,3*16
lxvd2x         VSR(H1M),0,HT
lxvd2x         VSR(H1L),r9,HT
lxvd2x         VSR(H2M),r10,HT
lxvd2x         VSR(H2L),r11,HT
addi HT, HT, 64
lxvd2x         VSR(H3M),0,HT
lxvd2x         VSR(H3L),r9,HT
lxvd2x         VSR(H4M),r10,HT
lxvd2x         VSR(H4L),r11,HT

li r25,0x10
li r26,0x20
li r27,0x30
li r28,0x40
li r29,0x50
li r30,0x60
li r31,0x70

I still think there's opportunity to reduce number of registers (and
corresponding load-store of callee save registers. E.g, here r9-r11 are
used for the same thing as r25-r27.
...
+.align 5

C increase ctr value as input to aes_encrypt
vaddudm S1, S0, CNT1
vaddudm S2, S1, CNT1
vaddudm S3, S2, CNT1
vaddudm S4, S3, CNT1
vaddudm S5, S4, CNT1
vaddudm S6, S5, CNT1
vaddudm S7, S6, CNT1

This is a rather long dependency chain; I wonder if you could make a
measurable saving of a cycle or two by using additional CNT2 or CNT4
registers (if not, it's preferable to keep the current simple chain).
Regards,
/Niels
--
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

RE: ppc64 micro optimization