Re: ppc64 micro optimization

21 Feb 2024

Hi Niels,
Here is the v5 patch from your comments.  Please review.
Thanks.
-Danny
...
On Feb 14, 2024, at 8:46 AM, Niels Möller nisse@lysator.liu.se wrote:
Danny Tsen dtsen@us.ibm.com writes:
...
Here is the new patch v4 for AES/GCM stitched implementation and
benchmark based on the current repo.
Thanks. I'm not able to read it all carefully at the moment, but I have
a few comments, see below.
In the mean time, I've also tried to implement something similar for
x86_64, see branch x86_64-gcm-aes. Unfortunately, I get no speedup, to
the contrary, my stitched implementation seems considerably slower...
But at least that helped me understand the higher-level issues better.
...

--- a/gcm-aes128.c
+++ b/gcm-aes128.c
@@ -63,14 +63,30 @@ void
gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx,
              size_t length, uint8_t *dst, const uint8_t *src)
{

GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);


size_t done = _gcm_aes_encrypt (&ctx->key, &ctx->gcm.x.b, &ctx->gcm.ctr.b,
                         _AES128_ROUNDS, &ctx->cipher.keys, length, dst, src);



I know I asked you to explode the context into many separate arguments
to _gcm_aes_encrypt. I'm now backpedalling a bit on that. For one, it's
not so nice to have so many arguments that they can't be passed in
registers. Second, when running a fat build on a machine where the
needed instructions are unavailable, it's a bit of a waste to have to
spend lots of instructions on preparing those arguments for calling a
nop function. So to reduce overhead, I'm now leaning towards an
interface like
/* To reduce the number of arguments (e.g., maximum of 6 register
    arguments on x86_64), pass a pointer to gcm_key, which really is a
    pointer to the first member of the appropriate gcm_aes*_ctx
    struct. */
 size_t
 _gcm_aes_encrypt (struct gcm_key *CTX,
                   unsigned rounds,
                   size_t size, uint8_t *dst, const uint8_t *src);
That's not so pretty, but I think that is workable and efficient, and
since it is an internal function, the interface can be changed if this
is implemented on other architectures and we find out that it needs some
tweaks. See
https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/x86_64/aesni_...
for the code I wrote to accept that ctx argument.
It would also be nice to have a #define around the code calling
_gcm_aes_encrypt, so that it is compiled only if (i) we have an
non-trivial implementation of _gcm_aes_encrypt, or (ii) we're a fat
build, which may select a non-trivial implementation of _gcm_aes_encrypt
at run time.
...

ctx->gcm.data_size += done;
length -= done;
if (length > 0) {

Not sure of the check for length > 0 is needed. It is fine to call
gcm_encrypt/GCM_ENCRYPT with length 0. There will be some overhead for a
call with length 0, though, which may be a more common case when
_gcm_aes_encrypt is used?
...
+define(`SAVE_GPR', `std $1, $2(SP)')
+define(`RESTORE_GPR', `ld $1, $2(SP)')
I think the above two macros are unneeded, it's easier to read to use
std and ld directly.
...
+define(`SAVE_VR',

`li r11, $2
stvx $1, r11, $3')

+define(`RESTORE_VR',

`li r11, $2
lvx $1, r11, $3')

It would be nice if we could trim the use of vector registers so we
don't need to save and restore lots of them. And if we need two
instructions anyway, then maybe it would be clearer with PUSH_VR/POP_VR
that also adjusts the stack pointer, and doesn't need to use an additional
register for indexing?
...

C load table elements
li             r9,1*16
li             r10,2*16
li             r11,3*16
lxvd2x         VSR(H1M),0,HT
lxvd2x         VSR(H1L),r9,HT
lxvd2x         VSR(H2M),r10,HT
lxvd2x         VSR(H2L),r11,HT
li             r9,4*16
li             r10,5*16
li             r11,6*16
li             r12,7*16
lxvd2x         VSR(H3M),r9,HT
lxvd2x         VSR(H3L),r10,HT
lxvd2x         VSR(H4M),r11,HT
lxvd2x         VSR(H4L),r12,HT

I think it would be nicer to follow the style I tried to implement in my
recent updates, using some registers (e.g., r9-r11) as offsets,
initializing them only once, and using everywhere. E.g., in this case,
the loading could be
lxvd2x         VSR(H1M),0,HT
   lxvd2x         VSR(H1L),r9,HT
   lxvd2x         VSR(H2M),r10,HT
   lxvd2x         VSR(H2L),r11,HT
   addi  HT, HT, 64
   lxvd2x         VSR(H3M),0,HT
   lxvd2x         VSR(H3L),r9,HT
   lxvd2x         VSR(H4M),r10,HT
   lxvd2x         VSR(H4L),r11,HT
...

C do two 4x ghash

C previous digest combining
xxlor vs0, VSR(S0), VSR(S0)
vxor S0,S0,D

GF_MUL(F2, R2, H3L, H3M, S1)
GF_MUL(F, R, H4L, H4M, S0)
vxor           F,F,F2
vxor           R,R,R2

xxlor VSR(S0), vs0, vs0

GF_MUL(F3, R3, H2L, H2M, S2)
GF_MUL(F4, R4, H1L, H1M, S3)
vxor           F3,F3,F4
vxor           R3,R3,R4

vxor           F,F,F3
vxor           D,R,R3
GHASH_REDUCE(D, F, POLY_L, R2, F2)  C R2, F2 used as temporaries

It may be possible to reduce number of registers, without making the
code slower, by accumulating differently. You ultimately accumulate the
values into F, D, before the final reduction. Maybe you don't need
separate registers for all of the F, F2, F3, F4 registers, and all of D,
R, R2, R3, R4?
If you have any short-lived registers used for the AES part, they could
also overlap with short-lived registers used for ghash.
Regards,
/Niels
--
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: ppc64 micro optimization