ppc64 micro optimization

List overview All Threads
Download

newer

older

[PATCH 0/2] Add optimized...

additional API for SHAKE streaming...

Niels Möller

21 Jan 2024 21 Jan '24

3:27 p.m.

In preparing for merging the gcm-aes "stitched" implementation, I'm reviewing the existing ghash code. WIP branch "ppc-ghash-macros.

I've introduced a macro GHASH_REDUCE, for the reduction logic. Besides that, I've been able to improve scheduling of the reduction instructions (adding in the result of vpmsumd last seems to improve parallelism, some 3% speedup of gcm_update on power10, benchmarked on cfarm120). I've also streamlined the way load offsets are used, and trimmed the number of needed vector registers slightly.

For the AES code, I've merged the new macros (I settled on the names OPN_XXY and OPN_XXXY), no change in speed expected from that change.

I've also tried to understand the differenct between AES encrypt and decrypt, where decrypt is much slower, and uses an extra xor instruction in the round loop. I think the reason for that is that other AES implementations (including x86_64 and arm64 instructions, and Nettle's C implementation) expect the decryption subkeys to be transformed via the AES "MIX_COLUMN" operation, see https://gitlab.com/gnutls/nettle/-/blob/master/aes-invert-internal.c?ref_typ...

While the powerpc64 vncipher instruction really wants the original subkeys, not transformed. So on power, it would be better to have a _nettle_aes_invert that is essentially a memcpy, and then the aes decrypt assembly code could be reworked without the xors, and run at exactly the same speed as encryption. Current _nettle_aes_invert also changes the order of the subkeys, with a FIXME comment suggesting that it would be better to update the order keys are accessed in the aes decryption functions.

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

Show replies by date

Danny Tsen

22 Jan 22 Jan

1:23 p.m.

Hi Niels,

Thanks for merging the stitched implementation for PPC64 with your detailed information and efforts

Thanks. -Danny

...

On Jan 21, 2024, at 11:27 PM, Niels Möller nisse@lysator.liu.se wrote:

In preparing for merging the gcm-aes "stitched" implementation, I'm reviewing the existing ghash code. WIP branch "ppc-ghash-macros.

I've introduced a macro GHASH_REDUCE, for the reduction logic. Besides that, I've been able to improve scheduling of the reduction instructions (adding in the result of vpmsumd last seems to improve parallelism, some 3% speedup of gcm_update on power10, benchmarked on cfarm120). I've also streamlined the way load offsets are used, and trimmed the number of needed vector registers slightly.

For the AES code, I've merged the new macros (I settled on the names OPN_XXY and OPN_XXXY), no change in speed expected from that change.

I've also tried to understand the differenct between AES encrypt and decrypt, where decrypt is much slower, and uses an extra xor instruction in the round loop. I think the reason for that is that other AES implementations (including x86_64 and arm64 instructions, and Nettle's C implementation) expect the decryption subkeys to be transformed via the AES "MIX_COLUMN" operation, see https://gitlab.com/gnutls/nettle/-/blob/master/aes-invert-internal.c?ref_typ...

While the powerpc64 vncipher instruction really wants the original subkeys, not transformed. So on power, it would be better to have a _nettle_aes_invert that is essentially a memcpy, and then the aes decrypt assembly code could be reworked without the xors, and run at exactly the same speed as encryption. Current _nettle_aes_invert also changes the order of the subkeys, with a FIXME comment suggesting that it would be better to update the order keys are accessed in the aes decryption functions.

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-leave@lists.lysator.liu.se

Niels Möller

24 Jan 24 Jan

7:58 p.m.

Danny Tsen dtsen@us.ibm.com writes:

...

Thanks for merging the stitched implementation for PPC64 with your detailed information and efforts

We're not quite there yet, though. Do you think you could rebase your work on top of recent changes? Sorry about conflicts, but I think new macros should fit well with what you need (feel free to have additional macros, where you find that useful).

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

Danny Tsen

11:12 p.m.

Hi Niels,

This may take a while before I digest the changes and I redo the changes. It could be another 2 weeks. It may take longer because I will be taking time out and out of the country. Will do my best.

Thanks. -Danny ________________________________ From: Niels Möller nisse@lysator.liu.se Sent: Thursday, January 25, 2024 3:58 AM To: Danny Tsen dtsen@us.ibm.com Cc: nettle-bugs@lists.lysator.liu.se nettle-bugs@lists.lysator.liu.se; George Wilson gcwilson@us.ibm.com Subject: [EXTERNAL] Re: ppc64 micro optimization

Danny Tsen dtsen@us.ibm.com writes:

...

Thanks for merging the stitched implementation for PPC64 with your detailed information and efforts

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

Niels Möller

30 Jan 30 Jan

8:35 p.m.

Niels Möller nisse@lysator.liu.se writes:

...

While the powerpc64 vncipher instruction really wants the original subkeys, not transformed. So on power, it would be better to have a _nettle_aes_invert that is essentially a memcpy, and then the aes decrypt assembly code could be reworked without the xors, and run at exactly the same speed as encryption.

I've tried this out, see branch https://git.lysator.liu.se/nettle/nettle/-/tree/ppc64-aes-invert. It appears to give the desired improvement in aes decrypt speed, making it run at the same speed as aes encrypt. Which is a speedup of about 80% when benchmarked on power10 (the cfarm120 machine).

...

Current _nettle_aes_invert also changes the order of the subkeys, with a FIXME comment suggesting that it would be better to update the order keys are accessed in the aes decryption functions.

I've merged the changes to keep subkey order the same for encrypt and decrypt (so that the decrypt round loop uses subkeys starting at the end of the array), which affects all aes implementations except s390x, which doesn't need any subkey expansion. But I've deleted the sparc32 assembly rather than updating it.

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

Danny Tsen

4 Feb 4 Feb

2:27 a.m.

Hi Niels,

Here is the new patch v4 for AES/GCM stitched implementation and benchmark based on the current repo.

Thanks. -Danny

...

On Jan 31, 2024, at 4:35 AM, Niels Möller nisse@lysator.liu.se wrote:

Niels Möller nisse@lysator.liu.se writes:

...
While the powerpc64 vncipher instruction really wants the original subkeys, not transformed. So on power, it would be better to have a _nettle_aes_invert that is essentially a memcpy, and then the aes decrypt assembly code could be reworked without the xors, and run at exactly the same speed as encryption.

I've tried this out, see branch https://git.lysator.liu.se/nettle/nettle/-/tree/ppc64-aes-invert . It appears to give the desired improvement in aes decrypt speed, making it run at the same speed as aes encrypt. Which is a speedup of about 80% when benchmarked on power10 (the cfarm120 machine).

...
Current _nettle_aes_invert also changes the order of the subkeys, with a FIXME comment suggesting that it would be better to update the order keys are accessed in the aes decryption functions.

I've merged the changes to keep subkey order the same for encrypt and decrypt (so that the decrypt round loop uses subkeys starting at the end of the array), which affects all aes implementations except s390x, which doesn't need any subkey expansion. But I've deleted the sparc32 assembly rather than updating it.

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-leave@lists.lysator.liu.se

Niels Möller

14 Feb 14 Feb

2:46 p.m.

Danny Tsen dtsen@us.ibm.com writes:

...

Here is the new patch v4 for AES/GCM stitched implementation and benchmark based on the current repo.

Thanks. I'm not able to read it all carefully at the moment, but I have a few comments, see below.

In the mean time, I've also tried to implement something similar for x86_64, see branch x86_64-gcm-aes. Unfortunately, I get no speedup, to the contrary, my stitched implementation seems considerably slower... But at least that helped me understand the higher-level issues better.

...

--- a/gcm-aes128.c +++ b/gcm-aes128.c @@ -63,14 +63,30 @@ void gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) {

GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
size_t done = _gcm_aes_encrypt (&ctx->key, &ctx->gcm.x.b, &ctx->gcm.ctr.b,
		  _AES128_ROUNDS, &ctx->cipher.keys, length, dst, src);

I know I asked you to explode the context into many separate arguments to _gcm_aes_encrypt. I'm now backpedalling a bit on that. For one, it's not so nice to have so many arguments that they can't be passed in registers. Second, when running a fat build on a machine where the needed instructions are unavailable, it's a bit of a waste to have to spend lots of instructions on preparing those arguments for calling a nop function. So to reduce overhead, I'm now leaning towards an interface like

/* To reduce the number of arguments (e.g., maximum of 6 register arguments on x86_64), pass a pointer to gcm_key, which really is a pointer to the first member of the appropriate gcm_aes*_ctx struct. */ size_t _gcm_aes_encrypt (struct gcm_key *CTX, unsigned rounds, size_t size, uint8_t *dst, const uint8_t *src);

That's not so pretty, but I think that is workable and efficient, and since it is an internal function, the interface can be changed if this is implemented on other architectures and we find out that it needs some tweaks. See https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/x86_64/aesni_... for the code I wrote to accept that ctx argument.

It would also be nice to have a #define around the code calling _gcm_aes_encrypt, so that it is compiled only if (i) we have an non-trivial implementation of _gcm_aes_encrypt, or (ii) we're a fat build, which may select a non-trivial implementation of _gcm_aes_encrypt at run time.

...

ctx->gcm.data_size += done;

length -= done;

if (length > 0) {

Not sure of the check for length > 0 is needed. It is fine to call gcm_encrypt/GCM_ENCRYPT with length 0. There will be some overhead for a call with length 0, though, which may be a more common case when _gcm_aes_encrypt is used?

...

+define(`SAVE_GPR', `std $1, $2(SP)') +define(`RESTORE_GPR', `ld $1, $2(SP)')

I think the above two macros are unneeded, it's easier to read to use std and ld directly.

...

+define(`SAVE_VR',

`li r11, $2

stvx $1, r11, $3')

+define(`RESTORE_VR',

`li r11, $2

lvx $1, r11, $3')

It would be nice if we could trim the use of vector registers so we don't need to save and restore lots of them. And if we need two instructions anyway, then maybe it would be clearer with PUSH_VR/POP_VR that also adjusts the stack pointer, and doesn't need to use an additional register for indexing?

...

C load table elements

li r9,1*16

li r10,2*16

li r11,3*16

lxvd2x VSR(H1M),0,HT

lxvd2x VSR(H1L),r9,HT

lxvd2x VSR(H2M),r10,HT

lxvd2x VSR(H2L),r11,HT

li r9,4*16

li r10,5*16

li r11,6*16

li r12,7*16

lxvd2x VSR(H3M),r9,HT

lxvd2x VSR(H3L),r10,HT

lxvd2x VSR(H4M),r11,HT

lxvd2x VSR(H4L),r12,HT

I think it would be nicer to follow the style I tried to implement in my recent updates, using some registers (e.g., r9-r11) as offsets, initializing them only once, and using everywhere. E.g., in this case, the loading could be

lxvd2x VSR(H1M),0,HT lxvd2x VSR(H1L),r9,HT lxvd2x VSR(H2M),r10,HT lxvd2x VSR(H2L),r11,HT addi HT, HT, 64 lxvd2x VSR(H3M),0,HT lxvd2x VSR(H3L),r9,HT lxvd2x VSR(H4M),r10,HT lxvd2x VSR(H4L),r11,HT

...

C do two 4x ghash

C previous digest combining

xxlor vs0, VSR(S0), VSR(S0)

vxor S0,S0,D

GF_MUL(F2, R2, H3L, H3M, S1)

GF_MUL(F, R, H4L, H4M, S0)

vxor F,F,F2

vxor R,R,R2

xxlor VSR(S0), vs0, vs0

GF_MUL(F3, R3, H2L, H2M, S2)

GF_MUL(F4, R4, H1L, H1M, S3)

vxor F3,F3,F4

vxor R3,R3,R4

vxor F,F,F3

vxor D,R,R3

GHASH_REDUCE(D, F, POLY_L, R2, F2) C R2, F2 used as temporaries

It may be possible to reduce number of registers, without making the code slower, by accumulating differently. You ultimately accumulate the values into F, D, before the final reduction. Maybe you don't need separate registers for all of the F, F2, F3, F4 registers, and all of D, R, R2, R3, R4?

If you have any short-lived registers used for the AES part, they could also overlap with short-lived registers used for ghash.

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

Danny Tsen

21 Feb 21 Feb

2:40 a.m.

Hi Niels,

Here is the v5 patch from your comments. Please review.

Thanks. -Danny

...

On Feb 14, 2024, at 8:46 AM, Niels Möller nisse@lysator.liu.se wrote:

Danny Tsen dtsen@us.ibm.com writes:

...
Here is the new patch v4 for AES/GCM stitched implementation and benchmark based on the current repo.

Thanks. I'm not able to read it all carefully at the moment, but I have a few comments, see below.

In the mean time, I've also tried to implement something similar for x86_64, see branch x86_64-gcm-aes. Unfortunately, I get no speedup, to the contrary, my stitched implementation seems considerably slower... But at least that helped me understand the higher-level issues better.

...
--- a/gcm-aes128.c +++ b/gcm-aes128.c @@ -63,14 +63,30 @@ void gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) {

GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
size_t done = _gcm_aes_encrypt (&ctx->key, &ctx->gcm.x.b, &ctx->gcm.ctr.b,
                         _AES128_ROUNDS, &ctx->cipher.keys, length, dst, src);
I know I asked you to explode the context into many separate arguments to _gcm_aes_encrypt. I'm now backpedalling a bit on that. For one, it's not so nice to have so many arguments that they can't be passed in registers. Second, when running a fat build on a machine where the needed instructions are unavailable, it's a bit of a waste to have to spend lots of instructions on preparing those arguments for calling a nop function. So to reduce overhead, I'm now leaning towards an interface like

/* To reduce the number of arguments (e.g., maximum of 6 register arguments on x86_64), pass a pointer to gcm_key, which really is a pointer to the first member of the appropriate gcm_aes*_ctx struct. */ size_t _gcm_aes_encrypt (struct gcm_key *CTX, unsigned rounds, size_t size, uint8_t *dst, const uint8_t *src);

That's not so pretty, but I think that is workable and efficient, and since it is an internal function, the interface can be changed if this is implemented on other architectures and we find out that it needs some tweaks. See https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/x86_64/aesni_... for the code I wrote to accept that ctx argument.

It would also be nice to have a #define around the code calling _gcm_aes_encrypt, so that it is compiled only if (i) we have an non-trivial implementation of _gcm_aes_encrypt, or (ii) we're a fat build, which may select a non-trivial implementation of _gcm_aes_encrypt at run time.

...

ctx->gcm.data_size += done;

length -= done;

if (length > 0) {

Not sure of the check for length > 0 is needed. It is fine to call gcm_encrypt/GCM_ENCRYPT with length 0. There will be some overhead for a call with length 0, though, which may be a more common case when _gcm_aes_encrypt is used?

...
+define(`SAVE_GPR', `std $1, $2(SP)') +define(`RESTORE_GPR', `ld $1, $2(SP)')

I think the above two macros are unneeded, it's easier to read to use std and ld directly.

...
+define(`SAVE_VR',

`li r11, $2

stvx $1, r11, $3')

+define(`RESTORE_VR',

`li r11, $2

lvx $1, r11, $3')

It would be nice if we could trim the use of vector registers so we don't need to save and restore lots of them. And if we need two instructions anyway, then maybe it would be clearer with PUSH_VR/POP_VR that also adjusts the stack pointer, and doesn't need to use an additional register for indexing?

...

C load table elements

li r9,1*16

li r10,2*16

li r11,3*16

lxvd2x VSR(H1M),0,HT

lxvd2x VSR(H1L),r9,HT

lxvd2x VSR(H2M),r10,HT

lxvd2x VSR(H2L),r11,HT

li r9,4*16

li r10,5*16

li r11,6*16

li r12,7*16

lxvd2x VSR(H3M),r9,HT

lxvd2x VSR(H3L),r10,HT

lxvd2x VSR(H4M),r11,HT

lxvd2x VSR(H4L),r12,HT

I think it would be nicer to follow the style I tried to implement in my recent updates, using some registers (e.g., r9-r11) as offsets, initializing them only once, and using everywhere. E.g., in this case, the loading could be

lxvd2x VSR(H1M),0,HT lxvd2x VSR(H1L),r9,HT lxvd2x VSR(H2M),r10,HT lxvd2x VSR(H2L),r11,HT addi HT, HT, 64 lxvd2x VSR(H3M),0,HT lxvd2x VSR(H3L),r9,HT lxvd2x VSR(H4M),r10,HT lxvd2x VSR(H4L),r11,HT

...

C do two 4x ghash

C previous digest combining

xxlor vs0, VSR(S0), VSR(S0)

vxor S0,S0,D

GF_MUL(F2, R2, H3L, H3M, S1)

GF_MUL(F, R, H4L, H4M, S0)

vxor F,F,F2

vxor R,R,R2

xxlor VSR(S0), vs0, vs0

GF_MUL(F3, R3, H2L, H2M, S2)

GF_MUL(F4, R4, H1L, H1M, S3)

vxor F3,F3,F4

vxor R3,R3,R4

vxor F,F,F3

vxor D,R,R3

GHASH_REDUCE(D, F, POLY_L, R2, F2) C R2, F2 used as temporaries

It may be possible to reduce number of registers, without making the code slower, by accumulating differently. You ultimately accumulate the values into F, D, before the final reduction. Maybe you don't need separate registers for all of the F, F2, F3, F4 registers, and all of D, R, R2, R3, R4?

If you have any short-lived registers used for the AES part, they could also overlap with short-lived registers used for ghash.

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

Niels Möller

23 Feb 23 Feb

7:07 a.m.

Danny Tsen dtsen@us.ibm.com writes:

...

Here is the v5 patch from your comments. Please review.

Thanks. I think this looks pretty good. Maybe I should commit it on a branch and we can iterate from there. I'll be on vacation and mostly offline next week, though.

...

--- a/gcm-aes128.c +++ b/gcm-aes128.c @@ -63,6 +63,11 @@ void gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) {

size_t done = _gcm_aes_encrypt ((struct gcm_key *)ctx, _AES128_ROUNDS, length, dst, src);

ctx->gcm.data_size += done;

length -= done;

src += done;

dst += done; GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);

}

We should come up with some preprocessor things to completely omit the new code on architectures that don't have _gcm_aes_encrypt (possibly with some macro to reduce duplication). I think that's the main thing I'd like to have before merge. Otherwise, looks nice and clean.

Ah, and I think you you could write &ctx->key instead of the explicit cast.

...

C load table elements

li r9,1*16

li r10,2*16

li r11,3*16

lxvd2x VSR(H1M),0,HT

lxvd2x VSR(H1L),r9,HT

lxvd2x VSR(H2M),r10,HT

lxvd2x VSR(H2L),r11,HT

addi HT, HT, 64

lxvd2x VSR(H3M),0,HT

lxvd2x VSR(H3L),r9,HT

lxvd2x VSR(H4M),r10,HT

lxvd2x VSR(H4L),r11,HT

li r25,0x10

li r26,0x20

li r27,0x30

li r28,0x40

li r29,0x50

li r30,0x60

li r31,0x70

I still think there's opportunity to reduce number of registers (and corresponding load-store of callee save registers. E.g, here r9-r11 are used for the same thing as r25-r27.

...

+.align 5

C increase ctr value as input to aes_encrypt

vaddudm S1, S0, CNT1

vaddudm S2, S1, CNT1

vaddudm S3, S2, CNT1

vaddudm S4, S3, CNT1

vaddudm S5, S4, CNT1

vaddudm S6, S5, CNT1

vaddudm S7, S6, CNT1

This is a rather long dependency chain; I wonder if you could make a measurable saving of a cycle or two by using additional CNT2 or CNT4 registers (if not, it's preferable to keep the current simple chain).

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

Danny Tsen

26 Feb 26 Feb

10:51 p.m.

Hi Niels,

Please let me know when you merge the code and we can work from there.

Thanks. -Danny ________________________________ From: Niels Möller nisse@lysator.liu.se Sent: Friday, February 23, 2024 1:07 AM To: Danny Tsen dtsen@us.ibm.com Cc: nettle-bugs@lists.lysator.liu.se nettle-bugs@lists.lysator.liu.se; George Wilson gcwilson@us.ibm.com Subject: [EXTERNAL] Re: ppc64 micro optimization

Danny Tsen dtsen@us.ibm.com writes:

...

Here is the v5 patch from your comments. Please review.

Thanks. I think this looks pretty good. Maybe I should commit it on a branch and we can iterate from there. I'll be on vacation and mostly offline next week, though.

...

--- a/gcm-aes128.c +++ b/gcm-aes128.c @@ -63,6 +63,11 @@ void gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) {

size_t done = _gcm_aes_encrypt ((struct gcm_key *)ctx, _AES128_ROUNDS, length, dst, src);

ctx->gcm.data_size += done;

length -= done;

src += done;

dst += done; GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);

}

Ah, and I think you you could write &ctx->key instead of the explicit cast.

...

C load table elements

li r9,1*16

li r10,2*16

li r11,3*16

lxvd2x VSR(H1M),0,HT

lxvd2x VSR(H1L),r9,HT

lxvd2x VSR(H2M),r10,HT

lxvd2x VSR(H2L),r11,HT

addi HT, HT, 64

lxvd2x VSR(H3M),0,HT

lxvd2x VSR(H3L),r9,HT

lxvd2x VSR(H4M),r10,HT

lxvd2x VSR(H4L),r11,HT

li r25,0x10

li r26,0x20

li r27,0x30

li r28,0x40

li r29,0x50

li r30,0x60

li r31,0x70

I still think there's opportunity to reduce number of registers (and corresponding load-store of callee save registers. E.g, here r9-r11 are used for the same thing as r25-r27.

...

+.align 5

C increase ctr value as input to aes_encrypt

vaddudm S1, S0, CNT1

vaddudm S2, S1, CNT1

vaddudm S3, S2, CNT1

vaddudm S4, S3, CNT1

vaddudm S5, S4, CNT1

vaddudm S6, S5, CNT1

vaddudm S7, S6, CNT1

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

Niels Möller

5 Mar 5 Mar

7:08 p.m.

Danny Tsen dtsen@us.ibm.com writes:

...

Please let me know when you merge the code and we can work from there.

Hi, I tried to apply and build with the v5 patch, and noticed some problems.

Declaration of _gcm_aes_encrypt / _gcm_aes_decrypt is missing. It can go in gcm-internal.h, like on this branch, https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/gcm-internal.... Corresponding name mangling defines should also be in gcm-internal.h, not in the installed gcm.h header.

The file gcm-aes.c was missing in the patch. If the dummy C versions of _gcm_aes_*crypt are needed only for fat builds, maybe simplest to put the definitions in fat-ppc.c (maybe one can even use the same "return 0" dummy function for both encrypt and decrypt).

It would also be nice if you could check that the new code is used and working in a non-fat build, configured with --disable-fat --enable-power-crypto-ext.

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

Danny Tsen

8:49 p.m.

Hi Niels,

My fault. I did not include the gym-aes-crypt.c in the patch. Here is the updated patch. Please apply this one and we can work from there.

Thanks. -Danny

...

On Mar 5, 2024, at 1:08 PM, Niels Möller nisse@lysator.liu.se wrote:

Danny Tsen dtsen@us.ibm.com writes:

...
Please let me know when you merge the code and we can work from there.

Hi, I tried to apply and build with the v5 patch, and noticed some problems.

Declaration of _gcm_aes_encrypt / _gcm_aes_decrypt is missing. It can go in gcm-internal.h, like on this branch, https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/gcm-internal.... Corresponding name mangling defines should also be in gcm-internal.h, not in the installed gcm.h header.

The file gcm-aes.c was missing in the patch. If the dummy C versions of _gcm_aes_*crypt are needed only for fat builds, maybe simplest to put the definitions in fat-ppc.c (maybe one can even use the same "return 0" dummy function for both encrypt and decrypt).

It would also be nice if you could check that the new code is used and working in a non-fat build, configured with --disable-fat --enable-power-crypto-ext.

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

Niels Möller

6 Mar 6 Mar

8:12 p.m.

Danny Tsen dtsen@us.ibm.com writes:

...

My fault. I did not include the gym-aes-crypt.c in the patch. Here is the updated patch. Please apply this one and we can work from there.

Thanks, now pushed onto a new branch ppc64-gcm-aes.

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

Niels Möller

15 Mar 15 Mar

8:53 p.m.

Niels Möller nisse@lysator.liu.se writes:

...

Danny Tsen dtsen@us.ibm.com writes:

...
My fault. I did not include the gym-aes-crypt.c in the patch. Here is the updated patch. Please apply this one and we can work from there.

Thanks, now pushed onto a new branch ppc64-gcm-aes.

I've now pushed some more changes to that branch: added gcm-internal.h, fixed the non-fat case, and moved around the nop definitions of the new functions.

Next, I'll have a look at register usage in the assembly code.

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

Niels Möller

17 Mar 17 Mar

4:47 p.m.

Niels Möller nisse@lysator.liu.se writes:

...

Next, I'll have a look at register usage in the assembly code.

Below is an updated version of gcm-aes-encrypt.asm, seems to work for me, and uses fewer of the regular registers. Some comments and questions:

1. What about the vsrX registers, 0 <= X < 32? They are used to copy values from and to the v registers (aka vsrX, 32 <= X < 64), e.g.,

xxlor vs1, VSR(S0), VSR(S0)

Can those registers be used freely, and how? If we can use them, we shouldn't need to save and restore any vector registers. They're not mentioned in powerpc64/README. Looking in the ELF v2 ABI spec, that seems to say that the low halves (which are used as floting point registers) are "volatile", but that's not quite enough?

2. From my reading of the ELF v2 ABI spec, there's a "protected zone" below the stack pointer that can be used freely for storage. Is that right? Or maybe that's only for te ELFv2 ABI?

Regards, /Niels

-----8<------- C powerpc64/p8/gcm-aes-encrypt.asm

GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:

* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

or both in parallel, as here.

GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ')

C Register usage:

define(`SP', `r1') define(`TOCP', `r2')

define(`HT', `r3') define(`SRND', `r4') define(`SLEN', `r5') define(`SDST', `r6') define(`SSRC', `r7') define(`RK', `r8') C r9-r11 used as constant indices. define(`LOOP', `r12')

C C vectors used in aes encrypt output C

define(`K', `v1') define(`S0', `v2') define(`S1', `v3') define(`S2', `v4') define(`S3', `v5') define(`S4', `v6') define(`S5', `v7') define(`S6', `v8') define(`S7', `v9')

C C ghash assigned registers and vectors C

define(`ZERO', `v21') define(`POLY', `v22') define(`POLY_L', `v0')

define(`D', `v10') define(`H1M', `v11') define(`H1L', `v12') define(`H2M', `v13') define(`H2L', `v14') define(`H3M', `v15') define(`H3L', `v16') define(`H4M', `v17') define(`H4L', `v18') define(`R', `v19') define(`F', `v20') define(`R2', `v21') define(`F2', `v22')

define(`LE_TEMP', `v30') define(`LE_MASK', `v31')

define(`CNT1', `v28') define(`LASTCNT', `v29')

.file "gcm-aes-encrypt.asm"

.text

C size_t C _gcm_aes_encrypt(struct gcm_key *key, size_t rounds, C size_t len, uint8_t *dst, const uint8_t *src) C

define(`FUNC_ALIGN', `5') PROLOGUE(_nettle_gcm_aes_encrypt) srdi. LOOP, SLEN, 7 C loop n 8 blocks beq No_encrypt_out

C 288 byte "protected zone" is sufficient for storage. stxv VSR(v20), -16(SP) stxv VSR(v21), -32(SP) stxv VSR(v22), -48(SP) stxv VSR(v28), -64(SP) stxv VSR(v29), -80(SP) stxv VSR(v30), -96(SP) stxv VSR(v31), -112(SP)

vxor ZERO,ZERO,ZERO vspltisb CNT1, 1 vsldoi CNT1, ZERO, CNT1, 1 C counter 1

DATA_LOAD_VEC(POLY,.polynomial,r9) IF_LE(` li r9,0 lvsl LE_MASK,0,r9 vspltisb LE_TEMP,0x07 vxor LE_MASK,LE_MASK,LE_TEMP ') xxmrghd VSR(POLY_L),VSR(ZERO),VSR(POLY)

C load table elements li r9,1*16 li r10,2*16 li r11,3*16 lxvd2x VSR(H1M),0,HT lxvd2x VSR(H1L),r9,HT lxvd2x VSR(H2M),r10,HT lxvd2x VSR(H2L),r11,HT addi HT, HT, 64 lxvd2x VSR(H3M),0,HT lxvd2x VSR(H3L),r9,HT lxvd2x VSR(H4M),r10,HT lxvd2x VSR(H4L),r11,HT

addi HT, HT, 4048 C Advance to point to the 'CTR' field in the context

lxvd2x VSR(D),r9,HT C load 'X' pointer C byte-reverse of each doubleword permuting on little-endian mode IF_LE(` vperm D,D,D,LE_MASK ')

lxvb16x VSR(S0), 0, HT C Load 'CTR'

sldi SLEN, LOOP, 7

addi LOOP, LOOP, -1

lxvd2x VSR(K),r11,HT C First subkey vperm K,K,K,LE_MASK

.align 5 C increase ctr value as input to aes_encrypt vaddudm S1, S0, CNT1 vaddudm S2, S1, CNT1 vaddudm S3, S2, CNT1 vaddudm S4, S3, CNT1 vaddudm S5, S4, CNT1 vaddudm S6, S5, CNT1 vaddudm S7, S6, CNT1 vmr LASTCNT, S7 C save last cnt

OPN_XXY(vxor, K, S0, S1, S2, S3, S4, S5, S6, S7)

addi SRND, SRND, -1 mtctr SRND addi RK, HT, 64 C Point at second subkey .align 5 L8x_round_loop1: lxvd2x VSR(K),0,RK vperm K,K,K,LE_MASK OPN_XXY(vcipher, K, S0, S1, S2, S3, S4, S5, S6, S7) addi RK,RK,0x10 bdnz L8x_round_loop1

lxvd2x VSR(K),0,RK vperm K,K,K,LE_MASK OPN_XXY(vcipherlast, K, S0, S1, S2, S3, S4, S5, S6, S7)

cmpdi LOOP, 0 beq do_ghash

.align 5 Loop8x_en: xxlor vs1, VSR(S0), VSR(S0) xxlor vs2, VSR(S1), VSR(S1) xxlor vs3, VSR(S2), VSR(S2) xxlor vs4, VSR(S3), VSR(S3) xxlor vs5, VSR(S4), VSR(S4) xxlor vs6, VSR(S5), VSR(S5) xxlor vs7, VSR(S6), VSR(S6) xxlor vs8, VSR(S7), VSR(S7)

lxvd2x VSR(S0),0,SSRC lxvd2x VSR(S1),r9,SSRC lxvd2x VSR(S2),r10,SSRC lxvd2x VSR(S3),r11,SSRC addi SSRC, SSRC, 0x40 lxvd2x VSR(S4),0,SSRC lxvd2x VSR(S5),r9,SSRC lxvd2x VSR(S6),r10,SSRC lxvd2x VSR(S7),r11,SSRC addi SSRC, SSRC, 0x40

IF_LE(`OPN_XXXY(vperm, LE_MASK, S0,S1,S2,S3)')

xxlxor VSR(S0), VSR(S0), vs1 xxlxor VSR(S1), VSR(S1), vs2 xxlxor VSR(S2), VSR(S2), vs3 xxlxor VSR(S3), VSR(S3), vs4

IF_LE(`OPN_XXXY(vperm, LE_MASK, S4,S5,S6,S7)')

C do two 4x ghash

C previous digest combining vxor D,S0,D

GF_MUL(F2, R2, H3L, H3M, S1) GF_MUL(F, R, H4L, H4M, D) vxor F,F,F2 vxor R,R,R2

GF_MUL(F2, R2, H2L, H2M, S2) vxor F, F, F2 vxor R, R, R2 GF_MUL(F2, R2, H1L, H1M, S3) vxor F, F, F2 vxor D, R, R2

GHASH_REDUCE(D, F, POLY_L, R2, F2) C R2, F2 used as temporaries

IF_LE(`OPN_XXXY(vperm, LE_MASK, S0,S1,S2,S3)')

stxvd2x VSR(S0),0,SDST stxvd2x VSR(S1),r9,SDST stxvd2x VSR(S2),r10,SDST stxvd2x VSR(S3),r11,SDST addi SDST, SDST, 0x40

xxlxor VSR(S4), VSR(S4), vs5 xxlxor VSR(S5), VSR(S5), vs6 xxlxor VSR(S6), VSR(S6), vs7 xxlxor VSR(S7), VSR(S7), vs8

C previous digest combining vxor D,S4,D

GF_MUL(F2, R2, H3L, H3M, S5) GF_MUL(F, R, H4L, H4M, D) vxor F,F,F2 vxor R,R,R2

GF_MUL(F2, R2, H2L, H2M, S6) vxor F, F, F2 vxor R, R, R2 GF_MUL(F2, R2, H1L, H1M, S7) vxor F, F, F2 vxor D, R, R2

GHASH_REDUCE(D, F, POLY_L, R2, F2) C R2, F2 used as temporaries

IF_LE(`OPN_XXXY(vperm, LE_MASK, S4,S5,S6,S7)')

stxvd2x VSR(S4),0,SDST stxvd2x VSR(S5),r9,SDST stxvd2x VSR(S6),r10,SDST stxvd2x VSR(S7),r11,SDST addi SDST, SDST, 0x40

lxvd2x VSR(K),r11,HT C First subkey vperm K,K,K,LE_MASK

vaddudm S0, LASTCNT, CNT1 vaddudm S1, S0, CNT1 vaddudm S2, S1, CNT1 vaddudm S3, S2, CNT1 vaddudm S4, S3, CNT1 vaddudm S5, S4, CNT1 vaddudm S6, S5, CNT1 vaddudm S7, S6, CNT1 vmr LASTCNT, S7 C save last cnt to v29

OPN_XXY(vxor, K, S0, S1, S2, S3, S4, S5, S6, S7)

mtctr SRND addi RK, HT, 64 C Point at second subkey .align 5 L8x_round_loop2: lxvd2x VSR(K),0,RK vperm K,K,K,LE_MASK OPN_XXY(vcipher, K, S0, S1, S2, S3, S4, S5, S6, S7) addi RK,RK,0x10 bdnz L8x_round_loop2

lxvd2x VSR(K),0,RK vperm K,K,K,LE_MASK OPN_XXY(vcipherlast, K, S0, S1, S2, S3, S4, S5, S6, S7)

addi LOOP, LOOP, -1

cmpdi LOOP, 0 bne Loop8x_en

do_ghash: xxlor vs1, VSR(S0), VSR(S0) xxlor vs2, VSR(S1), VSR(S1) xxlor vs3, VSR(S2), VSR(S2) xxlor vs4, VSR(S3), VSR(S3) xxlor vs5, VSR(S4), VSR(S4) xxlor vs6, VSR(S5), VSR(S5) xxlor vs7, VSR(S6), VSR(S6) xxlor vs8, VSR(S7), VSR(S7)

IF_LE(`OPN_XXXY(vperm, LE_MASK, S0,S1,S2,S3)')

xxlxor VSR(S0), VSR(S0), vs1 xxlxor VSR(S1), VSR(S1), vs2 xxlxor VSR(S2), VSR(S2), vs3 xxlxor VSR(S3), VSR(S3), vs4

IF_LE(`OPN_XXXY(vperm, LE_MASK, S4,S5,S6,S7)')

C previous digest combining vxor D,S0,D

GF_MUL(F2, R2, H3L, H3M, S1) GF_MUL(F, R, H4L, H4M, D) vxor F,F,F2 vxor R,R,R2

GF_MUL(F2, R2, H2L, H2M, S2) vxor F, F, F2 vxor R, R, R2 GF_MUL(F2, R2, H1L, H1M, S3) vxor F, F, F2 vxor D, R, R2

GHASH_REDUCE(D, F, POLY_L, R2, F2) C R2, F2 used as temporaries

IF_LE(`OPN_XXXY(vperm, LE_MASK, S0,S1,S2,S3)')

stxvd2x VSR(S0),0,SDST stxvd2x VSR(S1),r9,SDST stxvd2x VSR(S2),r10,SDST stxvd2x VSR(S3),r11,SDST addi SDST, SDST, 0x40

xxlxor VSR(S4), VSR(S4), vs5 xxlxor VSR(S5), VSR(S5), vs6 xxlxor VSR(S6), VSR(S6), vs7 xxlxor VSR(S7), VSR(S7), vs8

C previous digest combining vxor D,S4,D

GF_MUL(F2, R2, H3L, H3M, S5) GF_MUL(F, R, H4L, H4M, D) vxor F,F,F2 vxor R,R,R2

GF_MUL(F2, R2, H2L, H2M, S6) vxor F, F, F2 vxor R, R, R2 GF_MUL(F2, R2, H1L, H1M, S7) vxor F, F, F2 vxor D, R, R2

GHASH_REDUCE(D, F, POLY_L, R2, F2) C R2, F2 used as temporaries

IF_LE(`OPN_XXXY(vperm, LE_MASK, S4,S5,S6,S7)')

stxvd2x VSR(S4),0,SDST stxvd2x VSR(S5),r9,SDST stxvd2x VSR(S6),r10,SDST stxvd2x VSR(S7),r11,SDST

gcm_aes_out: vaddudm LASTCNT, LASTCNT, CNT1 C increase ctr

C byte-reverse of each doubleword permuting on little-endian mode IF_LE(` vperm D,D,D,LE_MASK ') stxvd2x VSR(D),r9,HT C store digest 'D'

IF_LE(` vperm LASTCNT,LASTCNT,LASTCNT,LE_MASK ') stxvd2x VSR(LASTCNT), 0, HT C store ctr

lxv VSR(v20), -16(SP) lxv VSR(v21), -32(SP) lxv VSR(v22), -48(SP) lxv VSR(v28), -64(SP) lxv VSR(v29), -80(SP) lxv VSR(v30), -96(SP) lxv VSR(v31), -112(SP)

mr r3, SLEN blr

No_encrypt_out: li r3, 0 blr EPILOGUE(_nettle_gcm_aes_encrypt)

.data C 0xC2000000000000000000000000000001 .polynomial: .align 4 IF_BE(` .byte 0xC2 .rept 14 .byte 0x00 .endr .byte 0x01 ',` .byte 0x01 .rept 14 .byte 0x00 .endr .byte 0xC2 ')

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

Niels Möller

20 Mar 20 Mar

8:19 a.m.

Niels Möller nisse@lysator.liu.se writes:

...

Below is an updated version of gcm-aes-encrypt.asm, seems to work for me, and uses fewer of the regular registers. Some comments and questions:

What about the vsrX registers, 0 <= X < 32? They are used to copy values from and to the v registers (aka vsrX, 32 <= X < 64), e.g.,

xxlor vs1, VSR(S0), VSR(S0)

Can those registers be used freely, and how?

I've asked in a different forum, and as far as I understand, registers vs0-vs13 free to use ("volatile"), because half of each corresponds to a volatile floating point register (fpr0-fpr13). While registers vs14-vs31 need to be saved and restored if used (the halves corresponding to fpr14-fpr31 are non-volatile, so in principle, it would be sufficent to save and restore those halves).

...

From my reading of the ELF v2 ABI spec, there's a "protected zone" below the stack pointer that can be used freely for storage. Is that right? Or maybe that's only for te ELFv2 ABI?

That appears to be the same in ELFv1 ABI, see https://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi.html#STACK

One other question: In the counter updates,

...

C increase ctr value as input to aes_encrypt
vaddudm S1, S0, CNT1
vaddudm S2, S1, CNT1
vaddudm S3, S2, CNT1
vaddudm S4, S3, CNT1
vaddudm S5, S4, CNT1
vaddudm S6, S5, CNT1
vaddudm S7, S6, CNT1

shouldn't that be vadduwm (32-bit word addition, rather than 64-bit dword addition)? As I understand it, gcm uses a 32-bit counter, which should wrap around without any carry to higher bits if the initial value is just below 2^32.

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

Niels Möller

24 Mar 24 Mar

8:11 p.m.

Niels Möller nisse@lysator.liu.se writes:

...

One other question: In the counter updates,

...
C increase ctr value as input to aes_encrypt
vaddudm S1, S0, CNT1
vaddudm S2, S1, CNT1
vaddudm S3, S2, CNT1
vaddudm S4, S3, CNT1
vaddudm S5, S4, CNT1
vaddudm S6, S5, CNT1
vaddudm S7, S6, CNT1
shouldn't that be vadduwm (32-bit word addition, rather than 64-bit dword addition)? As I understand it, gcm uses a 32-bit counter, which should wrap around without any carry to higher bits if the initial value is just below 2^32.

I've added tests that set the intial counter so that the four counter bytes wraps around 2^32, and I've verified that if these instructions should be changed to vadduwm, to get output that agrees with nettle's other gcm implementations.

Another question on powerpc64 assembly: For the byte swapping, currently done using the vperm instruction and a mask word, is there any reason to not use the xxbrd instruction (VSX Vector Byte-Reverse Doubleword) instead? That applies to more functions than the new gcm-aes code.

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

Niels Möller

14 Apr 14 Apr

9:44 a.m.

Niels Möller nisse@lysator.liu.se writes:

...

I've added tests that set the intial counter so that the four counter bytes wraps around 2^32, and I've verified that if these instructions should be changed to vadduwm, to get output that agrees with nettle's other gcm implementations.

I've commit those fixes, and a fix for big-endian support, on the branch ppc64-gcm-aes-rebased. I think that's now ready for merging.

I see some opportunities for further improvement, but that can be done after merge, to aid consistency with related fixes to the other ppc64 assembly files.

...

Another question on powerpc64 assembly: For the byte swapping, currently done using the vperm instruction and a mask word, is there any reason to not use the xxbrd instruction (VSX Vector Byte-Reverse Doubleword) instead? That applies to more functions than the new gcm-aes code.

A closer look at the spec indicated that xxbrd is only available from power9 (i.e., if the processor supports VSX, *and* supports ISA 3.0, if I've understood it correctly).

I think it would be a good idea to consistently use pseudoops like

.machine "power8"

in the ppc assembly files, if that would let the assembler catch accidental use of unavailable instructions.

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

481

Age (days ago)

565

Last active (days ago)

nettle-bugs@lists.lysator.liu.se

17 comments

2 participants

tags (0)

participants (2)

Danny Tsen
Niels Möller