In preparing for merging the gcm-aes "stitched" implementation, I'm reviewing the existing ghash code. WIP branch "ppc-ghash-macros.
I've introduced a macro GHASH_REDUCE, for the reduction logic. Besides that, I've been able to improve scheduling of the reduction instructions (adding in the result of vpmsumd last seems to improve parallelism, some 3% speedup of gcm_update on power10, benchmarked on cfarm120). I've also streamlined the way load offsets are used, and trimmed the number of needed vector registers slightly.
For the AES code, I've merged the new macros (I settled on the names OPN_XXY and OPN_XXXY), no change in speed expected from that change.
I've also tried to understand the differenct between AES encrypt and decrypt, where decrypt is much slower, and uses an extra xor instruction in the round loop. I think the reason for that is that other AES implementations (including x86_64 and arm64 instructions, and Nettle's C implementation) expect the decryption subkeys to be transformed via the AES "MIX_COLUMN" operation, see https://gitlab.com/gnutls/nettle/-/blob/master/aes-invert-internal.c?ref_typ...
While the powerpc64 vncipher instruction really wants the original subkeys, not transformed. So on power, it would be better to have a _nettle_aes_invert that is essentially a memcpy, and then the aes decrypt assembly code could be reworked without the xors, and run at exactly the same speed as encryption. Current _nettle_aes_invert also changes the order of the subkeys, with a FIXME comment suggesting that it would be better to update the order keys are accessed in the aes decryption functions.
Regards, /Niels
Hi Niels,
Thanks for merging the stitched implementation for PPC64 with your detailed information and efforts
Thanks. -Danny
On Jan 21, 2024, at 11:27 PM, Niels Möller nisse@lysator.liu.se wrote:
In preparing for merging the gcm-aes "stitched" implementation, I'm reviewing the existing ghash code. WIP branch "ppc-ghash-macros.
I've introduced a macro GHASH_REDUCE, for the reduction logic. Besides that, I've been able to improve scheduling of the reduction instructions (adding in the result of vpmsumd last seems to improve parallelism, some 3% speedup of gcm_update on power10, benchmarked on cfarm120). I've also streamlined the way load offsets are used, and trimmed the number of needed vector registers slightly.
For the AES code, I've merged the new macros (I settled on the names OPN_XXY and OPN_XXXY), no change in speed expected from that change.
I've also tried to understand the differenct between AES encrypt and decrypt, where decrypt is much slower, and uses an extra xor instruction in the round loop. I think the reason for that is that other AES implementations (including x86_64 and arm64 instructions, and Nettle's C implementation) expect the decryption subkeys to be transformed via the AES "MIX_COLUMN" operation, see https://gitlab.com/gnutls/nettle/-/blob/master/aes-invert-internal.c?ref_typ...
While the powerpc64 vncipher instruction really wants the original subkeys, not transformed. So on power, it would be better to have a _nettle_aes_invert that is essentially a memcpy, and then the aes decrypt assembly code could be reworked without the xors, and run at exactly the same speed as encryption. Current _nettle_aes_invert also changes the order of the subkeys, with a FIXME comment suggesting that it would be better to update the order keys are accessed in the aes decryption functions.
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-leave@lists.lysator.liu.se
Danny Tsen dtsen@us.ibm.com writes:
Thanks for merging the stitched implementation for PPC64 with your detailed information and efforts
We're not quite there yet, though. Do you think you could rebase your work on top of recent changes? Sorry about conflicts, but I think new macros should fit well with what you need (feel free to have additional macros, where you find that useful).
Regards, /Niels
Hi Niels,
This may take a while before I digest the changes and I redo the changes. It could be another 2 weeks. It may take longer because I will be taking time out and out of the country. Will do my best.
Thanks. -Danny ________________________________ From: Niels Möller nisse@lysator.liu.se Sent: Thursday, January 25, 2024 3:58 AM To: Danny Tsen dtsen@us.ibm.com Cc: nettle-bugs@lists.lysator.liu.se nettle-bugs@lists.lysator.liu.se; George Wilson gcwilson@us.ibm.com Subject: [EXTERNAL] Re: ppc64 micro optimization
Danny Tsen dtsen@us.ibm.com writes:
Thanks for merging the stitched implementation for PPC64 with your detailed information and efforts
We're not quite there yet, though. Do you think you could rebase your work on top of recent changes? Sorry about conflicts, but I think new macros should fit well with what you need (feel free to have additional macros, where you find that useful).
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.
Niels Möller nisse@lysator.liu.se writes:
While the powerpc64 vncipher instruction really wants the original subkeys, not transformed. So on power, it would be better to have a _nettle_aes_invert that is essentially a memcpy, and then the aes decrypt assembly code could be reworked without the xors, and run at exactly the same speed as encryption.
I've tried this out, see branch https://git.lysator.liu.se/nettle/nettle/-/tree/ppc64-aes-invert. It appears to give the desired improvement in aes decrypt speed, making it run at the same speed as aes encrypt. Which is a speedup of about 80% when benchmarked on power10 (the cfarm120 machine).
Current _nettle_aes_invert also changes the order of the subkeys, with a FIXME comment suggesting that it would be better to update the order keys are accessed in the aes decryption functions.
I've merged the changes to keep subkey order the same for encrypt and decrypt (so that the decrypt round loop uses subkeys starting at the end of the array), which affects all aes implementations except s390x, which doesn't need any subkey expansion. But I've deleted the sparc32 assembly rather than updating it.
Regards, /Niels
Hi Niels,
Here is the new patch v4 for AES/GCM stitched implementation and benchmark based on the current repo.
Thanks. -Danny
On Jan 31, 2024, at 4:35 AM, Niels Möller nisse@lysator.liu.se wrote:
Niels Möller nisse@lysator.liu.se writes:
While the powerpc64 vncipher instruction really wants the original subkeys, not transformed. So on power, it would be better to have a _nettle_aes_invert that is essentially a memcpy, and then the aes decrypt assembly code could be reworked without the xors, and run at exactly the same speed as encryption.
I've tried this out, see branch https://git.lysator.liu.se/nettle/nettle/-/tree/ppc64-aes-invert . It appears to give the desired improvement in aes decrypt speed, making it run at the same speed as aes encrypt. Which is a speedup of about 80% when benchmarked on power10 (the cfarm120 machine).
Current _nettle_aes_invert also changes the order of the subkeys, with a FIXME comment suggesting that it would be better to update the order keys are accessed in the aes decryption functions.
I've merged the changes to keep subkey order the same for encrypt and decrypt (so that the decrypt round loop uses subkeys starting at the end of the array), which affects all aes implementations except s390x, which doesn't need any subkey expansion. But I've deleted the sparc32 assembly rather than updating it.
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-leave@lists.lysator.liu.se
Danny Tsen dtsen@us.ibm.com writes:
Here is the new patch v4 for AES/GCM stitched implementation and benchmark based on the current repo.
Thanks. I'm not able to read it all carefully at the moment, but I have a few comments, see below.
In the mean time, I've also tried to implement something similar for x86_64, see branch x86_64-gcm-aes. Unfortunately, I get no speedup, to the contrary, my stitched implementation seems considerably slower... But at least that helped me understand the higher-level issues better.
--- a/gcm-aes128.c +++ b/gcm-aes128.c @@ -63,14 +63,30 @@ void gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) {
- GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
- size_t done = _gcm_aes_encrypt (&ctx->key, &ctx->gcm.x.b, &ctx->gcm.ctr.b,
_AES128_ROUNDS, &ctx->cipher.keys, length, dst, src);
I know I asked you to explode the context into many separate arguments to _gcm_aes_encrypt. I'm now backpedalling a bit on that. For one, it's not so nice to have so many arguments that they can't be passed in registers. Second, when running a fat build on a machine where the needed instructions are unavailable, it's a bit of a waste to have to spend lots of instructions on preparing those arguments for calling a nop function. So to reduce overhead, I'm now leaning towards an interface like
/* To reduce the number of arguments (e.g., maximum of 6 register arguments on x86_64), pass a pointer to gcm_key, which really is a pointer to the first member of the appropriate gcm_aes*_ctx struct. */ size_t _gcm_aes_encrypt (struct gcm_key *CTX, unsigned rounds, size_t size, uint8_t *dst, const uint8_t *src);
That's not so pretty, but I think that is workable and efficient, and since it is an internal function, the interface can be changed if this is implemented on other architectures and we find out that it needs some tweaks. See https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/x86_64/aesni_... for the code I wrote to accept that ctx argument.
It would also be nice to have a #define around the code calling _gcm_aes_encrypt, so that it is compiled only if (i) we have an non-trivial implementation of _gcm_aes_encrypt, or (ii) we're a fat build, which may select a non-trivial implementation of _gcm_aes_encrypt at run time.
- ctx->gcm.data_size += done;
- length -= done;
- if (length > 0) {
Not sure of the check for length > 0 is needed. It is fine to call gcm_encrypt/GCM_ENCRYPT with length 0. There will be some overhead for a call with length 0, though, which may be a more common case when _gcm_aes_encrypt is used?
+define(`SAVE_GPR', `std $1, $2(SP)') +define(`RESTORE_GPR', `ld $1, $2(SP)')
I think the above two macros are unneeded, it's easier to read to use std and ld directly.
+define(`SAVE_VR',
- `li r11, $2
- stvx $1, r11, $3')
+define(`RESTORE_VR',
- `li r11, $2
- lvx $1, r11, $3')
It would be nice if we could trim the use of vector registers so we don't need to save and restore lots of them. And if we need two instructions anyway, then maybe it would be clearer with PUSH_VR/POP_VR that also adjusts the stack pointer, and doesn't need to use an additional register for indexing?
- C load table elements
- li r9,1*16
- li r10,2*16
- li r11,3*16
- lxvd2x VSR(H1M),0,HT
- lxvd2x VSR(H1L),r9,HT
- lxvd2x VSR(H2M),r10,HT
- lxvd2x VSR(H2L),r11,HT
- li r9,4*16
- li r10,5*16
- li r11,6*16
- li r12,7*16
- lxvd2x VSR(H3M),r9,HT
- lxvd2x VSR(H3L),r10,HT
- lxvd2x VSR(H4M),r11,HT
- lxvd2x VSR(H4L),r12,HT
I think it would be nicer to follow the style I tried to implement in my recent updates, using some registers (e.g., r9-r11) as offsets, initializing them only once, and using everywhere. E.g., in this case, the loading could be
lxvd2x VSR(H1M),0,HT lxvd2x VSR(H1L),r9,HT lxvd2x VSR(H2M),r10,HT lxvd2x VSR(H2L),r11,HT addi HT, HT, 64 lxvd2x VSR(H3M),0,HT lxvd2x VSR(H3L),r9,HT lxvd2x VSR(H4M),r10,HT lxvd2x VSR(H4L),r11,HT
- C do two 4x ghash
- C previous digest combining
- xxlor vs0, VSR(S0), VSR(S0)
- vxor S0,S0,D
- GF_MUL(F2, R2, H3L, H3M, S1)
- GF_MUL(F, R, H4L, H4M, S0)
- vxor F,F,F2
- vxor R,R,R2
- xxlor VSR(S0), vs0, vs0
- GF_MUL(F3, R3, H2L, H2M, S2)
- GF_MUL(F4, R4, H1L, H1M, S3)
- vxor F3,F3,F4
- vxor R3,R3,R4
- vxor F,F,F3
- vxor D,R,R3
- GHASH_REDUCE(D, F, POLY_L, R2, F2) C R2, F2 used as temporaries
It may be possible to reduce number of registers, without making the code slower, by accumulating differently. You ultimately accumulate the values into F, D, before the final reduction. Maybe you don't need separate registers for all of the F, F2, F3, F4 registers, and all of D, R, R2, R3, R4?
If you have any short-lived registers used for the AES part, they could also overlap with short-lived registers used for ghash.
Regards, /Niels
Hi Niels,
Here is the v5 patch from your comments. Please review.
Thanks. -Danny
On Feb 14, 2024, at 8:46 AM, Niels Möller nisse@lysator.liu.se wrote:
Danny Tsen dtsen@us.ibm.com writes:
Here is the new patch v4 for AES/GCM stitched implementation and benchmark based on the current repo.
Thanks. I'm not able to read it all carefully at the moment, but I have a few comments, see below.
In the mean time, I've also tried to implement something similar for x86_64, see branch x86_64-gcm-aes. Unfortunately, I get no speedup, to the contrary, my stitched implementation seems considerably slower... But at least that helped me understand the higher-level issues better.
--- a/gcm-aes128.c +++ b/gcm-aes128.c @@ -63,14 +63,30 @@ void gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) {
- GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
- size_t done = _gcm_aes_encrypt (&ctx->key, &ctx->gcm.x.b, &ctx->gcm.ctr.b,
_AES128_ROUNDS, &ctx->cipher.keys, length, dst, src);
I know I asked you to explode the context into many separate arguments to _gcm_aes_encrypt. I'm now backpedalling a bit on that. For one, it's not so nice to have so many arguments that they can't be passed in registers. Second, when running a fat build on a machine where the needed instructions are unavailable, it's a bit of a waste to have to spend lots of instructions on preparing those arguments for calling a nop function. So to reduce overhead, I'm now leaning towards an interface like
/* To reduce the number of arguments (e.g., maximum of 6 register arguments on x86_64), pass a pointer to gcm_key, which really is a pointer to the first member of the appropriate gcm_aes*_ctx struct. */ size_t _gcm_aes_encrypt (struct gcm_key *CTX, unsigned rounds, size_t size, uint8_t *dst, const uint8_t *src);
That's not so pretty, but I think that is workable and efficient, and since it is an internal function, the interface can be changed if this is implemented on other architectures and we find out that it needs some tweaks. See https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/x86_64/aesni_... for the code I wrote to accept that ctx argument.
It would also be nice to have a #define around the code calling _gcm_aes_encrypt, so that it is compiled only if (i) we have an non-trivial implementation of _gcm_aes_encrypt, or (ii) we're a fat build, which may select a non-trivial implementation of _gcm_aes_encrypt at run time.
- ctx->gcm.data_size += done;
- length -= done;
- if (length > 0) {
Not sure of the check for length > 0 is needed. It is fine to call gcm_encrypt/GCM_ENCRYPT with length 0. There will be some overhead for a call with length 0, though, which may be a more common case when _gcm_aes_encrypt is used?
+define(`SAVE_GPR', `std $1, $2(SP)') +define(`RESTORE_GPR', `ld $1, $2(SP)')
I think the above two macros are unneeded, it's easier to read to use std and ld directly.
+define(`SAVE_VR',
- `li r11, $2
- stvx $1, r11, $3')
+define(`RESTORE_VR',
- `li r11, $2
- lvx $1, r11, $3')
It would be nice if we could trim the use of vector registers so we don't need to save and restore lots of them. And if we need two instructions anyway, then maybe it would be clearer with PUSH_VR/POP_VR that also adjusts the stack pointer, and doesn't need to use an additional register for indexing?
- C load table elements
- li r9,1*16
- li r10,2*16
- li r11,3*16
- lxvd2x VSR(H1M),0,HT
- lxvd2x VSR(H1L),r9,HT
- lxvd2x VSR(H2M),r10,HT
- lxvd2x VSR(H2L),r11,HT
- li r9,4*16
- li r10,5*16
- li r11,6*16
- li r12,7*16
- lxvd2x VSR(H3M),r9,HT
- lxvd2x VSR(H3L),r10,HT
- lxvd2x VSR(H4M),r11,HT
- lxvd2x VSR(H4L),r12,HT
I think it would be nicer to follow the style I tried to implement in my recent updates, using some registers (e.g., r9-r11) as offsets, initializing them only once, and using everywhere. E.g., in this case, the loading could be
lxvd2x VSR(H1M),0,HT lxvd2x VSR(H1L),r9,HT lxvd2x VSR(H2M),r10,HT lxvd2x VSR(H2L),r11,HT addi HT, HT, 64 lxvd2x VSR(H3M),0,HT lxvd2x VSR(H3L),r9,HT lxvd2x VSR(H4M),r10,HT lxvd2x VSR(H4L),r11,HT
- C do two 4x ghash
- C previous digest combining
- xxlor vs0, VSR(S0), VSR(S0)
- vxor S0,S0,D
- GF_MUL(F2, R2, H3L, H3M, S1)
- GF_MUL(F, R, H4L, H4M, S0)
- vxor F,F,F2
- vxor R,R,R2
- xxlor VSR(S0), vs0, vs0
- GF_MUL(F3, R3, H2L, H2M, S2)
- GF_MUL(F4, R4, H1L, H1M, S3)
- vxor F3,F3,F4
- vxor R3,R3,R4
- vxor F,F,F3
- vxor D,R,R3
- GHASH_REDUCE(D, F, POLY_L, R2, F2) C R2, F2 used as temporaries
It may be possible to reduce number of registers, without making the code slower, by accumulating differently. You ultimately accumulate the values into F, D, before the final reduction. Maybe you don't need separate registers for all of the F, F2, F3, F4 registers, and all of D, R, R2, R3, R4?
If you have any short-lived registers used for the AES part, they could also overlap with short-lived registers used for ghash.
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.
Danny Tsen dtsen@us.ibm.com writes:
Here is the v5 patch from your comments. Please review.
Thanks. I think this looks pretty good. Maybe I should commit it on a branch and we can iterate from there. I'll be on vacation and mostly offline next week, though.
--- a/gcm-aes128.c +++ b/gcm-aes128.c @@ -63,6 +63,11 @@ void gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) {
- size_t done = _gcm_aes_encrypt ((struct gcm_key *)ctx, _AES128_ROUNDS, length, dst, src);
- ctx->gcm.data_size += done;
- length -= done;
- src += done;
- dst += done; GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
}
We should come up with some preprocessor things to completely omit the new code on architectures that don't have _gcm_aes_encrypt (possibly with some macro to reduce duplication). I think that's the main thing I'd like to have before merge. Otherwise, looks nice and clean.
Ah, and I think you you could write &ctx->key instead of the explicit cast.
- C load table elements
- li r9,1*16
- li r10,2*16
- li r11,3*16
- lxvd2x VSR(H1M),0,HT
- lxvd2x VSR(H1L),r9,HT
- lxvd2x VSR(H2M),r10,HT
- lxvd2x VSR(H2L),r11,HT
- addi HT, HT, 64
- lxvd2x VSR(H3M),0,HT
- lxvd2x VSR(H3L),r9,HT
- lxvd2x VSR(H4M),r10,HT
- lxvd2x VSR(H4L),r11,HT
- li r25,0x10
- li r26,0x20
- li r27,0x30
- li r28,0x40
- li r29,0x50
- li r30,0x60
- li r31,0x70
I still think there's opportunity to reduce number of registers (and corresponding load-store of callee save registers. E.g, here r9-r11 are used for the same thing as r25-r27.
+.align 5
- C increase ctr value as input to aes_encrypt
- vaddudm S1, S0, CNT1
- vaddudm S2, S1, CNT1
- vaddudm S3, S2, CNT1
- vaddudm S4, S3, CNT1
- vaddudm S5, S4, CNT1
- vaddudm S6, S5, CNT1
- vaddudm S7, S6, CNT1
This is a rather long dependency chain; I wonder if you could make a measurable saving of a cycle or two by using additional CNT2 or CNT4 registers (if not, it's preferable to keep the current simple chain).
Regards, /Niels
Hi Niels,
Please let me know when you merge the code and we can work from there.
Thanks. -Danny ________________________________ From: Niels Möller nisse@lysator.liu.se Sent: Friday, February 23, 2024 1:07 AM To: Danny Tsen dtsen@us.ibm.com Cc: nettle-bugs@lists.lysator.liu.se nettle-bugs@lists.lysator.liu.se; George Wilson gcwilson@us.ibm.com Subject: [EXTERNAL] Re: ppc64 micro optimization
Danny Tsen dtsen@us.ibm.com writes:
Here is the v5 patch from your comments. Please review.
Thanks. I think this looks pretty good. Maybe I should commit it on a branch and we can iterate from there. I'll be on vacation and mostly offline next week, though.
--- a/gcm-aes128.c +++ b/gcm-aes128.c @@ -63,6 +63,11 @@ void gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) {
- size_t done = _gcm_aes_encrypt ((struct gcm_key *)ctx, _AES128_ROUNDS, length, dst, src);
- ctx->gcm.data_size += done;
- length -= done;
- src += done;
- dst += done; GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
}
We should come up with some preprocessor things to completely omit the new code on architectures that don't have _gcm_aes_encrypt (possibly with some macro to reduce duplication). I think that's the main thing I'd like to have before merge. Otherwise, looks nice and clean.
Ah, and I think you you could write &ctx->key instead of the explicit cast.
- C load table elements
- li r9,1*16
- li r10,2*16
- li r11,3*16
- lxvd2x VSR(H1M),0,HT
- lxvd2x VSR(H1L),r9,HT
- lxvd2x VSR(H2M),r10,HT
- lxvd2x VSR(H2L),r11,HT
- addi HT, HT, 64
- lxvd2x VSR(H3M),0,HT
- lxvd2x VSR(H3L),r9,HT
- lxvd2x VSR(H4M),r10,HT
- lxvd2x VSR(H4L),r11,HT
- li r25,0x10
- li r26,0x20
- li r27,0x30
- li r28,0x40
- li r29,0x50
- li r30,0x60
- li r31,0x70
I still think there's opportunity to reduce number of registers (and corresponding load-store of callee save registers. E.g, here r9-r11 are used for the same thing as r25-r27.
+.align 5
- C increase ctr value as input to aes_encrypt
- vaddudm S1, S0, CNT1
- vaddudm S2, S1, CNT1
- vaddudm S3, S2, CNT1
- vaddudm S4, S3, CNT1
- vaddudm S5, S4, CNT1
- vaddudm S6, S5, CNT1
- vaddudm S7, S6, CNT1
This is a rather long dependency chain; I wonder if you could make a measurable saving of a cycle or two by using additional CNT2 or CNT4 registers (if not, it's preferable to keep the current simple chain).
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.
Danny Tsen dtsen@us.ibm.com writes:
Please let me know when you merge the code and we can work from there.
Hi, I tried to apply and build with the v5 patch, and noticed some problems.
Declaration of _gcm_aes_encrypt / _gcm_aes_decrypt is missing. It can go in gcm-internal.h, like on this branch, https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/gcm-internal.... Corresponding name mangling defines should also be in gcm-internal.h, not in the installed gcm.h header.
The file gcm-aes.c was missing in the patch. If the dummy C versions of _gcm_aes_*crypt are needed only for fat builds, maybe simplest to put the definitions in fat-ppc.c (maybe one can even use the same "return 0" dummy function for both encrypt and decrypt).
It would also be nice if you could check that the new code is used and working in a non-fat build, configured with --disable-fat --enable-power-crypto-ext.
Regards, /Niels
Hi Niels,
My fault. I did not include the gym-aes-crypt.c in the patch. Here is the updated patch. Please apply this one and we can work from there.
Thanks. -Danny
On Mar 5, 2024, at 1:08 PM, Niels Möller nisse@lysator.liu.se wrote:
Danny Tsen dtsen@us.ibm.com writes:
Please let me know when you merge the code and we can work from there.
Hi, I tried to apply and build with the v5 patch, and noticed some problems.
Declaration of _gcm_aes_encrypt / _gcm_aes_decrypt is missing. It can go in gcm-internal.h, like on this branch, https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/gcm-internal.... Corresponding name mangling defines should also be in gcm-internal.h, not in the installed gcm.h header.
The file gcm-aes.c was missing in the patch. If the dummy C versions of _gcm_aes_*crypt are needed only for fat builds, maybe simplest to put the definitions in fat-ppc.c (maybe one can even use the same "return 0" dummy function for both encrypt and decrypt).
It would also be nice if you could check that the new code is used and working in a non-fat build, configured with --disable-fat --enable-power-crypto-ext.
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.
Danny Tsen dtsen@us.ibm.com writes:
My fault. I did not include the gym-aes-crypt.c in the patch. Here is the updated patch. Please apply this one and we can work from there.
Thanks, now pushed onto a new branch ppc64-gcm-aes.
Regards, /Niels
Niels Möller nisse@lysator.liu.se writes:
Danny Tsen dtsen@us.ibm.com writes:
My fault. I did not include the gym-aes-crypt.c in the patch. Here is the updated patch. Please apply this one and we can work from there.
Thanks, now pushed onto a new branch ppc64-gcm-aes.
I've now pushed some more changes to that branch: added gcm-internal.h, fixed the non-fat case, and moved around the nop definitions of the new functions.
Next, I'll have a look at register usage in the assembly code.
Regards, /Niels
Niels Möller nisse@lysator.liu.se writes:
Next, I'll have a look at register usage in the assembly code.
Below is an updated version of gcm-aes-encrypt.asm, seems to work for me, and uses fewer of the regular registers. Some comments and questions:
1. What about the vsrX registers, 0 <= X < 32? They are used to copy values from and to the v registers (aka vsrX, 32 <= X < 64), e.g.,
xxlor vs1, VSR(S0), VSR(S0)
Can those registers be used freely, and how? If we can use them, we shouldn't need to save and restore any vector registers. They're not mentioned in powerpc64/README. Looking in the ELF v2 ABI spec, that seems to say that the low halves (which are used as floting point registers) are "volatile", but that's not quite enough?
2. From my reading of the ELF v2 ABI spec, there's a "protected zone" below the stack pointer that can be used freely for storage. Is that right? Or maybe that's only for te ELFv2 ABI?
3. Nit: In the copyright line, I'd like to delete the "All rights reserved" phrase. That's not in any the copyright header of any other Nettle files, including those previously contributed by IBM.
Regards, /Niels
-----8<------- C powerpc64/p8/gcm-aes-encrypt.asm
ifelse(` Copyright (C) 2023- IBM Inc. All rights reserved This file is part of GNU Nettle.
GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:
* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
or
* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
or both in parallel, as here.
GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ')
C Register usage:
define(`SP', `r1') define(`TOCP', `r2')
define(`HT', `r3') define(`SRND', `r4') define(`SLEN', `r5') define(`SDST', `r6') define(`SSRC', `r7') define(`RK', `r8') C r9-r11 used as constant indices. define(`LOOP', `r12')
C C vectors used in aes encrypt output C
define(`K', `v1') define(`S0', `v2') define(`S1', `v3') define(`S2', `v4') define(`S3', `v5') define(`S4', `v6') define(`S5', `v7') define(`S6', `v8') define(`S7', `v9')
C C ghash assigned registers and vectors C
define(`ZERO', `v21') define(`POLY', `v22') define(`POLY_L', `v0')
define(`D', `v10') define(`H1M', `v11') define(`H1L', `v12') define(`H2M', `v13') define(`H2L', `v14') define(`H3M', `v15') define(`H3L', `v16') define(`H4M', `v17') define(`H4L', `v18') define(`R', `v19') define(`F', `v20') define(`R2', `v21') define(`F2', `v22')
define(`LE_TEMP', `v30') define(`LE_MASK', `v31')
define(`CNT1', `v28') define(`LASTCNT', `v29')
.file "gcm-aes-encrypt.asm"
.text
C size_t C _gcm_aes_encrypt(struct gcm_key *key, size_t rounds, C size_t len, uint8_t *dst, const uint8_t *src) C
define(`FUNC_ALIGN', `5') PROLOGUE(_nettle_gcm_aes_encrypt) srdi. LOOP, SLEN, 7 C loop n 8 blocks beq No_encrypt_out
C 288 byte "protected zone" is sufficient for storage. stxv VSR(v20), -16(SP) stxv VSR(v21), -32(SP) stxv VSR(v22), -48(SP) stxv VSR(v28), -64(SP) stxv VSR(v29), -80(SP) stxv VSR(v30), -96(SP) stxv VSR(v31), -112(SP)
vxor ZERO,ZERO,ZERO vspltisb CNT1, 1 vsldoi CNT1, ZERO, CNT1, 1 C counter 1
DATA_LOAD_VEC(POLY,.polynomial,r9) IF_LE(` li r9,0 lvsl LE_MASK,0,r9 vspltisb LE_TEMP,0x07 vxor LE_MASK,LE_MASK,LE_TEMP ') xxmrghd VSR(POLY_L),VSR(ZERO),VSR(POLY)
C load table elements li r9,1*16 li r10,2*16 li r11,3*16 lxvd2x VSR(H1M),0,HT lxvd2x VSR(H1L),r9,HT lxvd2x VSR(H2M),r10,HT lxvd2x VSR(H2L),r11,HT addi HT, HT, 64 lxvd2x VSR(H3M),0,HT lxvd2x VSR(H3L),r9,HT lxvd2x VSR(H4M),r10,HT lxvd2x VSR(H4L),r11,HT
addi HT, HT, 4048 C Advance to point to the 'CTR' field in the context
lxvd2x VSR(D),r9,HT C load 'X' pointer C byte-reverse of each doubleword permuting on little-endian mode IF_LE(` vperm D,D,D,LE_MASK ')
lxvb16x VSR(S0), 0, HT C Load 'CTR'
sldi SLEN, LOOP, 7
addi LOOP, LOOP, -1
lxvd2x VSR(K),r11,HT C First subkey vperm K,K,K,LE_MASK
.align 5 C increase ctr value as input to aes_encrypt vaddudm S1, S0, CNT1 vaddudm S2, S1, CNT1 vaddudm S3, S2, CNT1 vaddudm S4, S3, CNT1 vaddudm S5, S4, CNT1 vaddudm S6, S5, CNT1 vaddudm S7, S6, CNT1 vmr LASTCNT, S7 C save last cnt
OPN_XXY(vxor, K, S0, S1, S2, S3, S4, S5, S6, S7)
addi SRND, SRND, -1 mtctr SRND addi RK, HT, 64 C Point at second subkey .align 5 L8x_round_loop1: lxvd2x VSR(K),0,RK vperm K,K,K,LE_MASK OPN_XXY(vcipher, K, S0, S1, S2, S3, S4, S5, S6, S7) addi RK,RK,0x10 bdnz L8x_round_loop1
lxvd2x VSR(K),0,RK vperm K,K,K,LE_MASK OPN_XXY(vcipherlast, K, S0, S1, S2, S3, S4, S5, S6, S7)
cmpdi LOOP, 0 beq do_ghash
.align 5 Loop8x_en: xxlor vs1, VSR(S0), VSR(S0) xxlor vs2, VSR(S1), VSR(S1) xxlor vs3, VSR(S2), VSR(S2) xxlor vs4, VSR(S3), VSR(S3) xxlor vs5, VSR(S4), VSR(S4) xxlor vs6, VSR(S5), VSR(S5) xxlor vs7, VSR(S6), VSR(S6) xxlor vs8, VSR(S7), VSR(S7)
lxvd2x VSR(S0),0,SSRC lxvd2x VSR(S1),r9,SSRC lxvd2x VSR(S2),r10,SSRC lxvd2x VSR(S3),r11,SSRC addi SSRC, SSRC, 0x40 lxvd2x VSR(S4),0,SSRC lxvd2x VSR(S5),r9,SSRC lxvd2x VSR(S6),r10,SSRC lxvd2x VSR(S7),r11,SSRC addi SSRC, SSRC, 0x40
IF_LE(`OPN_XXXY(vperm, LE_MASK, S0,S1,S2,S3)')
xxlxor VSR(S0), VSR(S0), vs1 xxlxor VSR(S1), VSR(S1), vs2 xxlxor VSR(S2), VSR(S2), vs3 xxlxor VSR(S3), VSR(S3), vs4
IF_LE(`OPN_XXXY(vperm, LE_MASK, S4,S5,S6,S7)')
C do two 4x ghash
C previous digest combining vxor D,S0,D
GF_MUL(F2, R2, H3L, H3M, S1) GF_MUL(F, R, H4L, H4M, D) vxor F,F,F2 vxor R,R,R2
GF_MUL(F2, R2, H2L, H2M, S2) vxor F, F, F2 vxor R, R, R2 GF_MUL(F2, R2, H1L, H1M, S3) vxor F, F, F2 vxor D, R, R2
GHASH_REDUCE(D, F, POLY_L, R2, F2) C R2, F2 used as temporaries
IF_LE(`OPN_XXXY(vperm, LE_MASK, S0,S1,S2,S3)')
stxvd2x VSR(S0),0,SDST stxvd2x VSR(S1),r9,SDST stxvd2x VSR(S2),r10,SDST stxvd2x VSR(S3),r11,SDST addi SDST, SDST, 0x40
xxlxor VSR(S4), VSR(S4), vs5 xxlxor VSR(S5), VSR(S5), vs6 xxlxor VSR(S6), VSR(S6), vs7 xxlxor VSR(S7), VSR(S7), vs8
C previous digest combining vxor D,S4,D
GF_MUL(F2, R2, H3L, H3M, S5) GF_MUL(F, R, H4L, H4M, D) vxor F,F,F2 vxor R,R,R2
GF_MUL(F2, R2, H2L, H2M, S6) vxor F, F, F2 vxor R, R, R2 GF_MUL(F2, R2, H1L, H1M, S7) vxor F, F, F2 vxor D, R, R2
GHASH_REDUCE(D, F, POLY_L, R2, F2) C R2, F2 used as temporaries
IF_LE(`OPN_XXXY(vperm, LE_MASK, S4,S5,S6,S7)')
stxvd2x VSR(S4),0,SDST stxvd2x VSR(S5),r9,SDST stxvd2x VSR(S6),r10,SDST stxvd2x VSR(S7),r11,SDST addi SDST, SDST, 0x40
lxvd2x VSR(K),r11,HT C First subkey vperm K,K,K,LE_MASK
vaddudm S0, LASTCNT, CNT1 vaddudm S1, S0, CNT1 vaddudm S2, S1, CNT1 vaddudm S3, S2, CNT1 vaddudm S4, S3, CNT1 vaddudm S5, S4, CNT1 vaddudm S6, S5, CNT1 vaddudm S7, S6, CNT1 vmr LASTCNT, S7 C save last cnt to v29
OPN_XXY(vxor, K, S0, S1, S2, S3, S4, S5, S6, S7)
mtctr SRND addi RK, HT, 64 C Point at second subkey .align 5 L8x_round_loop2: lxvd2x VSR(K),0,RK vperm K,K,K,LE_MASK OPN_XXY(vcipher, K, S0, S1, S2, S3, S4, S5, S6, S7) addi RK,RK,0x10 bdnz L8x_round_loop2
lxvd2x VSR(K),0,RK vperm K,K,K,LE_MASK OPN_XXY(vcipherlast, K, S0, S1, S2, S3, S4, S5, S6, S7)
addi LOOP, LOOP, -1
cmpdi LOOP, 0 bne Loop8x_en
do_ghash: xxlor vs1, VSR(S0), VSR(S0) xxlor vs2, VSR(S1), VSR(S1) xxlor vs3, VSR(S2), VSR(S2) xxlor vs4, VSR(S3), VSR(S3) xxlor vs5, VSR(S4), VSR(S4) xxlor vs6, VSR(S5), VSR(S5) xxlor vs7, VSR(S6), VSR(S6) xxlor vs8, VSR(S7), VSR(S7)
lxvd2x VSR(S0),0,SSRC lxvd2x VSR(S1),r9,SSRC lxvd2x VSR(S2),r10,SSRC lxvd2x VSR(S3),r11,SSRC addi SSRC, SSRC, 0x40 lxvd2x VSR(S4),0,SSRC lxvd2x VSR(S5),r9,SSRC lxvd2x VSR(S6),r10,SSRC lxvd2x VSR(S7),r11,SSRC addi SSRC, SSRC, 0x40
IF_LE(`OPN_XXXY(vperm, LE_MASK, S0,S1,S2,S3)')
xxlxor VSR(S0), VSR(S0), vs1 xxlxor VSR(S1), VSR(S1), vs2 xxlxor VSR(S2), VSR(S2), vs3 xxlxor VSR(S3), VSR(S3), vs4
IF_LE(`OPN_XXXY(vperm, LE_MASK, S4,S5,S6,S7)')
C previous digest combining vxor D,S0,D
GF_MUL(F2, R2, H3L, H3M, S1) GF_MUL(F, R, H4L, H4M, D) vxor F,F,F2 vxor R,R,R2
GF_MUL(F2, R2, H2L, H2M, S2) vxor F, F, F2 vxor R, R, R2 GF_MUL(F2, R2, H1L, H1M, S3) vxor F, F, F2 vxor D, R, R2
GHASH_REDUCE(D, F, POLY_L, R2, F2) C R2, F2 used as temporaries
IF_LE(`OPN_XXXY(vperm, LE_MASK, S0,S1,S2,S3)')
stxvd2x VSR(S0),0,SDST stxvd2x VSR(S1),r9,SDST stxvd2x VSR(S2),r10,SDST stxvd2x VSR(S3),r11,SDST addi SDST, SDST, 0x40
xxlxor VSR(S4), VSR(S4), vs5 xxlxor VSR(S5), VSR(S5), vs6 xxlxor VSR(S6), VSR(S6), vs7 xxlxor VSR(S7), VSR(S7), vs8
C previous digest combining vxor D,S4,D
GF_MUL(F2, R2, H3L, H3M, S5) GF_MUL(F, R, H4L, H4M, D) vxor F,F,F2 vxor R,R,R2
GF_MUL(F2, R2, H2L, H2M, S6) vxor F, F, F2 vxor R, R, R2 GF_MUL(F2, R2, H1L, H1M, S7) vxor F, F, F2 vxor D, R, R2
GHASH_REDUCE(D, F, POLY_L, R2, F2) C R2, F2 used as temporaries
IF_LE(`OPN_XXXY(vperm, LE_MASK, S4,S5,S6,S7)')
stxvd2x VSR(S4),0,SDST stxvd2x VSR(S5),r9,SDST stxvd2x VSR(S6),r10,SDST stxvd2x VSR(S7),r11,SDST
gcm_aes_out: vaddudm LASTCNT, LASTCNT, CNT1 C increase ctr
C byte-reverse of each doubleword permuting on little-endian mode IF_LE(` vperm D,D,D,LE_MASK ') stxvd2x VSR(D),r9,HT C store digest 'D'
IF_LE(` vperm LASTCNT,LASTCNT,LASTCNT,LE_MASK ') stxvd2x VSR(LASTCNT), 0, HT C store ctr
lxv VSR(v20), -16(SP) lxv VSR(v21), -32(SP) lxv VSR(v22), -48(SP) lxv VSR(v28), -64(SP) lxv VSR(v29), -80(SP) lxv VSR(v30), -96(SP) lxv VSR(v31), -112(SP)
mr r3, SLEN blr
No_encrypt_out: li r3, 0 blr EPILOGUE(_nettle_gcm_aes_encrypt)
.data C 0xC2000000000000000000000000000001 .polynomial: .align 4 IF_BE(` .byte 0xC2 .rept 14 .byte 0x00 .endr .byte 0x01 ',` .byte 0x01 .rept 14 .byte 0x00 .endr .byte 0xC2 ')
Niels Möller nisse@lysator.liu.se writes:
Below is an updated version of gcm-aes-encrypt.asm, seems to work for me, and uses fewer of the regular registers. Some comments and questions:
What about the vsrX registers, 0 <= X < 32? They are used to copy values from and to the v registers (aka vsrX, 32 <= X < 64), e.g.,
xxlor vs1, VSR(S0), VSR(S0)
Can those registers be used freely, and how?
I've asked in a different forum, and as far as I understand, registers vs0-vs13 free to use ("volatile"), because half of each corresponds to a volatile floating point register (fpr0-fpr13). While registers vs14-vs31 need to be saved and restored if used (the halves corresponding to fpr14-fpr31 are non-volatile, so in principle, it would be sufficent to save and restore those halves).
- From my reading of the ELF v2 ABI spec, there's a "protected zone" below the stack pointer that can be used freely for storage. Is that right? Or maybe that's only for te ELFv2 ABI?
That appears to be the same in ELFv1 ABI, see https://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi.html#STACK
One other question: In the counter updates,
C increase ctr value as input to aes_encrypt vaddudm S1, S0, CNT1 vaddudm S2, S1, CNT1 vaddudm S3, S2, CNT1 vaddudm S4, S3, CNT1 vaddudm S5, S4, CNT1 vaddudm S6, S5, CNT1 vaddudm S7, S6, CNT1
shouldn't that be vadduwm (32-bit word addition, rather than 64-bit dword addition)? As I understand it, gcm uses a 32-bit counter, which should wrap around without any carry to higher bits if the initial value is just below 2^32.
Regards, /Niels
Niels Möller nisse@lysator.liu.se writes:
One other question: In the counter updates,
C increase ctr value as input to aes_encrypt vaddudm S1, S0, CNT1 vaddudm S2, S1, CNT1 vaddudm S3, S2, CNT1 vaddudm S4, S3, CNT1 vaddudm S5, S4, CNT1 vaddudm S6, S5, CNT1 vaddudm S7, S6, CNT1
shouldn't that be vadduwm (32-bit word addition, rather than 64-bit dword addition)? As I understand it, gcm uses a 32-bit counter, which should wrap around without any carry to higher bits if the initial value is just below 2^32.
I've added tests that set the intial counter so that the four counter bytes wraps around 2^32, and I've verified that if these instructions should be changed to vadduwm, to get output that agrees with nettle's other gcm implementations.
Another question on powerpc64 assembly: For the byte swapping, currently done using the vperm instruction and a mask word, is there any reason to not use the xxbrd instruction (VSX Vector Byte-Reverse Doubleword) instead? That applies to more functions than the new gcm-aes code.
Regards, /Niels
Niels Möller nisse@lysator.liu.se writes:
I've added tests that set the intial counter so that the four counter bytes wraps around 2^32, and I've verified that if these instructions should be changed to vadduwm, to get output that agrees with nettle's other gcm implementations.
I've commit those fixes, and a fix for big-endian support, on the branch ppc64-gcm-aes-rebased. I think that's now ready for merging.
I see some opportunities for further improvement, but that can be done after merge, to aid consistency with related fixes to the other ppc64 assembly files.
Another question on powerpc64 assembly: For the byte swapping, currently done using the vperm instruction and a mask word, is there any reason to not use the xxbrd instruction (VSX Vector Byte-Reverse Doubleword) instead? That applies to more functions than the new gcm-aes code.
A closer look at the spec indicated that xxbrd is only available from power9 (i.e., if the processor supports VSX, *and* supports ISA 3.0, if I've understood it correctly).
I think it would be a good idea to consistently use pseudoops like
.machine "power8"
in the ppc assembly files, if that would let the assembler catch accidental use of unavailable instructions.
Regards, /Niels
nettle-bugs@lists.lysator.liu.se