On Sun, May 9, 2021 at 11:19 AM Niels Möller nisse@lysator.liu.se wrote:
Before doing the other modes, do you think you could investigate if memxor and memxor3 can be sped up? That should benefit many ciphers and modes, and give more relevant speedup numbers for specialized functions like aes cbc and aes ctr.
The best strategy depends on whether or not unaligned memory access is possible and efficient. All current implementations do aligned writes to the destination area (and smaller writes if needed at the edges). For the C implementation and several of the asm implementations, they also do aligned reads, and use shifting to get inputs xored together at the right places.
While the x86_64 implementation uses unaligned reads, since that seems as efficient, and reduces complexity quite a lot.
On all platforms I'm familiar with, assembly implementations can assume that it is safe to read a few bytes outside the edge of the input buffer, as long as those reads don't cross a word boundary (corresponding to valgrind option --partial-loads-ok=yes).
Ideally, memxor performance should be limited by memory/cache bandwidth (with data in L1 cache probably being the most important case. It looks like nettle-benchmark calls it with a size of 10 KB).
Note that memxor3 must process data in descending address order, to support the call from cbc_decrypt, with overlapping operands.
This is great information that I can keep in my memory for next implementations. s390x arch offers 'xc' instruction "Storage-to-storage XOR" at maximum length of 256 bytes but we can do as many iterations as we need. I optimized memxor using that instruction as it achieves the optimal performance for such case, I'll attach the patch at the end of message. Unfortunately, I couldn't manage to optimize memxor3 using 'xc' instruction because while it supports the overlapped operands it processes them from left to right, one byte at a time. However, I think optimizing just memxor could make a good sense of how much it would increase the performance of AES modes. CBC mode could come in handy here since it uses memxor in encrypt and decrypt operations in case the operands of decrypt operation don't overlap. Here is the benchmark result of CBC mode:
*---------------------------------------------------------------------------------------------------* | AES-128 Encrypt | AES-128 Decrypt | |------------------------------------------------------------------------|----------------------------| | CBC-Accelerator 1.18 cbp | 0.75 cbp | | Basic AES-Accelerator 13.50 cbp | 3.34 cbp | | Basic AES-Accelerator with memxor 15.50 | 1.57 | *-----------------------------------------------------------------------------------------------------*
I can interpret the decrease in performance using optimized memxor by the overhead caused by 'ex' instruction since "xor_len" macro patches the length of 'xc' instruction then it fetches that instruction in memory in order to execute it, that happens for every single block so it makes sense to get more cycles per byte. The decrypt operation is improved using optimized memxor but still with CBC-accelerator it's almost twice the speed. The speed of encrypt operation doesn't improve and TBH 15 cycles per byte are a lot of cycles for CBC mode so we really need to consider the accelerators since it offers an optimal performance for the architecture.
regards, Mamone
--- s390x/machine.m4 | 13 +++++++++++++ s390x/memxor.asm | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 67 insertions(+) create mode 100644 s390x/memxor.asm
diff --git a/s390x/machine.m4 b/s390x/machine.m4 index acd5e26c..b94c408a 100644 --- a/s390x/machine.m4 +++ b/s390x/machine.m4 @@ -1,2 +1,15 @@ C Register usage: define(`RA', `%r14') + +C XOR contents of two areas in storage with specific length +C len cannot be assigned to general register 0 +C len <= 256 +C xor_len(dst, src, len, tmp_addr) +define(`xor_len', +`larl $4,18f + aghi $3,-1 + jm 19f + ex $3,0($4) + j 19f +18: xc 0(1,$1),0($2) +19:') diff --git a/s390x/memxor.asm b/s390x/memxor.asm new file mode 100644 index 00000000..178e68e9 --- /dev/null +++ b/s390x/memxor.asm @@ -0,0 +1,54 @@ +C s390/memxor.asm + +ifelse(` + Copyright (C) 2021 Mamone Tarsha + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + +.file "memxor.asm" + +.text + +C void * memxor(void *dst, const void *src, size_t n) + +PROLOGUE(nettle_memxor) + lgr %r0,%r2 + srlg %r5,%r4,8 + clgije %r5,0,Llen +L256_loop: + xc 0(256,%r2),0(%r3) + aghi %r2,256 + aghi %r3,256 + brctg %r5,L256_loop +Llen: + risbg %r5,%r4,56,191,0 + jz Ldone + xor_len(%r2,%r3,%r5,%r1) +Ldone: + lgr %r2,%r0 + br RA +EPILOGUE(nettle_memxor)