Re: S390x other modes and memxor (was: Re: [S390x] Optimize AES modes)

9 May 2021

On Sun, May 9, 2021 at 11:19 AM Niels Möller nisse@lysator.liu.se wrote:
...
Before doing the other modes, do you think you could investigate if
memxor and memxor3 can be sped up? That should benefit many ciphers
and modes, and give more relevant speedup numbers for specialized
functions like aes cbc and aes ctr.
The best strategy depends on whether or not unaligned memory access is
possible and efficient. All current implementations do aligned writes to
the destination area (and smaller writes if needed at the edges). For the
C implementation and several of the asm implementations, they also do
aligned reads, and use shifting to get inputs xored together at the right
places.
While the x86_64 implementation uses unaligned reads, since that seems
as efficient, and reduces complexity quite a lot.
On all platforms I'm familiar with, assembly implementations can assume
that it is safe to read a few bytes outside the edge of the input
buffer, as long as those reads don't cross a word boundary
(corresponding to valgrind option --partial-loads-ok=yes).
Ideally, memxor performance should be limited by memory/cache bandwidth
(with data in L1 cache probably being the most important case. It looks
like nettle-benchmark calls it with a size of 10 KB).
Note that memxor3 must process data in descending address order, to
support the call from cbc_decrypt, with overlapping operands.
This is great information that I can keep in my memory for next
implementations. s390x arch offers 'xc' instruction "Storage-to-storage
XOR" at maximum length of 256 bytes but we can do as many iterations as we
need. I optimized memxor using that instruction as it achieves the optimal
performance for such case, I'll attach the patch at the end of message.
Unfortunately, I couldn't manage to optimize memxor3 using 'xc' instruction
because while it supports the overlapped operands it processes them from
left to right, one byte at a time.
However, I think optimizing just memxor could make a good sense of how much
it would increase the performance of AES modes. CBC mode could come in
handy here since it uses memxor in encrypt and decrypt operations in case
the operands of decrypt operation don't overlap. Here is the benchmark
result of CBC mode:
*---------------------------------------------------------------------------------------------------*
|                                              AES-128 Encrypt | AES-128
Decrypt |
|------------------------------------------------------------------------|----------------------------|
| CBC-Accelerator                             1.18 cbp     |     0.75 cbp
        |
| Basic AES-Accelerator                    13.50 cbp   |     3.34 cbp
      |
| Basic AES-Accelerator with memxor 15.50         |     1.57
  |
*-----------------------------------------------------------------------------------------------------*
I can interpret the decrease in performance using optimized memxor by the
overhead caused by 'ex' instruction since "xor_len" macro patches the
length of 'xc' instruction then it fetches that instruction in memory in
order to execute it, that happens for every single block so it makes sense
to get more cycles per byte.
The decrypt operation is improved using optimized memxor but still with
CBC-accelerator it's almost twice the speed. The speed of encrypt operation
doesn't improve and TBH 15 cycles per byte are a lot of cycles for CBC mode
so we really need to consider the accelerators since it offers an optimal
performance for the architecture.
regards,
Mamone
---
 s390x/machine.m4 | 13 +++++++++++++
 s390x/memxor.asm | 54
++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 67 insertions(+)
 create mode 100644 s390x/memxor.asm

diff --git a/s390x/machine.m4 b/s390x/machine.m4
index acd5e26c..b94c408a 100644
--- a/s390x/machine.m4
+++ b/s390x/machine.m4
@@ -1,2 +1,15 @@
 C Register usage:
 define(`RA', `%r14')
+
+C XOR contents of two areas in storage with specific length
+C len cannot be assigned to general register 0
+C len <= 256
+C xor_len(dst, src, len, tmp_addr)
+define(`xor_len',
+`larl           $4,18f
+    aghi           $3,-1
+    jm             19f
+    ex             $3,0($4)
+    j              19f
+18: xc             0(1,$1),0($2)
+19:')
diff --git a/s390x/memxor.asm b/s390x/memxor.asm
new file mode 100644
index 00000000..178e68e9
--- /dev/null
+++ b/s390x/memxor.asm
@@ -0,0 +1,54 @@
+C s390/memxor.asm
+
+ifelse(`
+   Copyright (C) 2021 Mamone Tarsha
+   This file is part of GNU Nettle.
+
+   GNU Nettle is free software: you can redistribute it and/or
+   modify it under the terms of either:
+
+     * the GNU Lesser General Public License as published by the Free
+       Software Foundation; either version 3 of the License, or (at your
+       option) any later version.
+
+   or
+
+     * the GNU General Public License as published by the Free
+       Software Foundation; either version 2 of the License, or (at your
+       option) any later version.
+
+   or both in parallel, as here.
+
+   GNU Nettle is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received copies of the GNU General Public License and
+   the GNU Lesser General Public License along with this program.  If
+   not, see http://www.gnu.org/licenses/.
+')
+
+.file "memxor.asm"
+
+.text
+
+C void * memxor(void *dst, const void *src, size_t n)
+
+PROLOGUE(nettle_memxor)
+    lgr            %r0,%r2
+    srlg           %r5,%r4,8
+    clgije         %r5,0,Llen
+L256_loop:
+    xc             0(256,%r2),0(%r3)
+    aghi           %r2,256
+    aghi           %r3,256
+    brctg          %r5,L256_loop
+Llen:
+    risbg          %r5,%r4,56,191,0
+    jz             Ldone
+    xor_len(%r2,%r3,%r5,%r1)
+Ldone:
+    lgr            %r2,%r0
+    br             RA
+EPILOGUE(nettle_memxor)
-- 
2.25.1

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: S390x other modes and memxor (was: Re: [S390x] Optimize AES modes)