Re: S390x other modes and memxor

9 May 2021


      Maamoun TK maamoun.tk@googlemail.com writes:
...
This is great information that I can keep in my memory for next
implementations. s390x arch offers 'xc' instruction "Storage-to-storage
XOR" at maximum length of 256 bytes but we can do as many iterations as we
need. I optimized memxor using that instruction as it achieves the optimal
performance for such case, I'll attach the patch at the end of
message.
Nice! I'd like to merge this as soon as the s390x ci is up and running
again.
...
Unfortunately, I couldn't manage to optimize memxor3 using 'xc' instruction
because while it supports the overlapped operands it processes them from
left to right, one byte at a time.
Hmm, I wonder if there's some way to work around that.
...
However, I think optimizing just memxor could make a good sense of how much
it would increase the performance of AES modes. CBC mode could come in
handy here since it uses memxor in encrypt and decrypt operations in case
the operands of decrypt operation don't overlap. Here is the benchmark
result of CBC mode:
*---------------------------------------------------------------------------------------------------*
|                                              AES-128 Encrypt | AES-128
Decrypt |
|------------------------------------------------------------------------|----------------------------|
| CBC-Accelerator                             1.18 cbp     |     0.75 cbp
        |
| Basic AES-Accelerator                    13.50 cbp   |     3.34 cbp
      |
| Basic AES-Accelerator with memxor 15.50         |     1.57
  |
*-----------------------------------------------------------------------------------------------------*
This seems to confirm that cbc encrypt is the operation that gains the
most from assembly for the combined operation. That aes decrypt can also
gain a factor two in performance, does that mean that both aes-cbc and
memxor run at speed limited by memory bandwidth? And then the gain is
from one less pass loading and storing data from memory?
What unit is "cbp"? If it's cycles per byte, 0.77 cycles/byte for memxor
(the cost of "Basic AES-Accelerator with memxor" minus cost of
CBC-Accellerator) sounds unexpectedly slow, compared to, e.g, x86_64,
where I get 0.08 cycles per byte (regardless of alignment), or 0.64
cycles per 64-bit word.
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: S390x other modes and memxor