serpent on x86_64, using sse2 instructions

Niels Möller nisse at lysator.liu.se
Thu Jun 30 21:21:46 CEST 2011


I've now tried to write some sse2 assembly code for serpent, to do four
blocks at a time in parallell (which helps for ecb and ctr mode, and cbc
decrypt, but not at all for cbc encrypt). First time I've tried to use
these x86 instructions for anything.

I'm currently under the belief that all existing x86_64 processors have
the needed instructions, so there are no configure time or run time
tests of the corresponding cpuid flags.

Speed is then close to aes128 (aes128 is slightly faster for encypt, and
serpent slightly faster for decrypt), and a bit faster than other aes
variants as well as camellia.

I imagine that with processors with whatever sse extension is needed to
get the 256-bit ymm registers, one can get almost twice the performance
for serpent, compared to the current using only the 128-bit xmm
registers.

And in the other direction, the code could easily be ported to 32-bit
x86 with sse2, with just a small penalty from the fewer registers.

Benchmark results on my laptop (intel SU4100):

         Algorithm        mode Mbyte/s cycles/byte cycles/block

            aes128 ECB encrypt   73.47       16.87       269.98
            aes128 ECB decrypt   71.73       17.28       276.53
            aes128 CBC encrypt   61.99       20.00       320.01
            aes128 CBC decrypt   70.52       17.58       281.29

            aes192 ECB encrypt   63.23       19.61       313.72
            aes192 ECB decrypt   61.86       20.04       320.69
            aes192 CBC encrypt   54.30       22.83       365.31
            aes192 CBC decrypt   61.04       20.31       324.97

            aes256 ECB encrypt   54.35       22.81       364.97
            aes256 ECB decrypt   54.28       22.84       365.42
            aes256 CBC encrypt   48.52       25.55       408.80
            aes256 CBC decrypt   53.56       23.15       370.34

       camellia128 ECB encrypt   57.53       21.55       344.80
       camellia128 ECB decrypt   57.52       21.55       344.86
       camellia128 CBC encrypt   51.99       23.84       381.51
       camellia128 CBC decrypt   56.72       21.86       349.74

       camellia192 ECB encrypt   43.05       28.80       460.72
       camellia192 ECB decrypt   43.01       28.82       461.16
       camellia192 CBC encrypt   39.94       31.04       496.68
       camellia192 CBC decrypt   42.53       29.15       466.40

       camellia256 ECB encrypt   43.03       28.81       461.00
       camellia256 ECB decrypt   43.02       28.82       461.14
       camellia256 CBC encrypt   39.90       31.07       497.11
       camellia256 CBC decrypt   42.61       29.10       465.54

        serpent256 ECB encrypt   68.88       18.00       287.98
        serpent256 ECB decrypt   79.81       15.53       248.54
        serpent256 CBC encrypt   23.20       53.43       854.86
        serpent256 CBC decrypt   78.25       15.84       253.50

It's conceivable that one can use simd instructions to get some speedup
also for aes and camellia, but it's not going to be almost trivial like
it is for serpent. (And for aes, I don't yet have any code using the
special instructions available in newer x86_64 processors).

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.



More information about the nettle-bugs mailing list