I've now tried to write some sse2 assembly code for serpent, to do four blocks at a time in parallell (which helps for ecb and ctr mode, and cbc decrypt, but not at all for cbc encrypt). First time I've tried to use these x86 instructions for anything.
I'm currently under the belief that all existing x86_64 processors have the needed instructions, so there are no configure time or run time tests of the corresponding cpuid flags.
Speed is then close to aes128 (aes128 is slightly faster for encypt, and serpent slightly faster for decrypt), and a bit faster than other aes variants as well as camellia.
I imagine that with processors with whatever sse extension is needed to get the 256-bit ymm registers, one can get almost twice the performance for serpent, compared to the current using only the 128-bit xmm registers.
And in the other direction, the code could easily be ported to 32-bit x86 with sse2, with just a small penalty from the fewer registers.
Benchmark results on my laptop (intel SU4100):
Algorithm mode Mbyte/s cycles/byte cycles/block
aes128 ECB encrypt 73.47 16.87 269.98 aes128 ECB decrypt 71.73 17.28 276.53 aes128 CBC encrypt 61.99 20.00 320.01 aes128 CBC decrypt 70.52 17.58 281.29
aes192 ECB encrypt 63.23 19.61 313.72 aes192 ECB decrypt 61.86 20.04 320.69 aes192 CBC encrypt 54.30 22.83 365.31 aes192 CBC decrypt 61.04 20.31 324.97
aes256 ECB encrypt 54.35 22.81 364.97 aes256 ECB decrypt 54.28 22.84 365.42 aes256 CBC encrypt 48.52 25.55 408.80 aes256 CBC decrypt 53.56 23.15 370.34
camellia128 ECB encrypt 57.53 21.55 344.80 camellia128 ECB decrypt 57.52 21.55 344.86 camellia128 CBC encrypt 51.99 23.84 381.51 camellia128 CBC decrypt 56.72 21.86 349.74
camellia192 ECB encrypt 43.05 28.80 460.72 camellia192 ECB decrypt 43.01 28.82 461.16 camellia192 CBC encrypt 39.94 31.04 496.68 camellia192 CBC decrypt 42.53 29.15 466.40
camellia256 ECB encrypt 43.03 28.81 461.00 camellia256 ECB decrypt 43.02 28.82 461.14 camellia256 CBC encrypt 39.90 31.07 497.11 camellia256 CBC decrypt 42.61 29.10 465.54
serpent256 ECB encrypt 68.88 18.00 287.98 serpent256 ECB decrypt 79.81 15.53 248.54 serpent256 CBC encrypt 23.20 53.43 854.86 serpent256 CBC decrypt 78.25 15.84 253.50
It's conceivable that one can use simd instructions to get some speedup also for aes and camellia, but it's not going to be almost trivial like it is for serpent. (And for aes, I don't yet have any code using the special instructions available in newer x86_64 processors).
Regards, /Niels