Simo Sorce simo@redhat.com writes:
Just for curiosity, would it be ok change, LE_READ_UINT64/LE_WRITE_UINT64 to use uint64_t and have the macro use __bswap64 if available or fall back to the original unoptimized code ?
I think LE_READ_UINT64 and related macros should be left as is, and used for unaligned reads and writes.
For xts_shift, there are three cases, and I think they could be handled like this:
1. Little-endian system: Use 64-bit reads and writes, no swaps needed.
2. Big-endian system, __builtin_bswap64 detected (HAVE_BUILTIN_BSWAP64): Use 64-bit reads and writes + __builtin_bswap64.
3. Big-endian system, no __builtin_bswap64. Here we can either use the current code, with byte accesses only. Or attempt to define byteswap without builtins and follow 2. I'd lean towards using the current code, unless there's some system where something more complicated is known to improve performance.
In case (1) or (2), use 64-bit operations exclusively, also for the carry logic. And then the macros and #ifdefs should be arranged to make it not too ugly. E.g., cases (1) and (2) could perhaps be combined, with a BSWAP64_IF_BE macro or the like. Or a macro like LE_READ_ALIGNED_UINT64 (with an uint64_t * argument)
Only current use of __builtin_bswap64 is in ctr.c, the ctr_fill16 function.
An older example is in salsa20-core-internal.c, with a LE_SWAP32 macro. That one could take advantage of __builtin_bswap32, if available. I don't know if the compiler can recognize the current expression as a byte swap.
And it probably is more important to optimize xts_shift if/when one also arranges the code to produces multiple blocks of T_k values at a time (say, 32 blocks, 512 bytes of stack-allocated temporary storage), without reading the intermediate values back from memory to registers. That has been a significant optimization for both ctr mode and cbc decrypt.
I haven't reviewed the new version of the patch yet, I hope to get to that in a day or two.
Regards, /Niels