Nikos Mavrogiannopoulos nmav@redhat.com writes:
On Mon, 2018-02-19 at 15:27 +0100, Niels Möller wrote:
I think I'd prefer allocating a uint64_t array (largest type used in nettle context structs), and leave to the compiler to figure out what alignment is needed and how to get it.
That way you get 8-byte alignment which is ok, but if you use it for aesni key state for example, it results to slower operations.
My concern now is not performance, but correctness on platforms that require proper data alignment.
But to nevertheless comment about performance, for the benefit of x86_64 simd instructions, I've been thinking about specifying 16-byte alignment for things like union nettle_block16 and aes subkeys, but I'm not aware of any nice and portable way to do that. And if we do it only for some of the possible compilers on a platform, we get an inconsistent ABI.
And I doubt it makes much difference for performance, the bottleneck for AES isn't the reading of subkeys (at least not since recent aesni changes which loads the subkeys into registers outside of the block loop). memxor is the only function in Nettle which is likely to be limited by memory accesses, and it uses its own logic to do aligned accesses.
Speaking of memxor, I wonder if one could take advantage of alignment info propagated by recent versions of gcc, and do something like
#define COMPILE_TIME_ZERO(x) (__builtin_constant_p((x) == 0) && (x) == 0))
inline void memxor (void *src, void *dst, size_t n) { if (COMPILE_TIME_ZERO((intptr_t) src & 7) && COMPILE_TIME_ZERO((intptr_t) dst & 7) && COMPILE_TIME_ZERO(n & 7)) memxor_words(...); else ... }
to skip some runtime alignment checks when alignment can be derived at compile time.
Said that, variable arrays and alloca() are ok when the input doesn't come externally but I'm not sure if we can enforce that in nettle. What about moving to malloc() unconditionally?
The intention is to use TMP_ALLOC only for allocations with small bounds, like digest or block sizes, other small buffers (e.g., 512 bytes used in ctr.c and cbc.c), and in the pss code, hash context sizes.
You fixed a few potentially larger allocations to use TMP_GMP_ALLOC some time ago, iirc.
Regards, /Niels