Speaking of umac, I'm also looking at the umac context structs, for potential micro optimizations and fixes before it becomes a part of the ABI.
Some fields, like nonce_length, index, and (for umac32 and umac64) nonce_low, fit in 16 or even 8 bits. So it might make sense to make them adjacent.
And on the other hand, the umac block count is currently unsigned, and will wraparound after 2*32 blocks or 2^42 bytes. Other hash functions typically support data sizes up to 2^64 (except sha512 which uses a 128-bit coutner, which seems gross overkill).
For umac, the block counter is only needed to keep track of when to switch to different layer 2 hashing, and to keep track of odd and even blocks for poly128. So it could probably be made to work with only 16 bits and some saturation logic. But extending it to 64 bits seems simpler.
It would also be nice if we could force 16-byte alignment for the l1_key array (this is important for assembly routines), which would them imply 16-byte alignment for the complete context struct. Could help x86 sse2 assembly. And could help also on ARM, but I'm not sure if the system (primarily linker and malloc) really makes 16-byte alignment possible there.
And it would also be good to get a reasonably large alignment for the block buffer.
In gcc, there's __attribute__ ((aligned (16))), but since this gets part of the ABI, we can't use it in public headers unless we can specify the same alignment for *all* reasonable compilers for the given architecture.
Regards, /Niels
On Tue, Apr 16, 2013 at 11:55 AM, Niels Möller nisse@lysator.liu.se wrote:
It would also be nice if we could force 16-byte alignment for the l1_key
array (this is important for assembly routines), which would them imply 16-byte alignment for the complete context struct. Could help x86 sse2 assembly. And could help also on ARM, but I'm not sure if the system (primarily linker and malloc) really makes 16-byte alignment possible there.
Would it make sense to force allocation of the context (i.e., no context on the stack) via ctx_alloc() function that will use posix_memalign or memalign?
Alternatively you could have a separate set of functions that would operate on aligned data.
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
Would it make sense to force allocation of the context (i.e., no context on the stack) via ctx_alloc() function that will use posix_memalign or memalign?
I don't think so. That would be a departure from how Nettle's interfaces currently work, "no memory allcoation".
As far as I understand, if we could tell the compiler that the structure must be 16-byte aligned, then it should arrange that also for stack allocated objects.
But maybe it won't be reliable. For example,
struct some_ctx *ctx = alloca (sizeof(*ctx));
is a valid use, which depends on what alignment for alloca provides. Not sure exactly how that would work, but the ABI typically specifies required alignment of the stack pointer, and I suspect that (i) alloca won't round to a larger alignment than that, and (ii) the ABIs or relevant platforms is unlikely to specify larger alignment than 8 bytes.
Another ugly alternative would be to allocate one or a few extra elements and align manually, something like
uint32_t a[SIZE + 3];
#define ALIGNED_A ((uint32t_*)(((ptrdiff_t) a + 15) & -16))
But that's *too* ugly, I think.
And I'm not sure how much difference to performance it would really make. I guess it's not worth doing unless there's a large demonstraded gain in performance.
(And umac is not the only case where the x86 assembly files use movups/movupd where I'd prefer to use movaps/movapd).
Regards, /Niels
On Tue, Apr 16, 2013 at 1:08 PM, Niels Möller nisse@lysator.liu.se wrote:
Another ugly alternative would be to allocate one or a few extra elements and align manually, something like uint32_t a[SIZE + 3]; #define ALIGNED_A ((uint32t_*)(((ptrdiff_t) a + 15) & -16)) But that's *too* ugly, I think.
Indeed, from what I see I don't think there is a non-ugly solution to that problem :) If you want to ignore alignment you may provide aligned and unaligned versions of the functions and let the caller cope with the alignment means.
And I'm not sure how much difference to performance it would really make. I guess it's not worth doing unless there's a large demonstraded gain in performance.
The results will be very CPU-specific. If you have any benchmark or test code, I could test on i7 and amd 64 cpus.
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
On Tue, Apr 16, 2013 at 1:08 PM, Niels Möller nisse@lysator.liu.se wrote:
And I'm not sure how much difference to performance it would really make. I guess it's not worth doing unless there's a large demonstraded gain in performance.
The results will be very CPU-specific. If you have any benchmark or test code, I could test on i7 and amd 64 cpus.
No, I don't have any good benchmark. But maybe it matters mostly for code which is close to memory bandwidth limits.
Speaking of benchmarks, I've written some more umac assembly (not yet in the public repo, I'll try to get it in later today).
x86_64 (Intel i5, 3.4 GHz):
Algorithm mode Mbyte/s sha256 update 286.04 sha512 update 433.52 umac32 update 17837.65 umac64 update 8364.80 umac96 update 6447.72 umac128 update 5270.74
ARM (Cortex-A9, 1 GHz):
Algorithm mode Mbyte/s sha256 update 31.69 sha512 update 30.38 umac32 update 937.02 umac64 update 464.81 umac96 update 383.02 umac128 update 350.13
So umac128 seems to be an order of magnitude faster than sha2. On machines with decent multiplication performance.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
Speaking of benchmarks, I've written some more umac assembly (not yet in the public repo, I'll try to get it in later today).
Pushed in now. I also updated http://www.lysator.liu.se/~nisse/nettle/plan.html
Regards, /Niels
nettle-bugs@lists.lysator.liu.se