 
            Maamoun TK maamoun.tk@googlemail.com writes:
What's the speedup you get from assembly gcm_fill? I see the C implementation uses memcpy and WRITE_UINT32, and is likely significantly slower than the ctr_fill16 in ctr.c. But it could be improved using portable means. If done well, it should be a very small fraction of the cpu time spent for gcm encryption.
I measured the execution time of both C and altivec implementations on POWER8 for 32,768 blocks (512 KB), repeated 10000 times and compiled with -O3 gcm_fill_c() took 0.000073 seconds to execute gcm_fill_altivec() took 0.000019 seconds to execute As you can see, the function itself isn't time consuming at all and maybe optimizing it is not worth it,
Can you try below patch? For now, tested on little endian (x86_64) only, and there the loop compiles to
50: 89 c8 mov %ecx,%eax 52: 4c 89 0a mov %r9,(%rdx) 55: 48 83 c2 10 add $0x10,%rdx 59: 83 c1 01 add $0x1,%ecx 5c: 0f c8 bswap %eax 5e: 48 c1 e0 20 shl $0x20,%rax 62: 4c 01 d0 add %r10,%rax 65: 48 89 42 f8 mov %rax,-0x8(%rdx) 69: 4c 39 c2 cmp %r8,%rdx 6c: 75 e2 jne 50 <gcm_fill+0x20>
Should run in a few cycles per block (6 cycles assuming dual-issue, decent out-of-order capabilities per block). I would expect unrolling, to do multiple blocks in parallel, to give a large performance improvement only on strict in-order processors.
but gcm_fill is part of AES_CTR and what other libraries usually do is optimizing AES_CTR as a whole so I considered optimizing it to stay on the same track.
In Nettle, I strive to go to the extra complexity of assembler implementation only when there's a significant performance benefit.
Regards, /Niels
diff --git a/gcm.c b/gcm.c index cf615daf..71e9f365 100644 --- a/gcm.c +++ b/gcm.c @@ -334,6 +334,46 @@ gcm_update(struct gcm_ctx *ctx, const struct gcm_key *key, }
static nettle_fill16_func gcm_fill; +#if WORDS_BIGENDIAN +static void +gcm_fill(uint8_t *ctr, size_t blocks, union nettle_block16 *buffer) +{ + uint64_t hi, lo; + uint32_t lo; + size_t i; + hi = READ_UINT64(ctr); + mid = (uint64_t)READ_UINT32(ctr + 8) << 32; + lo = READ_UINT32(ctr + 12); + + for (i = 0; i < blocks; i++) + { + buffer[i].u64[0] = hi; + buffer[i].u64[1] = mid + lo++; + } + WRITE_UINT32(ctr + 12, lo); + +} +#elif HAVE_BUILTIN_BSWAP64 +/* Assume __builtin_bswap32 is also available */ +static void +gcm_fill(uint8_t *ctr, size_t blocks, union nettle_block16 *buffer) +{ + uint64_t hi, mid; + uint32_t lo; + size_t i; + hi = LE_READ_UINT64(ctr); + mid = LE_READ_UINT32(ctr + 8); + lo = READ_UINT32(ctr + 12); + + for (i = 0; i < blocks; i++) + { + buffer[i].u64[0] = hi; + buffer[i].u64[1] = mid + ((uint64_t)__builtin_bswap32(lo) << 32); + lo++; + } + WRITE_UINT32(ctr + 12, lo); +} +#else static void gcm_fill(uint8_t *ctr, size_t blocks, union nettle_block16 *buffer) { @@ -349,6 +389,7 @@ gcm_fill(uint8_t *ctr, size_t blocks, union nettle_block16 *buffer)
WRITE_UINT32(ctr + GCM_BLOCK_SIZE - 4, c); } +#endif
void gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key *key,