Maamoun TK maamoun.tk@googlemail.com writes:
+Lmod:
- C --- process the modulo bytes, padding the low-order bytes with zeros
- cmpldi LENGTH,0
- beq Ldone
- C load table elements
- li r8,1*TableElemAlign
- lxvd2x VSR(H1M),0,TABLE
- lxvd2x VSR(H1L),r8,TABLE
- C push every modulo byte to the stack and load them with padding into
vector register
- vxor ZERO,ZERO,ZERO
- addi r8,SP,-16
- stvx ZERO,0,r8
+Lstb_loop:
- subic. LENGTH,LENGTH,1
- lbzx r7,LENGTH,DATA
- stbx r7,LENGTH,r8
- bne Lstb_loop
- lxvd2x VSR(C0),0,r8
It's always a bit annoying to have to deal with leftovers like this in the assembly code. Can we avoid having to store it to memory and read back? I can see three other approaches:
1. Loop, reading a byte at a time, and shift into a target register. I guess we would need to assemble the bytes in a regular register, and then transfer the final value to a vector register. Is that expensive?
2. Round the address down to make it aligned, read an aligned word and, only if needed, the next word. And shift and mask to get the needed bytes. I think it is fine to read a few bytes outside of the input area, as long as the reads do *not* cross any word boundary (and hence a potential page boundary). We do things like this in some other places, but then for reading unaligned data in general, not just leftover parts.
3. Adapt the internal C/asm interface, so that the assembly routine only needs to handle complete blocks. It could provide a gcm_gf_mul, and let the C code handle partial blocks using memxor + gcm_gf_mul.
I would guess (1) or maybe (3) is the most reasonable. I don't think performance is that important, since it looks like for each message, this case can happen only for the last call to gcm_update and the last call to gcm_encrypt/gcm_decrypt.
What about test coverage? It looks like we have test cases for sizes up to 8 blocks, and for partial blocks, so I guess that should be fine?
Reards, /Niels