Maamoun TK maamoun.tk@googlemail.com writes:
On Sat, May 1, 2021 at 6:11 PM Niels Möller nisse@lysator.liu.se wrote:
Is https://git.lysator.liu.se/nettle/nettle/-/merge_requests/23 still the current code?
I've added the basic AES-192 and AES-256 too since there is no problem to test them all together.
Merged to the s390x branch now. Thanks for your patience.
For further improvement, it would be nice to have aesN_set_encrypt_key and aesN_set_decrypt_key be two entrypoints to the same function. But will make the file replacement logic a bit more complex.
And maybe the public aes*_invert_key functions should be marked as deprecated (and deleted, next time we have an abi break)? No other ciphers in Nettle have this feature, and it's not that useful for applications. From codesearch.debian.net, it looks like they are exposed by the haskell and rust bindings, though.
For the other the modes,
Before doing the other modes, do you think you could investigate if memxor and memxor3 can be sped up? That should benefit many ciphers and modes, and give more relevant speedup numbers for specialized functions like aes cbc and aes ctr.
The best strategy depends on whether or not unaligned memory access is possible and efficient. All current implementations do aligned writes to the destination area (and smaller writes if needed at the edges). For the C implementation and several of the asm implementations, they also do aligned reads, and use shifting to get inputs xored together at the right places.
While the x86_64 implementation uses unaligned reads, since that seems as efficient, and reduces complexity quite a lot.
On all platforms I'm familiar with, assembly implementations can assume that it is safe to read a few bytes outside the edge of the input buffer, as long as those reads don't cross a word boundary (corresponding to valgrind option --partial-loads-ok=yes).
Ideally, memxor performance should be limited by memory/cache bandwidth (with data in L1 cache probably being the most important case. It looks like nettle-benchmark calls it with a size of 10 KB).
Note that memxor3 must process data in descending address order, to support the call from cbc_decrypt, with overlapping operands.
Regards, /Niels