I've merged a reorganization of the x86_64 aesni code to the master-updates branch for testing. This replaces the x86_64/aesni/aes-*crypt-internal.asm files with separate files for the different key sizes, as has been discussed earlier.
And I've implemented 2-way interleaving, i.e., doing 2 blocks at a time, which gave a nice speedup on the order of 15% in my tests. I may be worthwhile to go to 3-way or 4-way, but I don't plan to try that soon.
Regards, /Niels