I used Apple's Shark tool which displays load stalls, cycle counts, loop alignment and more. It warned for loading and storing from the same memory address within a single PPC970 (G5) "bundle", and that caused me to reconsider the memory I/O in the loop.
This was with standard compiler flags so I didn't get any loop unrolling etc so there's probably plenty of headroom for further tuning. Other things which may be beneficial is to special-case for aligned memory buffers and read 32-bit chunks to/from memory at a time instead of XOR:ing individual bytes. That also applies to methods like memxor().
Ideally, since there are trade-offs for different CPUs even within the same family (in case of PPC: 604, G3, G4 (AltiVec), G5 (64-bit) etc) the library should pick the best implementation at run-time and not when compiled. For Pike it's not feasible to require a G4 or G5 but we still want to use vectorized code when possible.
/ Jonas Walldén
Previous text:
2004-02-05 13:05: Subject: Nettle
Nice, I'll try that. Have you looked at the assembler output? It would be interesting to know if there's any room for improvement by handtuning the assembler code.
/ Niels Möller (vässar rödpennan)