On Fri, 2015-01-16 at 22:18 +0100, Niels Möller wrote:
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
A quick and dirty patch to enable SSE2 instructions for memxor() on Intel CPUs is attached. I tried to follow the logic in the fat.c file, but I may have missed something. I've not added memxor3() because it is actually slower with SSE2.
Cool!
SSE2: memxor aligned 26081.83 memxor unaligned 25893.69
No-SSE2: memxor aligned 17806.94 memxor unaligned 16581.48
How confident are you that the intel vs amd check is the right way to enable sse2? I guess we could add check on the particular cpu model later, if needed. Which model(s) did you benchmark on?
The benchmarks (if it is same as the older code I've sent you few years ago), have been done on intel i7, i5 and a xeon. In all of them there was an improvement. The benchmark above is on i7.
About that not improving on AMD I have no more data than what I've wrote you last time (which was few years ago). No idea if newer AMD processors behave better.
It would be nice in a way if we could share code with x86_64/memxor.asm. E.g., by defining x86_64/fat/memxor-1.asm and x86_64/fat/memxor-2.asm which each include the same file with a different setting of USE_SSE2. But I haven't looked at that carefully, it might be better to have a unified x86_64/fat/memxor.asm with two entry points, like you do. I've also been considering m4 hacks to let a single fat .asm file include multiple other .asm files, or including the same file twice, without labels or m4 definitions colliding, but I'm not sure that's worth the effort. The foo-1.asm, foo-2.asm, ... scheme is a bit inelegant, but it is easy to understand.
I didn't like the duplication of code either. I'm not very skilled in m4, but I though that x86_64/ could include the fat variant and use the non-sse2 variant.
The code in fat.c is quite elaborate on the cases it handles. The more functions added the more unmanageable the code will become. Would it make sense to restrict that support to the systems where ifunc is available? Then the addition of new optimized functions becomes very simple.
regards, Nikos