Nikos Mavrogiannopoulos nmav@gnutls.org writes:
The benchmarks (if it is same as the older code I've sent you few years ago), have been done on intel i7, i5 and a xeon. In all of them there was an improvement. The benchmark above is on i7.
About that not improving on AMD I have no more data than what I've wrote you last time (which was few years ago). No idea if newer AMD processors behave better.
I don't remember much of this benchmarking (and things may have changed, anyway). I think I'm going to add an environment variable to override the cpu detection, so different variants can be checked easily at runtime. So we'll see later on if some finer granularity is needed.
I didn't like the duplication of code either. I'm not very skilled in m4, but I though that x86_64/ could include the fat variant and use the non-sse2 variant.
I think I'd prefer to do it the other way around, with memxor-1.asm and memxor-2.asm both including x86_64/memxor.asm, just defining USE_SSE2 differently. With little actual code under fat/. Do you see any problem with that approach?
The code in fat.c is quite elaborate on the cases it handles. The more functions added the more unmanageable the code will become. Would it make sense to restrict that support to the systems where ifunc is available? Then the addition of new optimized functions becomes very simple.
I agree that as more functions are added, we need some macros for the boilerplate code. But I think that can be done without dropping support for the non-ifunc systems. Basically, use an alternative definition of your DEFINE_FAT_FUNC which defines a wrapper function and an init function, instead of a resolver function.
Regards, /Niels