Hello, I've run some tests with memxor on a x86-64 machine. My results are: * C implementation (compiled with gcc 4.4): Xoring in chunks of 32768 bytes: done. 50.09 Gb in 5.00 secs: 10.02 Gb/sec Xoring (unaligned) in chunks of 32768 bytes: done. 39.90 Gb in 5.00 secs: 7.98 Gb/sec
* ASM implementation: Xoring in chunks of 32768 bytes: done. 38.32 Gb in 5.00 secs: 7.66 Gb/sec Xoring (unaligned) in chunks of 32768 bytes: done. 30.16 Gb in 5.00 secs: 6.03 Gb/sec
It seems that in x86-64 the ASM version is slower than the C one. Moreover I noticed that the loop unrolling techniques used in the C code have no visible performance benefit.
However, an SSE2 version of memxor (attached) increases performance by 30% or more in the same CPU.
* SSE2: Xoring in chunks of 32768 bytes: done. 69.94 Gb in 5.00 secs: 13.98 Gb/sec Xoring (unaligned) in chunks of 32768 bytes: done. 65.96 Gb in 5.00 secs: 13.19 Gb/sec
regards, Nikos