Now I'm confused. I do have the 45 and 66 MB/s figures in my *shell* buffer, but I can't reproduce the 66 MB/s figure. Perhaps that was with the buggy version of the code? Anyway, x86 performance for the C version doesn't matter that much anymore.
/ Niels Möller (vässar rödpennan)
Previous text:
2004-02-05 14:21: Subject: Nettle
On my laptop (intel P4), I get an increase from 45 MB/s to 66MB/s.
Does it matter if the si, sj are ints or uint8_t? I get no speed difference.
The inner loop gets compiled into (intel, gcc-3.3, -O2)
.L28: incb -13(%ebp) decl %ebx movzbl -13(%ebp), %edx movzbl (%edx,%edi), %ecx addb %cl, -14(%ebp) movzbl -14(%ebp), %eax movzbl (%eax,%edi), %eax movb %al, (%edx,%edi) addb %cl, %al movl 16(%ebp), %edx movzbl %al, %eax movzbl (%eax,%edi), %eax xorb (%esi), %al incl %esi movb %al, (%edx) incl %edx cmpl $-1, %ebx movl %edx, 16(%ebp) jne .L28
It seems it can't fit all variables into registers, hence the save and restore operations via %ebp.
I wonder if my intel books will ever arrive.
/ Niels Möller (vässar rödpennan)