On my laptop (intel P4), I get an increase from 45 MB/s to 66MB/s.
Does it matter if the si, sj are ints or uint8_t? I get no speed difference.
The inner loop gets compiled into (intel, gcc-3.3, -O2)
.L28: incb -13(%ebp) decl %ebx movzbl -13(%ebp), %edx movzbl (%edx,%edi), %ecx addb %cl, -14(%ebp) movzbl -14(%ebp), %eax movzbl (%eax,%edi), %eax movb %al, (%edx,%edi) addb %cl, %al movl 16(%ebp), %edx movzbl %al, %eax movzbl (%eax,%edi), %eax xorb (%esi), %al incl %esi movb %al, (%edx) incl %edx cmpl $-1, %ebx movl %edx, 16(%ebp) jne .L28
It seems it can't fit all variables into registers, hence the save and restore operations via %ebp.
I wonder if my intel books will ever arrive.
/ Niels Möller (vässar rödpennan)
Previous text:
2004-02-05 10:02: Subject: Nettle
I built it for OS X and installed it so Pikefarm can find it. Anyway, while looking at the code I optimized the RC4 function for better performance :-)
arcfour_crypt(struct arcfour_ctx *ctx, unsigned length, uint8_t *dst, const uint8_t *src) { register uint8_t i, j; register int si, sj;
i = ctx->i; j = ctx->j; while(length--) { i++; i &= 0xff; si = ctx->S[i]; j += si; j &= 0xff; sj = ctx->S[i] = ctx->S[j]; ctx->S[j] = si; *dst++ = *src++ ^ ctx->S[ (si + sj) & 0xff ]; } ctx->i = i; ctx->j = j; }
This improved performance from ~25 MB/s to ~39 MB/s on my G4/500 (gcc 3.3), and from ~14 MB/s to ~17 MB/s on a x86/600 (gcc 2.95). Feel free to verify and incorporate it. (arcfour_stream() should be modified in the same way of course.)
/ Jonas Walldén