By the way, what CPU did you test on?
A 64-bit dual-core core-2 CPU with 3Mb cache (per CPU).
So it's a fairly modern cpu, and the gcc is compiling for 64-bit targets.
Simply reading the whole string one long at a time takes 1.2 seconds (about 6 times faster).
So there are optimization possibilites. But pikes default search is not really one of them.
memmem is just as fast as my simple loop, but shorter. :-)
for( i=0,j=0; i<hlen; i++ ) { if( __builtin_expect(haystack[i] == needle[j], 0) ) { j++; if( j == nlen ) break; } else j = 0; }