I pretty often want to take a range out of a string that starts at a keyword and ends with another. Typically when reading a config-file or mining data out of an HTML-page.
Like this:
"foo bar gazonk"["o".."gaz"];
Result: "oo bar gaz"
"foo bar gazonk"["o".."o"];
Result: "o"
"foo bar gazonk"["gurka".."o"];
Result: ""
"foo bar gazonk"["o".."gurka"];
Result: "oo bar gazonk"
"gazonk foo bar gazonk"["o".."gaz"];
Result: "onk foo bar gaz"
(Excuse me if I've missed if this suggestion is already made.)
Sounds like a good idea to me. :)
I don't know when *I* would use it though.
I'd much rather see a search_reverse() that operates from the end of the string (or given data type). It would be a bit more code but the result would be more consistent and predictable.
I second that. I suspect the reason why there is no such thing is that the low level search implementation is pretty complex with all its optimizations, so doing the same thing for backward searching would result either in rather extensive code duplication or some quite ugly macro trickery (on top of all the other macro madness already in there). At least that's what stopped me when I've considered fixing it.
So it's a psychological problem.
I think it's best to just duplicate the code and reverse some indices. It's not like the search-code will change all the time, and changing in two places isn't a _real_ problem.
I seldom use reverse search though, and search(reverse()) works fine for me unless the string/array is huge, but even if it is, the reverse takes little time compared to the search.
An odd alternative would be to let search'es last element be negative so that search("foo", -1) starts with the last position and goes backwards. But... On the other: hand -1 could mean "start with the last position and go forward" too, and that's maybe better.
It's a code quality problem. For people like you who haven't taken a peek at the code (which is obvious since you say it's only a change in two places) and don't have to work with it, it could perhaps appear as a phantom problem, but for maintainers it's a real one.
Anyway, I didn't bring that up as a reason for not doing it, only as an explanation why it hasn't been done already. rsearch() will probably happen eventually. Most likely it just won't do all the optimizations search() is doing.
/.../ the reverse takes little time compared to the search.
Right.. If that were true, it'd be a very bad outcome indeed for all Hubbes efforts optimizing that beast.
On Sun, Nov 09, 2008 at 02:35:03PM +0000, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
/.../ the reverse takes little time compared to the search.
Right.. If that were true, it'd be a very bad outcome indeed for all Hubbes efforts optimizing that beast.
if what were true? or is there a 'not' missing? (if it were not true that reverse() takes little time)
would a reverse_search() be much faster than search(reverse()) even if it is less optimized than search()?
greetings, martin.
I think it's obvious that reverse() is a lot slower than search(). reverse() has to allocate memory, then copy every char in the string, and then hash it into the string table.
A standard textbook Boyer-Moore implementation would on average have to go through half the haystack string, so it should (on average) be four times faster than reverse(): Reading n/2 chars vs reading and writing n chars. If search() is any slower than that (barring perhaps some extra startup overhead) then we should toss the implementation asap.
On Sun, Nov 09, 2008 at 03:25:02PM +0000, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
I think it's obvious that reverse() is a lot slower than search().
it should have been, but i read it the wrong way. i attributed 'beast' to reverse and not search.
greetings, martin.
I was going to say that reverse() probably did 32-bit or 64-bit memory operations but that wasn't true either. Anyway, plenty of vectorization possibilities in reverse(), search() and string hashing in general.
By the way, what CPU did you test on? On older hardware a program could benefit immensely by having cache hints inserted in loops like this, but I believe they identify streaming access and preload automatically nowadays.
By the way, what CPU did you test on?
A 64-bit dual-core core-2 CPU with 3Mb cache (per CPU).
So it's a fairly modern cpu, and the gcc is compiling for 64-bit targets.
Simply reading the whole string one long at a time takes 1.2 seconds (about 6 times faster).
So there are optimization possibilites. But pikes default search is not really one of them.
memmem is just as fast as my simple loop, but shorter. :-)
for( i=0,j=0; i<hlen; i++ ) { if( __builtin_expect(haystack[i] == needle[j], 0) ) { j++; if( j == nlen ) break; } else j = 0; }
Sounds like we should toss the implementation asap then..
Here's a good overview of a lot of string search algorithms, with C code and pedagogic java applets showing the operation: http://www-igm.univ-mlv.fr/~lecroq/string/index.html
So someone just has to pick the best one and implement it, preferably with rsearch in mind from the start. Any takers?
(They sound mighty cool too: "Turbo Reverse Factor algorithm", "Backward Oracle Matching algorithm", or why not the "Not So Naive algorithm"?)
search is actually rather slow. It's slow enough that I think that something just has to be wrong with it currently.
Simple loop, searching for a 32 byte (non-existing) string 1000 times in 6Mb of data: 6.96s (average over 3 runs) (862Mb/s) (this includes the time to initialize the strings, but that's rather trivial..)
search() 1000 times in pike using the same strings: Over 205 seconds (30Mb/s)
It might be a code/cache-size vs. cpu speed problem, but not only. The CPU:s are rather fast compared to memory nowdays. Almost any code will run quickly if it will fit in the codecache.
Sure, but I think it should a function instead. For both search strings one would want to specify whether it's the first or last occurrence that's relevant, or more generally the nth occurrence from the beginning or the end.
A function? Like this
multisearch("foobar", ({"o", "a"}));
Result: ({"fo", "ooba","ar"})
multisearch("foobar", ({"o","o", "a"}));
Result: ({"fo", "oo", "oba","ar"})
multisearch("foobar", ({"o","x", "a"}));
Result: ({"fo", "","ooba","ar"}) or: Result: ({"fo", "oobar","",""})
?
Seems pretty useful too, but I generally just want the range-operator.
For example:
if (email[""".."""]!="") name=unquote(email[""".."""]);
Or:
blocktext=config_string["blockstart".."blockend"];
That multisearch thing looks to me like an entirely different thing.
No, I'm talking about essentially the same method to pick out substrings as you, albeit a bit more generalized. It seems useful, but it shouldn't abuse the range operator syntax. I think one should be very restrictive with extending operators to do things they weren't meant to do. If the analogy isn't completely apt, then don't do it; you can just as well make it an ordinary function instead.
In this case it breaks since the ordinary range operator lets you express "give me the range that starts at the third position", but the analogous case here, namely "give me the substring that starts at the third occurrence of this string", can't be expressed.
Also, what you have described is that the range operator is extended to solve a special case that you found to be convenient right now (the left substring is the first match, and the right substring is - for no good reason - the first match after that one). That's not good enough to make it standard behavior.
pike-devel@lists.lysator.liu.se