range operator

List overview All Threads
Download

newer

older

variables as function aliases?

CHANGES 7.8

Hedda (Lite ��) ＠ Pike (-) developers forum

5 Nov 2008 5 Nov '08

7:05 p.m.

I pretty often want to take a range out of a string that starts at a keyword and ends with another. Typically when reading a config-file or mining data out of an HTML-page.

Like this:

...

"foo bar gazonk"["o".."gaz"];

Result: "oo bar gaz"

...

"foo bar gazonk"["o".."o"];

Result: "o"

...

"foo bar gazonk"["gurka".."o"];

Result: ""

...

"foo bar gazonk"["o".."gurka"];

Result: "oo bar gazonk"

...

"gazonk foo bar gazonk"["o".."gaz"];

Result: "onk foo bar gaz"

(Excuse me if I've missed if this suggestion is already made.)

Show replies by date

Mirar ＠ Pike developers forum

5 Nov 5 Nov

7:20 p.m.

Sounds like a good idea to me. :)

I don't know when *I* would use it though.

Jonas Walld�n ＠ Pike developers forum

7:40 p.m.

I'd much rather see a search_reverse() that operates from the end of the string (or given data type). It would be a bit more code but the result would be more consistent and predictable.

Martin Stjernholm, Roxen IS ＠ Pike developers forum

6 Nov 6 Nov

1:15 a.m.

I second that. I suspect the reason why there is no such thing is that the low level search implementation is pretty complex with all its optimizations, so doing the same thing for backward searching would result either in rather extensive code duplication or some quite ugly macro trickery (on top of all the other macro madness already in there). At least that's what stopped me when I've considered fixing it.

Johan Sundstr�m (Achtung Liebe!) ＠ Pike (-) developers forum

1:55 a.m.

And an rsscanf! And a pony! :)

Hedda (Lite ��) ＠ Pike (-) developers forum

9 Nov 9 Nov

10:20 a.m.

So it's a psychological problem.

I think it's best to just duplicate the code and reverse some indices. It's not like the search-code will change all the time, and changing in two places isn't a _real_ problem.

I seldom use reverse search though, and search(reverse()) works fine for me unless the string/array is huge, but even if it is, the reverse takes little time compared to the search.

An odd alternative would be to let search'es last element be negative so that search("foo", -1) starts with the last position and goes backwards. But... On the other: hand -1 could mean "start with the last position and go forward" too, and that's maybe better.

Martin Stjernholm, Roxen IS ＠ Pike developers forum

2:35 p.m.

It's a code quality problem. For people like you who haven't taken a peek at the code (which is obvious since you say it's only a change in two places) and don't have to work with it, it could perhaps appear as a phantom problem, but for maintainers it's a real one.

Anyway, I didn't bring that up as a reason for not doing it, only as an explanation why it hasn't been done already. rsearch() will probably happen eventually. Most likely it just won't do all the optimizations search() is doing.

...

/.../ the reverse takes little time compared to the search.

Right.. If that were true, it'd be a very bad outcome indeed for all Hubbes efforts optimizing that beast.

Martin Bähr

3 p.m.

On Sun, Nov 09, 2008 at 02:35:03PM +0000, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:

...

...
/.../ the reverse takes little time compared to the search.

Right.. If that were true, it'd be a very bad outcome indeed for all Hubbes efforts optimizing that beast.

if what were true? or is there a 'not' missing? (if it were not true that reverse() takes little time)

would a reverse_search() be much faster than search(reverse()) even if it is less optimized than search()?

greetings, martin.

Martin Stjernholm, Roxen IS ＠ Pike developers forum

3:25 p.m.

I think it's obvious that reverse() is a lot slower than search(). reverse() has to allocate memory, then copy every char in the string, and then hash it into the string table.

A standard textbook Boyer-Moore implementation would on average have to go through half the haystack string, so it should (on average) be four times faster than reverse(): Reading n/2 chars vs reading and writing n chars. If search() is any slower than that (barring perhaps some extra startup overhead) then we should toss the implementation asap.

Martin Baehr

3:52 p.m.

On Sun, Nov 09, 2008 at 03:25:02PM +0000, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:

...

I think it's obvious that reverse() is a lot slower than search().

it should have been, but i read it the wrong way. i attributed 'beast' to reverse and not search.

greetings, martin.

Per Hedbor () ＠ Pike (-) developers forum

10 Nov 10 Nov

noon

...

I think it's obvious that reverse() is a lot slower than search().

reverse is actually a lot faster, though. :-) 25s for 1000 reverses of a 6Mb string.

Jonas Walld�n ＠ Pike developers forum

12:25 p.m.

I was going to say that reverse() probably did 32-bit or 64-bit memory operations but that wasn't true either. Anyway, plenty of vectorization possibilities in reverse(), search() and string hashing in general.

By the way, what CPU did you test on? On older hardware a program could benefit immensely by having cache hints inserted in loops like this, but I believe they identify streaming access and preload automatically nowadays.

Per Hedbor () ＠ Pike (-) developers forum

12:55 p.m.

...

By the way, what CPU did you test on?

A 64-bit dual-core core-2 CPU with 3Mb cache (per CPU).

So it's a fairly modern cpu, and the gcc is compiling for 64-bit targets.

Simply reading the whole string one long at a time takes 1.2 seconds (about 6 times faster).

So there are optimization possibilites. But pikes default search is not really one of them.

memmem is just as fast as my simple loop, but shorter. :-)

for( i=0,j=0; i<hlen; i++ ) { if( __builtin_expect(haystack[i] == needle[j], 0) ) { j++; if( j == nlen ) break; } else j = 0; }

Martin Stjernholm, Roxen IS ＠ Pike developers forum

2 p.m.

Sounds like we should toss the implementation asap then..

Here's a good overview of a lot of string search algorithms, with C code and pedagogic java applets showing the operation: http://www-igm.univ-mlv.fr/~lecroq/string/index.html

So someone just has to pick the best one and implement it, preferably with rsearch in mind from the start. Any takers?

(They sound mighty cool too: "Turbo Reverse Factor algorithm", "Backward Oracle Matching algorithm", or why not the "Not So Naive algorithm"?)

Per Hedbor () ＠ Pike (-) developers forum

noon

search is actually rather slow. It's slow enough that I think that something just has to be wrong with it currently.

Simple loop, searching for a 32 byte (non-existing) string 1000 times in 6Mb of data: 6.96s (average over 3 runs) (862Mb/s) (this includes the time to initialize the strings, but that's rather trivial..)

search() 1000 times in pike using the same strings: Over 205 seconds (30Mb/s)

It might be a code/cache-size vs. cpu speed problem, but not only. The CPU:s are rather fast compared to memory nowdays. Almost any code will run quickly if it will fit in the codecache.

Martin Stjernholm, Roxen IS ＠ Pike developers forum

6 Nov 6 Nov

1:10 a.m.

Sure, but I think it should a function instead. For both search strings one would want to specify whether it's the first or last occurrence that's relevant, or more generally the nth occurrence from the beginning or the end.

Hedda (Lite ��) ＠ Pike (-) developers forum

9 Nov 9 Nov

10:35 a.m.

A function? Like this

...

multisearch("foobar", ({"o", "a"}));

Result: ({"fo", "ooba","ar"})

...

multisearch("foobar", ({"o","o", "a"}));

Result: ({"fo", "oo", "oba","ar"})

...

multisearch("foobar", ({"o","x", "a"}));

Result: ({"fo", "","ooba","ar"}) or: Result: ({"fo", "oobar","",""})

Seems pretty useful too, but I generally just want the range-operator.

For example:

if (email[""".."""]!="") name=unquote(email[""".."""]);

Or:

blocktext=config_string["blockstart".."blockend"];

Martin Stjernholm, Roxen IS ＠ Pike developers forum

3:10 p.m.

That multisearch thing looks to me like an entirely different thing.

No, I'm talking about essentially the same method to pick out substrings as you, albeit a bit more generalized. It seems useful, but it shouldn't abuse the range operator syntax. I think one should be very restrictive with extending operators to do things they weren't meant to do. If the analogy isn't completely apt, then don't do it; you can just as well make it an ordinary function instead.

In this case it breaks since the ordinary range operator lets you express "give me the range that starts at the third position", but the analogous case here, namely "give me the substring that starts at the third occurrence of this string", can't be expressed.

Also, what you have described is that the range operator is extended to solve a special case that you found to be convenient right now (the left substring is the first match, and the right substring is - for no good reason - the first match after that one). That's not good enough to make it standard behavior.

6081

Age (days ago)

6086

Last active (days ago)

pike-devel@lists.lysator.liu.se

17 comments

8 participants

tags (0)

participants (8)

Hedda (Lite ��) ＠ Pike (-) developers forum
Johan Sundstr�m (Achtung Liebe!) ＠ Pike (-) developers forum
Jonas Walld�n ＠ Pike developers forum
Martin Baehr
Martin Bähr
Martin Stjernholm, Roxen IS ＠ Pike developers forum
Mirar ＠ Pike developers forum
Per Hedbor () ＠ Pike (-) developers forum