The main reason to make these functions go fast is actually to prevent people from using UTF8 internally and the PCRE regexp in UTF8 mode themselves... :)
The best would probably be a method like 'array(string,string) string_to_utf8_with_index( string input );'
that returns the utf8 string and a string (or array with integers, but that would use even more memory) with the byte->character mapping.
[string index,string utf8] = string_utf8_with_index( data ); array(int) offsets = rows(index,regexp_function_utf8( utf8 ));
/ Per Hedbor ()
Previous text:
2003-09-24 13:10: Subject: utf8_char_index
The main reason to make these functions go fast is actually to prevent people from using UTF8 internally and the PCRE regexp in UTF8 mode themselves... :)
/ Mirar
Hmm, it still doesn't work very well if the indata is huge... but I guess it wont be that much worse then the utf8 string.
The most important right now is a quick function for start_index from character to byte index, though.
/ Mirar
Previous text:
2003-09-24 13:14: Subject: utf8_char_index
The best would probably be a method like 'array(string,string) string_to_utf8_with_index( string input );'
that returns the utf8 string and a string (or array with integers, but that would use even more memory) with the byte->character mapping.
[string index,string utf8] = string_utf8_with_index( data ); array(int) offsets = rows(index,regexp_function_utf8( utf8 ));
/ Per Hedbor ()
Another thing, is 1) anyone using Regexp.replace, and 2) how is it supposed to work? Is it supposed to work?
/ Mirar
Previous text:
2003-09-24 13:17: Subject: utf8_char_index
Hmm, it still doesn't work very well if the indata is huge... but I guess it wont be that much worse then the utf8 string.
The most important right now is a quick function for start_index from character to byte index, though.
/ Mirar
Is this good enough?
| > Regexp.PCRE("b[^-]*m")->replace("abam-boom-fooabadoom","gurka"); | Result: "agurka-gurka-fooagurka" | > Regexp.PCRE("b[^-]*m")->replace("abam-boom-fooabadoom", | lambda(string s) { werror("%O\n",s); return "gurka"; }); | "bam" | "boom" | "badoom" | Result: "agurka-gurka-fooagurka"
/ Mirar
Previous text:
2003-09-24 13:17: Subject: utf8_char_index
Another thing, is 1) anyone using Regexp.replace, and 2) how is it supposed to work? Is it supposed to work?
/ Mirar
I'd say it looks good anyway - probably better than it already was, if it even did work at all. :-)
/ Johan Sundström, Lysator
Previous text:
2003-09-24 13:35: Subject: utf8_char_index
Is this good enough?
| > Regexp.PCRE("b[^-]*m")->replace("abam-boom-fooabadoom","gurka"); | Result: "agurka-gurka-fooagurka" | > Regexp.PCRE("b[^-]*m")->replace("abam-boom-fooabadoom", | lambda(string s) { werror("%O\n",s); return "gurka"; }); | "bam" | "boom" | "badoom" | Result: "agurka-gurka-fooagurka"
/ Mirar
I'll commit it like that, when the compat issue is solved...
Right now I'm considering checking in a master with `() in dirnode and joinnode, which will solve the compatibility issues, but might break typechecking. Anyones veto? It only affects dir- and joinnodes.
/ Mirar
Previous text:
2003-09-24 13:59: Subject: utf8_char_index
I'd say it looks good anyway - probably better than it already was, if it even did work at all. :-)
/ Johan Sundström, Lysator
I still think it's better to leave Regexp alone and add PCRE somewhere else instead. Trying to get Regexp.PCRE to work smells too much like a can of worms to me.
/ Martin Stjernholm, Roxen IS
Previous text:
2003-09-24 14:02: Subject: utf8_char_index
I'll commit it like that, when the compat issue is solved...
Right now I'm considering checking in a master with `() in dirnode and joinnode, which will solve the compatibility issues, but might break typechecking. Anyones veto? It only affects dir- and joinnodes.
/ Mirar
I might give up and do that, yes... :/ But it feels like the wrong solution.
/ Mirar
Previous text:
2003-09-24 14:08: Subject: utf8_char_index
I still think it's better to leave Regexp alone and add PCRE somewhere else instead. Trying to get Regexp.PCRE to work smells too much like a can of worms to me.
/ Martin Stjernholm, Roxen IS
Where will the New Regexp Engine live in the module name space, when/if it arrives?
Perhaps we could create a new Rx module, and put PCRE as Rx.PCRE? Or do we want just "Rx" to resolv to a class or function, just like "Regexp" currently?
/ Niels Möller (igelkottsräddare)
Previous text:
2003-09-24 14:08: Subject: utf8_char_index
I still think it's better to leave Regexp alone and add PCRE somewhere else instead. Trying to get Regexp.PCRE to work smells too much like a can of worms to me.
/ Martin Stjernholm, Roxen IS
I planned on using Rx on the top level. Since it has operator nodes that might be used frequently I want short names like Rx.or, Rx.and etc. I.e. the module name should preferably be short. Since the function names themselves are very common (there is e.g. a "map" function/class), the module can't be imported either.
A parent module could be imported though, so it's possible to name them something like AVeryLongModuleName.Rx.map and then suggest that the user imports AVeryLongModuleName and use Rx.map.
/ Martin Stjernholm, Roxen IS
Previous text:
2003-09-24 22:00: Subject: utf8_char_index
Where will the New Regexp Engine live in the module name space, when/if it arrives?
Perhaps we could create a new Rx module, and put PCRE as Rx.PCRE? Or do we want just "Rx" to resolv to a class or function, just like "Regexp" currently?
/ Niels Möller (igelkottsräddare)
What about "Automata" as a top-level name for regexps and other general state machinery?
/ Niels Möller (igelkottsräddare)
Previous text:
2003-09-25 13:19: Subject: utf8_char_index
I planned on using Rx on the top level. Since it has operator nodes that might be used frequently I want short names like Rx.or, Rx.and etc. I.e. the module name should preferably be short. Since the function names themselves are very common (there is e.g. a "map" function/class), the module can't be imported either.
A parent module could be imported though, so it's possible to name them something like AVeryLongModuleName.Rx.map and then suggest that the user imports AVeryLongModuleName and use Rx.map.
/ Martin Stjernholm, Roxen IS
Actually I'm not overenthusiastic about trying to sort in modules in deep hierarchies.
/ Martin Stjernholm, Roxen IS
Previous text:
2003-09-25 13:31: Subject: utf8_char_index
What about "Automata" as a top-level name for regexps and other general state machinery?
/ Niels Möller (igelkottsräddare)
It looks fairly straightforward to me. Always the longest match and no overlapping matches. What's the problem?
/ Martin Stjernholm, Roxen IS
Previous text:
2003-09-24 13:17: Subject: utf8_char_index
Another thing, is 1) anyone using Regexp.replace, and 2) how is it supposed to work? Is it supposed to work?
/ Mirar
pike-devel@lists.lysator.liu.se