utf8_char_index

List overview All Threads
Download

newer

older

topics for the conference

Pike conference

Mirar ＠ Pike developers forum

24 Sep 2003 24 Sep '03

11:15 a.m.

The main reason to make these functions go fast is actually to prevent people from using UTF8 internally and the PCRE regexp in UTF8 mode themselves... :)

Show replies by date

Per Hedbor () ＠ Pike (-) developers forum

24 Sep 24 Sep

11:15 a.m.

The best would probably be a method like 'array(string,string) string_to_utf8_with_index( string input );'

that returns the utf8 string and a string (or array with integers, but that would use even more memory) with the byte->character mapping.

[string index,string utf8] = string_utf8_with_index( data ); array(int) offsets = rows(index,regexp_function_utf8( utf8 ));

/ Per Hedbor ()

Previous text:

...

2003-09-24 13:10: Subject: utf8_char_index

The main reason to make these functions go fast is actually to prevent people from using UTF8 internally and the PCRE regexp in UTF8 mode themselves... :)

/ Mirar

Mirar ＠ Pike developers forum

11:20 a.m.

Hmm, it still doesn't work very well if the indata is huge... but I guess it wont be that much worse then the utf8 string.

The most important right now is a quick function for start_index from character to byte index, though.

/ Mirar

Previous text:

...

2003-09-24 13:14: Subject: utf8_char_index

The best would probably be a method like 'array(string,string) string_to_utf8_with_index( string input );'

that returns the utf8 string and a string (or array with integers, but that would use even more memory) with the byte->character mapping.

[string index,string utf8] = string_utf8_with_index( data ); array(int) offsets = rows(index,regexp_function_utf8( utf8 ));

/ Per Hedbor ()

Mirar ＠ Pike developers forum

11:20 a.m.

Another thing, is 1) anyone using Regexp.replace, and 2) how is it supposed to work? Is it supposed to work?

/ Mirar

Previous text:

...

2003-09-24 13:17: Subject: utf8_char_index

Hmm, it still doesn't work very well if the indata is huge... but I guess it wont be that much worse then the utf8 string.

The most important right now is a quick function for start_index from character to byte index, though.

/ Mirar

Mirar ＠ Pike developers forum

11:40 a.m.

Is this good enough?

| > Regexp.PCRE("b[^-]*m")->replace("abam-boom-fooabadoom","gurka"); | Result: "agurka-gurka-fooagurka" | > Regexp.PCRE("b[^-]*m")->replace("abam-boom-fooabadoom", | lambda(string s) { werror("%O\n",s); return "gurka"; }); | "bam" | "boom" | "badoom" | Result: "agurka-gurka-fooagurka"

/ Mirar

Previous text:

...

2003-09-24 13:17: Subject: utf8_char_index

Another thing, is 1) anyone using Regexp.replace, and 2) how is it supposed to work? Is it supposed to work?

/ Mirar

Johan Sundstr�m, Lysator ＠ Pike developers forum

noon

I'd say it looks good anyway - probably better than it already was, if it even did work at all. :-)

/ Johan Sundström, Lysator

Previous text:

...

2003-09-24 13:35: Subject: utf8_char_index

Is this good enough?

| > Regexp.PCRE("b[^-]*m")->replace("abam-boom-fooabadoom","gurka"); | Result: "agurka-gurka-fooagurka" | > Regexp.PCRE("b[^-]*m")->replace("abam-boom-fooabadoom", | lambda(string s) { werror("%O\n",s); return "gurka"; }); | "bam" | "boom" | "badoom" | Result: "agurka-gurka-fooagurka"

/ Mirar

Mirar ＠ Pike developers forum

12:05 p.m.

I'll commit it like that, when the compat issue is solved...

Right now I'm considering checking in a master with `() in dirnode and joinnode, which will solve the compatibility issues, but might break typechecking. Anyones veto? It only affects dir- and joinnodes.

/ Mirar

Previous text:

...

2003-09-24 13:59: Subject: utf8_char_index

I'd say it looks good anyway - probably better than it already was, if it even did work at all. :-)

/ Johan Sundström, Lysator

Martin Stjernholm, Roxen IS ＠ Pike developers forum

12:10 p.m.

I still think it's better to leave Regexp alone and add PCRE somewhere else instead. Trying to get Regexp.PCRE to work smells too much like a can of worms to me.

/ Martin Stjernholm, Roxen IS

Previous text:

...

2003-09-24 14:02: Subject: utf8_char_index

I'll commit it like that, when the compat issue is solved...

Right now I'm considering checking in a master with `() in dirnode and joinnode, which will solve the compatibility issues, but might break typechecking. Anyones veto? It only affects dir- and joinnodes.

/ Mirar

Mirar ＠ Pike developers forum

12:15 p.m.

I might give up and do that, yes... :/ But it feels like the wrong solution.

/ Mirar

Previous text:

...

2003-09-24 14:08: Subject: utf8_char_index

I still think it's better to leave Regexp alone and add PCRE somewhere else instead. Trying to get Regexp.PCRE to work smells too much like a can of worms to me.

/ Martin Stjernholm, Roxen IS

Niels M�ller (igelkottsr�ddare) ＠ Pike (-) developers forum

8:05 p.m.

Where will the New Regexp Engine live in the module name space, when/if it arrives?

Perhaps we could create a new Rx module, and put PCRE as Rx.PCRE? Or do we want just "Rx" to resolv to a class or function, just like "Regexp" currently?

/ Niels Möller (igelkottsräddare)

Previous text:

...

2003-09-24 14:08: Subject: utf8_char_index

I still think it's better to leave Regexp alone and add PCRE somewhere else instead. Trying to get Regexp.PCRE to work smells too much like a can of worms to me.

/ Martin Stjernholm, Roxen IS

Martin Stjernholm, Roxen IS ＠ Pike developers forum

25 Sep 25 Sep

11:20 a.m.

I planned on using Rx on the top level. Since it has operator nodes that might be used frequently I want short names like Rx.or, Rx.and etc. I.e. the module name should preferably be short. Since the function names themselves are very common (there is e.g. a "map" function/class), the module can't be imported either.

A parent module could be imported though, so it's possible to name them something like AVeryLongModuleName.Rx.map and then suggest that the user imports AVeryLongModuleName and use Rx.map.

/ Martin Stjernholm, Roxen IS

Previous text:

...

2003-09-24 22:00: Subject: utf8_char_index

Where will the New Regexp Engine live in the module name space, when/if it arrives?

Perhaps we could create a new Rx module, and put PCRE as Rx.PCRE? Or do we want just "Rx" to resolv to a class or function, just like "Regexp" currently?

/ Niels Möller (igelkottsräddare)

Niels M�ller (igelkottsr�ddare) ＠ Pike (-) developers forum

11:35 a.m.

What about "Automata" as a top-level name for regexps and other general state machinery?

/ Niels Möller (igelkottsräddare)

Previous text:

...

2003-09-25 13:19: Subject: utf8_char_index

I planned on using Rx on the top level. Since it has operator nodes that might be used frequently I want short names like Rx.or, Rx.and etc. I.e. the module name should preferably be short. Since the function names themselves are very common (there is e.g. a "map" function/class), the module can't be imported either.

A parent module could be imported though, so it's possible to name them something like AVeryLongModuleName.Rx.map and then suggest that the user imports AVeryLongModuleName and use Rx.map.

/ Martin Stjernholm, Roxen IS

Martin Stjernholm, Roxen IS ＠ Pike developers forum

5:15 p.m.

Actually I'm not overenthusiastic about trying to sort in modules in deep hierarchies.

/ Martin Stjernholm, Roxen IS

Previous text:

...

2003-09-25 13:31: Subject: utf8_char_index

What about "Automata" as a top-level name for regexps and other general state machinery?

/ Niels Möller (igelkottsräddare)

Peter Bortas ＠ Pike developers forum

7:40 p.m.

/ Peter Bortas

Previous text:

...

2003-09-25 19:10: Subject: utf8_char_index

Actually I'm not overenthusiastic about trying to sort in modules in deep hierarchies.

/ Martin Stjernholm, Roxen IS

Martin Stjernholm, Roxen IS ＠ Pike developers forum

24 Sep 24 Sep

11:45 a.m.

It looks fairly straightforward to me. Always the longest match and no overlapping matches. What's the problem?

/ Martin Stjernholm, Roxen IS

Previous text:

...

2003-09-24 13:17: Subject: utf8_char_index

Another thing, is 1) anyone using Regexp.replace, and 2) how is it supposed to work? Is it supposed to work?

/ Mirar

7968

Age (days ago)

7969

Last active (days ago)

pike-devel@lists.lysator.liu.se

14 comments

6 participants

tags (0)

participants (6)

Johan Sundstr�m, Lysator ＠ Pike developers forum
Martin Stjernholm, Roxen IS ＠ Pike developers forum
Mirar ＠ Pike developers forum
Niels M�ller (igelkottsr�ddare) ＠ Pike (-) developers forum
Per Hedbor () ＠ Pike (-) developers forum
Peter Bortas ＠ Pike developers forum