utf8_char_index

List overview All Threads
Download

newer

older

utf8_char_index

Mirar ＠ Pike developers forum

24 Sep 2003 24 Sep '03

11:25 a.m.

I need a hack for utf8 conversion, utf8 string byte offset <-> character number

Anyone that knows enough utf8 and feels up to it?

Show replies by date

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum

24 Sep 24 Sep

11:25 a.m.

int ch=0; while(p<offset) switch(str[p++]) { case 0xc0..0xdf: p+=1; break; case 0xe0..0xef: p+=2; break; case 0xf0..0xf7: p+=3; break; case 0xf8..0xfb: p+=4; break; case 0xfc..0xfd: p+=5; break; case 0xfe..0xff: error("Invalid UTF-8!\n"); }

/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)

Previous text:

...

2003-09-24 12:20: Subject: utf8_char_index

I need a hack for utf8 conversion, utf8 string byte offset <-> character number

Anyone that knows enough utf8 and feels up to it?

/ Mirar

Per Hedbor () ＠ Pike (-) developers forum

11:30 a.m.

It's probably not possible to do it all that much faster than strlen(utf8_to_string( X[..offset])).

The only thing you can skip is the string generation. It's always O(n) to go from byte index to character index in UTF-8.

/ Per Hedbor ()

Previous text:

...

2003-09-24 12:20: Subject: utf8_char_index

I need a hack for utf8 conversion, utf8 string byte offset <-> character number

Anyone that knows enough utf8 and feels up to it?

/ Mirar

Mirar ＠ Pike developers forum

11:30 a.m.

Actually, I wish to do it on a whole list of numbers (at least 2), unfortunately unsorted. That could be optimized.

It's the exec() function in pcre that gives useful indexes, and I was pondering to give the widestring wrapper object the correct result (= character offsets) from that function, not byteoffsets.

I also need the reverse function for continued search (start_index).

/ Mirar

Previous text:

...

2003-09-24 12:25: Subject: utf8_char_index

It's probably not possible to do it all that much faster than strlen(utf8_to_string( X[..offset])).

The only thing you can skip is the string generation. It's always O(n) to go from byte index to character index in UTF-8.

/ Per Hedbor ()

Per Hedbor () ＠ Pike (-) developers forum

11:35 a.m.

It would also be possible to generate a byte-offset -> index table when generating the UTF-8 from the widestring. Then it would be O(1), but it would use 4 bytes more memory for each byte.

/ Per Hedbor ()

Previous text:

...

2003-09-24 12:28: Subject: utf8_char_index

Actually, I wish to do it on a whole list of numbers (at least 2), unfortunately unsorted. That could be optimized.

It's the exec() function in pcre that gives useful indexes, and I was pondering to give the widestring wrapper object the correct result (= character offsets) from that function, not byteoffsets.

I also need the reverse function for continued search (start_index).

/ Mirar

Mirar ＠ Pike developers forum

11:35 a.m.

I have the feeling someone *will* run this function on a 500Mb string someday, so it might be a bad idea. :)

/ Mirar

Previous text:

...

2003-09-24 12:30: Subject: utf8_char_index

It would also be possible to generate a byte-offset -> index table when generating the UTF-8 from the widestring. Then it would be O(1), but it would use 4 bytes more memory for each byte.

/ Per Hedbor ()

8004

Age (days ago)

8004

Last active (days ago)

pike-devel@lists.lysator.liu.se

5 comments

3 participants

tags (0)

participants (3)

Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) ＠ Pike (-) developers forum
Mirar ＠ Pike developers forum
Per Hedbor () ＠ Pike (-) developers forum