I need a hack for utf8 conversion, utf8 string byte offset <-> character number
Anyone that knows enough utf8 and feels up to it?
int ch=0; while(p<offset) switch(str[p++]) { case 0xc0..0xdf: p+=1; break; case 0xe0..0xef: p+=2; break; case 0xf0..0xf7: p+=3; break; case 0xf8..0xfb: p+=4; break; case 0xfc..0xfd: p+=5; break; case 0xfe..0xff: error("Invalid UTF-8!\n"); }
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Previous text:
2003-09-24 12:20: Subject: utf8_char_index
I need a hack for utf8 conversion, utf8 string byte offset <-> character number
Anyone that knows enough utf8 and feels up to it?
/ Mirar
It's probably not possible to do it all that much faster than strlen(utf8_to_string( X[..offset])).
The only thing you can skip is the string generation. It's always O(n) to go from byte index to character index in UTF-8.
/ Per Hedbor ()
Previous text:
2003-09-24 12:20: Subject: utf8_char_index
I need a hack for utf8 conversion, utf8 string byte offset <-> character number
Anyone that knows enough utf8 and feels up to it?
/ Mirar
Actually, I wish to do it on a whole list of numbers (at least 2), unfortunately unsorted. That could be optimized.
It's the exec() function in pcre that gives useful indexes, and I was pondering to give the widestring wrapper object the correct result (= character offsets) from that function, not byteoffsets.
I also need the reverse function for continued search (start_index).
/ Mirar
Previous text:
2003-09-24 12:25: Subject: utf8_char_index
It's probably not possible to do it all that much faster than strlen(utf8_to_string( X[..offset])).
The only thing you can skip is the string generation. It's always O(n) to go from byte index to character index in UTF-8.
/ Per Hedbor ()
It would also be possible to generate a byte-offset -> index table when generating the UTF-8 from the widestring. Then it would be O(1), but it would use 4 bytes more memory for each byte.
/ Per Hedbor ()
Previous text:
2003-09-24 12:28: Subject: utf8_char_index
Actually, I wish to do it on a whole list of numbers (at least 2), unfortunately unsorted. That could be optimized.
It's the exec() function in pcre that gives useful indexes, and I was pondering to give the widestring wrapper object the correct result (= character offsets) from that function, not byteoffsets.
I also need the reverse function for continued search (start_index).
/ Mirar
I have the feeling someone *will* run this function on a 500Mb string someday, so it might be a bad idea. :)
/ Mirar
Previous text:
2003-09-24 12:30: Subject: utf8_char_index
It would also be possible to generate a byte-offset -> index table when generating the UTF-8 from the widestring. Then it would be O(1), but it would use 4 bytes more memory for each byte.
/ Per Hedbor ()
pike-devel@lists.lysator.liu.se