The best would probably be a method like 'array(string,string) string_to_utf8_with_index( string input );'
that returns the utf8 string and a string (or array with integers, but that would use even more memory) with the byte->character mapping.
[string index,string utf8] = string_utf8_with_index( data ); array(int) offsets = rows(index,regexp_function_utf8( utf8 ));
/ Per Hedbor ()
Previous text:
2003-09-24 13:10: Subject: utf8_char_index
The main reason to make these functions go fast is actually to prevent people from using UTF8 internally and the PCRE regexp in UTF8 mode themselves... :)
/ Mirar